The following table describes each of the workflows available in the De novo genome assembly and editing tab of the Workflow screen.
Group | Workflow | Description |
---|---|---|
ABI / Sanger | De novo assembly | Fast, accurate trimming and assembly of Sanger trace data, creating a project file that can be edited in SeqMan Ultra or SeqMan Pro. A non templated assembly of up to 30 million sequence reads and up to a 50 Mbase total length for all contigs combined. The capacity is determined by the amount of available RAM. When assembling a data set de novo, we recommend using paired end data, if available. |
Genome finishing – refinement | Align Sanger data to a draft sequence for further refinement of small errors. (Note: Use Variant Analysis/Resequencing workflow if your primary intent is SNP analysis). This workflow is most frequently used for extending off the ends of saved contig consensus sequences and correcting small errors within the contigs. This type of assembly can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Ultra or SeqMan Pro. | |
NGS-based | De novo assembly | Assembly of Sanger, Illumina and Long-read sequencing data that produces a file that can be edited with SeqMan Ultra or SeqMan Pro. A non templated assembly of up to 30 million sequence reads and up to a 50 Mbase total length for all contigs combined. The capacity is determined by the amount of available RAM. When assembling a data set de novo, we recommend using paired end data, if available. |
Genome finishing – initial error correction | Align NGS data to a draft genome or contigs to correct large misalignments and smaller errors. This option utilizes both reference-guided and de novo assembly steps to resolve both single nucleotide and small multibase replacements (indels) as well as three types of larger structural variation (SV): insertions, deletions and large indels with minimal user intervention. In this workflow, your data should be from a haploid genome with at least one mate pair data set with read lengths of 100 bases or greater. Your total number of reads should be 10 million or less. If you use a larger data set, only the first 10 million reads will be used. For mate pair data, equal numbers of matching forward and reverse reads are processed. The SQD-formatted assembly can be edited at a later time using SeqMan Ultra or SeqMan Pro. When opened in either application, contigs will already be organized into scaffolds in the Explorer panel. This workflow replaces the “gap closure workflow” from Lasergene 16.0 and before. This newer version features an additional “refinement” stage before the “gap closure” stage and some additional “finishing” steps after the gap closure portion takes place. During assembly, data is processed in several stages:
|
|
Genome finishing – refinement | Align NGS data to a draft genome for further refinement of small errors and closing small gaps between contigs. This workflow is most frequently used for extending off the ends of saved contig consensus sequences and correcting small errors within the contigs. This type of assembly, which uses mate-pair data, can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Ultra or SeqMan Pro. | |
Combined reference-guided/de novo assembly | This workflow aligns paired end NGS data from a new strain/isolate to a closely-related reference genome (>90% identity) to replace SNVs and small indels as well as larger structural variants in the reference with the sequences of the new organism. This workflow is analogous to the Genome finishing – initial error correction workflow above and uses the same series of stages to construct the new sequence from the starting reference. | |
PacBio/Nanopore | De novo assembly (beta) | De novo assembly of long-read-only data sets with an option to first “correct” a genome spanning set of overlapping read prior to assembly. This workflow is designed to work with Oxford Nanopore and PacBio CLR & HiFi (AKA “CCS”) reads. This workflow typically produces more contigs than the standard single stage de novo assembly, but consensus sequences are usually of higher base-level accuracy. The optional “correct first” mode of this workflow is initiated from the Preassembly Options screen by selecting the Run a first-pass correction assembly option. The “correct first” mode consists of two stages. First, the set of primary overlapping reads covering each contig from end to end are identified and combined with their overlapping and containment reads in a series of mini assemblies, the consensus sequences of which represent “corrected” sequence reads. Second, the corrected read sequences are de novo assembled into a final assembly from which new consensus sequences are determined. In the Post Assembly Options screen, you can optionally specify a reference sequence to use for ordering contigs into scaffolds. Note: If your genome is over 15 MB in length, you can only use this workflow if you use PacBio HiFi data and specify that read technology in the wizard. |
De novo assembly and polishing (beta) | De novo assembly of long-read-only data sets followed by NGS polishing to correct assembly errors. This workflow is useful for error-prone first-generation long read data but is not necessary for PacBio HiFi data or newer-generation Oxford Nanopore data. Choosing this workflow will first de novo assemble a long read data set and then automatically run the Genome finishing – initial error correction workflow (above, this table) starting from the de novo assembled consensus sequence(s). | |
NGS polishing of draft genome (beta) | This workflow is also known as the “Illumina correction” workflow. This option takes an existing set of long read assembled contig consensus sequences (AKA a “draft genome”) together with a NGS paired end data set from the same organism and runs the Genome finishing – initial error correction workflow (above, this table). The draft genome is often from HGap, Canu or Unicycler, but can also be a genome that was “sloppily” assembled in the past from 454 data; the NGS paired end data is typically Illumina data. | |
Genome finishing using long read data | This workflow is intended for unordered contigs that require reference-guided scaffolding, and is similar to the Genome finishing – initial error correction workflow for NGS date described above. As an example of when you might want to use this workflow, imagine you have an Illumina data set of some new Drosophila species that is different enough from D. melanogaster to make templated assembly problematic. You therefore follow a de novo assembly workflow that results in hundreds of unordered contigs. If you are using one of the older long-read technologies, one possibility would be to add long read data to extend the ends. However, this will lead to numerous mis-joins and an unreliable assembly. If you are using PacBio HiFi or ONT Duplex data, a better option would be to de novo assemble the long reads without genome polishing. This technology will likely produce a reasonable assembly. In both cases, though, a better solution would be to use this genome finishing workflow to align the contigs to a related reference genome. This ensures that at least some of the contigs will get assigned to the correct chromosome and will be ordered and oriented properly. If desired, you can then open the results in SeqMan Ultra and use “Align end-to-end” to join adjacent contigs. |
|
Scaffolding / Order Contigs | Reference-guided scaffolding of existing contigs | This option is useful if you have a set of unordered de novo assembled contigs and a new related reference genome becomes available. This workflow can order the contigs and assign them to the correct chromosomes. |
Reference-guided scaffolding and contig joining | This option starts just like the option in the previous row. It then automatically attempts an end-to-end contig alignment across all adjacent contigs. |
Need more help with this?
Contact DNASTAR