De novo genome assembling and editing workflows - User Guide to SeqMan NGen - 17.4

Welcome to SeqMan NGen
SeqMan NGen Tutorials
- Whole genome reference-guided workflow
- Whole genome de novo workflow with mate pair data
- De novo assembly using Sanger data
- Analysis of a whole genome de novo assembly
- RNA-Seq de novo transcriptome workflow
  - Part A: Setting up the transcriptome assembly in SeqMan NGen
  - Part B: Viewing annotated transcripts in SeqMan Ultra
- RNA-Seq reference-guided workflow with analysis in ArrayStar
- ChIP-Seq workflow with analysis in ArrayStar
- Copy number variation (CNV) workflow with analysis in ArrayStar and GenVision Pro
- Whole genome reference-guided workflow with analysis in ArrayStar
  - Part A: Setting up the assembly in SeqMan NGen
  - Part B: Analyzing the results in ArrayStar
- Long-read analysis with accuracy evaluation
  - Part A: Running the assembly in SeqMan NGen and viewing it in SeqMan Ultra
  - Part B (optional): Evaluating assembly accuracy using QUAST
- Exome workflow with analysis in ArrayStar
- Templated long-read workflow (ARTIC)
  - Part A: Creating draft genomes in SeqMan NGen and exporting a consensus from SeqMan Ultra
  - Part B: Using MegAlign Pro to determine the SARS-CoV-2 variant in an experimental sample
Wizard screen descriptions
- Welcome
- Workflow
  - De novo genome assembling and editing workflows
    - Create a reference-guided assembly to use in the “SNP to Structure” workflow
    - Remove PhiX control reads from Illumina data prior to import
  - Metagenomics workflows
  - RNA-seq/transcriptomics workflows
    - Include DESeq2 or edgeR statistics
  - Variant Analysis/Resequencing workflows
    - Variant calling accuracy workflow
    - ARTIC Amplicon workflow
  - Variant Call Format (VCF) files workflows
  - Combine/Reanalyze Existing Assemblies
- Analysis Options
  - RNA-seq normalization methods
  - ChIP-seq peak detection methods
- Assembly Log
- Assembly Options
- Assembly Output
- Assembly Summary
- Cloud Monitor
- Define Binding Proteins
- Input Assemblies
- Input Assembly
- Input Contig Sequences
- Input Host Files
- Input Reference (Sequence, Genome, for Scaffolding, etc.)
  - Annotate reference sequences prior to import
  - Manually specify an isoform prior to import
  - Use RNA-Seq de novo transcriptome output as a reference
  - Specify a VCF, BED or Manifest file
- Input Sequences
- Input Sequence files
  - Specify read technology
  - Specify paired-end data
    - Example regular expressions
  - Specify single sample, multi-sample or replicate data
  - Specify RNA-Seq options
- Input VCF Files
- Input Viral Genomes
- Post Assembly Options
- Preassembly Options
  - Preassembly Options for long-read workflows
  - Preassembly Options for all other workflows
- Run Assembly Project
  - Monitor the progress of a Cloud Assembly
- Set Contaminant
- Set Up Experiments
- Set Up Replicate Sets
- (Short Read) Polishing Options
- Transcript Annotation Database
  - Add a DNASTAR transcriptome package
  - Create a custom transcript annotation database
  - Use a local copy of RefSeq as a transcript annotation database
  - Annotation Options dialog
- Options tabs
  - Alignment tab
  - Layout tab
    - Layout tab (Preassembly Options, long read)
    - Layout tab (Assembly or Analysis Options)
  - Peak Detection tab
  - Scans tab
  - Trimming tab
    - Trimming tab (Preassembly Options, all others)
    - Trimming tab (Assembly Options)
  - Variants tab
    - Filter based on “P not Ref”
Log in to Cloud Assemblies
Use the DNASTAR Cloud Data Drive
- License and Credential Requirements
- The DNASTAR Cloud Data Drive User Interface
- Access the DNASTAR Cloud Data Drive
- Create a New Cloud Folder
- Transfer a Folder from a Physical Computer to the Cloud
- Transfer Files from a Physical Computer to the Cloud
- Transfer Files or Folders from the Cloud to a Physical Computer
- Permanently Remove Files and Folders from the Cloud
- Close the DNASTAR Cloud Data Drive
Navigate between wizard screens
Add and remove files in the wizard
- Add sequences from your computer or the cloud
- Add a genome template from DNASTAR
- Add a genome template from NCBI
- Remove a sequence from the list
Use editing commands in the wizard
Monitor the progress of a cloud assembly
Access and understand output files
- View the Project Report
  - Project Report contents for reference-guided workflows
  - Project Report contents for de novo workflows
- Reference-guided workflow output
  - Contents of the .assembly package
    - Contents of the -reports folder
      - Contents of the -zinternal folder
- De novo workflow output
- RNA-Seq reference-guided workflow output
- RNA-Seq de novo transcriptome workflow output
Appendix
- SeqMan NGen calculations
- Run SeqMan NGen through the command line
- Turn off usage logging
- Non-English keyboards
- Installed Lasergene file locations
- Troubleshoot failure to launch
- Research references

Download as PDF

The following table describes each of the workflows available in the De novo genome assembly and editing tab of the Workflow screen.

Group	Workflow	Description
ABI / Sanger	De novo assembly	Fast, accurate trimming and assembly of Sanger trace data, creating a project file that can be edited in SeqMan Ultra or SeqMan Pro. A non templated assembly of up to 30 million sequence reads and up to a 50 Mbase total length for all contigs combined. The capacity is determined by the amount of available RAM. When assembling a data set de novo, we recommend using paired end data, if available.
ABI / Sanger	Genome finishing – refinement	Align Sanger data to a draft sequence for further refinement of small errors. (Note: Use Variant Analysis/Resequencing workflow if your primary intent is SNP analysis). This workflow is most frequently used for extending off the ends of saved contig consensus sequences and correcting small errors within the contigs. This type of assembly can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Ultra or SeqMan Pro.
NGS-based	De novo assembly	Assembly of Sanger, Illumina and Long-read sequencing data that produces a file that can be edited with SeqMan Ultra or SeqMan Pro. A non templated assembly of up to 30 million sequence reads and up to a 50 Mbase total length for all contigs combined. The capacity is determined by the amount of available RAM. When assembling a data set de novo, we recommend using paired end data, if available.
	Genome finishing – initial error correction	Align NGS data to a draft genome or contigs to correct large misalignments and smaller errors. This option utilizes both reference-guided and de novo assembly steps to resolve both single nucleotide and small multibase replacements (indels) as well as three types of larger structural variation (SV): insertions, deletions and large indels with minimal user intervention. In this workflow, your data should be from a haploid genome with at least one mate pair data set with read lengths of 100 bases or greater. Your total number of reads should be 10 million or less. If you use a larger data set, only the first 10 million reads will be used. For mate pair data, equal numbers of matching forward and reverse reads are processed. The SQD-formatted assembly can be edited at a later time using SeqMan Ultra or SeqMan Pro. When opened in either application, contigs will already be organized into scaffolds in the Explorer panel. This workflow replaces the “gap closure workflow” from Lasergene 16.0 and before. This newer version features an additional “refinement” stage before the “gap closure” stage and some additional “finishing” steps after the gap closure portion takes place. During assembly, data is processed in several stages: Data is mapped and aligned to a user-defined set of consensus sequences from which a new consensus sequence is determined. Five rounds of this consensus refinement process are performed to remove the majority of single nucleotide and small multibase errors. Data is mapped and aligned to refined consensus sequence(s) from stage 1 and then analyzed for characteristic SV motifs. The reference sequence is split at the detected SV sites, forming a series of ordered contigs. Mate pair and split reads from each SV event are collected in site-specific pools and assembled de novo. Deletions are detected using three types of data: split reads, spanning paired-end reads, and sequence coverage information. For insertions and replacements, mate pair reads corresponding to the new sequence are collected from the unassembled read pool. Only reads anchored by mates flanking the SV in the main assembly are used at this stage. The de novo assembled contigs are then brought into the main assembly and positioned consistently with the mate pair information. For SVs where the gap is not completely covered by the de novo assembled contigs (e.g. insertions longer than twice the size of the insert library), additional reads from the unassembled read pool matching and extending the ends of the joining contigs are added in an attempt to “walk” across the gap. This walk is terminated when either no new reads are found or when a repeated element is encountered. Note that the final de novo assembly performed in stage 6 typically results in additional contigs added to the final assembly project. These are often small contigs with redundant sequences of chromosomal segments. However, they can also represent plasmids, for example, that were not present in the input consensus sequences. Click here to see benchmarks for SeqMan NGen vs. three open source tools.
	Genome finishing – refinement	Align NGS data to a draft genome for further refinement of small errors and closing small gaps between contigs. This workflow is most frequently used for extending off the ends of saved contig consensus sequences and correcting small errors within the contigs. This type of assembly, which uses mate-pair data, can include up to 10 million reads and up to a 100 Mbase genome. It can be edited at a later time using a utility like SeqMan Ultra or SeqMan Pro.
	Combined reference-guided/de novo assembly	This workflow aligns paired end NGS data from a new strain/isolate to a closely-related reference genome (>90% identity) to replace SNVs and small indels as well as larger structural variants in the reference with the sequences of the new organism. This workflow is analogous to the Genome finishing – initial error correction workflow above and uses the same series of stages to construct the new sequence from the starting reference.
PacBio/Nanopore	De novo assembly (beta)	De novo assembly of long-read-only data sets with an option to first “correct” a genome spanning set of overlapping read prior to assembly. This workflow is designed to work with Oxford Nanopore and PacBio CLR & HiFi (AKA “CCS”) reads. This workflow typically produces more contigs than the standard single stage de novo assembly, but consensus sequences are usually of higher base-level accuracy. The optional “correct first” mode of this workflow is initiated from the Preassembly Options screen by selecting the Run a first-pass correction assembly option. The “correct first” mode consists of two stages. First, the set of primary overlapping reads covering each contig from end to end are identified and combined with their overlapping and containment reads in a series of mini assemblies, the consensus sequences of which represent “corrected” sequence reads. Second, the corrected read sequences are de novo assembled into a final assembly from which new consensus sequences are determined. In the Post Assembly Options screen, you can optionally specify a reference sequence to use for ordering contigs into scaffolds. Note: If your genome is over 15 MB in length, you can only use this workflow if you use PacBio HiFi data and specify that read technology in the wizard.
	De novo assembly and polishing (beta)	De novo assembly of long-read-only data sets followed by NGS polishing to correct assembly errors. This workflow is useful for error-prone first-generation long read data but is not necessary for PacBio HiFi data or newer-generation Oxford Nanopore data. Choosing this workflow will first de novo assemble a long read data set and then automatically run the Genome finishing – initial error correction workflow (above, this table) starting from the de novo assembled consensus sequence(s).
	NGS polishing of draft genome (beta)	This workflow is also known as the “Illumina correction” workflow. This option takes an existing set of long read assembled contig consensus sequences (AKA a “draft genome”) together with a NGS paired end data set from the same organism and runs the Genome finishing – initial error correction workflow (above, this table). The draft genome is often from HGap, Canu or Unicycler, but can also be a genome that was “sloppily” assembled in the past from 454 data; the NGS paired end data is typically Illumina data.

Create a reference-guided assembly to use in the “SNP to Structure” workflow

Need more help with this?
Contact DNASTAR