Transcript Annotation (StarBlast) Workflow

The “transcript annotation” (or “StarBlast”) workflow is another name for the de novo Transcriptome/RNA-Seq workflow. To specify this path, chose Transcriptome/RNA-Seq in the Choose Assembly Workflow screen, and De novo assembly in the Choose Assembly Type screen.

 

In the past, de novo assembly of RNA-Seq data could result in thousands of contigs representing the expressed transcripts, without any context or labels. For Lasergene 13.0 and later, SeqMan NGen automatically attempts to group contigs from the same gene, and then name and annotate them based on the best match to a collection of annotated reference sequences. Two different SeqMan NGen assembly engines are used to optimize your results. Note that results from this workflow are non-quantitative.

 

Result files for this workflow are described in detail in Transcript Annotation (StarBlast) Workflow Output.

 

The de novo assembly and annotation of the RNA-Seq data occurs in a series of steps performed automatically by SeqMan NGen. If you are interested in a “behind the scenes” look at the assembly process, the steps are described below:

 

1)  Perform read clustering with the XNG assembler. Any RNA-Seq data can be used, though the expected data for this workflow is Illumina data, preferably with reads ≥ 100 bases in length.

 

2)  Perform de novo assembly of each cluster with the SNG assembler.

 

3)  Compare contig consensus sequences to the specified set of reference sequences (the “database”). Licensed users can use the SeqMan NGen wizard to access DNASTAR’s database of transcript annotations extracted from data on NCBI’s RefSeq website. In addition to the complete collection, subsets of the data are available:

 

Archea

Bacteria

Fungi

Invertebrate

Mitochondrion

Plant

Plasmid

Plastid

Protozoa

Vertebrate_mammalian

Vertebrate_other

Viral

 

Alternatively, you may create a custom database for use in this step, as described in Creating a Custom Transcription Annotation Database.

 

4)  Identify and merge contigs belonging to the same gene.

 

5)  Perform a second de novo assembly for the grouped contig sequences using the SNG assembler. The goal is to produce the most complete assembly possible for each transcript in the data set.

 

6)  Compare the updated contig sequences to the same database as in Step 3. The best matching database entry for each contig is used to label that contig at the gene level and provide summary statistics on the match.

 

See Using Transcript Annotation (StarBlast) Workflow Output as the Template to learn how to use the output of a de novo transcriptome/RNA-Seq assembly as input for the templated transcriptome/RNA-Seq workflow.