Transcript Annotation (StarBlast) Workflow Output

If you are following the transcript annotation workflow (i.e., if you chose Transcriptome/RNA-Seq Assembly in the Choose Assembly Workflow screen, and de novo assembly in the Choose Assembly Type screen), output results are saved in a folder called [project name] De Novo Transcriptome Assembly. This folder contains the following subfolders and files:

 

Subfolder

File/Folder Name

Description

 

[project name]_rnaAssemble.script

Input script used to create the assembly results. This file can be opened in SeqMan Pro in order to examine isoforms using the Feature Table.

Assemblies

[project name]_novel_transcripts.sqd

SQD assembly of all contigs that did not have a database match.

[project name]_unassembled.fastq

Multi-sequence FASTQ file with all unclustered and unassembled sequences.

sub_0 (folder)k

Folder containing sub-folders (sub_0, sub_1, etc.) with a separate .sqd document for each final assembly. If available, gene and organism names are used to create the file names.

Intermediate Assembly Results

cluster (folder)

Intermediate results are deleted by default at the end of the assembly, but can be retained by setting the input script parameter deleteIntermediate to ‘false.’

combine (folder)

intermediateFiles (folder)

Reports

[project name].AllTranscripts.SearchResults.txt

Excel file containing summary information for each of the final assembled contigs. The table automatically opens for viewing when you open a .Transcriptome package in SeqMan Pro. The table, known in SeqMan Pro as the “All Transcripts” table, contains the following columns:

 

Assembly ID

Name assigned to the assembled sequence, using the criteria specified in the wizard.

Gene name,

Custom column #1*

Best matching gene meeting criteria defined in the wizard.

Organism name,

Custom column #2*

Organism from which the best matching gene came.

Accession number,

Custom column #3*

Accession number of the best match.

Description,

Custom column #4*

Description of the best match.

Database

Database (e.g. RefSeq, Custom, etc.) from which the best matching gene came.

Transcript length

Length of the assembled sequence, in bases.

Transcript start

Position in the assembled sequence where the match begins.

Transcript end

Position in the assembled sequence where the match ends.

%Transcript match

Length of the matching segment in the transcript x 100, divided by the total length of the transcript.

Gene length

Length of the database entry, in bases.

% of Full length

Length of the assembled sequence x 100, divided by the length of the corresponding database entry. Values greater than 100% indicate that the assembled sequence is longer than the database entry.

Gene start

Position in the database entry where the match begins.

Gene end

Position in the database entry where the match ends.

% Gene match

Length of the matching segment in the database entry x 100, divided by the total length of the database entry.

% Identity

Total number of identical bases in the matching region x 100, divided by the total number of bases in the matching region.

Bit score

Normalized value calculated from the raw score and expressed in units of “bits,” a common measure in information theory.

eValue

“Expectation value,” an estimate of the probability of obtaining the observed alignment score with two random sequences. Expectation values are less sensitive to length than Bit scores and are therefore are generally a better measure of alignment quality.

Assembled reads

Total number of assembled reads for that sequence.

 

*Custom columns: These four columns use default names (e.g., Gene name, Organism name) if one of the default RefSeq databases was used in the SeqMan NGen assembly. However, if you used a custom GREP expression or a custom database that did not include these fields, these columns may have different names or be absent from the table.

[project name].AllTranscripts.Table.txt

Excel file containing summary information for each of the final assembled contigs. The table contains the following columns:

 

Assembly ID

Name assigned to the assembled sequence, using the criteria specified in the wizard.

Type

Type of matching gene (e.g., mRNA, tRNA, rRNA, etc.)

Gene length

Length of the database entry, in bases.

% of Full length

Length of the assembled sequence x 100, divided by the length of the corresponding database entry. Values greater than 100% indicate that the assembled sequence is longer than the database entry.

Assembled reads

Total number of assembled reads for that sequence.

Depth

Average depth of coverage.

Transcripts

[project name]_identified_transcripts.fas

Multi-sequence .fasta file containing the consensus sequences from all the assembled contigs that had a database match. Header lines for each entry contain the name and sequence length.

[project name]_novel_transcripts.fas

Multi-sequence .fasta file containing the consensus sequences from all the assembled contigs that did not have a database match. Header lines for each entry contain the name and sequence length.