By Matthew Keyser
De novo assembly of whole transcriptome sequencing data from non-model organisms can be very challenging for a variety of reasons. Factors such as computing limitations and unseen problems lurking in the sequence data can quickly derail your results and your timeline. For example, the presence of untrimmed Illumina adapter sequences, ribosomal RNA, and genomic repetitive sequences often leads to failed or poor quality assemblies.
Furthermore, even a successful assembly will typically yield thousands of unannotated contigs. In order to produce a complete and annotated mRNA set from the contig assembly, you may need to master multiple assembly and annotation pipelines.
Fortunately, by using DNASTAR’s novel assembly and annotation algorithms, and following a few tips before and after sequencing, you can greatly increase your chance for success.
Below, we’ve compiled our top recommendations for de novo transcriptome sequencing and assembly. These recommendations are based on extensive testing using a variety of transcriptome data sets from the NCBI Short Read Archive, as well as years of experience working with DNASTAR customers on transcriptome assembly and annotation projects.
Check your transcriptome sequence data prior to assembly
Transcriptome sequencing data commonly contains untrimmed adapters that severely impair de novo assembly. Prior to assembly, I always use FASTQC (free download Babraham Institute) to scan the fastq input data files for the presence of adapters. I then use the adapter scan in the Assembly Options page of SeqMan NGen to remove them.
In my experience, more than 50% of the transcriptome data sets downloaded from the NCBI’s Short Read Archive contain unacceptably high levels of the “Illumina Universal Adapter.” Once trimmed, these data sets assemble in a fraction of the time, with longer and more completely assembled mRNAs.
Read length matters
The algorithms used for de novo transcriptome assembly and RNA-seq are quite different. While it might be tempting to de novo assemble your 500 million, 2x76bp read Illumina RNA-seq data set, most assembled transcripts will be truncated (shorter) compared to an assembly run with longer reads. Here are results from de novo assembly of transcriptomic data from 20 different organisms and 7 different Illumina paired read lengths:
While 50-100bp paired Illumina reads are perfectly suitable for RNA-seq analysis, de novo transcriptome assembly is greatly improved when using paired Illumina reads 150bp or greater.
Number of cores matters
The DNASTAR de novo assembly algorithms are optimized to use the maximum number of available cores in your computer. While a standard Intel i7, 6-core desktop computer will perform well, assembly times on the largest data sets are significantly reduced using a computer such as the 16-core AMD Ryzen “Threadripper”. If powerful desktop computers are not your thing, you can acheive similarly fast results using DNASTAR Cloud Assemblies to run your assembly on an Amazon 16-core computer. The graph below shows assembly times, in hours, on three different hardware configurations, including the DNASTAR Cloud. For data sets with over 50 million reads, there is a significant time savings when using a computer with 16 cores.
Use transcript databases for auto-annotation
During assembly setup, DNASTAR provides users with mRNA RefSeq database options that our assembly algorithms utilize for both assembly and auto-annotation based on matching. The use of RefSeq mRNA is done automatically at assembly setup. The process is seamless and automated. All you need to do is select an mRNA database from related taxa or select the entire RefSeq database and the process is seamless and automated. The output includes an interactive table of annotations and an individual assembly file for each transcript allowing you to closely evaluate the quality of the output.
By comparison, BLAST-based annotation approaches for data sets containing thousands of query sequences can be extremely time consuming (taking multiple days) and difficult to manage.
For many non-model organisms, DNASTAR’s auto-annotation approach results in many identified transcripts:
Organism | Identified transcripts | Novel transcripts |
Hogfish | 31,402 | 37,142 |
Brassica napus | 63,808 | 16,434 |
Orchid | 31,059 | 19,758 |
Bent Grass | 5,052 | 11,136 |
Gecko | 4,843 | 3,345 |
Atlantic salmon | 26,285 | 7,331 |
Giardia intestinalis | 9,338 | 4,644 |
Holstein cow | 40,856 | 4,687 |
Neurospora crassa | 20,799 | 2,810 |
Novel transcripts can sometimes be identified using additional annotation options, like DNA to protein database matching . Please contact DNASTAR for more information on this topic.
While de novo transcriptome assemblies can be challenging, and often fail due to untrimmed linkers and computing shortcomings (too many short Illumina reads for too few processor cores), these issues are easily overcome by making some preparations before and after sequencing, and by using the SeqMan NGen assembler in the DNASTAR Genomics Suite.
Want to learn more about transcriptome analysis using DNASTAR Genomics? Try our next generation sequence analysis tools for yourself by requesting a fully functional trial of Lasergene
Leave a Reply
Your email is safe with us.