Chapter 1: Step-by-Step Variant
Analysis Procedure
There is no universally agreed upon number of steps involved in variant analysis, and some projects require more steps than others. In this ebook, we have divided the bioinformatics portion of the workflow into seven steps based on when you would typically need to switch from one tool to the next to keep advancing through the pipeline.
If you have bioinformatics expertise, you can build these steps into an automated or semi-automated pipeline using scripting and open-source tools. Note that some commercial variant analysis software packages combine two or more of these steps into a single process. If you are working with a core facility, they may have their own pipeline that covers some or all of these steps before delivering the data to you.
If you are new to variant analysis, be sure to do your research and have a plan in place to make sure that the output from one step is readable by the next tool along the path. If a core facility or bioinformatics group is doing some of this work for you, you’ll want to understand what data they will be delivering to you so you can plan your next steps accordingly.
Variant Analysis Overview
To download this entire ebook as a PDF, click here.
Preliminary Steps
Choose a sequencing strategy and prepare samples
While this ebook focuses primarily on the bioinformatics and data analysis steps involved in studying human variants, there are many important decisions you will need to make before you can dive into the data. How you design and execute your sequencing strategy will greatly impact the quality of your data, so be sure to consider your analysis goals before beginning your experiment.
Before obtaining DNA samples, first determine if you will be doing a germline or somatic analysis. Somatic analysis compares samples from related tissues from a single individual and is often used in cancer research. Germline analysis, by contrast, compares samples from different individuals that may be heterozygous or homozygous for a trait.
Another consideration is whether you will be doing whole-genome, whole-exome, or RNA sequencing. All of these can yield important results and help elucidate the genetic cause of a disease. RNA-Seq is useful when you want to study differences in gene expression rather than simply the presence or absence of a SNP in the DNA sequence. Whole-genome sequencing (WGS) is not typically used in the case of human subjects. One reason is that it is prohibitively expensive for most researchers, although costs have come down substantially. Another concern is that the significance of non-coding variants is typically much more difficult to determine than coding changes and their abundance (~400x more than in coding regions) makes data handling and analysis more cumbersome and inefficient. Technological improvements and cost reductions are making WGS a more viable option when needed.
Because of the issues mentioned here, whole-exome sequencing, which considers only the ~1-2% of DNA that codes for proteins, is much more common than whole-genome sequencing for variant analysis studies. One caveat to be aware of is that exome capturing kits may miss some regions of interest, potentially missing important variants.
Once you have obtained the appropriate human DNA, each sequencing platform (e.g., Sanger, Illumina, Ion Torrent, PacBio, Oxford Nanopore), will have their own wet-lab protocols for sample preparation specific to the technology. For example, short read sample preparation typically requires fragmentation and PCR amplification steps, while long read preparation avoids DNA fragmentation at all costs. If you would like to enhance mapping accuracy and identify structural rearrangements with more clarity, consider a technology that produces long reads, paired-end, or mate-pair reads.
The final preliminary step is the actual sequencing, which is typically performed at a dedicated
sequencing facility.
Bioinformatics Steps
After sequencing is complete and you have sequence data in FASTQ format or variant call format (VCF), you are ready to start your analysis. The remaining steps all come down to your bioinformatics software and your ability to utilize it. Keep in mind that some of these steps may have been completed before you receive your data.
Step 1: Import & clean up sequencing data
If your data files are in FASTQ format, this step is where you’ll jump in and upload the file(s) into a software application for assessment and possible clean-up.
NGS data files (Illumina and Ion Torrent) are typically cleaned up using the pipeline tools associated with the sequencing instrument. This is normally sufficient. However, some output sequence files can benefit from scanning with a third party tool like FastQC.
Sanger data, by contrast, is not typically cleaned up during sequencing. Sanger data usually contains many base calling errors at the 5’ and 3’ ends where the chromatogram peaks are not high quality. This type of data requires a high-quality software program that can accurately trim the sequence ends.
Step 2: Align reads to a reference genome
This step involves using a computer program to align/map the read sequences to an existing reference genome; this is usually followed by a local realignment. The two most widely used public sources for human reference genomes are the Genome Reference Consortium (GRC), which provides GRCh36 through GRCh38, and the University of Santa Cruz (UCSC), which provides versions hg18 and hg19.
Some popular tools used for alignment include SOAP, Bowtie/Bowtie2, BWA, and MOSAIK. The results of the alignment are generally saved as BAM or CRAM formatted files.
Step 3: Remove PCR duplicates
When performing whole-genome or whole-exome sequencing, PCR duplicates should be removed immediately after the alignment step. This prevents duplicate reads that originate from a single template from interfering with later variant calling statistics. Commercial solutions may automatically detect and remove duplicate reads from the variant caller. One tool that can be used for this purpose is Picard Tools, though it works only with BAM files.
Step 4: Call variants and INDELS
Variant calling is the process of comparing the aligned read sequences to the reference sequence in order to find locations where they disagree. The four main types of variants are:
- Single-nucleotide polymorphisms (SNPs)
- Small insertions and deletions (INDELs)
- Larger indels, known as “structural variations” (SVs)
- Copy number variations (CNVs), in which a section of DNA is repeated multiple times
Scores of variant callers are available and are generally classified into four groups:
- Germline callers (including CRISP, SAMtools, GATK), typically used for elucidating the causes of rare diseases.
- Somatic callers (Including GATK, SomaticSniper), the choice used in most cancer studies.
- Copy number variation (CNV) finders (including CONTRA, CNVnator, RDXplorer). CNVs can be detected in both whole-genome and whole-exome assemblies, which is not true of other structural variations.
- Non-CNV structural variation (SV) finders (including ExomeCNV, CONTRA) are used to find inversions, translocations or large INDELs.
It is critically important to choose the right variant caller for your data. Using the wrong variant calling pipeline can lead to missed variant calls. In addition, different variant callers have been shown to work better with different types of sequencing technology.
After alignment and variant calling, the limited number of applications that can perform sample cross-normalization or that can utilize a BED file to filter the variant list will perform those steps now.
The output from GATK and other variant calling pipelines is a Variant Calling Format (VCF) file. VCF is a scalable, tab-delimited text file with specific rules pertaining to the order and contents of columns. Each row represents a single nucleotide variant (SNV), insertion, deletion, or other sequence variation. Each variant is uniquely identified with a unique combination of letters and numbers, and the file may contain identifiers from multiple databases.
Step 5: Filter data to discover important variants
If Steps 1-4 didn’t sound familiar or relevant to your experience, that’s because they are frequently performed by a core facility or bioinformatics group. This step— which may start with the import of one or more VCF files—is the first one in which most researchers become directly involved.
All sequencing platforms produce sequence reads that contain many base-level errors, and it is crucial to apply filtering that can separate “sequencing noise” from the variant “signal.” Much of this baseline filtering can be done in an automated fashion, preferably using statistical confidence measures, but some may require you to run some initial “noise filtering” tools.
Even after filtering out sequencing errors, a typical human exome set will still contain thousands of variants. You must now rely on additional criteria, typically imported into the analysis pipeline, to differentiate the uninteresting variants from those that have functional or clinical significance. Some open-source tools that support filtering are VCF Tools and SnpSift.
Many filtering strategies can be used, but here are four examples:
- Filter out synonymous SNPs. Since these SNPs encode for the normal amino acid, they will not result in a deleterious effect.
- In the case of a whole-genome study, filter out variants located in non-coding regions.
- Filter out common alleles. For example, we can reasonably assume that a mutation that causes a rare kidney disease will be rare in the human population. Thus, if a SNP is common (i.e., has a high allele frequency) across a population, it is unlikely to be the cause of the rare disease and can be removed from consideration.
- Some cloud-based tools (e.g., the Integrative Genomics Viewer) and standalone applications (CLC Bio, Lasergene) let you compare variants from multiple VCF files. For instance, to find variants related to a particular form of brain cancer, you could filter out variants found in both brain cancer patients and controls and focus on variants found only in the cancer patients.
Step 6: Determine how variants affect genes
The goal of this step is to determine the functional and clinical consequences of variants and how they impact genes. This is best accomplished with data tables that contain both variant and functional information for one or multiple samples and which have the ability to easily verify variant calls by visualizing the assembled sequence reads. This is most often done using genome browsers, which are available in standalone and web-based versions.
A benefit to standalone browsers is that most provide a graphical user interface (GUI) which can support zooming, easier visualization, and the ability to browse interactively. One drawback, however, is that you may be responsible for importing a large number of different data tracks for each of your samples. In addition, cross-comparison and analysis of multiple samples may be challenging, depending on the analysis tools that are available.
One advantage to web-based genome browsers (e.g., Ensembl, UCSC Genome Browser, ANNOVAR, AnnTools, VariantAnnotation, NGS-SNP, and snpEff) is that you do not usually need to download or update a human variant annotation database. On the other hand, you must upload your own proprietary data to a remote server, which is not a viable option for all organizations, owing to data privacy policies. Furthermore, some of these tools are command-line driven and can require typing multiple long strings of text into the command-line interface.
At a minimum, a genome browser should allow you to display the aligned reads, see variants clearly (e.g., by displaying them in a different color), view annotation information, and visit the corresponding public database (e.g., Ensembl, the GWAS Catalog, EVA, UniProt, dbSNP, ClinVar) entries online via hyperlinks. Most tools only allow visualization of SNPs, while a few also support viewing CNVs and SVs. Some genome visualization tools allow you to compare sequences from multiple individuals or even multiple organisms.
Once you have tabulated a list of variants, you may identify some for which there is limited data. Most search engines cannot excavate any further than paper titles and abstracts, where 85% of variants are not mentioned. This represents an enormous lost opportunity. While the crowd-sourced ClinVar platform is a useful repository of variant interpretations and related information, it is far from complete. In fact, an average of 31% of submitted references per variant are false-positives, with 30% of submissions having no references at all. Together, the downstream effects of these statistics can be serious, and may result in erroneous conclusions and/or a missed diagnosis. To mitigate these information quality issues, a deeper analysis can be performed using the Mastermind Genomic Search Engine, which is discussed in detail in Chapter 3.
Step 7 (Optional): Visualize the impact of the variant on the 3D protein structure
An optional step in the variant analysis pipeline is to attempt to determine the effect of a given variant on the 3D structure of the protein it codes for. Sometimes, the 3D structure may be minimally affected, if at all. In other cases, the variant can cause a dramatic change in protein structure, vitally impacting protein function.
Tools like I-Mutant and SDM can predict how the variant might affect protein stability, while structure prediction tools like I-TASSER and Phyre2 can predict the effect on protein structure. However, the latter tools will only work if a structural homolog is available. In some cases, you may find that the structure of the variant protein has already been determined through x-ray crystallography and is available for download through the Protein Data Bank (PDB). You can view these structures through various programs, including the LiteMol viewer.
An easier option is to use DNASTAR’s Protean 3D application to accomplish all these tasks. Protean 3D’s “Protein Design” workflow lets you use a wizard to easily “mutate” one or more residues and then calculates whether the changes are likely to be stabilizing or destabilizing compared to the original structure. NovaFold, a separately-licensed application that runs through the Protean 3D interface and is powered by the award-winning I-TASSER algorithm, can be used to predict the new protein structure caused by the mutated residue(s). Protean 3D lets you view the original and mutated proteins as fully customizable and rotatable 3D structures.