Home > Blog > Phased Variant (Haplotype) Analysis for Whole Genome Sequencing

Phased Variant (Haplotype) Analysis for Whole Genome Sequencing

By Matt Keyser, DNASTAR Senior Product Manager
November 19, 2024 | Lasergene Genomics

As the Senior Product Manager for Lasergene, Matt Keyser works with scientists, software developers and support staff at DNASTAR to create sequence analysis software that meets the current needs of researchers and that is ready to support future challenges and changing technology. In his 20 years (and counting) at DNASTAR, Matt has advised numerous customers on a wide array of sequencing and analysis projects, giving him a unique understanding of the challenges faced by scientists today.

What is haplotype analysis and why is it useful?

Haplotype analysis, also called phased variant analysis, looks at each variant locus on a pair of chromosomes and aims to determine the arrangement of alleles on each chromosome.

For example, consider a locus with two possible alleles: A and B. An individual might have two copies of the A allele (AA), two copies of the B allele (BB), or one copy of each (AB). Haplotype phasing would determine whether the AB individual has the alleles arranged as A on one chromosome and B on the other (cis configuration), or if they are arranged as AB on one chromosome and BA on the other (trans configuration).

A complete haplotype analysis workflow is now available in Lasergene 18.

Haplotype analysis is useful in a variety of research fields. For example, it can be used to:

– Trace the ancestry of individuals and populations through haplotype analysis.

– Study the genetic diversity of populations and track the evolution of genetic variations over time.

– Determine whether or not there is at least one functional allele in cases of compound heterozygosity.

– Predict how a patient will respond to certain medications, leading to more personalized and effective treatments.

This PacBio blog post discusses several studies in which haplotype analysis has proved invaluable.

What basic steps are involved?

In haplotype analysis, a DNA assembly is created using long read data from Pacific Biosciences (PacBio) or Oxford Nanopore Technologies (ONT), which provide the long-range connectivity needed for our novel haplotype phasing algorithm. Long reads are multiple kilobases in length (versus a couple hundred bases for short-read sequencing) and therefore contain multiple heterozygous positions that allow the reads to be phased into haplotypes.

During sequence assembly, statistical methods are used to infer the specific combinations of variations (haplotypes) carried by individuals. In regions of the genome where adjacent heterozygotes are more than a read length apart, the phasing gets broken up into “phase blocks.” These blocks can be joined using genome wide association studies (GWAS), but trio analysis (i.e., sequencing the subject’s parents) can also be used to identify the parental source of a particular variant and allow adjacent blocks to be assembled together. The frequency of different haplotypes can then be compared between individuals with and without a particular trait or disease.

Figure 1. When a phased assembly is open in GenVision Pro, the Genome view shows phase blocks in alternating bars of blue and green.

How was this workflow accomplished prior to Lasergene 18?

Haplotype analysis has generally required users to string together multiple unsupported open-source software applications, requiring extensive bioinformatics expertise.

A common pipeline uses minimap2 or bwa-mem2 to map and align the reads to the reference genome, a tool like WhatsHap for haplotype phasing, GATK or DeepVariant to call allele-specific variants, and finally, a tool like AnnoVar to annotate the variants.

Most open-source applications require a Linux computer for at least part of the pipeline. A commercial all-in-one solution requires an HPC server costing almost $400,000 to achieve haplotype phasing and both SNV and structural variation analysis.

This combination of unrelated open-source applications and costly computers has limited haplotype analysis to a small number of bioinformatics experts. For example, an assembly and annotation pipeline developed by Goenka et al. (see reference below) required at least 10 different pieces of software and a high-performance grid running on the Google Cloud consisting of 33 instances equipped with multiple high-end NVIDIA GPUs and hundreds of CPUs.

Reference: Goenka SD, Gorzynski JE, Shafin K, et al. Accelerated identification of disease-causing variants with ultra-rapid nanopore genome sequencing. Nat Biotechnol. 2022 Jul;40(7):1035-1041. doi: 10.1038/s41587-022-01221-5. Epub 2022 Mar 28.

How does Lasergene 18 make haplotype analysis easy and accessible?

With the recent release of Lasergene 18, Lasergene Genomics features a new algorithm for automated phasing of long read data from diploid organisms. Furthermore, all the steps for the haplotype phasing workflow can be completed on a standard Windows or Mac computer, quickly taking you from raw sequencing data to results.

Using this algorithm, our SeqMan NGen assembler identifies putative heterozygous positions using a Bayesian proportion detector and then evaluates those positions in each read to find the consistent set of heterozygous columns from which to separate or “phase” the reads into the two haplotype groups. Each set of phased reads represents a haplotype block that is realigned as two parallel sequences from which allele-specific variants are called. Haplotype phased alignments and variant calls are saved as part of the SeqMan NGen .assembly package, and downstream analysis is performed in GenVision Pro and/or SeqMan Ultra.

In Lasergene, the haplotype analysis workflow takes place in four simple steps.

Step 1: Assemble sequencing data

Use SeqMan NGen to assemble sequence reads to the human genome template. To create an assembly suitable for exploring phasing, you need to use ONT or PacBio long-read data and check the Diploid – Phased button in the Analysis Options screen of the setup wizard (Figure 2).

Figure 2. In the SeqMan NGen setup wizard, check Diploid-Phased in the Analysis Options screen to trigger haplotype phasing during assembly.

If you are working with human data, we highly recommend checking the Variant Annotation Database (VAD) box to enable enriched annotation of called SNVs/small indels. Annotations fall into five broad categories:1) Allele and genotype frequencies from the 1000 Genomes Project, 2) Functional impact predictions from three different methods (LRT, MutationTaster and SIFT), 3) Evolutionary conservation scores from four different algorithms (GERP++, SiPhy, PhyloP and PhastCons), 4) Pathogenicity information from ClinVar and 5) General information from a variety of sources. Annotation results can later be viewed and used as a basis for filtering variants in GenVision Pro.

During assembly, SeqMan NGen automatically calls and annotates small variants, characterizes structural variations, and separates results by haplotype. Each called variant is “decorated” with information such as its impact on corresponding coding regions and phased assemblies also present the variants in an allele-specific manner. Finally, XNG uses alignment signatures in long read data, such as multiple split reads, to identify positions of likely SVs and then analyzes that data to 1) determine whether the event is an insertion or deletion and 2) estimate the length of the variant. Called SVs are decorated with the statistics and affected feature information.

When the assembly concludes, just click a button (Figure 3) to launch the results in GenVision Pro for downstream analysis.

Figure 3. When assembly is complete, click the “Analyze and compare variants in multiple samples” button to open the results in GenVision Pro.

Step 2: View and filter variants

GenVision Pro correlates phase blocks to one or more genes of interest to identify compound heterozygous variants.

When you click the command to create a variants table, GenVision Pro’s computational algorithms separate the parent chromosomes and analyze the genetic variants on each. The resulting table can be customized to show any of dozens of columns of statistical information. If you enabled the VAD (see above), many columns contain links to database entries on websites like dbSNP or Mastermind.

Use column sorting along with advanced filtering to find variants and genes of relevance to your research. There are many useful ways to filter, from baseline comparisons and statistical thresholds to using a Venn diagram to select intersections of interest.

Step 3: Visualize results

The GenVision Pro views let you easily assess the quality of the target capture, navigate to called SNV positions, and view the phased alignments of the corresponding haplotypes. Small variants and structural variations are shown in customizable tables. The Analysis view shows the alignment of reads phased into the two sister chromosomes. Single nucleotide variants (SNVs) are indicated as red tick marks (Figure 4). Alternating phase blocks are shown in shades of blue and green, with light and dark versions of each color denoting the two haplotypes.

Apply analysis tracks such as Variants, Phase block, or Phase consistency to dive deeper into genes and variants of interest.

Figure 4. A phased genome assembly shown in the Variant table and Analysis view. This Variant table can be used to apply filters and locate variants in the Analysis view. As with all Lasergene applications, selections in any view are synchronized across all views.

Different zoom levels reveal different aspects of the data, as shown in Figures 5-8.

Figure 5. This zoomed out Analysis view shows an example of phased compound het in the CCHCR1 gene on chromosome 6. Note that each copy of the gene has an affected allele (boxed in black).

Figure 6. When zoomed in, larger variants and structural variations can clearly be seen in the phased data.

Figure 7: By contrast, zooming into the view shows a more granular view of the alignment.

Figure 8. At certain zoom levels, the Variants track in the Analysis view is bifurcated at each phase block to show the variants in each of the two alleles (parents).

Downstream analysis for this workflow can also be performed in SeqMan Ultra. In Figure 9, SeqMan Ultra’s multi-pane displays let you simultaneously assess the quality of the target capture (A), interrogate and navigate to the called SNV positions (B) and view the phased alignments of the corresponding haplotypes (C).

Figure 9. An example of a multi-pane display in SeqMan Ultra. “A” is the coverage overview of chromosome 7 segment containing TRB; “B” is the small-variant table; “C” shows the Alignment view with the phased reads separated into respective haplotypes using heterozygous SNPs shown in blue lettering. A pseudo-consensus of each haplotype is shown with a goldenrod background.

Step 4 (optional): Import and export

Import and export BAM alignments and VCF files for straightforward comparison of sequence data processed in other assembly and analysis pipelines. You can also use GenVision Pro’s search functionality to search NCBI and download sequence matches (BLAST) or text matches (Entrez). GenVision Pro also supports the creation and management of local sequence databases.

Conclusion

Haplotype analysis is a valuable tool in genetics, providing insights into disease risk, population history, drug response, and evolutionary processes. By understanding the patterns of genetic variation inherited together, researchers can make significant advances in medicine and our understanding of human biology.

With the recent release of Lasergene 18, a complete haplotype analysis workflow is included in Lasergene Genomics. This workflow can be easily run on a standard Windows or Mac computer and requires no specialized knowledge of bioinformatics.

Would you like to receive technical tips and special offers straight to your inbox?