Variant Discovery Methods for SeqMan Pro Assemblies

Note: This topic is not applicable to BAM-based projects.

This topic describes how SeqMan Pro identifies putative variant bases in sample sequences.

Note that SeqMan Pro considers sample sequence pairs in its variant discovery method. See Pair Specifier Parameters for information on identifying pairs in your sample data. See Variant Discovery Parameters for information on how pairs are used in variant discovery.

Identifying variants with trace data:

For each base that has trace data, SeqMan Pro will do the following to determine if it is a putative variant:

1) Examine the shapes of the four traces to determine which ones exhibit a peak. A peak is defined as a trace that is characterized by a negative curvature.

2) Of the traces that exhibit a peak, find the highest trace peak intensity.

3) For each trace peak found, identify which ones have an intensity that is at or above the variant threshold percentage of the highest trace peak intensity. For example, if the highest trace peak is at intensity 900 and the threshold is set to 50%, find each other trace peak with an intensity of at least 450.

4) Determine the ambiguity code that represents all of the traces with peak intensities that are at or above the variant threshold. Consider an example in which the intensities of the peaks of the T, G, and A traces are 900, 700, and 100, respectively, and the variant threshold is 50%. The C trace does not exhibit a peak. The ambiguity code for this base is a K. This code includes the above threshold intensities of the T and G, ignores the below threshold intensity of the A, and ignores the C trace that does not have a peak. Note that if only one trace has an intensity that is at or above threshold, the code is unambiguous (A, C, G, or T).

5) If the base is in a sample sequence that is paired with a forward or reverse sequence as specified by the Pair Specifier Parameters, then compare its code with the code for its paired sequence. If the codes do not agree and one code is not encompassed by the other (e.g. T is encompassed by K), the base is not a putative variant.

6) Otherwise, if the base is aligned to a reference sequence base, compare the ambiguity code to the reference sequence base. If it does not agree, it is a putative variant base.

7) Otherwise, if the base is not aligned to a reference sequence base, compare the ambiguity code to those from all other sample sequences in the aligned column. If all codes agree, no putative variants are identified. If all codes do not agree, putative variant bases are identified in the column according to the following rules.

If the column contains at least one unambiguous code (A, C, G, or T), and the base is not the most frequently occurring unambiguous base, it is a putative variant base. For example, consider a column of five aligned sample sequences containing the following base codes: A, A, A, T, W. The T and W bases in the column are putative variant bases and the A bases (the most frequently occurring) are not.

Otherwise, if all bases in the column are identified by ambiguous codes (not A, C, G, or T), and the code for the base is not the most frequently occurring, it is a putative variant base. For example, consider a column of five aligned sample sequences containing the following base codes: K, K, K, W, W. The W bases in the column are putative variant bases and the K bases (the most frequently occurring) are not.

Identifying variants with sequence data only:

For each base that has sequence data only (i.e. no trace data), SeqMan Pro will do the following to determine if it is a putative variant:

1) If the base is in a sample sequence that is paired with a forward or reverse sequence as specified in the Pair Specifier Parameters, compare its code with the code for its paired sequence. If the codes do not agree, the base is not a putative variant.

2) Otherwise, if the base is aligned to a reference sequence base, compare the called base to the reference sequence base. If it does not agree, it is a putative variant base.

3) Otherwise, if the base is not aligned to a reference sequence base, compare the called base to those from all other sample sequences in the aligned column. If all called bases do not agree, and the base is not the most frequently occurring base, it is a putative variant base. For example, consider a column of five aligned sample sequences containing the following base codes: A, A, A, T, T. The T bases in the column are putative variant bases and the A bases (the most frequently occurring) are not.