Variant Call Format (VCF) files have multiple uses. For instance, they can provide a way to flag previously known SNPs and to filter them in SNP tables. In DNASTAR’s SeqMan NGen, these SNPs are called "annotated SNPs"; in ArrayStar, they are referred to as "user variants.” VCF files can also be used to keep track of previously identified variants so that they can be verified in a new assembly or experiment.
The following brief video is an overview of how VCF files can be used in an assembly and in downstream analysis:
VCF files can be custom-made or automatically generated by sequencing software. For instance, you can create a VCF file using software such as SeqMan Pro, SeqMan Ultra or ArrayStar. Certain SeqMan NGen assemblies also output a VCF file called [assembly_name}.sample.vcf. VCF files are also available from other sources, such as the UCSC Genome Browser, and the Genome In a Bottle project. For a description of various VCF version specifications, see the Sourceforge VCF Specification page.
These two columns are REQUIRED, and must be in the order shown. All cells in these columns must be filled. | These four columns are OPTIONAL. If optional columns are present, the assembler will check the length of the string and compare against the length of the called variant. The base identities will not be checked. | Columns 7 and beyond are allowed, but will be ignored. | ||||
---|---|---|---|---|---|---|
#CHROM | POS | ID | REF | ALT | INFO | (Misc.) |
Chromosome identifier. Numbers are preferred, but chr or ch prefixes are allowed. All cells in this column must be filled. | Position in the reference sequence. All cells in this column must be filled. | For known dbSNP entries, the rs ID. The valid format is rs followed by a series of digits. For unknown or nonexistent IDs, a period (.) | The reference base(s). For unknown bases, a period (.) | The variant base(s). For unknown bases, a period (.) | User ID and source assembly information. For unknown bases, a period (.) | These columns may contain data, but they will be ignored by the SeqMan NGen assembler. |
- The table portion of the file must be sorted numerically, first by #CHROM, and then by POS. Make sure to sort the columns numerically (1, 2, 3…) and not alphabetically (1, 11, 12…). If you attempt to run the assembly after loading an improperly-sorted VCF file, multiple red error messages will be displayed during the assembly.
- When you try to open extremely large VCF files in a spreadsheet program or text editor for sorting purposes, you may receive an “insufficient memory” warning. If you need to sort a VCF file that is too big to open on your machine, we recommend using Sourceforge’s VCFTools.
- If quotation marks are used anywhere in the VCF file, they must be straight quotes, not curly or “smart” quotes. In addition, quotation marks should not be used in lines beginning with ##contig, ##UnifiedGenotyper, or ##INFO. If these rules are not followed, an error message will appear during assembly stating that “the VCF file has an incorrect or missing header.” Though the assembly will continue, the VCF SNP file that is output will be empty.
- Chromosome names are captured from genome template packages and used to assign contig IDs to entries from BED, VCF and manifest files. SeqMan NGen can read and produce output using common naming conventions (i.e., “chr” and “ch”) and Arabic numerals. It understands that chr1, ch1, or 1 can all be used to represent “the first template in the index,” and so on. In addition, Genome Template Packages sometimes internally define “short names” for particular chromosomes. For example, the C. elegans template package names its chromosomes using the standard convention for that organism: “I”, “II”, “III”, “IV”, “V”, “X”, “M.” SeqMan NGen does not normally recognize Roman numerals, but can in this case, because the numbers are “short names” that have been mapped to specific chromosomes.
Need more help with this?
Contact DNASTAR