理解VCF文件

VCF(Variant Call Format),是存储变异信息的文件格式,包括SNP(单核苷酸多态性,指单碱基变异)、INDEL(插入或缺失,指短片段变异)、SV(结构变异,指长片段变异)、CNV(拷贝数变异)。

一、VCF文件内容

VCF文件包含两个部分,注释与主体。注释部分以##开头,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
##fileformat=VCFv4.2
# VCF版本号
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
# 上面这些信息都是对主体内容的解释,主体内容中的任何缩写在注释信息中都找得到,比如,ALT表示NON_REF,“Represents any possible alternative allele not already represented at this location by REF and ALT”;下面是FORMAT信息,包括AD,DP,GQ,GT,MIN_DP,PGT,PID等,比如,AD表示“Allelic depths for the ref and alt alleles in the order listed”,即参考序列与实际序列的位点深度。

##GATKCommandLine=<ID=CNNScoreVariants,CommandLine="CNNScoreVariants --output /gatk/my_data/wes/sample/annotate/annotated.vcf --disable-avx-check true --variant /gatk/my_data/wes/sample/caller/output.vcf.gz --reference /gatk/my_data/ncbi/hg38/chroms/hg38.fa --tensor-type reference --window-size 128 --read-limit 128 --filter-symbolic-and-sv false --info-annotation-keys MQ --info-annotation-keys DP --info-annotation-keys SOR --info-annotation-keys FS --info-annotation-keys QD --info-annotation-keys MQRankSum --info-annotation-keys ReadPosRankSum --inference-batch-size 256 --transfer-batch-size 512 --inter-op-threads 0 --intra-op-threads 0 --output-tensor-dir --enable-journal false --keep-temp-file false --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-index true --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false --add-output-sam-program-record true --add-output-vcf-command-line true --cloud-prefetch-buffer 40 --cloud-index-prefetch-buffer -1 --disable-bam-index-caching false --sites-only-vcf-output false --help false --version false --showHidden false --verbosity INFO --QUIET false --use-jdk-deflater false --use-jdk-inflater false --gcs-max-retries 20 --gcs-project-for-requester-pays --disable-tool-default-read-filters false --read-group-black-list ID:ArtificialHaplotypeRG --read-group-black-list ID:ArtificialHaplotype",Version="4.1.3.0",Date="November 12, 2020 1:29:06 AM UTC">
# 接着,是使用过的程序及命令甚至时间,非常详细

##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=CNN_1D,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained 1D Convolutional Neural Network">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
# 后面是INFO,包括AC,AF,AN,DP,FS等标签

##contig=<ID=chrX,length=156040895>
##contig=<ID=chrX_KI270880v1_alt,length=284869>
##contig=<ID=chrX_KI270881v1_alt,length=144206>
##contig=<ID=chrX_KI270913v1_alt,length=274009>
##contig=<ID=chrY,length=57227415>
##contig=<ID=chrY_KI270740v1_random,length=37240>
# 接着,是长长的contig,记录每条染色体以及长度,以及source信息

##source=CNNScoreVariants
##source=GenotypeGVCFs
##source=HaplotypeCaller

主体部分没有#号,共10列:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1
chr1 15903 . G GC 59.28 . AC=2;AF=1.00;AN=2;CNN_1D=-0.800;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:71,6,0
chr1 16495 . G C 36.65 . AC=1;AF=0.500;AN=2;BaseQRankSum=-9.670e-01;CNN_1D=-4.292;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=22.00;MQRankSum=0.00;QD=12.22;ReadPosRankSum=0.967;SOR=1.179 GT:AD:DP:GQ:PL 0/1:1,2:3:18:44,0,18
chr1 16734 . TG T 31.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-6.740e-01;CNN_1D=-6.239;DP=10;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=32.17;MQRankSum=-1.400e-01;QD=3.51;ReadPosRankSum=-9.210e-01;SOR=0.132 GT:AD:DP:GQ:PL 0/1:7,2:9:39:39,0,197

# 这里列举三行
# CHROM,染色体
# POS,位置,“with the 1st base having position 1”
# ID,对应dbSNP数据库中的ID,默认未设置,为"."
# REF,参考序列,A,C,G,T,N
# ALT,实际序列,A,C,G,T,N,*
# QUAL,质量分数,这个数值有必要解释一下,根据VCF说明文件,“Phred-scaled quality score for the assertion made in ALT. i.e. −10log10 prob(call in ALT is wrong). If ALT is ‘.’ (no variant) then this is −10log10 prob(variant), and if ALT is not ‘.’ this is −10log10 prob(no variant). If unknown, the missing value should be specified. (Numeric)”,即 QUAL= −10log10 prob,此处的prob是错误概率,比如prob = 0.01,则QUAL= −10log10x0.01 = 20,即当QUAL=20时,ALT可能出错的概率是0.01,即可能正确的概率是0.99。质量分数是我们在判断一个位点的突变是否为真突变的主要参考,在后面的过滤中还会用到。
# FILTER,是否满足过滤条件,说明:“ PASS if this position has passed all filters, i.e., a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g. “q10;s50” might indicate that at this site the quality is below 10 and the number of samples with data is below 50% of the total number of samples. ‘0’ is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no white-space or semi-colons permitted),若满足所有过滤条件,则显示PASS,若某个过滤条件不满足,则显示不满足过滤条件,多个过滤条件之间用分号分隔;在这个例子中,FILTER这一项是".",代表未被过滤。
# INFO,额外信息,包括多个字段,用分号分隔,如本例中的“AC=2;AF=1.00;AN=2”。
# 关于INFO字段,详细信息如下:
• AA : ancestral allele
• AC : allele count in genotypes, for each ALT allele, in the same order as listed
• AF : allele frequency for each ALT allele in the same order as listed: use this when estimated from primary data, not called genotypes
• AN : total number of alleles in called genotypes
# AC=2;AF=1.00;AN=2,代表二倍体,该位点有2个等位基因,2个发生变异,突变等位基因为100%,即纯合突变。
• BQ : RMS base quality at this position
• CIGAR : cigar string describing how to align an alternate allele to the reference allele
• DB : dbSNP membership
• DP : combined depth across samples, e.g. DP=154
• END : end position of the variant described in this record (for use with symbolic alleles)
• H2 : membership in hapmap2
• H3 : membership in hapmap3
• MQ : RMS mapping quality, e.g. MQ=52
• MQ0 : Number of MAPQ == 0 reads covering this record
• NS : Number of samples with data
• SB : strand bias at this position
• SOMATIC : indicates that the record is a somatic mutation, for cancer genomics
• VALIDATED : validated by follow-up experiment
• 1000G : membership in 1000 Genomes

参考官方文档:VCF

  • 本文作者:括囊无誉
  • 本文链接: WES/vcf_file/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!