外显子测序分析教程

一、原始数据下载

如果是从测序公司拿到数据,一般会得到处理过的fastq格式,而我们在此以SRA为数据源获得原始数据。以PRJNA669198这个项目为例,我们下载SRR12846241,原始数据2.76Gb,样本来自31岁的男性colon,患有林奇综合征(Lynch syndrome,或称HNPCC,遗传性非息肉性大肠癌),测序平台是Illumina,双端测序,以上是基本信息。

1
prefetch SRR12846241

注:prefetch是sratoolkit中的一个程序,用于从SRA数据库下载测序数据,以上是下载单个文件,若要下载多个文件,可以在SRA数据库的原始数据页面获取Accession List (SRR_Acc_List.txt),批量下载命令如下:

1
prefetch --option-file SRR_Acc_List.txt

使用prefetch命令下载,数据将默认储存在~/ncbi/public/sra/目录下。

二、原始数据转换

下载完成后,我们获得的原始数据是SRA格式(SRR12846241.sra),需要首先使用sratoolkit转换为fastq格式。

1
2
3
4
5
fastq-dump --split-3 ./SRR12846241.sra
#Read 31651170 spots for ./SRR12846241.sra
#Written 31651170 spots for ./SRR12846241.sra
#生成两个文件
#SRR12846241_1.fastq SRR12846241_2.fastq

解释一下,对于双端测序,使用–split-3参数,将生成1.fastq和2.fastq两个文件;对于单端测序,需要去掉该参数,只生成一个.fastq文件。

三、测序数据质控

在进行下一步之前,我们先建立一些必要的目录,以便对后续的数据进行分类

1
2
mkdir {fastq,qc,bam,nodup}
#将生成的.fastq文件移至目录fastq下

如要查看数据质量,我们可以使用经典的程序fastqc,语法如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
fastqc -o ./qc -t 4 ./fastq/*.fastq
# -o,指定输出目录;-t 线程数;后面是输入文件
# 这个分析过程很快,而且显示进程
Started analysis of SRR12846241_1.fastq
Started analysis of SRR12846241_2.fastq
Approx 5% complete for SRR12846241_1.fastq
Approx 5% complete for SRR12846241_2.fastq
Approx 10% complete for SRR12846241_1.fastq
Approx 10% complete for SRR12846241_2.fastq
...
Analysis complete for SRR12846241_1.fastq
Analysis complete for SRR12846241_2.fastq
# 运行完成后会生成如下文件
SRR12846241_1_fastqc.html SRR12846241_1_fastqc.zip SRR12846241_2_fastqc.html SRR12846241_2_fastqc.zip

fastqc是用于查看数据质量的,打开生成的.html文件可以看到,这个数据的质量是相当好的,这得益于近年来测序技术的巨大进步。

数据质控的部分我们可以使用fastp,这个程序可以进行过滤(去除低质量序列,较多N的序列),去掉接头,进行碱基校正,输出质控报告等。

对于单端测序,

1
fastp -i input.fastq -o output.fastq

对于双端测序,

1
fastp -i input.1.fastq -I input.2.fastq -o output.1.fastq -O output.2.fastq

这个程序的多数功能是默认开启的,不需额外设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
fastp -i ./fastq/SRR12846241_1.fastq -I ./fastq/SRR12846241_2.fastq -o ./qc/cleaned_1.fastq -O ./qc/cleaned_2.fastq -w 4
# --adapter_sequence read1接头序列,如不指定则自动检测;
# --adapter_sequence_r2 read1接头序列,如不指定则自动检测;
# -c 适用于双端测序,对碱基进行校正,通过查找每一对read的重合区,然后对mismatch的碱基进行校正,方法是使用高质量的read取代低质量的read;
# -q 设置低质量的标准,默认值15;
# -u 允许不合格碱基的百分比,默认值40%;
# -n 若read中的N碱基数据大于此值,则该对reads将被丢弃,默认值5;
# -w 线程数,默认值2,根据自己电脑的CPU来确定;
# -5/-3 滑窗裁剪,需要指定threshold,-5 滑窗从5'向3'移动,-3滑窗从3'向5'移动;
# 结果如下
Read1 before filtering:
total reads: 31651170
total bases: 4779326670
Q20 bases: 4680761815(97.9377%)
Q30 bases: 4513495612(94.4379%)

Read2 before filtering:
total reads: 31651170
total bases: 4779326670
Q20 bases: 4651248152(97.3202%)
Q30 bases: 4453204808(93.1764%)

Read1 after filtering:
total reads: 31297805
total bases: 4676179403
Q20 bases: 4589169104(98.1393%)
Q30 bases: 4429126929(94.7168%)

Read2 aftering filtering:
total reads: 31297805
total bases: 4676179403
Q20 bases: 4574607381(97.8279%)
Q30 bases: 4387146408(93.819%)

Filtering result:
reads passed filter: 62595610
reads failed due to low quality: 696646
reads failed due to too many N: 10084
reads failed due to too short: 0
reads with adapter trimmed: 6762458
bases trimmed due to adapters: 99756614

Duplication rate: 14.5952%

Insert size peak (evaluated by paired-end reads): 189

JSON report: fastp.json
HTML report: fastp.html

fastp -i ./fastq/SRR12846241_1.fastq -I ./fastq/SRR12846241_2.fastq -o ./qc/cleaned_1.fastq -O ./qc/cleaned_2.fastq -w 4
fastp v0.20.1, time used: 642 seconds

四、比对至基因组

对于DNA的比对,使用BWA这个程序,使用之前,需要为BWA生成索引文件,命令如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
bwa index hg38.fa
# 过程如下,耗时约100分钟
[bwa_index] Pack FASTA... 45.94 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=6418572210, availableWord=463634060
[BWTIncConstructFromPacked] 10 iterations done. 99999986 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 199999986 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 299999986 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 399999986 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 499999986 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 599999986 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 699999986 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 799999986 characters processed.
[BWTIncConstructFromPacked] 90 iterations done. 899999986 characters processed.
[BWTIncConstructFromPacked] 100 iterations done. 999999986 characters processed.
[BWTIncConstructFromPacked] 110 iterations done. 1099999986 characters processed.
[BWTIncConstructFromPacked] 120 iterations done. 1199999986 characters processed.
[BWTIncConstructFromPacked] 130 iterations done. 1299999986 characters processed.
[BWTIncConstructFromPacked] 140 iterations done. 1399999986 characters processed.
[BWTIncConstructFromPacked] 150 iterations done. 1499999986 characters processed.
[BWTIncConstructFromPacked] 160 iterations done. 1599999986 characters processed.
[BWTIncConstructFromPacked] 170 iterations done. 1699999986 characters processed.
[BWTIncConstructFromPacked] 180 iterations done. 1799999986 characters processed.
[BWTIncConstructFromPacked] 190 iterations done. 1899999986 characters processed.
[BWTIncConstructFromPacked] 200 iterations done. 1999999986 characters processed.
[BWTIncConstructFromPacked] 210 iterations done. 2099999986 characters processed.
[BWTIncConstructFromPacked] 220 iterations done. 2199999986 characters processed.
[BWTIncConstructFromPacked] 230 iterations done. 2299999986 characters processed.
[BWTIncConstructFromPacked] 240 iterations done. 2399999986 characters processed.
[BWTIncConstructFromPacked] 250 iterations done. 2499999986 characters processed.
[BWTIncConstructFromPacked] 260 iterations done. 2599999986 characters processed.
[BWTIncConstructFromPacked] 270 iterations done. 2699999986 characters processed.
[BWTIncConstructFromPacked] 280 iterations done. 2799999986 characters processed.
[BWTIncConstructFromPacked] 290 iterations done. 2899999986 characters processed.
[BWTIncConstructFromPacked] 300 iterations done. 2999999986 characters processed.
[BWTIncConstructFromPacked] 310 iterations done. 3099999986 characters processed.
[BWTIncConstructFromPacked] 320 iterations done. 3199999986 characters processed.
[BWTIncConstructFromPacked] 330 iterations done. 3299999986 characters processed.
[BWTIncConstructFromPacked] 340 iterations done. 3399999986 characters processed.
[BWTIncConstructFromPacked] 350 iterations done. 3499999986 characters processed.
[BWTIncConstructFromPacked] 360 iterations done. 3599999986 characters processed.
[BWTIncConstructFromPacked] 370 iterations done. 3699999986 characters processed.
[BWTIncConstructFromPacked] 380 iterations done. 3799999986 characters processed.
[BWTIncConstructFromPacked] 390 iterations done. 3899999986 characters processed.
[BWTIncConstructFromPacked] 400 iterations done. 3999999986 characters processed.
[BWTIncConstructFromPacked] 410 iterations done. 4099999986 characters processed.
[BWTIncConstructFromPacked] 420 iterations done. 4199999986 characters processed.
[BWTIncConstructFromPacked] 430 iterations done. 4299999986 characters processed.
[BWTIncConstructFromPacked] 440 iterations done. 4399999986 characters processed.
[BWTIncConstructFromPacked] 450 iterations done. 4499999986 characters processed.
[BWTIncConstructFromPacked] 460 iterations done. 4599999986 characters processed.
[BWTIncConstructFromPacked] 470 iterations done. 4699999986 characters processed.
[BWTIncConstructFromPacked] 480 iterations done. 4799999986 characters processed.
[BWTIncConstructFromPacked] 490 iterations done. 4899999986 characters processed.
[BWTIncConstructFromPacked] 500 iterations done. 4999999986 characters processed.
[BWTIncConstructFromPacked] 510 iterations done. 5099999986 characters processed.
[BWTIncConstructFromPacked] 520 iterations done. 5199999986 characters processed.
[BWTIncConstructFromPacked] 530 iterations done. 5299999986 characters processed.
[BWTIncConstructFromPacked] 540 iterations done. 5399999986 characters processed.
[BWTIncConstructFromPacked] 550 iterations done. 5499999986 characters processed.
[BWTIncConstructFromPacked] 560 iterations done. 5599999986 characters processed.
[BWTIncConstructFromPacked] 570 iterations done. 5699999986 characters processed.
[BWTIncConstructFromPacked] 580 iterations done. 5798165394 characters processed.
[BWTIncConstructFromPacked] 590 iterations done. 5886412978 characters processed.
[BWTIncConstructFromPacked] 600 iterations done. 5964843650 characters processed.
[BWTIncConstructFromPacked] 610 iterations done. 6034549010 characters processed.
[BWTIncConstructFromPacked] 620 iterations done. 6096499330 characters processed.
[BWTIncConstructFromPacked] 630 iterations done. 6151556914 characters processed.
[BWTIncConstructFromPacked] 640 iterations done. 6200488210 characters processed.
[BWTIncConstructFromPacked] 650 iterations done. 6243974434 characters processed.
[BWTIncConstructFromPacked] 660 iterations done. 6282621122 characters processed.
[BWTIncConstructFromPacked] 670 iterations done. 6316966322 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6347488418 characters processed.
[BWTIncConstructFromPacked] 690 iterations done. 6374612482 characters processed.
[BWTIncConstructFromPacked] 700 iterations done. 6398716386 characters processed.
[BWTIncConstructFromPacked] 710 iterations done. 6418572210 characters processed.
[bwt_gen] Finished constructing BWT in 710 iterations.
[bwa_index] 4475.92 seconds elapse.
[bwa_index] Update BWT... 34.89 sec
[bwa_index] Pack forward-only FASTA... 30.80 sec
[bwa_index] Construct SA from BWT and Occ... 1760.91 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index hg38.fa
[main] Real time: 6467.646 sec; CPU: 6348.453 sec
# 运行完成后,生成如下文件
hg38.fa.amb hg38.fa.ann hg38.fa.bwt hg38.fa.fai hg38.fa.pac hg38.fa.sa
# 若没有这些索引文件,则会在运行时报错,提示找不到参考基因组

下面,使用BWA进行MAPPING,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#    bwa mem\
# -t $NUM_THREADS\ 线程数
# -B $MISMATCH_PENALTY\ 空位罚分
# -R '@RG\tID:lane1\tSM:sample1\tLB:genome\tPL:ILLUMINA'\ 设定头文件,必填,ID:通道名或样本名,通过这个信息分组,必须唯一;SM:样本名;LB:⽂库名;PL:测序平台信息[ILLUMINA, SOLID]
# -M\ 将shorter split hits标记为次优,以兼容 Picard’s markDuplicates软件
# $INDEX\ 指定参考基因组
# ./$sample\_1.fastq\ 输入fastq
# ./$sample\_2.fastq\ 输入fastq
# | samtools view -q4 -bS -o - -\ 将上一步生成的SAM转为BAM
# | samtools sort -@4 -o ./bam_files/$sample.bam - 排序并输出BAM
bwa mem -t 4 -B 4 -M ~/ncbi/hg38/chroms/hg38.fa /mnt/d/wes/qc/cleaned_1.fastq /mnt/d/wes/qc/cleaned_2.fastq | samtools view -q4 -bS -o - - | samtools sort -@4 -o /mnt/d/wes/bam/clean.bam -
# 这个过程是非常漫长的,对于普通的电脑可能耗时10个小时,做生信分析配备一台超高性能的电脑是很必要的,否则时间都花在等待当中
# 下面这是一个典型的比对单元,整个过程是由数百个这样的单元组成
[M::process] read 267710 sequences (40000156 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 90279, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 198, 232)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (43, 358)
[M::mem_pestat] mean and std.dev: (202.73, 45.61)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 421)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267710 reads in 105.719 CPU sec, 61.174 real sec
# 运行结束后,会生成一个clean.bam文件

五、加头文件

这步是外加的,因为在使用BWA进行比对时没有加-R参数,因此缺少Read Group信息,可以通过picard tool中的AddOrReplaceReadGroups.jar加入RG信息,因为picard被整合在GATK中,因此,我们可以使用GATK来加RG

1
2
3
4
5
6
gatk AddOrReplaceReadGroups -I /mnt/d/wes/bam/clean.bam -O /mnt/d/wes/bam/clean.RG.bam -LB genome -PL ILLUMINA -PU unit1 -SM sample1
# 参数解释
--RGLB,-LB <String> Read-Group library Required.
--RGPL,-PL <String> Read-Group platform (e.g. ILLUMINA, SOLID) Required.
--RGPU,-PU <String> Read-Group platform unit (eg. run barcode) Required.
--RGSM,-SM <String> Read-Group sample name Required.

六、PCR去重

1
2
3
4
5
6
7
8
gatk MarkDuplicates -I /mnt/d/wes/bam/clean.RG.bam -O /mnt/d/wes/nodup/nodup.bam -M /mnt/d/wes/nodup/metrics.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true 
# 参数解释
--METRICS_FILE,-M <File> 生成metrics的文件
--REMOVE_DUPLICATES 默认为false,若为true,则在生成的文件中去掉duplicates
--ASSUME_SORTED 默认为false

# 生成索引文件
samtools index /mnt/d/wes/nodup/nodup.bam

七、BQSR (Recalibration Base Quality Score)

下面进行变异检测,首先,我们需要先为hg38.fa生成dict文件,这可以通过调用picard中的CreateSequenceDictonary模块来实现,也可以直接在GATK中调用。

1
2
java -jar /home/eric/picard/build/libs/picard.jar CreateSequenceDictionary R=hg38.fa O=hg38.dict
gatk CreateSequenceDictionary -R hg38.fa -O hg38.dict && echo "** dict done **" # 这两个命令等价,&& echo "** dict done **"是提示dict完成,可以去掉

另外,补充hg38.fa.fai文件的生成过程,可以调用samtool faidx来生成

1
samtools faidx hg38.fa

上面这两个文件的生成过程都很快,几秒钟就可以完成。

BQSR分两步进行,分别使用BaseRecalibrator与ApplyBQSR两个工具。

第一步:BaseRecalibrator,此工具根据物种的known sites,生成一个校正质量值所需要的表格文件,基本用法如下:

1
2
3
4
5
6
gatk BaseRecalibrator \
-I my_reads.bam \
-R reference.fasta \
--known-sites sites_of_variation.vcf \
--known-sites another/optional/setOfSitesToMask.vcf \
-O recal_data.table

在本例中,执行如下命令:

1
gatk BaseRecalibrator -I /mnt/d/wes/nodup/nodup.bam -R /home/eric/ncbi/hg38/chroms/hg38.fa --known-sites /home/eric/ncbi/hg38/vcf/dbsnp_138.hg38.vcf.gz --known-sites /home/eric/ncbi/hg38/vcf/1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites /home/eric/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -L /home/eric/ncbi/hg38/vcf/hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list -O /mnt/d/wes/bqsr/recal_data2.table

第二步:ApplyBQSR,此工具执行BQSR的第二步,具体来说,它会根据第一步中BaseRecalibrator工具生成的重新校准表来重新校准输入的BAM的质量,并输出重新校准的BAM文件,基本用法如下:

1
2
3
4
5
gatk ApplyBQSR \
-R reference.fasta \
-I input.bam \
--bqsr-recal-file recalibration.table \
-O output.bam

在本例中,执行如下命令:

1
gatk ApplyBQSR -I /mnt/d/wes/nodup/nodup.bam -R /home/eric/ncbi/hg38/chroms/hg38.fa --bqsr-recal-file /mnt/d/wes/bqsr/recal_data2.table -O /mnt/d/wes/bqsr/bqsr.bam

下面,可以对生成的BAM文件进行验证,使用ValidateSamFile工具,若显示no errors found,就可以进行变异检测了。

1
gatk ValidateSamFile -I /mnt/d/wes/bqsr/bqsr.bam

如发现错误,则可以使用FixMateInformation工具进行修复

1
gatk FixMateInformation -I /mnt/d/wes/bqsr/bqsr.bam -O /mnt/d/wes/bqsr/fixed_bqsr.bam

变异检测,使用HaplotypeCaller工具,

1
2
3
4
5
gatk HaplotypeCaller\
-R Homo_sapiens_assembly38.fasta \
-I input.bam \
-O output.g.vcf.gz \
-ERC GVCF

在本例中,使用如下参数,此步极费时间,我这台低配电脑耗时10个小时

1
gatk HaplotypeCaller -R /home/eric/ncbi/hg38/chroms/hg38.fa -I /mnt/d/wes/bqsr/bqsr.bam -O /mnt/d/wes/caller/output.g.vcf.gz -ERC GVCF

生成g.vcf.gz文件之后,就可以使用GenotypeGVCFs工具对多个样本执行joint genotyping,目的是将多个样本合成一个文件

1
2
3
4
5
6
gatk GenotypeGVCFs
-R Homo_sapiens_assembly38.fasta \
-V input.g.vcf.gz \
-O output.vcf.gz

gatk GenotypeGVCFs -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.g.vcf.gz -O /mnt/d/wes/caller/output.vcf.gz

提取SNV,使用SelectVariants工具,

1
2
3
4
5
gatk SelectVariants \
-R Homo_sapiens_assembly38.fasta \
-V input.vcf \
--select-type-to-include SNP \ # 可选以下参数:NO_VARIATION, SNP, MNP, INDEL, SYMBOLIC, MIXED
-O output.vcf

在本例中,使用如下命令:

1
gatk SelectVariants -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.vcf.gz --select-type-to-include SNP -O /mnt/d/wes/caller/snp.vcf

这个过程极快,只需要几秒即可完成。下面,提取INDEL,仍然使用SelectVariants工具,只是参数选择INDEL,

1
gatk SelectVariants -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.vcf.gz --select-type-to-include INDEL -O /mnt/d/wes/caller/indel.vcf

突变位点质量值重矫正(VQSR)

这一步是对突变位点进行质控,原理是通过比对与已知变异位点的交集,评估各个突变位点的可信度。已知变异集包括:hapmap,omni,1000G,dbsnp。

第一步:SNP的VQSR的过滤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
gatk VariantRecalibrator \
-R Homo_sapiens_assembly38.fasta \
-V input.vcf.gz \
--resource hapmap,known=false,training=true,truth=true,prior=15.0:hapmap_3.3.hg38.sites.vcf.gz \
--resource omni,known=false,training=true,truth=false,prior=12.0:1000G_omni2.5.hg38.sites.vcf.gz \
--resource 1000G,known=false,training=true,truth=false,prior=10.0:1000G_phase1.snps.high_confidence.hg38.vcf.gz \
--resource dbsnp,known=true,training=false,truth=false,prior=2.0:Homo_sapiens_assembly38.dbsnp138.vcf.gz \
-an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR \
-mode SNP \ # 可选参数:SNP,INDEL,BOTH(仅用于测试目的)
-O output.recal \
--tranches-file output.tranches \
--rscript-file output.plots.R

# 在本例中,命令如下:
gatk VariantRecalibrator -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.vcf.gz --resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/eric/ncbi/hg38/vcf/hapmap_3.3.hg38.vcf.gz --resource:omni,known=false,training=true,truth=false,prior=12.0 /home/eric/ncbi/hg38/vcf/1000G_omni2.5.hg38.vcf.gz --resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/eric/ncbi/hg38/vcf/1000G_phase1.snps.high_confidence.hg38.vcf.gz --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/eric/ncbi/hg38/vcf/dbsnp_138.hg38.vcf.gz -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode SNP -O /mnt/d/wes/caller/output.snp.recal --tranches-file /mnt/d/wes/caller/output.snp.tranches --rscript-file /mnt/d/wes/caller/output.snp.plots.R

INDEL的VQSR的过滤,参数有点变动,–resource也不同

1
gatk VariantRecalibrator -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.vcf.gz --max-gaussians 4 --resource:mills,known=false,training=true,truth=true,prior=12.0 /home/eric/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/eric/ncbi/hg38/vcf/dbsnp_146.hg38.vcf.gz -an QD -an DP -an MQRankSum -an ReadPosRankSum -an FS -an SOR -mode INDEL -O /mnt/d/wes/caller/output.indel.recal --tranches-file /mnt/d/wes/caller/output.indel.tranches --rscript-file /mnt/d/wes/caller/output.indel.plots.R

第二步:ApplyVQSR to SNP(Apply a score cutoff to filter variants based on a recalibration table),即将第一步生成的阀值文件应用于vcf.gz,并生成新的vcf.gz文件,这个过程将阀值以下的突变去除

1
2
3
4
5
6
7
8
9
10
11
gatk ApplyVQSR \
-R Homo_sapiens_assembly38.fasta \
-V input.vcf.gz \ # 输入文件是由GenotypeGVCFs生成的vcf.gz
-O output.vcf.gz \
--truth-sensitivity-filter-level 99.0 \
--tranches-file output.tranches \ # VariantRecalibrator生成的snp.tranches文件
--recal-file output.recal \ # VariantRecalibrator生成的snp.recal文件
-mode SNP # 参数为SNP,INDEL,或BOTH

# 在本例中,命令如下:
gatk ApplyVQSR -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.vcf.gz -O /mnt/d/wes/caller/output.vqsr.snp.vcf.gz --truth-sensitivity-filter-level 99.0 --tranches-file /mnt/d/wes/caller/output.snp.tranches --recal-file /mnt/d/wes/caller/output.snp.recal -mode SNP

INDEL也是如此,Applying recalibration to INDEL,命令如下:

1
gatk ApplyVQSR -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/caller/output.vcf.gz -O /mnt/d/wes/caller/output.vqsr.indel.vcf.gz --truth-sensitivity-filter-level 99.0 --tranches-file /mnt/d/wes/caller/output.indel.tranches --recal-file /mnt/d/wes/caller/output.indel.recal -mode INDEL

位点注释

1
2
3
4
5
6
7
8
 gatk CNNScoreVariants \
-V vcf_to_annotate.vcf.gz \
-R reference.fasta \
-O annotated.vcf
# SNP文件注释
gatk CNNScoreVariants -V /mnt/d/wes/caller/output.vqsr.snp.vcf.gz -R /home/eric/ncbi/hg38/chroms/hg38.fa -O /mnt/d/wes/annotate/annotate.snp.vcf --disable-avx-check true
# INDEL文件注释
gatk CNNScoreVariants -V /mnt/d/wes/caller/output.vqsr.indel.vcf.gz -R /home/eric/ncbi/hg38/chroms/hg38.fa -O /mnt/d/wes/annotate/annotate.snp.vcf

附:GATK参考文件下载地址:gatk/bundle/hg38,这些参考文件可以在多个地址下载,但是有些因为压缩格式的问题会报错,这个网站上的参考文件已经验证,可以使用。

  • 本文作者:括囊无誉
  • 本文链接: WES/wes01/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!