肿瘤基因组分析教程:五、检测变异

关于变异检测,目前有多个程序可供使用,但很难说哪个程序更好,需要用实验的手段来验证。目前,TCGA采取4款软件,varscan,MuTect,MuSE,SomaticSniper。在这个教程中,我们会陆续介绍这几款主流的程序的使用方法。

一、Varscan

1
2
3
4
5
6
7
8
9
10
11
12
$ samtools mpileup -d 1000 -B -q 1 -f /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam > /mnt/d/cancer/variant/normal-tumor.mpileup

# 关于生成mpileup文件,varscan推荐生成normal-tumor.mpileup,同时也有助于缩小空间占用

# -d, --max-depth INT max per-file depth; avoids excessive memory usage [8000]
# -B, --no-BAQ disable BAQ (per-Base Alignment Quality)
# -q, --min-MQ INT skip alignments with mapQ smaller than INT [0]
# -f, --fasta-ref FILE faidx indexed reference sequence file

# mpileup文件很大
total 9.7G
-rwxrwxrwx 1 eric eric 9.7G Nov 24 11:15 normal-tumor.mpileup

现在,可以使用进行VARIANT CALLING

varscan是基于启发式的算法,并通过对一组样本进行统计检验来检测变异,可用于检测somatic/germline variants或CNV

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
***NON-COMMERCIAL VERSION***

USAGE: java -jar VarScan.jar [COMMAND] [OPTIONS]

COMMANDS:
pileup2snp Identify SNPs from a pileup file
pileup2indel Identify indels a pileup file
pileup2cns Call consensus and variants from a pileup file
mpileup2snp Identify SNPs from an mpileup file
mpileup2indel Identify indels an mpileup file
mpileup2cns Call consensus and variants from an mpileup file

somatic Call germline/somatic variants from tumor-normal pileups
copynumber Determine relative tumor copy number from tumor-normal pileups
readcounts Obtain read counts for a list of variants from a pileup file

filter Filter SNPs by coverage, frequency, p-value, etc.
somaticFilter Filter somatic variants for clusters/indels
fpfilter Apply the false-positive filter

processSomatic Isolate Germline/LOH/Somatic calls from output
copyCaller GC-adjust and process copy number changes from VarScan copynumber output
compare Compare two lists of positions/variants
limit Restrict pileup/snps/indels to ROI positions

对于somatic/germline variants,可以使用varscan somatic命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
USAGE: VarScan somatic [normal_pileup] [tumor_pileup] [Opt: output] OPTIONS
normal_pileup - The SAMtools pileup file for Normal
tumor_pileup - The SAMtools pileup file for Tumor
output - Output base name for SNP and indel output

OPTIONS:
--output-snp - Output file for SNP calls [output.snp]
--output-indel - Output file for indel calls [output.indel]
--min-coverage - Minimum coverage in normal and tumor to call variant [8]
--min-coverage-normal - Minimum coverage in normal to call somatic [8]
--min-coverage-tumor - Minimum coverage in tumor to call somatic [6]
--min-var-freq - Minimum variant frequency to call a heterozygote [0.20]
--min-freq-for-hom Minimum frequency to call homozygote [0.75]
--normal-purity - Estimated purity (non-tumor content) of normal sample [1.00]
--tumor-purity - Estimated purity (tumor content) of tumor sample [1.00]
--p-value - P-value threshold to call a heterozygote [0.99]
--somatic-p-value - P-value threshold to call a somatic site [0.05]
--strand-filter - If set to 1, removes variants with >90% strand bias [0]
--validation - If set to 1, outputs all compared positions even if non-variant
--output-vcf - If set to 1, output VCF instead of VarScan native format

# 在本例中,使用如下参数:
$ $ varscan somatic /mnt/d/cancer/variant/normal-tumor.mpileup /mnt/d/cancer/variant/varscan/varscan --mpileup 1 --output-vcf 1 --strand-filter 1 --somatic-p-value 0.001

Min coverage: 8x for Normal, 6x for Tumor
Min reads2: 2
Min strands2: 1
Min var freq: 0.2
Min freq for hom: 0.75
Normal purity: 1.0
Tumor purity: 1.0
Min avg qual: 15
P-value thresh: 0.99
Somatic p-value: 0.001
Reading input from /mnt/d/cancer/variant/normal-tumor.mpileup
Reading mpileup input...
87538592 positions in mpileup file
43808558 had sufficient coverage for comparison
43766686 were called Reference
0 were mixed SNP-indel calls and filtered
0 were removed by the strand filter
41675 were called Germline
16 were called LOH
181 were called Somatic
0 were called Unknown
0 were called Variant

# 在mpileup文件中有87538592个位置,其中43808558个位置有足够的测序深度进行对比,43766686个位置是Reference,41675个是Germline,16个LOH,181个是Somatic。我们来看一下生成的VCF文件的内容,前面已经介绍过对于VCF文件的操作

total 5.9M
-rwxrwxrwx 1 eric eric 305K Nov 24 12:18 varscan.indel.vcf
-rwxrwxrwx 1 eric eric 5.6M Nov 24 12:18 varscan.snp.vcf

$ wc -l *
1974 varscan.indel.vcf
37579 varscan.snp.vcf
39553 total

$ grep -v '^#' varscan.indel.vcf | wc -l
1956
$ grep -v '^#' varscan.snp.vcf | wc -l
37561

$ grep -v '^#' varscan.indel.vcf | head
chr1 866511 . C CCCCT . PASS DP=22;SS=1;SSC=3;GPV=5.3885E-2;SPV=4.4962E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:9:8:1:11.11%:3,5,0,1 0/1:.:13:10:3:23.08%:4,6,2,1
chr1 948846 . T TA . PASS DP=26;SS=1;SSC=4;GPV=1.0968E-2;SPV=3.202E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:13:10:2:16.67%:10,0,2,0 0/1:.:13:8:4:33.33%:8,0,4,0
chr1 1110987 . T TC . PASS DP=109;SS=1;SSC=10;GPV=5.5481E-19;SPV=9.2343E-2 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:50:31:19:38%:27,4,18,1 0/1:.:59:28:31:52.54%:28,0,31,0
chr1 1177918 . CT C . PASS DP=81;SS=1;SSC=2;GPV=5.0311E-18;SPV=5.4934E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:40:18:22:55%:18,0,21,1 0/1:.:41:18:23:56.1%:18,0,23,0
chr1 1276930 . GCA G . PASS DP=111;SS=1;SSC=2;GPV=3.1066E-64;SPV=5.8559E-1 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:46:0:46:100%:0,0,46,0 1/1:.:65:1:64:98.46%:1,0,64,0
chr1 1276973 . G GACAC . PASS DP=344;SS=1;SSC=2;GPV=2.5102E-171;SPV=5.1236E-1 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:147:6:133:95.68%:6,0,125,8 1/1:.:197:7:178:96.22%:6,1,164,14
chr1 1289367 . CTG C . PASS DP=45;SS=1;SSC=6;GPV=1.6658E-22;SPV=2.3087E-1 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:28:3:25:89.29%:0,3,19,6 1/1:.:17:0:17:100%:0,0,15,2
chr1 1647893 . C CTTTCTT . PASS DP=148;SS=1;SSC=1;GPV=1.5376E-11;SPV=6.6262E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:61:46:14:23.33%:27,19,6,8 0/1:.:87:68:19:21.84%:41,27,11,8
chr1 1887091 . CG C . PASS DP=83;SS=1;SSC=9;GPV=5.6303E-12;SPV=1.1544E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:34:24:10:29.41%:18,6,8,2 0/1:.:49:27:22:44.9%:26,1,21,1
chr1 1887111 . GC G . PASS DP=101;SS=1;SSC=5;GPV=3.6902E-16;SPV=2.8695E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:42:26:16:38.1%:20,6,12,4 0/1:.:59:32:27:45.76%:27,5,23,4

$ grep -v '^#' varscan.snp.vcf | head
chr1 69511 . A G . PASS DP=60;SS=1;SSC=0;GPV=1.035E-35;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:36:0:36:100%:0,0,33,3 1/1:.:24:0:24:100%:0,0,22,2
chr1 762273 . G A . PASS DP=66;SS=1;SSC=0;GPV=2.6498E-39;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:37:0:37:100%:0,0,12,25 1/1:.:29:0:29:100%:0,0,14,15
chr1 777232 . C T . PASS DP=28;SS=1;SSC=0;GPV=2.3037E-6;SPV=9.2937E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:18:7:11:61.11%:7,0,9,2 0/1:.:10:6:4:40%:6,0,3,1
chr1 792480 . C T . PASS DP=27;SS=1;SSC=0;GPV=5.1363E-16;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:16:0:16:100%:0,0,6,10 1/1:.:11:0:11:100%:0,0,5,6
chr1 866319 . G A . PASS DP=49;SS=1;SSC=0;GPV=3.925E-29;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:21:0:21:100%:0,0,21,0 1/1:.:28:0:28:100%:0,0,28,0
chr1 876499 . A G . PASS DP=31;SS=1;SSC=0;GPV=2.1486E-18;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:19:0:19:100%:0,0,18,1 1/1:.:12:0:12:100%:0,0,10,2
chr1 880238 . A G . PASS DP=78;SS=1;SSC=0;GPV=1.7165E-46;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:56:0:56:100%:0,0,37,19 1/1:.:22:0:22:100%:0,0,14,8
chr1 881627 . G A . PASS DP=57;SS=1;SSC=0;GPV=6.4572E-34;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:22:0:22:100%:0,0,19,3 1/1:.:35:0:35:100%:0,0,32,3
chr1 883625 . A G . PASS DP=87;SS=1;SSC=0;GPV=6.9142E-52;SPV=1E0 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:32:0:32:100%:0,0,26,6 1/1:.:55:0:55:100%:0,0,42,13
chr1 884101 . A C . PASS DP=117;SS=1;SSC=0;GPV=7.359E-48;SPV=9.3085E-1 GT:GQ:DP:RD:AD:FREQ:DP4 1/1:.:61:7:54:88.52%:4,3,21,33 1/1:.:56:11:45:80.36%:9,2,20,25

$ grep -v '^#' varscan.snp.vcf | cut -f 1 | sort | uniq -c
3781 chr1
1640 chr10
2783 chr11
1844 chr12
476 chr13
1289 chr14
1210 chr15
2049 chr16
2432 chr17
489 chr18
3112 chr19
2336 chr2
1091 chr20
572 chr21
1110 chr22
1720 chr3
1239 chr4
1509 chr5
1492 chr6
2162 chr7
1263 chr8
1635 chr9
322 chrX
5 chrY
$ grep -v '^#' varscan.indel.vcf | cut -f 1 | sort | uniq -c
189 chr1
69 chr10
137 chr11
116 chr12
20 chr13
76 chr14
67 chr15
100 chr16
162 chr17
25 chr18
140 chr19
102 chr2
56 chr20
40 chr21
58 chr22
90 chr3
57 chr4
77 chr5
91 chr6
115 chr7
59 chr8
83 chr9
27 chrX

# chr1没有INDEL,但有最多的SNP

# 只抓取SOMATIC
$ grep SOMATIC varscan.indel.vcf
##INFO=<ID=SOMATIC,Number=0,Type=Flag,Description="Indicates if record is a somatic mutation">
chr1 67154730 . CAAA C . PASS DP=21;SOMATIC;SS=2;SSC=9;GPV=1E0;SPV=1.1905E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:9:3:0:0%:3,0,0,0 0/1:.:12:2:4:66.67%:2,0,4,0
chr1 116940496 . TG T . PASS DP=30;SOMATIC;SS=2;SSC=9;GPV=1E0;SPV=1.0345E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:20:20:0:0%:18,2,0,0 0/1:.:10:8:2:20%:8,0,2,0
chr11 134054705 . TAC T . PASS DP=17;SOMATIC;SS=2;SSC=7;GPV=1E0;SPV=1.9231E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:9:7:0:0%:4,3,0,0 0/1:.:8:4:2:33.33%:4,0,2,0
chr12 104732900 . AT A . PASS DP=18;SOMATIC;SS=2;SSC=7;GPV=1E0;SPV=1.8301E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:10:10:0:0%:10,0,0,0 0/1:.:8:6:2:25%:6,0,2,0
chr12 122064778 . CA C . PASS DP=116;SOMATIC;SS=2;SSC=7;GPV=1E0;SPV=1.8182E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:43:6:0:0%:4,2,0,0 0/1:.:73:3:2:40%:2,1,1,1
chr12 132445122 . AT A . PASS DP=22;SOMATIC;SS=2;SSC=9;GPV=1E0;SPV=1.2121E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:14:14:0:0%:11,3,0,0 0/1:.:8:6:2:25%:6,0,2,0
chr14 50623681 . GTA G . PASS DP=18;SOMATIC;SS=2;SSC=9;GPV=1E0;SPV=1.0294E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:9:9:0:0%:9,0,0,0 0/1:.:9:6:3:33.33%:6,0,3,0
chr15 50782434 . GT G . PASS DP=20;SOMATIC;SS=2;SSC=6;GPV=1E0;SPV=2.3684E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:10:10:0:0%:10,0,0,0 0/1:.:10:8:2:20%:8,0,2,0
chr17 54981571 . CA C . PASS DP=24;SOMATIC;SS=2;SSC=5;GPV=1E0;SPV=2.7368E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:9:7:0:0%:7,0,0,0 0/1:.:15:11:3:21.43%:11,0,3,0
chr19 6469042 . CT C . PASS DP=35;SOMATIC;SS=2;SSC=10;GPV=1E0;SPV=9.4721E-2 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:15:14:0:0%:10,4,0,0 0/1:.:20:15:4:21.05%:8,7,3,1
chr19 56000261 . G GTACA . PASS DP=16;SOMATIC;SS=2;SSC=6;GPV=1E0;SPV=2.3333E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:8:8:0:0%:5,3,0,0 0/1:.:8:6:2:25%:4,2,1,1
chr2 116593690 . TGTGA T . PASS DP=20;SOMATIC;SS=2;SSC=8;GPV=1E0;SPV=1.4737E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:12:12:0:0%:11,1,0,0 0/1:.:8:6:2:25%:6,0,2,0
chr3 10183699 . CG C . PASS DP=87;SOMATIC;SS=2;SSC=37;GPV=1E0;SPV=1.9809E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:31:31:0:0%:30,1,0,0 0/1:.:56:39:17:30.36%:37,2,17,0
chr3 186393073 . GGTGT G . PASS DP=16;SOMATIC;SS=2;SSC=5;GPV=1E0;SPV=2.6667E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:8:7:0:0%:3,4,0,0 0/1:.:8:6:2:25%:4,2,1,1
chr4 4190507 . C CTTGTTAGTCA . PASS DP=24;SOMATIC;SS=2;SSC=10;GPV=1E0;SPV=9.4203E-2 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:10:10:0:0%:10,0,0,0 0/1:.:14:10:4:28.57%:10,0,4,0
chr4 4190511 . G GTTA . PASS DP=27;SOMATIC;SS=2;SSC=16;GPV=1E0;SPV=2.19E-2 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:10:10:0:0%:10,0,0,0 0/1:.:17:10:7:41.18%:9,1,4,3
chr4 88048315 . GC G . PASS DP=27;SOMATIC;SS=2;SSC=12;GPV=1E0;SPV=5.641E-2 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:16:16:0:0%:0,16,0,0 0/1:.:11:8:3:27.27%:4,4,2,1
chr7 56087291 . AC A . PASS DP=24;SOMATIC;SS=2;SSC=6;GPV=1E0;SPV=2.248E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:9:9:0:0%:9,0,0,0 0/1:.:15:12:3:20%:11,1,3,0
chr7 114293360 . AT A . PASS DP=27;SOMATIC;SS=2;SSC=9;GPV=1E0;SPV=1.2E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:17:16:0:0%:16,0,0,0 0/1:.:10:7:2:22.22%:7,0,2,0
chr8 145738581 . TGGG T . PASS DP=22;SOMATIC;SS=2;SSC=7;GPV=1E0;SPV=1.8947E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:13:11:0:0%:10,1,0,0 0/1:.:9:7:2:22.22%:7,0,1,1
chr9 125932152 . CAT C . PASS DP=24;SOMATIC;SS=2;SSC=6;GPV=1E0;SPV=2.1053E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:15:10:0:0%:9,1,0,0 0/1:.:9:7:2:22.22%:7,0,2,0
chrX 47107214 . TC T . PASS DP=165;SOMATIC;SS=2;SSC=98;GPV=1E0;SPV=1.4237E-10 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:59:58:0:0%:37,21,0,0 0/1:.:106:63:43:40.57%:42,21,30,13
chrX 100666907 . CT C . PASS DP=21;SOMATIC;SS=2;SSC=7;GPV=1E0;SPV=1.75E-1 GT:GQ:DP:RD:AD:FREQ:DP4 0/0:.:12:9:0:0%:9,0,0,0 0/1:.:9:5:2:28.57%:5,0,2,0

$ grep SOMATIC varscan.indel.vcf | wc -l
24
$ grep 'SS=2' varscan.indel.vcf |wc -l # 体会一下这两个参数的区别,都是指SOMATIC
23

$ grep SOMATIC varscan.snp.vcf | wc -l
108
$ grep 'SS=2' varscan.snp.vcf | wc -l
107

# 同样,可以看一下Germline有多少
$ grep 'SS=1' varscan.indel.vcf | wc -l
1933
$ grep 'SS=1' varscan.snp.vcf | wc -l
37442

# 看一下LOH
$ grep 'SS=3' varscan.indel.vcf | wc -l
0
$ grep 'SS=3' varscan.snp.vcf | wc -l
12
$ grep 'SS=3' varscan.snp.vcf | head
chr1 45218895 . A T . PASS DP=52;SS=3;SSC=35;GPV=1E0;SPV=3.0463E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:28:20:8:28.57%:17,3,8,0 1/1:.:24:5:19:79.17%:5,0,18,1
chr11 72316344 . G T . PASS DP=63;SS=3;SSC=32;GPV=1E0;SPV=5.3356E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:44:25:19:43.18%:9,16,7,12 1/1:.:19:2:17:89.47%:1,1,10,7
chr14 105935577 . C A . PASS DP=75;SS=3;SSC=31;GPV=1E0;SPV=6.9789E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:33:24:8:25%:24,0,0,8 0/0:.:42:42:0:0%:38,4,0,0
chr16 71956593 . T G . PASS DP=151;SS=3;SSC=66;GPV=1E0;SPV=2.2714E-7 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:74:37:37:50%:10,27,8,29 0/0:.:77:64:8:11.11%:17,47,3,5
chr16 71956596 . A G . PASS DP=146;SS=3;SSC=66;GPV=1E0;SPV=2.2714E-7 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:74:37:37:50%:10,27,8,29 0/0:.:72:64:8:11.11%:17,47,3,5
chr16 71956597 . A T . PASS DP=146;SS=3;SSC=37;GPV=1E0;SPV=1.6498E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:74:51:23:31.08%:24,27,8,15 0/0:.:72:67:5:6.94%:20,47,3,2
chr3 19959756 . G A . PASS DP=90;SS=3;SSC=37;GPV=1E0;SPV=1.6442E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:61:25:36:59.02%:19,6,28,8 0/0:.:29:24:5:17.24%:18,6,2,3
chr3 48732390 . A G . PASS DP=31;SS=3;SSC=34;GPV=1E0;SPV=3.5706E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:19:7:12:63.16%:4,3,11,1 0/0:.:12:12:0:0%:8,4,0,0
chr3 52474480 . A G . PASS DP=69;SS=3;SSC=39;GPV=1E0;SPV=1.2352E-4 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:37:21:16:43.24%:20,1,16,0 1/1:.:32:4:28:87.5%:4,0,28,0
chr3 52860936 . C T . PASS DP=153;SS=3;SSC=46;GPV=1E0;SPV=2.3344E-5 GT:GQ:DP:RD:AD:FREQ:DP4 0/1:.:91:43:48:52.75%:25,18,26,22 0/0:.:62:50:12:19.35%:35,15,5,7
# 这些VCF文件我们后面要作图进行可视化,先将VCF文件压缩并生成索引
$ bgzip varscan.indel.vcf
$ bgzip varscan.snp.vcf
$ bcftools index -t varscan.indel.vcf.gz
$ bcftools index -t varscan.snp.vcf.gz
# 查看
total 1.6M
-rwxrwxrwx 1 eric eric 86K Nov 24 22:33 varscan.indel.vcf.gz
-rwxrwxrwx 1 eric eric 26K Nov 24 22:34 varscan.indel.vcf.gz.tbi
-rwxrwxrwx 1 eric eric 1.4M Nov 24 22:33 varscan.snp.vcf.gz
-rwxrwxrwx 1 eric eric 175K Nov 24 22:34 varscan.snp.vcf.gz.tbi

二、MuTect2

MuTect2使用起来非常繁琐,软件更新十分频繁,因此流程变来变去,不推荐使用。详情敬请移步官方教程:
Call somatic mutations using GATK4 Mutect2
Mutect2;
Somatic short variant discovery (SNVs + Indels)

2.1 创建PON (panel of normals)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ gatk Mutect2 -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -I /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam --max-mnp-distance 0 -O /mnt/d/cancer/variant/mutect/normal.vcf.gz

# 运行完成后会生成3个文件
# Step 1: Run Mutect2 in tumor-only mode for each normal sample
total 5.2M
-rwxrwxrwx 1 eric eric 4.8M Nov 26 22:01 normal.vcf.gz
-rwxrwxrwx 1 eric eric 37 Nov 26 22:01 normal.vcf.gz.stats
-rwxrwxrwx 1 eric eric 414K Nov 26 22:01 normal.vcf.gz.tbi

# Step 2: Create a GenomicsDB from the normal Mutect2 calls
$ gatk GenomicsDBImport -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed --genomicsdb-workspace-path pon_db -V /mnt/d/cancer/variant/mutect/normal.vcf.gz
# 注意,此处"--genomicsdb-workspace-path pon_db"是为GenomicsDB提供工作目录,即在运行目录下会新建一个pon_db目录

# Step 3: Combine the normal calls using CreateSomaticPanelOfNormals
$ gatk CreateSomaticPanelOfNormals -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa --germline-resource /mnt/d/ncbi/wxs/gnomAD/af-only-gnomad.hg19.vcf.gz -V gendb://pon_db -O /mnt/d/cancer/variant/mutect/pon.vcf.gz
# 现在,得到了PON文件,这个文件将作为下一步的输入文件

第三步中,需要用到一个af-only-gnomad.hg19.vcf.gz文件,如果使用hg38参考基因组,这个文件可以从gatk的resource of bundle上下载,但是对于hg19,并不提供这个文件。但是,我们可以找到gnomAD的完全版,8.3G,我们可以使用bcftools来提取AF信息,并生成新的文件,过程如下:

下载地址: hg19

1
2
3
4
5
6
7
8
9
10
11
12
# 下载后的文件
-rwxrwxrwx 1 eric eric 8.3G Nov 27 09:45 gnomad.exomes.r2.0.2.sites.vcf.gz
-rwxrwxrwx 1 eric eric 902K Nov 27 10:13 gnomad.exomes.r2.0.2.sites.vcf.gz.tbi

# 提取AF,压缩并索引
$ bcftools annotate -x ^INFO/AF gnomad.exomes.r2.0.2.sites.vcf.gz > af-only-gnomad.hg19.vcf
$ bgzip af-only-gnomad.hg19.vcf
$ bcftools index -t af-only-gnomad.hg19.vcf.gz

total 212M
-rwxrwxrwx 1 eric eric 211M Nov 27 22:13 af-only-gnomad.hg19.vcf.gz
-rwxrwxrwx 1 eric eric 534K Nov 27 22:15 af-only-gnomad.hg19.vcf.gz.tbi

2.2 获取原始的候选变异

1
gatk Mutect2 -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -I /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam -I /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam -normal SRR5478492C.bqsr --germline-resource af-only-gnomad.vcf.gz --panel-of-normals pon.vcf.gz -O somatic.vcf.gz

https://hgdownload.soe.ucsc.edu/gbdb/hg19/gnomAD/vcf/

https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-b37;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
https://gatk.broadinstitute.org/hc/en-us/articles/360035890711-GRCh37-hg19-b37-humanG1Kv37-Human-Reference-Discrepancies
https://gatk.broadinstitute.org/hc/en-us/articles/360035890951-Human-genome-reference-builds-GRCh38-or-hg38-b37-hg19
https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-
https://blog.csdn.net/viancheng/article/details/106332585

拷贝数变异

sequenza

生成GC content文件

1
$ sequenza-utils gc_wiggle --fasta /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -w 50 -o /mnt/d/ncbi/wxs/sequenza/hg19_gc50.wig.gz

sequenza-utils bam2seqz –normal normal_sample.bam –tumor tumor_sample.bam \
–fasta genome.fa.gz -gc genome_gc50.wig.gz –output sample.seqz.gz

sequenza bam2seqz \
-n data/normal.chr5.60Mb.bam \
-t data/tumour.chr5.60Mb.bam \
–fasta assets/human_g1k_v37.fasta \
-gc assets/human_g1k_v37.gc50Base.txt.gz \
-C 5:1-60000000 | gzip > stage1.seqz.gz

-n: the normal BAM
-t: the tumour BAM
–fasta: the reference genome used for mapping (b37 here)
-gc: GC content as windows through the genome (pre-generated and downloadable from the Sequenza website)
-C: specifies the genomic location to process

$ sequenza-utils bam2seqz –normal /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam -t /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam –fasta /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -gc /mnt/d/ncbi/wxs/sequenza/hg19_gc50.wig.gz –output /mnt/d/cancer/cnv/stage1.seqz.gz

$ sequenza-utils bam2seqz –normal /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam -t /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam –fasta /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -gc /mnt/d/ncbi/wxs/sequenza/hg19_gc50.wig.gz –output /mnt/d/cancer/cnv/stage1.seqz.gz
[mpileup] 1 samples in 1 input files
[mpileup] 1 samples in 1 input files

total 75M
-rwxrwxrwx 1 eric eric 75M Nov 27 22:47 stage1.seqz.gz
-rwxrwxrwx 1 eric eric 126K Nov 27 22:47 stage1.seqz.gz.tbi

$ zcat stage1.seqz.gz | head -n 20
chromosome position base.ref depth.normal depth.tumor depth.ratio Af Bf zygosity.normal GC.percent good.reads AB.normal AB.tumor tumor.strand
chr1 69457 C 13 7 0.538 1.0 0 hom 48 7 C . 0
chr1 69458 T 13 7 0.538 1.0 0 hom 48 7 T . 0
chr1 69459 A 13 7 0.538 1.0 0 hom 48 7 A . 0
chr1 69460 C 14 8 0.571 1.0 0 hom 48 8 C . 0
chr1 69461 A 14 8 0.571 1.0 0 hom 48 8 A . 0
chr1 69462 C 14 8 0.571 1.0 0 hom 48 8 C . 0
chr1 69463 T 14 8 0.571 1.0 0 hom 48 8 T . 0
chr1 69464 A 15 8 0.533 1.0 0 hom 48 8 A . 0
chr1 69465 C 16 8 0.5 1.0 0 hom 48 8 C . 0
chr1 69466 A 17 9 0.529 1.0 0 hom 48 9 A . 0
chr1 69467 C 17 9 0.529 1.0 0 hom 48 9 C . 0
chr1 69468 T 17 9 0.529 1.0 0 hom 48 9 T . 0
chr1 69469 A 16 9 0.562 1.0 0 hom 48 9 A . 0
chr1 69470 C 18 9 0.5 1.0 0 hom 48 9 C . 0
chr1 69471 A 20 9 0.45 1.0 0 hom 48 9 A . 0
chr1 69472 A 22 9 0.409 1.0 0 hom 48 9 A . 0
chr1 69473 T 22 11 0.5 1.0 0 hom 48 11 T . 0
chr1 69474 T 22 14 0.636 1.0 0 hom 48 14 T . 0
chr1 69475 A 22 14 0.636 1.0 0 hom 48 14 A . 0

sequenza-utils seqz-binning \
-w 200 \
-s stage1.seqz.gz | gzip > stage2.seqz.gz

sequenza-utils seqz_binning -w 200 -s stage1.seqz.gz -o stage2.seqz.gz

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
> library(sequenza)
> data.file <- "./stage2.seqz.gz"
> seqzdata <- sequenza.extract(data.file)
Collecting GC information . done

Processing chr1:
1 variant calls.
2 copy-number segments.
2255 heterozygous positions.
107658 homozygous positions.
Processing chr10:
1 variant calls.
2 copy-number segments.
1005 heterozygous positions.
38009 homozygous positions.
Processing chr11:
2 variant calls.
3 copy-number segments.
1744 heterozygous positions.
71520 homozygous positions.
  • 本文作者:括囊无誉
  • 本文链接: WES/cancer_seq5/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!