外显子测序分析二

参考资料:Germline short variant discovery (SNPs + Indels)

下面这张图是单样本种系突变的主要分析步骤(Main steps for Germline Single-Sample Data)

一、数据质控

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
fastp -i /mnt/d/wes/sample/fastq/sam01_1.fastq -I /mnt/d/wes/sample/fastq/sam01_2.fastq -o /mnt/d/wes/sample/qc/clean_sam01_1.fastq -O /mnt/d/wes/sample/qc/clean_sam01_2.fastq

# 耗时30秒
Read1 before filtering:
total reads: 1582559
total bases: 238966409
Q20 bases: 234035553(97.9366%)
Q30 bases: 225671450(94.4365%)

Read2 before filtering:
total reads: 1582559
total bases: 238966409
Q20 bases: 232564639(97.3211%)
Q30 bases: 222663993(93.1779%)

Read1 after filtering:
total reads: 1564770
total bases: 233787024
Q20 bases: 229439315(98.1403%)
Q30 bases: 221439194(94.7183%)

Read2 aftering filtering:
total reads: 1564770
total bases: 233787024
Q20 bases: 228711758(97.8291%)
Q30 bases: 219341614(93.8211%)

Filtering result:
reads passed filter: 3129540
reads failed due to low quality: 35106
reads failed due to too many N: 472
reads failed due to too short: 0
reads with adapter trimmed: 339368
bases trimmed due to adapters: 4994344

Duplication rate: 0.84949%

Insert size peak (evaluated by paired-end reads): 190

JSON report: fastp.json
HTML report: fastp.html

fastp -i /mnt/d/wes/sample/fastq/sam01_1.fastq -I /mnt/d/wes/sample/fastq/sam01_2.fastq -o /mnt/d/wes/sample/qc/clean_sam01_1.fastq -O /mnt/d/wes/sample/qc/clean_sam01_2.fastq
fastp v0.20.1, time used: 30 seconds

二、比对至基因组

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
bwa mem -t 4 -B 4 -R '@RG\tID:lane1\tSM:sample1\tLB:genome\tPL:ILLUMINA' -M ~/ncbi/hg38/chroms/hg38.fa /mnt/d/wes/sample/qc/clean_sam01_1.fastq /mnt/d/wes/sample/qc/clean_sam01_2.fastq | samtools view -q4 -bS -o - - | samtools sort -@4 -o /mnt/d/wes/sample/bam/clean_sam01.bam -
# 比对耗时30分钟
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 267772 sequences (40000138 bp)...
[M::process] read 267742 sequences (40000164 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 90228, 1, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 197, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.01, 45.34)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267772 reads in 272.156 CPU sec, 11919.176 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 90153, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 198, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.42, 45.38)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267742 reads in 144.141 CPU sec, 120.016 real sec
[M::process] read 267734 sequences (40000200 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 90334, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 198, 232)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (43, 358)
[M::mem_pestat] mean and std.dev: (202.52, 45.61)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 421)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::process] read 267748 sequences (40000292 bp)...
[M::mem_process_seqs] Processed 267734 reads in 151.547 CPU sec, 143.398 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 90216, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 198, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.43, 45.49)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267748 reads in 146.859 CPU sec, 111.532 real sec
[M::process] read 267744 sequences (40000056 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (4, 90236, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 197, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.11, 45.29)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267744 reads in 123.953 CPU sec, 90.588 real sec
[M::process] read 267718 sequences (40000024 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 90313, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 197, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.03, 45.17)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267718 reads in 141.703 CPU sec, 111.135 real sec
[M::process] read 267686 sequences (40000084 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 90081, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 198, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.38, 45.37)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267686 reads in 134.469 CPU sec, 95.876 real sec
[M::process] read 267726 sequences (40000068 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (3, 90134, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 198, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.15, 45.43)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267726 reads in 134.438 CPU sec, 100.282 real sec
[M::process] read 267692 sequences (40000016 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 90379, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (170, 198, 232)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (46, 356)
[M::mem_pestat] mean and std.dev: (202.61, 45.49)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 418)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267692 reads in 134.969 CPU sec, 100.242 real sec
[M::process] read 267726 sequences (40000064 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 90356, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 197, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.19, 45.30)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::process] read 267698 sequences (40000206 bp)...
[M::mem_process_seqs] Processed 267726 reads in 141.531 CPU sec, 116.261 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 90064, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 197, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (202.27, 45.37)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 267698 reads in 149.391 CPU sec, 125.276 real sec
[M::process] read 184554 sequences (27572736 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 62039, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (169, 197, 231)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (45, 355)
[M::mem_pestat] mean and std.dev: (201.96, 45.23)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 417)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 184554 reads in 102.375 CPU sec, 77.923 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -B 4 -R @RG\tID:lane1\tSM:sample1\tLB:genome\tPL:ILLUMINA -M /home/eric/ncbi/hg38/chroms/hg38.fa /mnt/d/wes/sample/qc/clean_sam01_1.fastq /mnt/d/wes/sample/qc/clean_sam01_2.fastq
[main] Real time: 13158.133 sec; CPU: 1804.688 sec
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
# 加上索引文件
samtools index /mnt/d/wes/sample/bam/clean_sam01.bam
# 查看文件大小
total 159M
-rwxrwxrwx 1 xxx xxx 155M Nov 6 18:45 clean_sam01.bam
-rwxrwxrwx 1 xxx xxx 3.1M Nov 6 19:12 clean_sam01.bam.bai

三、PCR去重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
gatk MarkDuplicates -I /mnt/d/wes/sample/bam/clean_sam01.bam -O /mnt/d/wes/sample/nodup/nodup_sam01.bam -M /mnt/d/wes/sample/nodup/metrics_sam01.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true

Using GATK jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar MarkDuplicates -I /mnt/d/wes/sample/bam/clean_sam01.bam -O /mnt/d/wes/sample/nodup/nodup_sam01.bam -M /mnt/d/wes/sample/nodup/metrics_sam01.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true
19:17:37.667 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Nov 06 19:17:37 CST 2020] MarkDuplicates --INPUT /mnt/d/wes/sample/bam/clean_sam01.bam --OUTPUT /mnt/d/wes/sample/nodup/nodup_sam01.bam --METRICS_FILE /mnt/d/wes/sample/nodup/metrics_sam01.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
Nov 06, 2020 7:17:38 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
[Fri Nov 06 19:17:38 CST 2020] Executing as eric@LAPTOP-90TPP2NM on Linux 4.4.0-18362-Microsoft amd64; OpenJDK 64-Bit Server VM 1.8.0_152-release-1056-b12; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.1.9.0
INFO 2020-11-06 19:17:38 MarkDuplicates Start of doWork freeMemory: 430727552; totalMemory: 446169088; maxMemory: 1883242496
INFO 2020-11-06 19:17:38 MarkDuplicates Reading input file and constructing read end information.
INFO 2020-11-06 19:17:38 MarkDuplicates Will retain up to 6823342 data points before spilling to disk.
WARNING 2020-11-06 19:17:38 AbstractOpticalDuplicateFinderCommandLineProgram A field field parsed out of a read name was expected to contain an integer and did not. Read name: SRR12846241.10302161. Cause: String 'SRR12846241.10302161' did not start with a parsable number.
INFO 2020-11-06 19:17:49 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 10s. Last read position: chr17:17,169,090
INFO 2020-11-06 19:17:49 MarkDuplicates Tracking 2670 as yet unmatched pairs. 78 records in RAM.
INFO 2020-11-06 19:17:58 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:19s. Time for last 1,000,000: 8s. Last read position: chr5:170,878,110
INFO 2020-11-06 19:17:58 MarkDuplicates Tracking 3897 as yet unmatched pairs. 134 records in RAM.
INFO 2020-11-06 19:18:02 MarkDuplicates Read 2540180 records. 4018 pairs never matched.
INFO 2020-11-06 19:18:03 MarkDuplicates After buildSortedReadEndLists freeMemory: 720026592; totalMemory: 1068498944; maxMemory: 1883242496
INFO 2020-11-06 19:18:03 MarkDuplicates Will retain up to 58851328 duplicate indices before spilling to disk.
INFO 2020-11-06 19:18:03 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2020-11-06 19:18:04 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2020-11-06 19:18:04 MarkDuplicates Sorting list of duplicate records.
INFO 2020-11-06 19:18:05 MarkDuplicates After generateDuplicateIndexes freeMemory: 1016276296; totalMemory: 1506803712; maxMemory: 1883242496
INFO 2020-11-06 19:18:05 MarkDuplicates Marking 22740 records as duplicates.
INFO 2020-11-06 19:18:05 MarkDuplicates Found 0 optical duplicate clusters.
INFO 2020-11-06 19:18:05 MarkDuplicates Reads are assumed to be ordered by: coordinate
INFO 2020-11-06 19:18:23 MarkDuplicates Writing complete. Closing input iterator.
INFO 2020-11-06 19:18:23 MarkDuplicates Duplicate Index cleanup.
INFO 2020-11-06 19:18:23 MarkDuplicates Getting Memory Stats.
INFO 2020-11-06 19:18:23 MarkDuplicates Before output close freeMemory: 1489644800; totalMemory: 1512046592; maxMemory: 1883242496
INFO 2020-11-06 19:18:23 MarkDuplicates Closed outputs. Getting more Memory Stats.
INFO 2020-11-06 19:18:23 MarkDuplicates After output close freeMemory: 1519717896; totalMemory: 1537736704; maxMemory: 1883242496
[Fri Nov 06 19:18:23 CST 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.77 minutes.
Runtime.totalMemory()=1537736704
Tool returned:
0
# 加上索引文件
samtools index /mnt/d/wes/sample/nodup/nodup_sam01.bam
# 查看文件大小
total 216M
-rwxrwxrwx 1 xxx xxx 3.5K Nov 6 19:18 metrics_sam01.txt
-rwxrwxrwx 1 xxx xxx 216M Nov 6 19:18 nodup_sam01.bam

四、BQSR

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# 第一步
gatk BaseRecalibrator -I /mnt/d/wes/sample/nodup/nodup_sam01.bam -R /home/eric/ncbi/hg38/chroms/hg38.fa --known-sites /home/eric/ncbi/hg38/vcf/dbsnp_138.hg38.vcf.gz --known-sites /home/eric/ncbi/hg38/vcf/1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites /home/eric/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -L /home/eric/ncbi/hg38/vcf/hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list -O /mnt/d/wes/sample/bqsr/recal.table

Using GATK jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar BaseRecalibrator -I /mnt/d/wes/sample/nodup/nodup_sam01.bam -R /home/eric/ncbi/hg38/chroms/hg38.fa --known-sites /home/eric/ncbi/hg38/vcf/dbsnp_138.hg38.vcf.gz --known-sites /home/eric/ncbi/hg38/vcf/1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites /home/eric/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -L /home/eric/ncbi/hg38/vcf/hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list -O /mnt/d/wes/sample/bqsr/recal.table
19:30:17.520 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 06, 2020 7:30:18 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
19:30:18.041 INFO BaseRecalibrator - ------------------------------------------------------------
19:30:18.042 INFO BaseRecalibrator - The Genome Analysis Toolkit (GATK) v4.1.9.0
19:30:18.044 INFO BaseRecalibrator - For support and documentation go to https://software.broadinstitute.org/gatk/
19:30:18.046 INFO BaseRecalibrator - Executing as eric@LAPTOP-90TPP2NM on Linux v4.4.0-18362-Microsoft amd64
19:30:18.048 INFO BaseRecalibrator - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
19:30:18.050 INFO BaseRecalibrator - Start Date/Time: November 6, 2020 7:30:17 PM CST
19:30:18.052 INFO BaseRecalibrator - ------------------------------------------------------------
19:30:18.053 INFO BaseRecalibrator - ------------------------------------------------------------
19:30:18.057 INFO BaseRecalibrator - HTSJDK Version: 2.23.0
19:30:18.058 INFO BaseRecalibrator - Picard Version: 2.23.3
19:30:18.060 INFO BaseRecalibrator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:30:18.061 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:30:18.063 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:30:18.064 INFO BaseRecalibrator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:30:18.067 INFO BaseRecalibrator - Deflater: IntelDeflater
19:30:18.069 INFO BaseRecalibrator - Inflater: IntelInflater
19:30:18.071 INFO BaseRecalibrator - GCS max retries/reopens: 20
19:30:18.073 INFO BaseRecalibrator - Requester pays: disabled
19:30:18.074 INFO BaseRecalibrator - Initializing engine
19:30:18.976 INFO FeatureManager - Using codec VCFCodec to read file file:///home/eric/ncbi/hg38/vcf/dbsnp_138.hg38.vcf.gz
19:30:19.256 INFO FeatureManager - Using codec VCFCodec to read file file:///home/eric/ncbi/hg38/vcf/1000G_phase1.snps.high_confidence.hg38.vcf.gz
19:30:19.439 INFO FeatureManager - Using codec VCFCodec to read file file:///home/eric/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
19:30:19.691 INFO FeatureManager - Using codec IntervalListCodec to read file file:///home/eric/ncbi/hg38/vcf/hg38_v0_HybSelOligos_whole_exome_illumina_coding_v1_whole_exome_illumina_coding_v1.Homo_sapiens_assembly38.targets.interval_list
19:30:24.305 INFO IntervalArgumentCollection - Processing 38692858 bp from intervals
19:30:24.465 INFO BaseRecalibrator - Done initializing engine
19:30:24.491 INFO BaseRecalibrationEngine - The covariates being used here:
19:30:24.493 INFO BaseRecalibrationEngine - ReadGroupCovariate
19:30:24.496 INFO BaseRecalibrationEngine - QualityScoreCovariate
19:30:24.497 INFO BaseRecalibrationEngine - ContextCovariate
19:30:24.499 INFO BaseRecalibrationEngine - CycleCovariate
19:30:24.520 INFO ProgressMeter - Starting traversal
19:30:24.524 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
19:30:34.676 INFO ProgressMeter - chr1:37721636 0.2 45000 266009.9
19:30:44.837 INFO ProgressMeter - chr1:153937400 0.3 108000 319023.2
19:30:54.949 INFO ProgressMeter - chr1:232471505 0.5 164000 323428.9
19:31:05.042 INFO ProgressMeter - chr10:96206871 0.7 224000 331712.6
19:31:15.085 INFO ProgressMeter - chr11:57300855 0.8 282000 334651.9
19:31:25.278 INFO ProgressMeter - chr12:2688514 1.0 343000 338748.7
19:31:35.402 INFO ProgressMeter - chr12:100262225 1.2 402000 340307.9
19:31:45.512 INFO ProgressMeter - chr13:111202974 1.3 456000 337832.0
19:31:55.612 INFO ProgressMeter - chr15:28228255 1.5 514000 338581.1
19:32:05.652 INFO ProgressMeter - chr16:1495337 1.7 572000 339375.2
19:32:15.750 INFO ProgressMeter - chr16:88639021 1.9 634000 342012.5
19:32:25.807 INFO ProgressMeter - chr17:58316342 2.0 700000 346300.4
19:32:35.823 INFO ProgressMeter - chr18:79451785 2.2 756000 345473.7
19:32:45.892 INFO ProgressMeter - chr19:40938908 2.4 822000 348881.6
19:32:56.008 INFO ProgressMeter - chr2:46623835 2.5 884000 350138.3
19:33:06.010 INFO ProgressMeter - chr2:169735426 2.7 946000 351489.9
19:33:16.139 INFO ProgressMeter - chr20:20047689 2.9 1003000 350670.7
19:33:26.142 INFO ProgressMeter - chr22:23061695 3.0 1058000 349526.8
19:33:36.257 INFO ProgressMeter - chr3:48672416 3.2 1113000 348300.5
19:33:46.412 INFO ProgressMeter - chr3:171093877 3.4 1168000 347126.6
19:33:56.453 INFO ProgressMeter - chr4:99342797 3.5 1220000 345400.3
19:34:06.491 INFO ProgressMeter - chr5:79030744 3.7 1273000 344106.8
19:34:16.538 INFO ProgressMeter - chr6:10898046 3.9 1327000 343170.4
19:34:26.588 INFO ProgressMeter - chr6:136558746 4.0 1379000 341811.8
19:34:36.722 INFO ProgressMeter - chr7:87836098 4.2 1434000 341161.9
19:34:46.961 INFO ProgressMeter - chr8:24321118 4.4 1490000 340654.5
19:34:57.099 INFO ProgressMeter - chr9:5811124 4.5 1536000 338110.0
19:35:07.197 INFO ProgressMeter - chr9:132014280 4.7 1593000 338130.4
19:35:16.870 INFO BaseRecalibrator - 0 read(s) filtered by: MappingQualityNotZeroReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
425 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: WellformedReadFilter
425 total reads filtered
19:35:16.886 INFO ProgressMeter - chrY:13260230 4.9 1646850 337976.0
19:35:16.888 INFO ProgressMeter - Traversal complete. Processed 1646850 total reads in 4.9 minutes.
19:35:17.006 INFO BaseRecalibrator - Calculating quantized quality scores...
19:35:17.053 INFO BaseRecalibrator - Writing recalibration report...
19:35:17.663 INFO BaseRecalibrator - ...done!
19:35:17.712 INFO BaseRecalibrator - BaseRecalibrator was able to recalibrate 1646850 reads
19:35:17.753 INFO BaseRecalibrator - Shutting down engine
[November 6, 2020 7:35:17 PM CST] org.broadinstitute.hellbender.tools.walkers.bqsr.BaseRecalibrator done. Elapsed time: 5.01 minutes.
Runtime.totalMemory()=876609536
Tool returned:
SUCCESS

# 第二步
gatk ApplyBQSR -I /mnt/d/wes/sample/nodup/nodup_sam01.bam -R /home/eric/ncbi/hg38/chroms/hg38.fa --bqsr-recal-file /mnt/d/wes/sample/bqsr/recal.table -O /mnt/d/wes/sample/bqsr/bqsr_sam01.bam

Using GATK jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar ApplyBQSR -I /mnt/d/wes/sample/nodup/nodup_sam01.bam -R /home/eric/ncbi/hg38/chroms/hg38.fa --bqsr-recal-file /mnt/d/wes/sample/bqsr/recal.table -O /mnt/d/wes/sample/bqsr/bqsr_sam01.bam
19:38:44.345 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 06, 2020 7:38:44 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
19:38:44.928 INFO ApplyBQSR - ------------------------------------------------------------
19:38:44.929 INFO ApplyBQSR - The Genome Analysis Toolkit (GATK) v4.1.9.0
19:38:44.931 INFO ApplyBQSR - For support and documentation go to https://software.broadinstitute.org/gatk/
19:38:44.934 INFO ApplyBQSR - Executing as eric@LAPTOP-90TPP2NM on Linux v4.4.0-18362-Microsoft amd64
19:38:44.936 INFO ApplyBQSR - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
19:38:44.938 INFO ApplyBQSR - Start Date/Time: November 6, 2020 7:38:44 PM CST
19:38:44.939 INFO ApplyBQSR - ------------------------------------------------------------
19:38:44.942 INFO ApplyBQSR - ------------------------------------------------------------
19:38:44.949 INFO ApplyBQSR - HTSJDK Version: 2.23.0
19:38:44.950 INFO ApplyBQSR - Picard Version: 2.23.3
19:38:44.957 INFO ApplyBQSR - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:38:44.959 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:38:44.961 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:38:44.962 INFO ApplyBQSR - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:38:44.963 INFO ApplyBQSR - Deflater: IntelDeflater
19:38:44.965 INFO ApplyBQSR - Inflater: IntelInflater
19:38:44.967 INFO ApplyBQSR - GCS max retries/reopens: 20
19:38:44.972 INFO ApplyBQSR - Requester pays: disabled
19:38:44.973 INFO ApplyBQSR - Initializing engine
19:38:45.850 INFO ApplyBQSR - Done initializing engine
19:38:45.930 INFO ProgressMeter - Starting traversal
19:38:45.937 INFO ProgressMeter - Current Locus Elapsed Minutes Reads Processed Reads/Minute
19:38:55.958 INFO ProgressMeter - chr1:235328572 0.2 249000 1491315.6
19:39:05.958 INFO ProgressMeter - chr12:96684881 0.3 597000 1789300.2
19:39:15.963 INFO ProgressMeter - chr17:4901462 0.5 969000 1936450.8
19:39:25.972 INFO ProgressMeter - chr2:96187143 0.7 1348000 2020383.7
19:39:35.978 INFO ProgressMeter - chr3:142790913 0.8 1735000 2080377.3
19:39:45.992 INFO ProgressMeter - chr7:21588589 1.0 2123000 2121126.3
19:39:55.994 INFO ProgressMeter - chrX:153961533 1.2 2509000 2148883.0
19:39:56.224 INFO ApplyBQSR - 0 read(s) filtered by: WellformedReadFilter

19:39:56.227 INFO ProgressMeter - chrY:25168958 1.2 2517440 2148964.3
19:39:56.228 INFO ProgressMeter - Traversal complete. Processed 2517440 total reads in 1.2 minutes.
19:39:56.277 INFO ApplyBQSR - Shutting down engine
[November 6, 2020 7:39:56 PM CST] org.broadinstitute.hellbender.tools.walkers.bqsr.ApplyBQSR done. Elapsed time: 1.20 minutes.
Runtime.totalMemory()=617086976
# 查看文件
total 421M
-rwxrwxrwx 1 xxx xxx 3.8M Nov 6 19:39 bqsr_sam01.bai
-rwxrwxrwx 1 xxx xxx 417M Nov 6 19:39 bqsr_sam01.bam
# 现在,BQSR完成后,就得到Analysis-Ready Reads

五、变异检测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
gatk HaplotypeCaller -R /home/eric/ncbi/hg38/chroms/hg38.fa -I /mnt/d/wes/sample/bqsr/bqsr_sam01.bam -O /mnt/d/wes/sample/caller/output.g.vcf.gz -ERC GVCF
# 这一步比较耗时
22:06:51.680 INFO HaplotypeCaller - 17268 read(s) filtered by: MappingQualityReadFilter
0 read(s) filtered by: MappingQualityAvailableReadFilter
0 read(s) filtered by: MappedReadFilter
1538 read(s) filtered by: NotSecondaryAlignmentReadFilter
0 read(s) filtered by: NotDuplicateReadFilter
0 read(s) filtered by: PassesVendorQualityCheckReadFilter
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter
0 read(s) filtered by: GoodCigarReadFilter
0 read(s) filtered by: WellformedReadFilter
18806 total reads filtered
22:06:51.872 INFO ProgressMeter - chrY_KI270740v1_random:36001 86.3 10857084 125749.4
22:06:51.881 INFO ProgressMeter - Traversal complete. Processed 10857084 total regions in 86.3 minutes.
22:06:52.209 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 298.5258293
22:06:52.233 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 131.45 sec
22:06:52.234 INFO HaplotypeCaller - Shutting down engine
[November 6, 2020 10:06:52 PM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 86.37 minutes.
Runtime.totalMemory()=1270349824
# 下面转化为vcf.gz
gatk GenotypeGVCFs -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/sample/caller/output.g.vcf.gz -O /mnt/d/wes/sample/caller/output.vcf.gz
# 这个过程很快
Using GATK jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar GenotypeGVCFs -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/sample/caller/output.g.vcf.gz -O /mnt/d/wes/sample/caller/output.vcf.gz
22:14:25.026 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/eric/miniconda3/share/gatk4-4.1.9.0-0/gatk-package-4.1.9.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 06, 2020 10:14:25 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:14:25.679 INFO GenotypeGVCFs - ------------------------------------------------------------
22:14:25.687 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.9.0
22:14:25.688 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
22:14:25.691 INFO GenotypeGVCFs - Executing as eric@LAPTOP-90TPP2NM on Linux v4.4.0-18362-Microsoft amd64
22:14:25.693 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_152-release-1056-b12
22:14:25.697 INFO GenotypeGVCFs - Start Date/Time: November 6, 2020 10:14:24 PM CST
22:14:25.700 INFO GenotypeGVCFs - ------------------------------------------------------------
22:14:25.703 INFO GenotypeGVCFs - ------------------------------------------------------------
22:14:25.705 INFO GenotypeGVCFs - HTSJDK Version: 2.23.0
22:14:25.706 INFO GenotypeGVCFs - Picard Version: 2.23.3
22:14:25.709 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:14:25.711 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:14:25.717 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:14:25.720 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:14:25.722 INFO GenotypeGVCFs - Deflater: IntelDeflater
22:14:25.724 INFO GenotypeGVCFs - Inflater: IntelInflater
22:14:25.728 INFO GenotypeGVCFs - GCS max retries/reopens: 20
22:14:25.729 INFO GenotypeGVCFs - Requester pays: disabled
22:14:25.734 INFO GenotypeGVCFs - Initializing engine
22:14:26.642 INFO FeatureManager - Using codec VCFCodec to read file file:///mnt/d/wes/sample/caller/output.g.vcf.gz
22:14:27.083 INFO GenotypeGVCFs - Done initializing engine
22:14:27.249 INFO ProgressMeter - Starting traversal
22:14:27.261 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
22:14:27.459 WARN ReferenceConfidenceVariantContextMerger - Detected invalid annotations: When trying to merge variant contexts at location chr1:14653 the annotation MLEAC=[0, 0] was not a numerical value and was ignored
22:14:27.707 WARN InbreedingCoeff - InbreedingCoeff will not be calculated; at least 10 samples must have called genotypes
22:14:37.332 INFO ProgressMeter - chr1:19155008 0.2 82000 488725.5
22:14:47.369 INFO ProgressMeter - chr1:81936862 0.3 255000 761042.6
22:14:57.397 INFO ProgressMeter - chr1:201005243 0.5 520000 1035478.4
22:15:07.551 INFO ProgressMeter - chr10:48156213 0.7 728000 1084247.6
22:15:17.572 INFO ProgressMeter - chr11:10800804 0.8 959000 1143777.2
22:15:27.591 INFO ProgressMeter - chr11:129436951 1.0 1238000 1231330.3
22:15:37.603 INFO ProgressMeter - chr12:125957935 1.2 1553000 1324746.2
22:15:47.616 INFO ProgressMeter - chr14:80761356 1.3 1831000 1367251.2
22:15:57.654 INFO ProgressMeter - chr16:10649222 1.5 2159000 1433139.0
22:16:07.676 INFO ProgressMeter - chr17:21318468 1.7 2465000 1472946.2
22:16:17.746 INFO ProgressMeter - chr18:31086752 1.8 2744000 1490211.0
22:16:27.768 INFO ProgressMeter - chr19:46527144 2.0 3066000 1526601.0
22:16:37.774 INFO ProgressMeter - chr2:100864558 2.2 3365000 1547019.7
22:16:47.780 INFO ProgressMeter - chr21:10473621 2.3 3816000 1629434.6
22:16:57.794 INFO ProgressMeter - chr3:151326235 2.5 4320000 1721927.3
22:17:07.829 INFO ProgressMeter - chr5:108575621 2.7 4796000 1792182.6
22:17:17.874 INFO ProgressMeter - chr7:17796878 2.8 5264000 1851250.5
22:17:27.887 INFO ProgressMeter - chr8:47824115 3.0 5638000 1872861.6
22:17:37.893 INFO ProgressMeter - chrX:130165728 3.2 6174000 1943261.2
22:17:38.980 INFO ProgressMeter - chrY:56836134 3.2 6220164 1946690.9
22:17:38.981 INFO ProgressMeter - Traversal complete. Processed 6220164 total variants in 3.2 minutes.
22:17:39.095 INFO GenotypeGVCFs - Shutting down engine
[November 6, 2020 10:17:39 PM CST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 3.24 minutes.
Runtime.totalMemory()=534773760

# 为理解下面的步骤,我们来看一下这个文件中的第一条记录:
CHROM: chr1
POS: 15903
ID: .
REF: G
ALT: GC
QUAL: 59.28
FILTER: .
INFO: AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303
FORMAT: GT:AD:DP:GQ:PL
sample1: 1/1:0,2:2:6:71,6,0

六、位点注释

位点注释需要用到三个工具:CNNScoreVariantsFilterVariantTranches,以及Funcotator

CNNScoreVariants是指使用CNN (Convolutional Neural Network)的分数对VCF文件进行注释,这个工具需要调用PYTHON程序,因此,需要预先搭建环境,最简单不易出错的方法是使用Docker,关于Docker的搭建及测试会专门写一篇文章介绍。

6.1 启动Docker

1
2
3
4
# 由于在Docker内部无法调用外部文件,因此,需要将外部文件挂载在Docker中,如果把Docker看作是一个独立的操作系统,挂载后的外部文件即相当于Docker内部的一个盘符,方便调用,在此,使用-V参数将D盘挂载到Docker下,并命名为my_data目录,路径为"/gatk/my_data/";同时,使用-it参数启动gatk。
$ docker run -v /mnt/d:/gatk/my_data -it broadinstitute/gatk:4.1.3.0

(gatk) root@9f0ac1b25dec:/gatk#

6.2 使用CNNScoreVariants进行评分

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
gatk CNNScoreVariants -V /gatk/my_data/wes/sample/caller/output.vcf.gz -R /gatk/my_data/ncbi/hg38/chroms/hg38.fa -O /gatk/my_data/wes/sample/annotate/annotated.vcf --disable-avx-check true
# --disable-avx-check true这个参数根据需要使用,如果系统中AVX的版本低于1.6,可以使用此参数跳过检查,否则会报错。

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.3.0-local.jar CNNScoreVariants -V /gatk/my_data/wes/sample/caller/output.vcf.gz -R /gatk/my_data/ncbi/hg38/chroms/hg38.fa -O /gatk/my_data/wes/sample/annotate/annotated.vcf --disable-avx-check true
22:13:19.307 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 11, 2020 10:13:21 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:13:21.208 INFO CNNScoreVariants - ------------------------------------------------------------
22:13:21.209 INFO CNNScoreVariants - The Genome Analysis Toolkit (GATK) v4.1.3.0
22:13:21.209 INFO CNNScoreVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
22:13:21.212 INFO CNNScoreVariants - Executing as root@9f0ac1b25dec on Linux v4.19.128-microsoft-standard amd64
22:13:21.213 INFO CNNScoreVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
22:13:21.213 INFO CNNScoreVariants - Start Date/Time: November 11, 2020 10:13:19 PM UTC
22:13:21.214 INFO CNNScoreVariants - ------------------------------------------------------------
22:13:21.214 INFO CNNScoreVariants - ------------------------------------------------------------
22:13:21.217 INFO CNNScoreVariants - HTSJDK Version: 2.20.1
22:13:21.220 INFO CNNScoreVariants - Picard Version: 2.20.5
22:13:21.220 INFO CNNScoreVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:13:21.220 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:13:21.220 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:13:21.220 INFO CNNScoreVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:13:21.220 INFO CNNScoreVariants - Deflater: IntelDeflater
22:13:21.220 INFO CNNScoreVariants - Inflater: IntelInflater
22:13:21.220 INFO CNNScoreVariants - GCS max retries/reopens: 20
22:13:21.220 INFO CNNScoreVariants - Requester pays: disabled
22:13:21.221 INFO CNNScoreVariants - Initializing engine
22:13:22.011 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/my_data/wes/sample/caller/output.vcf.gz
22:13:22.334 INFO CNNScoreVariants - Done initializing engine
22:13:37.467 INFO CNNScoreVariants - Using key:CNN_1D for CNN architecture:/tmp/1d_cnn_mix_train_full_bn.4399861944292417098.json and weights:/tmp/1d_cnn_mix_train_full_bn.6480348361258943885.hd5
22:13:42.239 INFO ProgressMeter - Starting traversal
22:13:42.239 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
22:13:42.247 INFO CNNScoreVariants - Starting pass 0 through the variants
22:14:02.248 INFO ProgressMeter - chr1:87045715 0.3 3000 9000.0
22:14:13.936 INFO ProgressMeter - chr1:147029254 0.5 4000 7571.9
22:14:25.604 INFO ProgressMeter - chr1:169397278 0.7 5000 6918.2
22:14:38.391 INFO ProgressMeter - chr1:214663767 0.9 6000 6411.3
22:14:49.750 INFO ProgressMeter - chr10:3082324 1.1 7000 6221.3
22:15:02.307 INFO ProgressMeter - chr10:45759533 1.3 8000 5995.1
22:15:15.155 INFO ProgressMeter - chr10:93404283 1.5 9000 5811.8
22:15:26.166 INFO ProgressMeter - chr10:133168333 1.7 10000 5773.3
22:15:37.367 INFO ProgressMeter - chr11:18255670 1.9 11000 5732.8
22:15:49.865 INFO ProgressMeter - chr11:63490944 2.1 12000 5641.5
22:16:02.650 INFO ProgressMeter - chr11:101593990 2.3 13000 5555.2
22:16:13.848 INFO ProgressMeter - chr12:2786919 2.5 14000 5540.6
22:16:25.342 INFO ProgressMeter - chr12:31102025 2.7 15000 5518.0
22:16:41.195 INFO ProgressMeter - chr12:71769943 3.0 16000 5364.5
22:16:53.911 INFO ProgressMeter - chr12:122986425 3.2 17000 5321.6
22:17:05.693 INFO ProgressMeter - chr13:35044829 3.4 18000 5308.4
22:17:17.926 INFO ProgressMeter - chr14:18656315 3.6 19000 5285.5
22:17:29.413 INFO ProgressMeter - chr14:61865740 3.8 20000 5282.3
22:17:40.900 INFO ProgressMeter - chr14_GL000225v1_random:23601 4.0 21000 5279.5
22:17:52.338 INFO ProgressMeter - chr15:51215311 4.2 22000 5277.9
22:18:03.518 INFO ProgressMeter - chr15:78796665 4.4 23000 5281.7
22:18:14.921 INFO ProgressMeter - chr16:3483577 4.5 24000 5280.9
22:18:26.399 INFO ProgressMeter - chr16:34594921 4.7 25000 5278.7
22:18:40.819 INFO ProgressMeter - chr16:84976364 5.0 26000 5224.7
22:18:57.583 INFO ProgressMeter - chr17:10640146 5.3 27000 5137.3
22:19:09.159 INFO ProgressMeter - chr17:40023414 5.4 28000 5138.9
22:19:20.628 INFO ProgressMeter - chr17:70095791 5.6 29000 5142.0
22:19:35.491 INFO ProgressMeter - chr18:11639357 5.9 30000 5095.5
01:21:13.590 INFO ProgressMeter - chr19:966458 187.5 31000 165.3
01:21:28.856 INFO ProgressMeter - chr19:12440901 187.8 32000 170.4
01:21:41.329 INFO ProgressMeter - chr19:38505335 188.0 33000 175.5
01:21:52.685 INFO ProgressMeter - chr19:51594208 188.2 34000 180.7
01:22:04.454 INFO ProgressMeter - chr2:23887741 188.4 35000 185.8
01:22:16.139 INFO ProgressMeter - chr2:72465143 188.6 36000 190.9
01:22:27.642 INFO ProgressMeter - chr2:112531455 188.8 37000 196.0
01:22:39.186 INFO ProgressMeter - chr2:165155693 188.9 38000 201.1
01:22:50.654 INFO ProgressMeter - chr2:208338389 189.1 39000 206.2
01:23:02.084 INFO ProgressMeter - chr20:1578455 189.3 40000 211.3
01:23:13.819 INFO ProgressMeter - chr20:38000451 189.5 41000 216.3
01:23:25.157 INFO ProgressMeter - chr21:14304122 189.7 42000 221.4
01:23:37.058 INFO ProgressMeter - chr22:36261582 189.9 44000 231.7
01:23:48.698 INFO ProgressMeter - chr3:15475403 190.1 45000 236.7
01:24:00.525 INFO ProgressMeter - chr3:57577393 190.3 46000 241.7
01:24:12.212 INFO ProgressMeter - chr3:117027527 190.5 47000 246.7
01:24:23.981 INFO ProgressMeter - chr3:171792652 190.7 48000 251.7
01:24:35.067 INFO ProgressMeter - chr4:7025314 190.9 49000 256.7
01:24:47.070 INFO ProgressMeter - chr4:68063022 191.1 50000 261.7
01:24:58.504 INFO ProgressMeter - chr4:125476979 191.3 51000 266.6
01:25:13.077 INFO ProgressMeter - chr5:6611870 191.5 52000 271.5
01:25:25.667 INFO ProgressMeter - chr5:79730716 191.7 53000 276.4
01:25:50.363 INFO ProgressMeter - chr5:149028538 192.1 54000 281.1
01:26:03.247 INFO ProgressMeter - chr6:10047168 192.4 55000 285.9
01:26:16.475 INFO ProgressMeter - chr6:52813539 192.6 56000 290.8
01:26:28.586 INFO ProgressMeter - chr6:131950089 192.8 57000 295.7
01:26:40.160 INFO ProgressMeter - chr7:2278738 193.0 58000 300.6
01:26:51.860 INFO ProgressMeter - chr7:47885733 193.2 59000 305.4
01:27:03.179 INFO ProgressMeter - chr7:88163268 193.3 60000 310.3
01:27:14.805 INFO ProgressMeter - chr7:129734366 193.5 61000 315.2
01:27:26.079 INFO ProgressMeter - chr8:8217007 193.7 62000 320.0
01:27:37.491 INFO ProgressMeter - chr8:43510803 193.9 63000 324.9
01:27:49.555 INFO ProgressMeter - chr8:136245034 194.1 64000 329.7
01:28:01.077 INFO ProgressMeter - chr9:39809861 194.3 65000 334.5
01:28:12.564 INFO ProgressMeter - chr9:93613446 194.5 66000 339.3
01:28:23.919 INFO ProgressMeter - chr9:131070323 194.7 67000 344.1
01:28:35.122 INFO ProgressMeter - chrUn_KI270754v1:33254 194.9 68000 348.9
01:28:47.983 INFO ProgressMeter - chrX:145716091 195.1 69000 353.7
01:28:48.562 INFO CNNScoreVariants - Finished pass 0 through the variants
01:29:06.929 INFO CNNScoreVariants - Starting pass 1 through the variants
01:29:07.160 INFO ProgressMeter - chr1:13198782 195.4 70000 358.2
01:29:11.852 INFO CNNScoreVariants - Finished pass 1 through the variants
01:29:11.853 INFO CNNScoreVariants - No variants filtered by: AllowAllVariantsVariantFilter
01:29:11.856 INFO CNNScoreVariants - No reads filtered by: AllowAllReadsReadFilter
01:29:11.861 INFO ProgressMeter - chrX:101883466 195.5 138600 709.0
01:29:11.862 INFO ProgressMeter - Traversal complete. Processed 138600 total variants in 195.5 minutes.
01:29:11.863 INFO CNNScoreVariants - Done scoring variants with CNN.
01:29:12.315 INFO CNNScoreVariants - Shutting down engine
[November 12, 2020 1:29:12 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.CNNScoreVariants done. Elapsed time: 195.88 minutes.
Runtime.totalMemory()=241696768

# 此步生成的annotated.vcf作为下一步的输入文件,这个注释过的文件是在INFO字段中加入CNN_1D的评分,如"CNN_1D=-0.800"
# 跟输入的文件作个对比看看:
CHROM: chr1
POS: 15903
ID: .
REF: G
ALT: GC
QUAL: 59.28
FILTER: .
INFO(输入): AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303
INFO(输出): AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303;*CNN_1D=-0.800;*
FORMAT: GT:AD:DP:GQ:PL
sample1: 1/1:0,2:2:6:71,6,0
# 两个文件的区别在于,输出文件在INFO字段加上CNN_1D评分,下一步将以此评分进行过滤

6.3 使用FilterVariantTranches过滤

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# 这个工具根据注释后的VCF中的INFO字段的分数进行过滤,注释可以是CNNScoreVariants(CNNLOD)或VQSR (VQSLOD)

gatk FilterVariantTranches \
-V input.vcf.gz \
--resource hapmap.vcf \
--resource mills.vcf \
--info-key CNN_1D \
--snp-tranche 99.95 \
--indel-tranche 99.4 \
-O filtered.vcf

# 在Docker中运行

gatk FilterVariantTranches -V /gatk/my_data/wes/sample/annotate/annotated.vcf --resource /gatk/my_data/ncbi/hg38/vcf/hapmap_3.3.hg38.vcf.gz --resource /gatk/my_data/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O /gatk/my_data/wes/sample/annotate/annotated.filtered.vcf

Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.3.0-local.jar FilterVariantTranches -V /gatk/my_data/wes/sample/annotate/annotated.vcf --resource /gatk/my_data/ncbi/hg38/vcf/hapmap_3.3.hg38.vcf.gz --resource /gatk/my_data/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz --info-key CNN_1D --snp-tranche 99.95 --indel-tranche 99.4 -O /gatk/my_data/wes/sample/annotate/annotated.filtered.vcf
01:40:26.170 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Nov 12, 2020 1:40:28 AM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
01:40:28.454 INFO FilterVariantTranches - ------------------------------------------------------------
01:40:28.455 INFO FilterVariantTranches - The Genome Analysis Toolkit (GATK) v4.1.3.0
01:40:28.455 INFO FilterVariantTranches - For support and documentation go to https://software.broadinstitute.org/gatk/
01:40:28.456 INFO FilterVariantTranches - Executing as root@9f0ac1b25dec on Linux v4.19.128-microsoft-standard amd64
01:40:28.456 INFO FilterVariantTranches - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
01:40:28.456 INFO FilterVariantTranches - Start Date/Time: November 12, 2020 1:40:26 AM UTC
01:40:28.457 INFO FilterVariantTranches - ------------------------------------------------------------
01:40:28.457 INFO FilterVariantTranches - ------------------------------------------------------------
01:40:28.458 INFO FilterVariantTranches - HTSJDK Version: 2.20.1
01:40:28.458 INFO FilterVariantTranches - Picard Version: 2.20.5
01:40:28.458 INFO FilterVariantTranches - HTSJDK Defaults.COMPRESSION_LEVEL : 2
01:40:28.458 INFO FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
01:40:28.458 INFO FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
01:40:28.458 INFO FilterVariantTranches - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
01:40:28.459 INFO FilterVariantTranches - Deflater: IntelDeflater
01:40:28.459 INFO FilterVariantTranches - Inflater: IntelInflater
01:40:28.459 INFO FilterVariantTranches - GCS max retries/reopens: 20
01:40:28.459 INFO FilterVariantTranches - Requester pays: disabled
01:40:28.459 INFO FilterVariantTranches - Initializing engine
01:40:29.423 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/my_data/ncbi/hg38/vcf/hapmap_3.3.hg38.vcf.gz
01:40:30.099 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/my_data/ncbi/hg38/vcf/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz
01:40:30.362 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/my_data/wes/sample/annotate/annotated.vcf
01:40:30.630 INFO FilterVariantTranches - Done initializing engine
01:40:30.796 INFO ProgressMeter - Starting traversal
01:40:30.797 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
01:40:30.801 INFO FilterVariantTranches - Starting pass 0 through the variants
01:40:40.840 INFO ProgressMeter - chr17:40023414 0.2 28000 167330.7
01:40:50.889 INFO ProgressMeter - chr9:131070323 0.3 67000 200099.6
01:40:51.287 INFO FilterVariantTranches - Finished pass 0 through the variants
01:40:51.287 INFO FilterVariantTranches - Found 57803 SNPs and 11479 indels with INFO score key:CNN_1D.
01:40:51.287 INFO FilterVariantTranches - Found 26323 SNPs and 3646 indels in the resources.
01:40:51.324 INFO FilterVariantTranches - Starting pass 1 through the variants
01:40:54.831 INFO FilterVariantTranches - Finished pass 1 through the variants
01:40:54.834 INFO FilterVariantTranches - No variants filtered by: AllowAllVariantsVariantFilter
01:40:54.836 INFO FilterVariantTranches - No reads filtered by: AllowAllReadsReadFilter
01:40:54.837 INFO ProgressMeter - chrX:101883466 0.4 138600 345952.2
01:40:54.838 INFO ProgressMeter - Traversal complete. Processed 138600 total variants in 0.4 minutes.
01:40:54.839 INFO FilterVariantTranches - Filtered 1760 SNPs out of 57803 and filtered 188 indels out of 11479 with INFO score: CNN_1D.
01:40:55.436 INFO FilterVariantTranches - Shutting down engine
[November 12, 2020 1:40:55 AM UTC] org.broadinstitute.hellbender.tools.walkers.vqsr.FilterVariantTranches done. Elapsed time: 0.49 minutes.
Runtime.totalMemory()=663748608

# 再看看输出文件的第一条记录,与前面的两条记录作对比:
CHROM: chr1
POS: 15903
ID: .
REF: G
ALT: GC
QUAL: 59.28
FILTER: PASS
INFO(Step1): AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303
INFO(输入): AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303;CNN_1D=-0.800;
INFO(输出): AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=30.13;QD=29.64;SOR=2.303;CNN_1D=-0.800;
FORMAT: GT:AD:DP:GQ:PL
sample1: 1/1:0,2:2:6:71,6,0
# INFO字段没有变化,但FILTER字段由"."变为"PASS"

6.4 使用Funcotator进行功能注释

在使用Funcotator进行功能注释前,需要下载Data Sources。GATK有两套预封装数据源,分别对应germline和somatic,可以使用gatk FuncotatorDataSourceDownloader工具进行下载

1
2
3
4
# For somatic data sources: (29GB)
gatk FuncotatorDataSourceDownloader --somatic --validate-integrity --extract-after-download
# For germline data sources:
gatk FuncotatorDataSourceDownloader --germline --validate-integrity --extract-after-download

数据源下载完成后即可进行分析,语法示例:

1
2
3
4
5
6
7
./gatk Funcotator \
-R reference.fasta \
-V input.vcf \
-O outputFile \
--output-file-format MAF \ #可以选择VCF或MAF
--data-sources-path dataSourcesFolder/ \
--ref-version hg19

在本例中,使用如下命令:

1
2
3
4
5
# Germline Mutation
gatk Funcotator -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/sample/annotate/annotated.filtered.vcf -O /mnt/d/wes/sample/annotate/annotated.filtered.funcotator.germline.vcf --output-file-format VCF --data-sources-path /mnt/d/ncbi/funcotator/funcotator_dataSources.v1.7.20200521g/ --ref-version hg38

# Somatic Mutation
gatk Funcotator -R /home/eric/ncbi/hg38/chroms/hg38.fa -V /mnt/d/wes/sample/annotate/annotated.filtered.vcf -O /mnt/d/wes/sample/annotate/annotated.filtered.funcotator.somatic.vcf --output-file-format VCF --data-sources-path /mnt/d/ncbi/funcotator/funcotator_dataSources.v1.7.20200521s/ --ref-version hg38

至此,我们得到了注释好的VCF文件,下面,就可以进行后续的功能分析了。

  • 本文作者:括囊无誉
  • 本文链接: WES/wes02/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!