肿瘤基因组分析教程:四、BAM文件预处理

本教程目的是对上一步生成的BAM文件进行质控,并检测变异,以及对变异进行注释,参考教程:Single Nucleotide Variant Calling and AnnotationData pre-processing for variant discovery,以及Somatic short variant discovery (SNVs + Indels)

一、查看BAM信息

我们先来具体看一下BAM文件,下面是BAM文件主体内容的前4行

1
2
3
4
5
6
7
$ samtools view SRR5478491C.bam | head -n 4

SRR5478491.1 163 chr1 10037 19 13M1I87M = 10064 102 TAACCCTAACCCTAAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAGCC ?@@DDD?BBBAF?GFBAA<AAGEFHIJGGGE?DFIIBHFBBDF;F?FC8B=BFH=CD===E;CEHBBDFE66?5;;38?CC5?9?ADD(8?<<B1(2<A@C NM:i:2 MD:Z:97A2 MC:Z:75M AS:i:90 XS:i:91 RG:Z:SRR5478491
SRR5478491.2 163 chr1 10057 35 59M2D39M3S = 10081 103 ACCCTCACCCTCACCCTCACCCTCACCCTAACCCTAACCCTAACCCTAACCCAACCCTACCCAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTA CCCFFFFFHHGHHJJJIJIGJGEIGIIIIJJIJIIJJIDHGIEHGHIJIHAAFHGIHHECD>DFDCCECCBDDBCCBBD<<CDBBDCDBD<5<@<ADD@@< NM:i:7 MD:Z:5A5A5A5A35^AC2T36 MC:Z:13S35M2D36M1I6M AS:i:65
XS:i:61 RG:Z:SRR5478491
SRR5478491.1 83 chr1 10064 19 75M = 10037 -102 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAGCCCT 9?;55-;(A=;.>@?6=)HAC==7CAF@=.>CCB?98@?;B?AFC?*IGHGC9HHE@IIGGFAHFHFD?DDD?;= NM:i:1 MD:Z:70A4 MC:Z:13M1I87M AS:i:70 XS:i:65 RG:Z:SRR5478491
SRR5478491.2 83 chr1 10081 35 13S35M2D36M1I6M = 10057 -103 CCCCCTCCCCCTCACCCTAACCCTAACCCTAACCCTAACCCAACCCTACCCAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCCTAA 0&-@<90&?DBDBC@?@B@DDCBBBABCAAACBEDB?BFHHHEHIGF@DGGDBEJIIHFIIHFC?JIHFCEIHEEIJIHHHFFDFFFFBB@ NM:i:4 MD:Z:35^AC2T39 MC:Z:59M2D39M3S AS:i:61 XS:i:58 RG:Z:SRR5478491

SAM/BAM文件至少有11列,内容如下

1
2
3
4
5
6
7
8
9
10
11
# 1. QNAME, Query template NAME
# 2. FLAG, bitwise FLAG
# 3. RNAME, Reference sequence NAME
# 4. POS, 1-based
# 5. MAPQ, MAPping Quality
# 6. CIGAR,
# 7. RNEXT, Reference name of the mate/next read
# 8. PNEXT, Position of the mate/next read
# 9. TLEN, observed Template LENgth
# 10. SEQ, segment SEQuence
# 11. QUAL, ASCII of Phred-scaled base QUALity+33

特别关注一下第6列,此列名为CIGAR,比如,13M1I87M,代表如下信息:匹配13个碱基,第14个碱基是插入,从15至101(87个)碱基匹配,当然,M可以表示匹配也可表示SNP,即SNP也会被标记为M

1
2
3
4
M == base aligns but doesn’t have to be a match. A SNP will have an M even if it disagrees with the reference.
I == Insertion
D == Deletion
S == soft-clips. These are handy to find un removed adapters, viral insertions, etc.

关于SAM//BAM文件的格式,可参见其 说明文件SAM;关于FLAG,可参看FLAG,通过查询,可知上面的163代表:

1
2
3
4
read paired (0x1)
read mapped in proper pair (0x2)
mate reverse strand (0x20)
second in pair (0x80)

查看不配对的reads,可以使用samtools view -c -f4实现,若查看配对的reads,可以使用samtools view -c -F4实现

1
2
3
4
5
6
7
8
# -c       print only the count of matching records # 打印数目
# -f INT only include reads with all of the FLAGs in INT present [0] # 包含指定FLAG的reads,在此指定4,即read unmapped (0x4)
# -F INT only include reads with none of the FLAGS in INT present [0] # 不含指定FLAG的reads,在此指定4,即read unmapped (0x4)

$ samtools view -c -f4 SRR5478491C.bam
# 0
$ samtools view -c -F4 SRR5478491C.bam
# 36533039

二、BAM去除DUPLICATES

现在,我们得到的BAM文件称为RAW BAM,需要进行预处理,主要包括去掉PCR重复和BQSR,如下图

PCR去重可以通过MarkDuplicates来实现,即使用MarkDuplicates来标记重复,并将其直接去掉,不输出到BAM文件中,再使用samtools index生成索引文件。如果电脑性能比较强悍,建议使用最新的MarkDuplicatesSpark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
$ gatk MarkDuplicates -I /mnt/d/cancer/bam/SRR5478491C.bam -O /mnt/d/cancer/nodup/SRR5478491C.nodup.bam -M /mnt/d/cancer/nodup/SRR5478491C.metrics.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX true
$ gatk MarkDuplicates -I /mnt/d/cancer/bam/SRR5478492C.bam -O /mnt/d/cancer/nodup/SRR5478492C.nodup.bam -M /mnt/d/cancer/nodup/SRR5478492C.metrics.txt --REMOVE_DUPLICATES true --ASSUME_SORTED true --CREATE_INDEX true

# 记录一下文件大小
total 5.5G
-rwxrwxrwx 1 eric eric 3.4K Nov 22 15:21 SRR5478491C.metrics.txt
-rwxrwxrwx 1 eric eric 2.9G Nov 22 15:21 SRR5478491C.nodup.bam
-rwxrwxrwx 1 eric eric 4.1M Nov 22 15:27 SRR5478491C.nodup.bam.bai
-rwxrwxrwx 1 eric eric 3.4K Nov 22 15:21 SRR5478492C.metrics.txt
-rwxrwxrwx 1 eric eric 2.7G Nov 22 15:21 SRR5478492C.nodup.bam
-rwxrwxrwx 1 eric eric 4.0M Nov 22 15:27 SRR5478492C.nodup.bam.bai

# metrics.txt记录duplicates的信息,查看可知,重复率0.000003
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
P1T 5 18202826 127381 0 1 62 0 0.000003 2672112597763

三、校正碱基质量

现在,我们得到的是去掉PCR重复的BAM文件,下面进行Base (Quality Score) Recalibration (BQSR),即碱基质量校正,原因在于,测序过程中不同厂商的碱基读取有不同的倾向,需要重新校正,此步是通过机器学习来对测序结果中的碱基质量进行校正,通过两步实现:

  1. Build covariates based on context and known SNP sites. 根据物种已知SNP,生成用于质量校准的metrics
  2. Correct the reads based on these metrics. 根据metrics文件生成新的BAM文件

首先,需要先为hg19.fa生成dict文件,这可以通过调用CreateSequenceDictonary模块来实现

1
$ gatk CreateSequenceDictionary -R hg19.fa -O hg19.dict

3.1 BQSR step1,BaseRecalibrator

1
2
3
$ gatk BaseRecalibrator -I /mnt/d/cancer/nodup/SRR5478491C.nodup.bam -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa --known-sites /mnt/d/ncbi/hg19/bqsr/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz --known-sites /mnt/d/ncbi/hg19/bqsr/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz  --known-sites /mnt/d/ncbi/hg19/bqsr/dbsnp_138.hg19.vcf.gz -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed -O /mnt/d/cancer/bqsr/recal_SRR5478491C.table

$ gatk BaseRecalibrator -I /mnt/d/cancer/nodup/SRR5478492C.nodup.bam -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa --known-sites /mnt/d/ncbi/hg19/bqsr/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz --known-sites /mnt/d/ncbi/hg19/bqsr/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz --known-sites /mnt/d/ncbi/hg19/bqsr/dbsnp_138.hg19.vcf.gz -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed -O /mnt/d/cancer/bqsr/recal_SRR5478492C.table

–known-sites中用到的三个VCF文件,用于提供物种已知的多态性位点,这些位点将被分析程序跳过,下载地址resources bundle of hg19,下载好的文件可以解压后使用,但解压文件空间占用太大,建议重新压缩并生成索引后使用

1
2
3
4
5
6
7
8
9
10
11
12
$ bgzip -d ***.vcf.gz
$ bgzip ***.vcf # 压缩比约4:1 - 8:1,必须使用bgzip压缩
$ bcftools index -t ***.vcf.gz # 生成.tbi格式索引

# VCF文件如下
total 3.3G
-rwxrwxrwx 1 eric eric 1.8G Nov 22 19:03 1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
-rwxrwxrwx 1 eric eric 2.1M Nov 22 19:10 1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.tbi
-rwxrwxrwx 1 eric eric 20M Nov 22 18:48 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz
-rwxrwxrwx 1 eric eric 1.4M Nov 22 19:06 Mills_and_1000G_gold_standard.indels.hg19.sites.vcf.gz.tbi
-rwxrwxrwx 1 eric eric 1.5G Nov 22 18:20 dbsnp_138.hg19.vcf.gz
-rwxrwxrwx 1 eric eric 2.2M Nov 22 18:26 dbsnp_138.hg19.vcf.gz.tbi

3.2 BQSR step2,ApplyBQSR

1
2
$ gatk ApplyBQSR -I /mnt/d/cancer/nodup/SRR5478491C.nodup.bam -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa --bqsr-recal-file /mnt/d/cancer/bqsr/recal_SRR5478491C.table -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed -O /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam --create-output-bam-index true
$ gatk ApplyBQSR -I /mnt/d/cancer/nodup/SRR5478492C.nodup.bam -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa --bqsr-recal-file /mnt/d/cancer/bqsr/recal_SRR5478492C.table -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed -O /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam --create-output-bam-index true

四、BAM文件质控

4.1 计算覆盖度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 使用[DepthOfCoverage](https://gatk.broadinstitute.org/hc/en-us/articles/360046789152-DepthOfCoverage-BETA-)可以计算BAM文件的覆盖度
gatk \
DepthOfCoverage \
-R reference.fasta \
-O file_name_base \
-I input_bams.list
[-geneList refSeq.sorted.refseq] \
[-pt readgroup] \
[-ct 4 -ct 6 -ct 10] \
[-L my_capture_genes.interval_list]
# 在本例中,使用如下参数
$ gatk DepthOfCoverage -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -O /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.coverage -I /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed --omit-depth-output-at-each-base true --summary-coverage-threshold 10 --summary-coverage-threshold 25 --summary-coverage-threshold 50 --summary-coverage-threshold 100

$ gatk DepthOfCoverage -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -O /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.coverage -I /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam -L /mnt/d/ncbi/wxs/agilent/sureSelect_All_Exon_V5/S04380110_Padded.bed --omit-depth-output-at-each-base true --summary-coverage-threshold 10 --summary-coverage-threshold 25 --summary-coverage-threshold 50 --summary-coverage-threshold 100

运行后看一下覆盖度

1
2
3
4
5
6
7
8
9
10
11
$ less SRR5478491C.bqsr.nodup.coverage.sample_summary
sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_10,%_bases_above_25,%_bases_above_50,%_bases_above_100
ccRCC,3065081044,34.26,43,12,2,52.9,36.2,21.7,9.6
Total,3065081044,34.26,N/A,N/A,N/A,N/A,N/A,N/A,N/A

$ less SRR5478492C.bqsr.nodup.coverage.sample_summary
sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_10,%_bases_above_25,%_bases_above_50,%_bases_above_100
normal,2945165227,32.92,48,18,4,60.3,42.0,23.8,7.5
Total,2945165227,32.92,N/A,N/A,N/A,N/A,N/A,N/A,N/A

# 平均覆盖度 > 30X,然而,中位数比均值小很多,数据并非特别理想

4.2 计算Insert Size

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
gatk CollectInsertSizeMetrics \
I=input.bam \
O=insert_size_metrics.txt \
H=insert_size_histogram.pdf \
M=0.5

# 在本例中,使用如下参数:
$ gatk CollectInsertSizeMetrics -I /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam -O /mnt/d/cancer/bqsr/SRR5478491C.insert_size_metrics.txt -H /mnt/d/cancer/bqsr/SRR5478491C.insert_size_histogram.pdf --VALIDATION_STRINGENCY SILENT -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa

$ gatk CollectInsertSizeMetrics -I /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam -O /mnt/d/cancer/bqsr/SRR5478492C.insert_size_metrics.txt -H /mnt/d/cancer/bqsr/SRR5478492C.insert_size_histogram.pdf --VALIDATION_STRINGENCY SILENT -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa

# 查看一下insert size,Normal的insert size比ccRCC小一点(123 vs 134)

$ less SRR5478491C.insert_size_metrics.txt
## METRICS CLASS picard.analysis.InsertSizeMetrics
MEDIAN_INSERT_SIZE MODE_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE
STANDARD_DEVIATION READ_PAIRS PAIR_ORIENTATION WIDTH_OF_10_PERCENT WIDTH_OF_20_PERCENT WIDTH_OF_30_PERCENT WIDTH_OF_40_PERCENT WIDTH_OF_50_PERCENT WIDTH_OF_60_PERCENT WIDTH_OF_70_PERCENT WIDTH_OF_80_PERCENT WIDTH_OF_90_PERCENT WIDTH_OF_95_PERCENT WIDTH_OF_99_PERCENT SAMPLE LIBRARY READ_GROUP
123 117 15 7 210906619 126.082217 22.010779 15536547 FR 7 13 19 25 31 37 45 53 69 91 131

$ less SRR5478492C.insert_size_metrics.txt
## METRICS CLASS picard.analysis.InsertSizeMetrics
MEDIAN_INSERT_SIZE MODE_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE
STANDARD_DEVIATION READ_PAIRS PAIR_ORIENTATION WIDTH_OF_10_PERCENT WIDTH_OF_20_PERCENT WIDTH_OF_30_PERCENT WIDTH_OF_40_PERCENT WIDTH_OF_50_PERCENT WIDTH_OF_60_PERCENT WIDTH_OF_70_PERCENT WIDTH_OF_80_PERCENT WIDTH_OF_90_PERCENT WIDTH_OF_95_PERCENT WIDTH_OF_99_PERCENT SAMPLE LIBRARY READ_GROUP
134 129 18 9 71334417 135.454785 24.140099 14838912 FR 7 15 21 29 37 45 53 63 77 89 125

4.3 计算Alignment metrics

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
gatk CollectAlignmentSummaryMetrics \
R=reference_sequence.fasta \
I=input.bam \
O=output.txt

$ gatk CollectAlignmentSummaryMetrics -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -I /mnt/d/cancer/bqsr/SRR5478491C.bqsr.nodup.bam -O /mnt/d/cancer/bqsr/SRR5478491C.align.txt --VALIDATION_STRINGENCY SILENT

$ gatk CollectAlignmentSummaryMetrics -R /mnt/d/ncbi/hg19/BWAIndex/hg19.fa -I /mnt/d/cancer/bqsr/SRR5478492C.bqsr.nodup.bam -O /mnt/d/cancer/bqsr/SRR5478492C.align.txt --VALIDATION_STRINGENCY SILENT

# 查看一下,比对质量相当不错
$ less -S SRR5478491C.align.txt
## METRICS CLASS picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED P
FIRST_OF_PAIR 15538245 15538245 1 0 15538245 1 1542727791 15532149 154218009
SECOND_OF_PAIR 15537614 15537614 1 0 15537614 1 1533251102 15531520 153271206
PAIR 31075859 31075859 1 0 31075859 1 3075978893 31063669 3074892161 3

$ less -S SRR5478492C.align.txt
## METRICS CLASS picard.analysis.AlignmentSummaryMetrics
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED P
FIRST_OF_PAIR 14839431 14839431 1 0 14839431 1 1480977340 14834355 148050594
SECOND_OF_PAIR 14838958 14838958 1 0 14838958 1 1473988914 14833866 147352039
PAIR 29678389 29678389 1 0 29678389 1 2954966254 29668221 2954026333 2

关于BAM文件的处理与质控就到此为止,我们现在得到了“Analysis-Ready Reads”的BAM文件,下一步就可以进行MUTATION CALLING。

  • 本文作者:括囊无誉
  • 本文链接: WES/cancer_seq4/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!