肿瘤基因组分析教程:三、比对至基因组

本教程主要目的是下载测序数据,并将其比对至基因组上,参考教程:Read Alignment

主要流程见下图

一、下载数据

本次分析使用的数据来自SRA数据库,患有ccRCC的病人,分析其中的两个样本,SRR5478491 (matched primary ccRCC)与 SRR5478492 (matched normal),双端测序,WXS,平台是ILLUMINA,Library Name是P1T与P1N,参考基因组为GRCh37。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 下载数据
$ prefetch SRR5478491
$ prefetch SRR5478491

# 下载完成后提取FASTQ文件
$ fastq-dump --split-3 --gzip SRR5478491.sra
# Rejected 14481 READS because of filtering out non-biological READS
# Read 18433992 spots for SRR5478491.sra
# Written 18433992 spots for SRR5478491.sra

$ fastq-dump --split-3 --gzip SRR5478492.sra
# Rejected 6955 READS because of filtering out non-biological READS
# Read 17390617 spots for SRR5478492.sra
# Written 17390617 spots for SRR5478492.sra

二、FASTQ质控

获得FASTQ文件之后,按老规矩做个质控

1
2
$ fastqc -f fastq SRR5478491_1.fastq.gz
$ fastqc -f fastq SRR5478492_1.fastq.gz

记录一下文件大小,后面对比一下

1
2
3
4
5
6
7
8
9
10
$ ls -lh
total 3.8G
-rwxrwxrwx 1 eric eric 995M Nov 21 18:51 SRR5478491_1.fastq.gz
-rwxrwxrwx 1 eric eric 216K Nov 21 19:06 SRR5478491_1_fastqc.html
-rwxrwxrwx 1 eric eric 225K Nov 21 19:06 SRR5478491_1_fastqc.zip
-rwxrwxrwx 1 eric eric 992M Nov 21 18:51 SRR5478491_2.fastq.gz
-rwxrwxrwx 1 eric eric 937M Nov 21 18:50 SRR5478492_1.fastq.gz
-rwxrwxrwx 1 eric eric 212K Nov 21 19:06 SRR5478492_1_fastqc.html
-rwxrwxrwx 1 eric eric 220K Nov 21 19:06 SRR5478492_1_fastqc.zip
-rwxrwxrwx 1 eric eric 930M Nov 21 18:50 SRR5478492_2.fastq.gz

来看一下FASTQ文件的质量,

根据统计结果,测序数据质量还是不错的,长度在36-101之间,质量也很不错,而且没有接头,可见已经做过质控,这里作为演示,我们使用fastp做个质控。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
$ fastp -z 4 -i ./SRR5478491_1.fastq.gz -I ./SRR5478491_2.fastq.gz -o ./qc/SRR5478491C_1.fastq.gz -O ./qc/SRR5478491C_2.fastq.gz

Read1 before filtering:
total reads: 18419511
total bases: 1839620558
Q20 bases: 1827378397(99.3345%)
Q30 bases: 1763281066(95.8503%)

Read2 before filtering:
total reads: 18419511
total bases: 1829856220
Q20 bases: 1814586330(99.1655%)
Q30 bases: 1746573463(95.4487%)

Read1 after filtering:
total reads: 18418466
total bases: 1831827347
Q20 bases: 1819758154(99.3411%)
Q30 bases: 1756297954(95.8768%)

Read2 aftering filtering:
total reads: 18418466
total bases: 1820984863
Q20 bases: 1805979814(99.176%)
Q30 bases: 1738919113(95.4933%)

Filtering result:
reads passed filter: 36836932
reads failed due to low quality: 2070
reads failed due to too many N: 20
reads failed due to too short: 0
reads with adapter trimmed: 3515078
bases trimmed due to adapters: 16455158

Duplication rate: 0.00384464%

Insert size peak (evaluated by paired-end reads): 114

JSON report: fastp.json
HTML report: fastp.html

fastp -z 4 -i ./SRR5478491_1.fastq.gz -I ./SRR5478491_2.fastq.gz -o ./qc/SRR5478491C_1.fastq.gz -O ./qc/SRR5478491C_2.fastq.gz
fastp v0.20.1, time used: 1030 seconds

# 只过滤除掉一点点,这个质控是为演示,样品二也是如此

$ fastp -z 4 -i ./SRR5478492_1.fastq.gz -I ./SRR5478492_2.fastq.gz -o ./qc/SRR5478492C_1.fastq.gz -O ./qc/SRR5478492C_2.fastq.gz

Read1 before filtering:
total reads: 17383662
total bases: 1739308879
Q20 bases: 1729287310(99.4238%)
Q30 bases: 1675903679(96.3546%)

Read2 before filtering:
total reads: 17383662
total bases: 1731815560
Q20 bases: 1719451643(99.2861%)
Q30 bases: 1663768392(96.0708%)

Read1 after filtering:
total reads: 17383562
total bases: 1734964983
Q20 bases: 1725008250(99.4261%)
Q30 bases: 1671980399(96.3697%)

Read2 aftering filtering:
total reads: 17383562
total bases: 1726945775
Q20 bases: 1714689404(99.2903%)
Q30 bases: 1659508087(96.095%)

Filtering result:
reads passed filter: 34767124
reads failed due to low quality: 186
reads failed due to too many N: 14
reads failed due to too short: 0
reads with adapter trimmed: 2069808
bases trimmed due to adapters: 9193707

Duplication rate: 0.00294159%

Insert size peak (evaluated by paired-end reads): 129

JSON report: fastp.json
HTML report: fastp.html

fastp -z 4 -i ./SRR5478492_1.fastq.gz -I ./SRR5478492_2.fastq.gz -o ./qc/SRR5478492C_1.fastq.gz -O ./qc/SRR5478492C_2.fastq.gz
fastp v0.20.1, time used: 981 seconds

质控后文件变大一点

1
2
3
4
-rwxrwxrwx 1 eric eric 1.1G Nov 21 20:25 SRR5478491C_1.fastq.gz
-rwxrwxrwx 1 eric eric 1.1G Nov 21 20:25 SRR5478491C_2.fastq.gz
-rwxrwxrwx 1 eric eric 983M Nov 21 20:25 SRR5478492C_1.fastq.gz
-rwxrwxrwx 1 eric eric 981M Nov 21 20:25 SRR5478492C_2.fastq.gz

二、使用BWA比对至基因组

BWA适合将低差异度的序列比对到大型基因组上,有三种算法: BWA-backtrack, BWA-SW 和 BWA-MEM。第一种算法用于比对100bp以内的ILLUMINA序列,后两种算法可以比对更长的序列(70bp-1Mbp),后两者相比,BWA-MEM更新、更快、更准确。而且,即使对于70bp-100bp的序列,BWA-MEM相比BWA-backtrack也有更好的表现。

BWA使用索引基因组来进行比对,以占用更小的内存,对于特定的基因组版本,如需产生索引,可以使用如下命令:

1
2
3
4
5
6
7
8
$ bwa index hg19.fa
# 运行完成后会生成6个文件
-rwxrwxrwx 1 eric eric 8.4K Nov 21 22:12 hg19.fa.amb
-rwxrwxrwx 1 eric eric 4.0K Nov 21 22:12 hg19.fa.ann
-rwxrwxrwx 1 eric eric 3.0G Nov 21 22:09 hg19.fa.bwt
-rwxrwxrwx 1 eric eric 3.5K Sep 27 20:34 hg19.fa.fai
-rwxrwxrwx 1 eric eric 748M Nov 21 22:12 hg19.fa.pac
-rwxrwxrwx 1 eric eric 1.5G Nov 21 22:41 hg19.fa.sa

生成索引文件之后,我们开始使用BWA进行比对

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
$ mkdir bam # 创建存放BAM文件的目录

# 下面正式使用BWA进行比对,关于bwa mem的用法,之前的教程中多次有介绍,这是一个组合用法,方便快捷且不输出中间文件,节省空间
$ bwa mem -M -t 4 -R '@RG\tSM:ccRCC\tID:SRR5478491\tLB:P1T\tPL:ILLUMINA' /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478491C_1.fastq.gz /mnt/d/cancer/qc/SRR5478491C_2.fastq.gz | samtools view -q4 -bS -o - - | samtools sort -@2 -o /mnt/d/cancer/bam/SRR5478491C.bam -
$ bwa mem -M -t 4 -R '@RG\tSM:normal\tID:SRR5478492\tLB:P1N\tPL:ILLUMINA' /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478492C_1.fastq.gz /mnt/d/cancer/qc/SRR5478492C_2.fastq.gz | samtools view -q4 -bS -o - - | samtools sort -@2 -o /mnt/d/cancer/bam/SRR5478492C.bam -

# 比对部分参数如下,以下如不特殊说明,皆针对样本SRR5478491:
mem: fast mode of high quality input such the Illumina # 使用BWA-MEM算法,最快最准确
-M: flags extra hits as secondary. This is needed for compatibility with other tools downstream. # 为了兼容下游的程序如picard
-t: Number of threads. # 线程数,根据电脑性能设置
-R: Complete read group header line. # Read Group信息后面需要用到,如果不加-R参数,后面还需要重新生成Read Group,增加麻烦
# 比对完成后,实际上生成的是SAM文件,SAM指Sequence Alignment/Map,顾名思义,是存储大量比对数据的标准格式。SAM文件以TAB分隔,包括header section与alignment section两个部分,header section是可选的,以@开头,比如,
@SQ SN:GL000192.1 LN:547496
@RG SM:Blood ID:ERR059356 LB:lb PL:ILLUMINA
@PG ID:bwa PN:bwa VN:0.7.15-r1140 CL:bwa mem -M -t 4 -R '@RG\tSM:Blood\tID:ERR059356\tLB:lb\tPL:ILLUMINA' /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478491C_1.fastq.gz /mnt/d/cancer/qc/SRR5478491C_2.fastq.gz

# 各个字段的含义如下:
@HD: Header line; VN: Format version; SO: the sort order of alignments.
@SQ: Reference sequence information; SN: reference sequence name; LN: reference sequence length.
@PG: Program; ID: Program record identifier; VN: Program version; CL: the command line that produces the alignment.

# alignment section内容如下:
HWI-ST478_0133:3:1101:1374:2056#0 147 11
HWI-ST478_0133:3:1101:1352:2070#0 163 14

比对生成SAM文件,但是SAM文件非常大,太耗空间,因此,将其转化为等价的二进制BAM文件,使用命令samtools view实现

1
2
3
4
# -q,线程数
# -S,the input is in SAM format
# -b,the output is in BAM format
# -o,输出文件

此步最耗时间,运行记录如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$ bwa mem -M -t 4 -R '@RG\tSM:ccRCC\tID:SRR5478491\tLB:P1T\tPL:ILLUMINA' /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478491C_1.fastq.gz /mnt/d/cancer/qc/SRR5478491C_2.fastq.gz | samtools view -q4 -bS -o - - | samtools sort -@2 -o /mnt/d/cancer/bam/SRR5478491C.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 404750 sequences (40000081 bp)...
[M::process] read 403546 sequences (40000135 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 193935, 1, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (108, 122, 139)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (46, 201)
[M::mem_pestat] mean and std.dev: (124.92, 21.87)
[M::mem_pestat] low and high boundaries for proper pairs: (15, 232)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 404750 reads in 194.728 CPU sec, 99.446 real sec
.....
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -M -t 4 -R @RG\tSM:ccRCC\tID:SRR5478491\tLB:P1T\tPL:ILLUMINA /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478491C_1.fastq.gz /mnt/d/cancer/qc/SRR5478491C_2.fastq.gz
[main] Real time: 12756.180 sec; CPU: 15585.349 sec
[bam_sort_core] merging from 12 files and 2 in-memory blocks...

$ bwa mem -M -t 4 -R '@RG\tSM:normal\tID:SRR5478492\tLB:P1N\tPL:ILLUMINA' /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478492C_1.fastq.gz /mnt/d/cancer/qc/SRR5478492C_2.fastq.gz | samtools view -q4 -bS -o - - | samtools sort -@2 -o /mnt/d/cancer/bam/SRR5478492C.bam -
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 403032 sequences (40000144 bp)...
[M::process] read 402398 sequences (40000087 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 192948, 0, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (116, 133, 151)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (46, 221)
[M::mem_pestat] mean and std.dev: (134.36, 24.18)
[M::mem_pestat] low and high boundaries for proper pairs: (11, 256)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 403032 reads in 242.385 CPU sec, 126.808 real sec
......
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -M -t 4 -R @RG\tSM:normal\tID:SRR5478492\tLB:P1N\tPL:ILLUMINA /mnt/d/ncbi/hg19/BWAIndex/hg19.fa /mnt/d/cancer/qc/SRR5478492C_1.fastq.gz /mnt/d/cancer/qc/SRR5478492C_2.fastq.gz
[main] Real time: 12491.684 sec; CPU: 15071.044 sec
[bam_sort_core] merging from 12 files and 2 in-memory blocks...

生成的BAM文件需要索引,而建立索引则需要对BAM文件按染色体进行排序,可以使用samtools view -h来查看BAM文件头部,确定是否已经排序

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ samtools view -h SRR5478491C.bam | less

@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr11_gl000202_random LN:40103
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr17_ctg5_hap1 LN:1680828
@SQ SN:chr17_gl000203_random LN:37498
@SQ SN:chr17_gl000204_random LN:81310
@SQ SN:chr17_gl000205_random LN:174588
@SQ SN:chr17_gl000206_random LN:41001
@SQ SN:chr18 LN:78077248
@SQ SN:chr18_gl000207_random LN:4262
@SQ SN:chr19 LN:59128983
@SQ SN:chr19_gl000208_random LN:92689
@SQ SN:chr19_gl000209_random LN:159169
@SQ SN:chr1_gl000191_random LN:106433
@SQ SN:chr1_gl000192_random LN:547496
@SQ SN:chr2 LN:243199373
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr21_gl000210_random LN:27682
@SQ SN:chr22 LN:51304566
@SQ SN:chr3 LN:198022430

# 可以看出,染色体已经排序,可以生成索引文件

$ samtools index SRR5478491C.bam
$ samtools index SRR5478492C.bam

看一下文件大小

1
2
3
4
5
total 4.1G
-rwxrwxrwx 1 eric eric 2.2G Nov 22 02:57 SRR5478491C.bam
-rwxrwxrwx 1 eric eric 4.0M Nov 22 08:00 SRR5478491C.bam.bai
-rwxrwxrwx 1 eric eric 2.0G Nov 22 07:08 SRR5478492C.bam
-rwxrwxrwx 1 eric eric 3.9M Nov 22 08:03 SRR5478492C.bam.bai

使用samtools flagstat来对Flag values做个统计,每一项统计数据都由两部分组成,即QC-passed与QC-failed,每一项都有详细说明,不重复列举

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
$ samtools flagstat SRR5478491C.bam
36533039 + 0 in total (QC-passed reads + QC-failed reads)
127381 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
36533039 + 0 mapped (100.00% : N/A)
36405658 + 0 paired in sequencing
18202897 + 0 read1
18202761 + 0 read2
36400546 + 0 properly paired (99.99% : N/A)
36405653 + 0 with itself and mate mapped
5 + 0 singletons (0.00% : N/A)
4611 + 0 with mate mapped to a different chr
4590 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat SRR5478492C.bam
34403886 + 0 in total (QC-passed reads + QC-failed reads)
6054 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
34403886 + 0 mapped (100.00% : N/A)
34397832 + 0 paired in sequencing
17198921 + 0 read1
17198911 + 0 read2
34397521 + 0 properly paired (100.00% : N/A)
34397829 + 0 with itself and mate mapped
3 + 0 singletons (0.00% : N/A)
224 + 0 with mate mapped to a different chr
222 + 0 with mate mapped to a different chr (mapQ>=5)
  • 本文作者:括囊无誉
  • 本文链接: WES/cancer_seq3/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!