使用MAFTOOLS分析肿瘤变异数据

我们以TCGA-KIRC数据为例,来介绍肿瘤突变数据与临床数据的下载方法、预处理及可视化

一、数据下载与处理

TCGA-GDC官网,在CASES选项下,选择CTGA-KIRC,这里共包含537个CASE,回到FILES选项下,选择Data Category[simple nucleotide variation]+Data Type[Masked Somatic Mutation]+Workflow Type[MuTect2 Variant Aggregation and Masking],至此,数据选择就完成了,点击Add Files to Cart,回到Cart,FILE=1,CASES=339,FILE SIZE=7.12 MB。下载Clinical/TSV文件,下载Download/Cart文件,前一个文件包含临床数据,后一个文件包含突变数据。

将这两个文件移至R的工作目录之下,使用命令对这两个tat.gz文件进行解压,

1
2
tar -zxvf clinical.cart.2020-10-27.tar.gz # 解压后,得到三个文件:clinical.tsv,family_history.tsv,exposure.tsv
tar -zxvf gdc_download_20201027_190832.301544.tar.gz # 解压后,得到maf.gz文件,再使用gzip -d filename.gz进行解压,得到我们需要的.maf文件,重命名为mutect.maf

在下一步操作之前,我们需要对下载好的数据作预处理,以使临床数据与突变数据能够通过Tumor_Sample_Barcode联系起来,分两步进行:

第一、打开clinical.tsv,将第二列“case_submitter_id”修改为“Tumor_Sample_Barcode”;
第二、打开mutect.maf,将“Tumor_Sample_Barcode”这一列中的内容修改为与clinical.tsv中的”Tumor_Sample_Barcode”一致,即只保留前12个字符,在EXCEL中可以新建一列,并使用[LEFT(text,12)]这个函数进行操作,操作完成后,使用CTRL+SHIFT+DOWN向下选择,再使用CTRL+D快速填充,填充完毕,将新列命名为”Tumor_Sample_Barcode”,并将旧的”Tumor_Sample_Barcode”删除,保存退出。

至此,数据下载与预处理就完成了,后面的过程会比较简单。

二、读取MAF文件并进行统计

2.1 MAF文件读取

下面的操作都将在RStudio/R中进行,代码如下:

1
2
3
> setwd("D:/mutation/KIRC/mutect") # 切换至工作目录下
> library(maftools) # 加载maftools
> kirc <- read.maf(maf="mutect.maf", clinicalData="clinical.tsv")

文件读取过程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-Reading
|--------------------------------------------------|
|==================================================|
|--------------------------------------------------|
|==================================================|
-Validating
--Removed 1021876 duplicated variants
--Found 1 variants with no Gene Symbols
--Annotating them as 'UnknownGene' for convenience
--Non MAF specific values in Variant_Classification column:

--Non MAF specific values in Variant_Type column:

-Silent variants: 8384
-Summarizing
--Mutiple centers found
BI;BCM--Possible FLAGS among top ten genes:
TTN
MUC16
HMCN1
-Processing clinical data
--Annotation missing for below samples in MAF:

-Finished in 39.7s elapsed (16.7s cpu)

2.2 MAF文件统计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> kirc # 查看MAF文件基本信息
An object of class MAF
ID summary Mean Median
1: NCBI_Build GRCh38 NA NA
2: Center BI;BCM NA NA
3: Samples 337 NA NA
4: nGenes 9444 NA NA
5: Frame_Shift_Del 1732 5.155 4
6: Frame_Shift_Ins 1201 3.574 1
7: In_Frame_Del 238 0.708 0
8: In_Frame_Ins 350 1.042 0
9: Missense_Mutation 12997 38.682 36
10: Nonsense_Mutation 1259 3.747 2
11: Nonstop_Mutation 18 0.054 0
12: Splice_Site 490 1.458 1
13: Translation_Start_Site 25 0.074 0
14: total 18310 54.494 47
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> getSampleSummary(kirc) # 对每一个样本的情况进行统计(336个样本)
Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
1: TCGA-B8-4143 39 721 13
2: TCGA-B0-5098 88 25 5
3: TCGA-A3-A8OV 7 3 1
4: TCGA-CJ-4920 16 4 0
5: TCGA-CZ-5468 4 2 0
---
332: TCGA-B0-5117 0 1 0
333: TCGA-BP-4760 1 0 0
334: TCGA-DV-5576 0 0 0
335: TCGA-B8-5546 0 0 0
336: TCGA-DV-A4VZ 1 0 0
In_Frame_Ins Missense_Mutation Nonsense_Mutation Nonstop_Mutation
1: 305 201 320 1
2: 1 390 20 0
3: 0 99 7 0
4: 0 87 6 0
5: 0 91 4 0
---
332: 0 7 0 0
333: 0 7 0 0
334: 0 7 1 0
335: 0 7 0 0
336: 0 4 0 0
Splice_Site Translation_Start_Site total
1: 11 0 1611
2: 20 1 550
3: 3 0 120
4: 2 2 117
5: 1 0 102
---
332: 0 0 8
333: 0 0 8
334: 0 0 8
335: 0 0 7
336: 0 0 5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> getGeneSummary(kirc) # 对每一个基因进行统计(9444个基因)
Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
1: VHL 41 22 2 0
2: PBRM1 64 12 1 1
3: TTN 4 4 0 1
4: SETD2 12 2 0 0
5: BAP1 13 1 0 0
---
9440: ZWINT 0 0 0 0
9441: ZXDA 0 0 0 0
9442: ZXDB 0 0 0 0
9443: ZXDC 0 0 0 0
9444: ZYX 0 0 0 0
Missense_Mutation Nonsense_Mutation Nonstop_Mutation Splice_Site
1: 60 27 1 16
2: 19 38 0 13
3: 58 9 0 1
4: 10 18 0 4
5: 8 8 0 4
---
9440: 1 0 0 0
9441: 1 0 0 0
9442: 0 1 0 0
9443: 0 1 0 0
9444: 1 0 0 0
Translation_Start_Site total MutatedSamples AlteredSamples
1: 0 169 166 166
2: 0 148 141 141
3: 0 77 59 59
4: 0 46 42 42
5: 3 37 35 35
---
9440: 0 1 1 1
9441: 0 1 1 1
9442: 0 1 1 1
9443: 0 1 1 1
9444: 0 1 1 1
1
> getClinicalData(kirc) # 临床信息,即clinical.tsv文件中的内容,相同的case_id会输出一行,内容很长,不展示
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
> getFields(kirc) # 显示MAF文件中的所有字段,第16列就是我们预先修改过的"Tumor_Sample_Barcode"
[1] "Hugo_Symbol" "Entrez_Gene_Id"
[3] "Center" "NCBI_Build"
[5] "Chromosome" "Start_Position"
[7] "End_Position" "Strand"
[9] "Variant_Classification" "Variant_Type"
[11] "Reference_Allele" "Tumor_Seq_Allele1"
[13] "Tumor_Seq_Allele2" "dbSNP_RS"
[15] "dbSNP_Val_Status" "Tumor_Sample_Barcode"
[17] "Matched_Norm_Sample_Barcode" "Match_Norm_Seq_Allele1"
[19] "Match_Norm_Seq_Allele2" "Tumor_Validation_Allele1"
[21] "Tumor_Validation_Allele2" "Match_Norm_Validation_Allele1"
[23] "Match_Norm_Validation_Allele2" "Verification_Status"
[25] "Validation_Status" "Mutation_Status"
[27] "Sequencing_Phase" "Sequence_Source"
[29] "Validation_Method" "Score"
[31] "BAM_File" "Sequencer"
[33] "Tumor_Sample_UUID" "Matched_Norm_Sample_UUID"
[35] "HGVSc" "HGVSp"
[37] "HGVSp_Short" "Transcript_ID"
[39] "Exon_Number" "t_depth"
[41] "t_ref_count" "t_alt_count"
[43] "n_depth" "n_ref_count"
[45] "n_alt_count" "all_effects"
[47] "Allele" "Gene"
[49] "Feature" "Feature_type"
[51] "One_Consequence" "Consequence"
[53] "cDNA_position" "CDS_position"
[55] "Protein_position" "Amino_acids"
[57] "Codons" "Existing_variation"
[59] "ALLELE_NUM" "DISTANCE"
[61] "TRANSCRIPT_STRAND" "SYMBOL"
[63] "SYMBOL_SOURCE" "HGNC_ID"
[65] "BIOTYPE" "CANONICAL"
[67] "CCDS" "ENSP"
[69] "SWISSPROT" "TREMBL"
[71] "UNIPARC" "RefSeq"
[73] "SIFT" "PolyPhen"
[75] "EXON" "INTRON"
[77] "DOMAINS" "GMAF"
[79] "AFR_MAF" "AMR_MAF"
[81] "ASN_MAF" "EAS_MAF"
[83] "EUR_MAF" "SAS_MAF"
[85] "AA_MAF" "EA_MAF"
[87] "CLIN_SIG" "SOMATIC"
[89] "PUBMED" "MOTIF_NAME"
[91] "MOTIF_POS" "HIGH_INF_POS"
[93] "MOTIF_SCORE_CHANGE" "IMPACT"
[95] "PICK" "VARIANT_CLASS"
[97] "TSL" "HGVS_OFFSET"
[99] "PHENO" "MINIMISED"
[101] "ExAC_AF" "ExAC_AF_Adj"
[103] "ExAC_AF_AFR" "ExAC_AF_AMR"
[105] "ExAC_AF_EAS" "ExAC_AF_FIN"
[107] "ExAC_AF_NFE" "ExAC_AF_OTH"
[109] "ExAC_AF_SAS" "GENE_PHENO"
[111] "FILTER" "CONTEXT"
[113] "src_vcf_id" "tumor_bam_uuid"
[115] "normal_bam_uuid" "case_id"
[117] "GDC_FILTER" "COSMIC"
[119] "MC3_Overlap" "GDC_Validation_Status"
1
> write.mafSummary(maf=kirc, basename="kirc") # 输出一个kirc_sampleSummary.txt文件,内容就是getSampleSummary()输出的内容

三、突变的可视化

3.1 MAF文件汇总统计图

1
> plotmafSummary(maf=kirc, rmOutlier=FALSE, addStat="median", dashboard=TRUE, titvRaw = FALSE)

3.2 ONCOPLOT(瀑布图)

1
oncoplot(maf=kirc,top=10,fontSize = 0.6,showTumorSampleBarcodes = F)

若对某些基因感兴趣,则使用genes参数指定基因,如下:

1
oncoplot(maf=kirc,genes=c("VHL","PBRM1","SETD2","BAP1","KDM5C","TP53","ATM","ARID1A","PTEN","TSC1","NF2","CDKN2A","CDKN2B"),fontSize = 0.6,showTumorSampleBarcodes = F)

3.3 基因突变共存与排斥

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
> somaticInteractions(maf=kirc,top=10,pvalue = c(0.01,0.05),fontSize = 0.5)
gene1 gene2 pValue oddsRatio 00 11 01 10 Event
1: VHL PBRM1 0.0006345551 2.1499701 115 85 56 81 Co_Occurence
2: SETD2 PBRM1 0.0069684692 2.5362176 180 26 115 16 Co_Occurence
3: BAP1 MUC16 0.0177253442 3.6761737 286 6 16 29 Co_Occurence
4: TTN HMCN1 0.0231208832 3.2520719 267 7 11 52 Co_Occurence
5: MTOR TTN 0.0811830227 2.3527658 263 7 52 15 Co_Occurence
6: DNAH9 BAP1 0.0869868888 2.8556913 289 4 31 13 Co_Occurence
7: DNAH9 MTOR 0.0902657408 3.3752125 301 3 19 14 Co_Occurence
8: TTN BAP1 0.0963286074 2.0600380 253 10 25 49 Co_Occurence
9: TTN SETD2 0.1287443528 1.8222905 247 11 31 48 Co_Occurence
10: MUC16 MTOR 0.1649934265 2.4505615 296 3 19 19 Co_Occurence
11: DNAH9 TTN 0.1924226420 2.0471964 266 5 54 12 Co_Occurence
12: BAP1 PBRM1 0.2090438361 0.6072762 172 11 130 24 Mutually_Exclusive
13: MUC16 TTN 0.2426284078 1.8496958 262 6 53 16 Co_Occurence
14: KDM5C BAP1 0.2453372474 0.0000000 282 0 35 20 Mutually_Exclusive
15: DNAH9 SETD2 0.2460022133 2.2760862 282 4 38 13 Co_Occurence
16: HMCN1 SETD2 0.2586292953 2.1068377 281 4 38 14 Co_Occurence
17: VHL SETD2 0.3232858924 1.4350657 153 24 18 142 Co_Occurence
18: MTOR HMCN1 0.3314867116 1.8644128 299 2 16 20 Co_Occurence
19: MUC16 HMCN1 0.3314867116 1.8644128 299 2 16 20 Co_Occurence
20: KDM5C TTN 0.3647966906 1.6207908 263 5 54 15 Co_Occurence
21: MUC16 KDM5C 0.3814379116 1.6470297 297 2 18 20 Co_Occurence
22: SETD2 BAP1 0.4145364481 1.5265571 266 6 29 36 Co_Occurence
23: HMCN1 BAP1 0.4153981852 1.7899322 287 3 32 15 Co_Occurence
24: PBRM1 DNAH9 0.4501215062 1.5999432 188 9 8 132 Co_Occurence
25: BAP1 MTOR 0.4880885198 1.3948288 283 3 19 32 Co_Occurence
26: KDM5C SETD2 0.4881372857 0.3550657 276 1 41 19 Mutually_Exclusive
27: DNAH9 VHL 0.6206385050 0.7095260 161 7 159 10 Mutually_Exclusive
28: TTN PBRM1 0.6649326479 0.8666566 160 23 118 36 Mutually_Exclusive
29: MUC16 SETD2 0.7451002458 1.1170630 276 3 39 19 Co_Occurence
30: HMCN1 VHL 0.8097726647 0.8156840 161 8 158 10 Mutually_Exclusive
31: PBRM1 HMCN1 0.8113574779 1.1184198 186 8 10 133 Co_Occurence
32: MTOR PBRM1 0.8239686926 1.1699269 184 10 131 12 Co_Occurence
33: MUC16 PBRM1 0.8239686926 1.1699269 184 10 131 12 Co_Occurence
34: MTOR VHL 0.8264941510 0.8497706 159 10 156 12 Mutually_Exclusive
35: TTN VHL 1.0000000000 0.9949054 141 29 137 30 Mutually_Exclusive
36: BAP1 VHL 1.0000000000 0.9698796 153 17 149 18 Mutually_Exclusive
37: MUC16 VHL 1.0000000000 1.0321697 160 11 155 11 Co_Occurence
38: KDM5C VHL 1.0000000000 1.0319633 161 10 156 10 Co_Occurence
39: KDM5C PBRM1 1.0000000000 0.9225070 184 8 133 12 Mutually_Exclusive
40: MTOR SETD2 1.0000000000 0.6881771 275 2 40 20 Mutually_Exclusive
41: KDM5C MTOR 1.0000000000 0.7424513 296 1 21 19 Mutually_Exclusive
42: DNAH9 MUC16 1.0000000000 0.8901653 299 1 21 16 Mutually_Exclusive
43: HMCN1 KDM5C 1.0000000000 0.9289862 300 1 19 17 Mutually_Exclusive
44: DNAH9 KDM5C 1.0000000000 0.9901605 301 1 19 16 Mutually_Exclusive
45: DNAH9 HMCN1 1.0000000000 1.1135821 303 1 17 16 Co_Occurence
gene1 gene2 pValue oddsRatio 00 11 01 10 Event
pair event_ratio
1: PBRM1, VHL 85/137
2: PBRM1, SETD2 26/131
3: BAP1, MUC16 6/45
4: HMCN1, TTN 7/63
5: MTOR, TTN 7/67
6: BAP1, DNAH9 4/44
7: DNAH9, MTOR 3/33
8: BAP1, TTN 10/74
9: SETD2, TTN 11/79
10: MTOR, MUC16 3/38
11: DNAH9, TTN 5/66
12: BAP1, PBRM1 11/154
13: MUC16, TTN 6/69
14: BAP1, KDM5C 0/55
15: DNAH9, SETD2 4/51
16: HMCN1, SETD2 4/52
17: SETD2, VHL 24/160
18: HMCN1, MTOR 2/36
19: HMCN1, MUC16 2/36
20: KDM5C, TTN 5/69
21: KDM5C, MUC16 2/38
22: BAP1, SETD2 6/65
23: BAP1, HMCN1 3/47
24: DNAH9, PBRM1 9/140
25: BAP1, MTOR 3/51
26: KDM5C, SETD2 1/60
27: DNAH9, VHL 7/169
28: PBRM1, TTN 23/154
29: MUC16, SETD2 3/58
30: HMCN1, VHL 8/168
31: HMCN1, PBRM1 8/143
32: MTOR, PBRM1 10/143
33: MUC16, PBRM1 10/143
34: MTOR, VHL 10/168
35: TTN, VHL 29/167
36: BAP1, VHL 17/167
37: MUC16, VHL 11/166
38: KDM5C, VHL 10/166
39: KDM5C, PBRM1 8/145
40: MTOR, SETD2 2/60
41: KDM5C, MTOR 1/40
42: DNAH9, MUC16 1/37
43: HMCN1, KDM5C 1/36
44: DNAH9, KDM5C 1/35
45: DNAH9, HMCN1 1/33
pair event_ratio

同理,也可以指定感兴趣的基因

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
> somaticInteractions(maf=kirc,genes=c("VHL","PBRM1","SETD2","BAP1","KDM5C","TP53","ATM","ARID1A","PTEN","TSC1","NF2","CDKN2A","CDKN2B"),pvalue = c(0.01,0.05),fontSize = 0.5)
gene1 gene2 pValue oddsRatio 00 11 01 10 Event
1: VHL PBRM1 0.0006345551 2.1499701 115 85 56 81 Co_Occurence
2: SETD2 PBRM1 0.0069684692 2.5362176 180 26 115 16 Co_Occurence
3: PBRM1 ATM 0.0358546719 3.2210326 191 11 5 130 Co_Occurence
4: NF2 VHL 0.0607849993 0.0000000 166 0 166 5 Mutually_Exclusive
5: TP53 VHL 0.0673114551 0.1426375 164 1 165 7 Mutually_Exclusive
6: NF2 PBRM1 0.0774205829 0.0000000 191 0 141 5 Mutually_Exclusive
7: VHL PTEN 0.0826438335 3.1999405 168 9 3 157 Co_Occurence
8: NF2 TP53 0.1138366801 11.3372992 325 1 7 4 Co_Occurence
9: TP53 PBRM1 0.1457726032 0.1935716 189 1 140 7 Mutually_Exclusive
10: ATM TSC1 0.1775445900 6.9716044 318 1 3 15 Co_Occurence
11: PTEN SETD2 0.1778634395 2.4357280 286 3 39 9 Co_Occurence
12: BAP1 PBRM1 0.2090438361 0.6072762 172 11 130 24 Mutually_Exclusive
13: KDM5C BAP1 0.2453372474 0.0000000 282 0 35 20 Mutually_Exclusive
14: TP53 PTEN 0.2541098432 4.0970807 318 1 11 7 Co_Occurence
15: VHL ARID1A 0.2853701129 1.8996157 166 9 5 157 Co_Occurence
16: ATM VHL 0.3144443178 1.7598874 165 10 156 6 Co_Occurence
17: VHL SETD2 0.3232858924 1.4350657 153 24 18 142 Co_Occurence
18: TSC1 SETD2 0.4143213860 2.3659439 292 1 41 3 Co_Occurence
19: SETD2 BAP1 0.4145364481 1.5265571 266 6 29 36 Co_Occurence
20: KDM5C SETD2 0.4881372857 0.3550657 276 1 41 19 Mutually_Exclusive
21: SETD2 NF2 0.4881907608 1.7706936 291 1 4 41 Co_Occurence
22: KDM5C PTEN 0.5261326770 1.4621664 306 1 11 19 Co_Occurence
23: PTEN PBRM1 0.5665516486 1.4059303 190 6 135 6 Co_Occurence
24: ARID1A PBRM1 0.5860353120 1.4089576 189 7 134 7 Co_Occurence
25: ATM KDM5C 0.6110121591 0.0000000 301 0 20 16 Mutually_Exclusive
26: TSC1 VHL 0.6229248936 0.3403639 168 1 165 3 Mutually_Exclusive
27: ARID1A BAP1 0.6463374918 1.4627394 290 2 33 12 Co_Occurence
28: ATM BAP1 0.6765457171 1.2458698 288 2 33 14 Co_Occurence
29: SETD2 ARID1A 0.6890948911 1.1785374 283 2 12 40 Co_Occurence
30: ATM SETD2 0.7037518548 0.4560963 280 1 41 15 Mutually_Exclusive
31: TSC1 BAP1 1.0000000000 0.0000000 298 0 35 4 Mutually_Exclusive
32: TP53 BAP1 1.0000000000 0.0000000 294 0 35 8 Mutually_Exclusive
33: ARID1A KDM5C 1.0000000000 0.0000000 303 0 20 14 Mutually_Exclusive
34: TSC1 KDM5C 1.0000000000 0.0000000 313 0 20 4 Mutually_Exclusive
35: TP53 ATM 1.0000000000 0.0000000 313 0 16 8 Mutually_Exclusive
36: TP53 ARID1A 1.0000000000 0.0000000 315 0 14 8 Mutually_Exclusive
37: NF2 ARID1A 1.0000000000 0.0000000 318 0 14 5 Mutually_Exclusive
38: BAP1 NF2 1.0000000000 0.0000000 297 0 5 35 Mutually_Exclusive
39: KDM5C NF2 1.0000000000 0.0000000 312 0 5 20 Mutually_Exclusive
40: PTEN NF2 1.0000000000 0.0000000 320 0 5 12 Mutually_Exclusive
41: PTEN TSC1 1.0000000000 0.0000000 321 0 4 12 Mutually_Exclusive
42: TP53 TSC1 1.0000000000 0.0000000 325 0 4 8 Mutually_Exclusive
43: BAP1 VHL 1.0000000000 0.9698796 153 17 149 18 Mutually_Exclusive
44: KDM5C VHL 1.0000000000 1.0319633 161 10 156 10 Co_Occurence
45: KDM5C PBRM1 1.0000000000 0.9225070 184 8 133 12 Mutually_Exclusive
46: TSC1 PBRM1 1.0000000000 1.3942563 194 2 139 2 Co_Occurence
47: TP53 SETD2 1.0000000000 1.0034740 288 1 41 7 Co_Occurence
48: PTEN BAP1 1.0000000000 0.7786031 291 1 34 11 Mutually_Exclusive
49: TP53 KDM5C 1.0000000000 0.0000000 309 0 20 8 Mutually_Exclusive
50: ARID1A ATM 1.0000000000 0.0000000 307 0 16 14 Mutually_Exclusive
51: PTEN ATM 1.0000000000 0.0000000 309 0 16 12 Mutually_Exclusive
52: NF2 ATM 1.0000000000 0.0000000 316 0 16 5 Mutually_Exclusive
53: PTEN ARID1A 1.0000000000 0.0000000 311 0 14 12 Mutually_Exclusive
54: TSC1 ARID1A 1.0000000000 0.0000000 319 0 14 4 Mutually_Exclusive
55: TSC1 NF2 1.0000000000 0.0000000 328 0 5 4 Mutually_Exclusive
gene1 gene2 pValue oddsRatio 00 11 01 10 Event
pair event_ratio
1: PBRM1, VHL 85/137
2: PBRM1, SETD2 26/131
3: ATM, PBRM1 11/135
4: NF2, VHL 0/171
5: TP53, VHL 1/172
6: NF2, PBRM1 0/146
7: PTEN, VHL 9/160
8: NF2, TP53 1/11
9: PBRM1, TP53 1/147
10: ATM, TSC1 1/18
11: PTEN, SETD2 3/48
12: BAP1, PBRM1 11/154
13: BAP1, KDM5C 0/55
14: PTEN, TP53 1/18
15: ARID1A, VHL 9/162
16: ATM, VHL 10/162
17: SETD2, VHL 24/160
18: SETD2, TSC1 1/44
19: BAP1, SETD2 6/65
20: KDM5C, SETD2 1/60
21: NF2, SETD2 1/45
22: KDM5C, PTEN 1/30
23: PBRM1, PTEN 6/141
24: ARID1A, PBRM1 7/141
25: ATM, KDM5C 0/36
26: TSC1, VHL 1/168
27: ARID1A, BAP1 2/45
28: ATM, BAP1 2/47
29: ARID1A, SETD2 2/52
30: ATM, SETD2 1/56
31: BAP1, TSC1 0/39
32: BAP1, TP53 0/43
33: ARID1A, KDM5C 0/34
34: KDM5C, TSC1 0/24
35: ATM, TP53 0/24
36: ARID1A, TP53 0/22
37: ARID1A, NF2 0/19
38: BAP1, NF2 0/40
39: KDM5C, NF2 0/25
40: NF2, PTEN 0/17
41: PTEN, TSC1 0/16
42: TP53, TSC1 0/12
43: BAP1, VHL 17/167
44: KDM5C, VHL 10/166
45: KDM5C, PBRM1 8/145
46: PBRM1, TSC1 2/141
47: SETD2, TP53 1/48
48: BAP1, PTEN 1/45
49: KDM5C, TP53 0/28
50: ARID1A, ATM 0/30
51: ATM, PTEN 0/28
52: ATM, NF2 0/21
53: ARID1A, PTEN 0/26
54: ARID1A, TSC1 0/18
55: NF2, TSC1 0/9
pair event_ratio

3.4 文字云

1
> geneCloud(input=kirc,top=10)

3.5 ONCOSTRIP

1
> oncostrip(maf=kirc,top=10) # oncostrip与oncoplot非常相似,但没有右侧的统计结果,按需选择吧

或按一组基因作图

1
oncostrip(maf=kirc,genes=c("VHL","PBRM1","SETD2","BAP1","KDM5C","TP53","ATM","ARID1A","PTEN","TSC1","NF2","CDKN2A","CDKN2B"))

3.6 转换(transition)和颠换(transversion)

1
> titv(maf=kirc, plot=TRUE, useSyn=TRUE) # titv函数的结果是对SNV_Class的另外一种诠释

3.7 LOLLIPOPPLOT

1
2
3
4
5
6
7
8
9
10
11
12
> lollipopPlot(maf=kirc, gene="TP53", AACol="HGVSp_Short", showMutationRate=TRUE)
8 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
HGNC refseq.ID protein.ID aa.length
1: TP53 NM_000546 NP_000537 393
2: TP53 NM_001126112 NP_001119584 393
3: TP53 NM_001126118 NP_001119590 354
4: TP53 NM_001126115 NP_001119587 261
5: TP53 NM_001126113 NP_001119585 346
6: TP53 NM_001126117 NP_001119589 214
7: TP53 NM_001126114 NP_001119586 341
8: TP53 NM_001126116 NP_001119588 209
Using longer transcript NM_000546 for now.

上面提示有报错信息,原因是TP53这个基因有8个已知的转录本,程序自动选择了最长的转录本,即NM_000546,我们可以通过refSeqID或proteinID手动指定转录本,如:

1
lollipopPlot(maf=kirc, gene="TP53",refSeqID="NM_001126115", AACol="HGVSp_Short", showMutationRate=TRUE)

3.8 RAINFALLPLOT(降雨图)

1
> rainfallPlot(maf=kirc, detectChangePoints=TRUE, pointSize=0.6) # 通过绘制染色体尺度的突变间距来展示hyper mutated genomic region

3.9 TMB(肿瘤突变负荷)

1
tcgaCompare(maf=kirc, cohortName="kirc") # cohortName是自定义的名称

3.10 VAF(Variant Allele Frequency,变异等位基因频率)

1
plotVaf(maf=kirc,top=10) # 默认top=10,可以手动指定,也可以使用genes指定基因名称

四、生存分析

生存分析是基于MAF文件中给定基因或样本的突变状态进行预测,可以通过参数genes指定一个或一组基因,可以通过参数samples指定样本,参数中的status需要特别注意,我这里使用的是”vital_status”,这个参数必须是“TRUE”/“FALSE”或者”1”/“0”,因此,需要在clinical.tsv中对相应的列进行修改,具体到不同的分析软件这一列的名称会大不相同。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
> mafSurvival(maf=kirc, genes=c("VHL"), time="days_to_last_follow_up", Status="vital_status", isTCGA=TRUE)
Looking for clinical data in annoatation slot of MAF..
Number of mutated samples for given genes:
VHL
166
Removed 69 samples with NA's
Median survival..
Group medianTime N
1: Mutant 1261.5 130
2: WT 1217.0 137
Warning messages:
1: In survival::Surv(time = Time, event = Status) :
Invalid status value, converted to NA
2: In survival::Surv(time = Time, event = Status) :
Invalid status value, converted to NA
3: In survival::Surv(time = Time, event = Status) :
Invalid status value, converted to NA

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
> mafSurvGroup(maf=kirc,geneSet=c("SETD2","PBRM1"), time="days_to_last_follow_up", Status="vital_status")
Looking for clinical data in annoatation slot of MAF..
Removed 69 samples with NA's
Median survival..
Group medianTime N
1: Mutant 2231 19
2: WT 1176 248
> prog_geneset = survGroup(maf = kirc, top = 20, minSamples=5,geneSetSize = 2, time = "days_to_last_follow_up", Status="vital_status", verbose = TRUE)
------------------
genes: 20
geneset size: 2
190 combinations
Looking for clinical data in annoatation slot of MAF..
Removed 69 samples with NA's
Geneset: VHL,PBRM1 [N= 85 ]
Geneset: VHL,TTN [N= 29 ]
Geneset: VHL,SETD2 [N= 24 ]
Geneset: VHL,BAP1 [N= 17 ]
Geneset: VHL,MUC16 [N= 11 ]
Geneset: VHL,MTOR [N= 10 ]
Geneset: VHL,KDM5C [N= 10 ]
Geneset: VHL,HMCN1 [N= 8 ]
Geneset: VHL,DNAH9 [N= 7 ]
Geneset: VHL,LRP2 [N= 7 ]
Geneset: VHL,ATM [N= 10 ]
Geneset: VHL,ARID1A [N= 9 ]
Geneset: VHL,CSMD3 [N= 8 ]
Geneset: VHL,DST [N= 6 ]
Geneset: VHL,KMT2C [N= 7 ]
Geneset: VHL,ERBB4 [N= 7 ]
Geneset: VHL,SMARCA4 [N= 6 ]
Geneset: VHL,USH2A [N= 10 ]
Geneset: VHL,PCLO [N= 5 ]
Geneset: PBRM1,TTN [N= 23 ]
Geneset: PBRM1,SETD2 [N= 26 ]
Geneset: PBRM1,BAP1 [N= 11 ]
Geneset: PBRM1,MUC16 [N= 10 ]
Geneset: PBRM1,MTOR [N= 10 ]
Geneset: PBRM1,KDM5C [N= 8 ]
Geneset: PBRM1,HMCN1 [N= 8 ]
Geneset: PBRM1,DNAH9 [N= 9 ]
Geneset: PBRM1,LRP2 [N= 11 ]
Geneset: PBRM1,ATM [N= 11 ]
Geneset: PBRM1,ARID1A [N= 7 ]
Geneset: PBRM1,CSMD3 [N= 8 ]
Geneset: PBRM1,KMT2C [N= 7 ]
Geneset: PBRM1,ERBB4 [N= 6 ]
Geneset: PBRM1,SMARCA4 [N= 6 ]
Geneset: PBRM1,USH2A [N= 10 ]
Geneset: PBRM1,PCLO [N= 5 ]
Geneset: TTN,SETD2 [N= 11 ]
Geneset: TTN,BAP1 [N= 10 ]
Geneset: TTN,MUC16 [N= 6 ]
Geneset: TTN,MTOR [N= 7 ]
Geneset: TTN,KDM5C [N= 5 ]
Geneset: TTN,HMCN1 [N= 7 ]
Geneset: TTN,DNAH9 [N= 5 ]
Geneset: TTN,USH2A [N= 5 ]
Geneset: SETD2,BAP1 [N= 6 ]
Geneset: SETD2,LRP2 [N= 5 ]
Geneset: BAP1,MUC16 [N= 6 ]
There were 34 warnings (use warnings() to see them)
> print(prog_geneset)
Gene_combination P_value hr WT Mutant
1: VHL_USH2A 0.000465 9.42e+00 259 8
2: TTN_USH2A 0.001190 1.36e+01 264 3
3: PBRM1_USH2A 0.001380 8.17e+00 259 8
4: VHL_KMT2C 0.026500 7.30e+00 263 4
5: PBRM1_ERBB4 0.029000 7.12e+00 263 4
6: VHL_PBRM1 0.046000 3.14e+00 198 69
7: VHL_ERBB4 0.065300 5.58e+00 261 6
8: PBRM1_MTOR 0.085400 5.11e+00 261 6
9: VHL_MTOR 0.156000 3.97e+00 261 6
10: VHL_ARID1A 0.523000 3.89e-08 260 7
11: PBRM1_MUC16 0.556000 3.94e-08 258 9
12: PBRM1_ATM 0.558000 3.94e-08 258 9
13: TTN_BAP1 0.559000 3.92e-08 262 5
14: VHL_ATM 0.562000 3.93e-08 260 7
15: PBRM1_ARID1A 0.563000 3.93e-08 261 6
16: VHL_MUC16 0.568000 3.95e-08 260 7
17: VHL_BAP1 0.585000 1.08e-07 260 7
18: PBRM1_LRP2 0.596000 1.08e-07 259 8
19: PBRM1_SMARCA4 0.603000 1.08e-07 262 5
20: TTN_SETD2 0.608000 1.08e-07 260 7
21: VHL_DST 0.636000 1.09e-07 261 6
22: PBRM1_BAP1 0.637000 1.09e-07 262 5
23: PBRM1_CSMD3 0.643000 1.09e-07 261 6
24: TTN_MUC16 0.646000 1.09e-07 264 3
25: PBRM1_KMT2C 0.647000 1.09e-07 262 5
26: VHL_DNAH9 0.656000 1.09e-07 261 6
27: VHL_KDM5C 0.660000 1.09e-07 261 6
28: VHL_CSMD3 0.665000 1.09e-07 259 8
29: VHL_LRP2 0.666000 1.09e-07 262 5
30: PBRM1_KDM5C 0.671000 1.09e-07 262 5
31: VHL_SETD2 0.672000 1.55e+00 251 16
32: VHL_SMARCA4 0.672000 1.10e-07 263 4
33: PBRM1_DNAH9 0.683000 1.10e-07 260 7
34: SETD2_LRP2 0.684000 1.10e-07 264 3
35: VHL_PCLO 0.689000 1.10e-07 263 4
36: SETD2_BAP1 0.693000 1.10e-07 264 3
37: TTN_MTOR 0.697000 1.10e-07 263 4
38: VHL_HMCN1 0.715000 1.10e-07 263 4
39: TTN_DNAH9 0.724000 1.10e-07 264 3
40: PBRM1_PCLO 0.727000 1.10e-07 263 4
41: TTN_KDM5C 0.727000 1.10e-07 264 3
42: TTN_HMCN1 0.743000 3.01e-07 263 4
43: PBRM1_TTN 0.751000 1.39e+00 250 17
44: BAP1_MUC16 0.793000 3.02e-07 266 1
45: PBRM1_HMCN1 0.898000 8.26e-07 265 2
46: VHL_TTN 0.913000 1.12e+00 246 21
47: PBRM1_SETD2 0.955000 1.06e+00 248 19
Gene_combination P_value hr WT Mutant

# 接着,可以使用mafSurvGroup进行作图

mafSurvGroup(maf=kirc,geneSet=c("VHL","USH2A"), time="days_to_last_follow_up", Status="vital_status")

五、对比两个MAF文件

这一功能使得我们可以通过对临床数据进行分类而比较其差异,比如同一疾病中不同性别是否有显著差异,比如不同种族是否有显著差异,或者不同年龄阶段是否有显著差异,等等。下面,我们以性别将KIRC分为两类,

1
2
3
4
5
6
7
8
9
kirc <- read.maf(maf="mutect.maf")
require(data.table)

clin<-as.data.frame(fread("clinical.tsv"))
clin.female <- subset(clin, gender=="female")$Tumor_Sample_Barcode
clin.male <- subset(clin, gender=="male")$Tumor_Sample_Barcode

kirc.male <- subsetMaf(maf=kirc, tsb=clin.male, isTCGA=TRUE)
kirc.female <- subsetMaf(maf=kirc, tsb=clin.female, isTCGA=TRUE)

至此,基于性别分类的两个MAF文件就准备好了,下面可以对这两组数据进行作图,

5.1 FORESTPLOT(森林图)

1
2
comp <- mafCompare(m1=kirc.male, m2=kirc.female, m1Name="Male", m2Name="Female", minMut=5)
forestPlot(mafCompareRes=comp, pVal=0.05, color=c("maroon", "royalblue"), geneFontSize=0.8)

5.2 ONCOPLOT(双MAF瀑布图)

oncoplot中基因的选择,可根据mafCompare()的结果,选择有显著差异的基因,进行作图

1
2
genes <- c("ZFHX4", "ERBB4", "TGM5", "TENM1", "BRWD3", "ARID1A", "KDM5C")
coOncoplot(m1=kirc.female, m2=kirc.male, m1Name="Female", m2Name="male", genes=genes)

5.3 LOLLIPOPOLOT2(双MAF棒棒糖图)

1
2
lollipopPlot2(m1=luad.female, m2=luad.male, m1_name="Female", m2_name="Male", gene="IQGAP2", AACol1 = "HGVSp_Short", AACol2 = "HGVSp_Short")
# 同前面的棒棒糖图一样,直接运行会提示有多个转录本,可以通过refSeqID/ proteinID来手动指定转录本进行绘图

六、信号通路富集

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> OncogenicPathways(maf = laml)
# 这个功能可说得上是简单粗暴,没有可选参数,只需要一个INPUT即可
Pathway alteration fractions
Pathway N n_affected_genes fraction_affected
1: RTK-RAS 85 60 0.7058824
2: NOTCH 71 43 0.6056338
3: WNT 68 38 0.5588235
4: Hippo 38 25 0.6578947
5: PI3K 29 21 0.7241379
6: MYC 13 7 0.5384615
7: TP53 6 5 0.8333333
8: Cell_Cycle 15 5 0.3333333
9: NRF2 3 3 1.0000000
10: TGF-Beta 7 3 0.4285714
# 同时绘制图形

从上面输出的统计结果,可以选择感兴趣的信号通路进行作图,注意,图中的红色表示抑癌基因,蓝色表示致癌基因。

1
> PlotOncogenicPathways(maf=kirc, pathways = "RTK-RAS")

七、临床富集分析

clinicalEnrichment可以对指定的临床特征进行富集分析,比如指定”ajcc_pathologic_stage”,命令如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
> stage = clinicalEnrichment(maf=kirc, clinicalFeature = 'ajcc_pathologic_stage')

Sample size per factor in ajcc_pathologic_stage:

'-- Stage I Stage II Stage III Stage IV
2 192 33 68 41

# 查看符合p_value < 0.01时的富集情况

> stage$groupwise_comparision[p_value < 0.01]
Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2
1: SETD2 Stage IV Rest 13 of 41 29 of 295
2: CSMD3 Stage III Rest 8 of 68 6 of 268
3: COL1A2 Stage II Rest 4 of 33 3 of 303
4: PKHD1L1 Stage IV Rest 5 of 41 5 of 295
5: ANKS1B Stage IV Rest 4 of 41 3 of 295
6: STAG2 Stage III Rest 6 of 68 4 of 268
7: PRKDC Stage IV Rest 4 of 41 4 of 295
p_value OR_low OR_high fdr
1: 0.0004091851 0 0.4862509 0.4828384
2: 0.0020655103 0 0.4967966 0.8691319
3: 0.0022096573 0 0.3539644 0.8691319
4: 0.0033918860 0 0.4593002 1.0000000
5: 0.0051372257 0 0.4581794 1.0000000
6: 0.0059701059 0 0.5596025 1.0000000
7: 0.0093632230 0 0.5595471 1.0000000

# 通过函数plotEnrichmentResults来作图

> plotEnrichmentResults(enrich_res=stage, pVal = 0.01)

同理,也可对其他临床特征进行富集分析,比如性别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> gender = clinicalEnrichment(maf=kirc, clinicalFeature = 'gender')
Sample size per factor in gender:

female male
120 216
> gender$groupwise_comparision[p_value < 0.01]
Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2 p_value
1: ARID1A female Rest 11 of 120 3 of 216 0.001073289
2: DYNC2H1 female Rest 7 of 120 1 of 216 0.003695386
3: KDM5C male Rest 18 of 216 2 of 120 0.008638251
4: KIAA1549L female Rest 6 of 120 1 of 216 0.009419279
5: TENM1 female Rest 6 of 120 1 of 216 0.009419279
OR_low OR_high fdr
1: 0 0.4594307 0.5065922
2: 0 0.4787417 0.8721112
3: 0 0.6794989 0.8891799
4: 0 0.5892996 0.8891799
5: 0 0.5892996 0.8891799
> plotEnrichmentResults(enrich_res=gender, pVal = 0.01)

八、肿瘤异质性分析

一般而言,肿瘤是异质的,包含不同的亚型的细胞,可以通过inferHeterogeneity来对VAF进行评估以推断肿瘤异质性

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
> inferHeterogeneity(maf=kirc)
t_vaf field is missing, but found t_ref_count & t_alt_count columns. Estimating vaf..
Processing TCGA-B8-4143..
Processing TCGA-B0-5098..
Processing TCGA-A3-A8OV..
Processing TCGA-CJ-4920..
Processing TCGA-CZ-5468..
$clusterData
Hugo_Symbol Chromosome Start_Position End_Position
1: CDK11B 1 1638569 1638569
2: CDK11B 1 1638570 1638570
3: PLOD1 1 11949869 11949869
4: PLOD1 1 11949870 11949871
5: DNAJC16 1 15565965 15565966
---
2495: UBOX5 20 3122571 3122571
2496: PHF20 20 35871808 35871808
2497: LIPI 21 14206820 14206820
2498: MARS 12 57488179 57488179
2499: MXRA5 X 3317692 3317692
Tumor_Sample_Barcode t_vaf cluster MATH
1: TCGA-B8-4143 0.07500000 1 41.61684
2: TCGA-B8-4143 0.07500000 1 41.61684
3: TCGA-B8-4143 0.07500000 1 41.61684
4: TCGA-B8-4143 0.07692308 1 41.61684
5: TCGA-B8-4143 0.06217617 1 41.61684
---
2495: TCGA-CZ-5468 0.18994413 1 54.05899
2496: TCGA-CZ-5468 0.08011869 1 54.05899
2497: TCGA-CZ-5468 0.11620795 1 54.05899
2498: TCGA-CZ-5468 0.78947368 outlier 54.05899
2499: TCGA-CZ-5468 0.71794872 outlier 54.05899
MedianAbsoluteDeviation
1: 1.538092
2: 1.538092
3: 1.538092
4: 1.538092
5: 1.538092
---
2495: 10.526316
2496: 10.526316
2497: 10.526316
2498: 10.526316
2499: 10.526316

$clusterMeans
Tumor_Sample_Barcode cluster meanVaf
1: TCGA-B8-4143 1 0.05095599
2: TCGA-B8-4143 2 0.11837474
3: TCGA-B8-4143 3 0.30865454
4: TCGA-B8-4143 outlier 0.60606061
5: TCGA-B0-5098 3 0.53075245
6: TCGA-B0-5098 1 0.10022037
7: TCGA-B0-5098 2 0.32285970
8: TCGA-B0-5098 outlier 0.64120134
9: TCGA-A3-A8OV 2 0.37256626
10: TCGA-A3-A8OV 1 0.06902849
11: TCGA-A3-A8OV 3 0.71787855
12: TCGA-A3-A8OV outlier 0.58955224
13: TCGA-CJ-4920 2 0.27004853
14: TCGA-CJ-4920 1 0.06939873
15: TCGA-CJ-4920 3 0.43948759
16: TCGA-CJ-4920 outlier 0.13793103
17: TCGA-CZ-5468 2 0.35860649
18: TCGA-CZ-5468 1 0.14729743
19: TCGA-CZ-5468 outlier 0.75371120

上面,对5个barcode进行了整体分析,下面,可以指定某一个Barcode来具体看一下,比如指定”TCGA-B8-4143”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> barcode1 <- inferHeterogeneity(maf=kirc, tsb="TCGA-B8-4143")
t_vaf field is missing, but found t_ref_count & t_alt_count columns. Estimating vaf..
Processing TCGA-B8-4143..

# 可以通过$clusterMeans来查看聚类情况与VAF均值,或通过$clusterData来查看基因的详细信息
> barcode1$clusterMeans
Tumor_Sample_Barcode cluster meanVaf
1: TCGA-B8-4143 1 0.05095599
2: TCGA-B8-4143 2 0.11837474
3: TCGA-B8-4143 3 0.30865454
4: TCGA-B8-4143 outlier 0.60606061

# 使用plotClusters来作图
> plotClusters(clusters=barcode1)
# 这个肿瘤异质性聚类结果是比较差的,一方面主峰的频率只有5%,另一方面MATH scores太高,分数越高结果越差

  • 本文作者:括囊无誉
  • 本文链接: R/maltools/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!