TCGA学习笔记-使用R语言中的RTCGAToolbox分析TCGA数据

Firehose是由Broad institute开发的,提供经过预处理后的TCGA数据,在R语言中,RTCGAToolbox可用于查询、下载和分析Firehose的数据。

1
2
3
4
5
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
BiocManager::install("RTCGAToolbox")

library(RTCGAToolbox)

在通过Firehose获取数据之前,需要检查有效的数据集,“getFirehoseDatasets”可以列出所有的数据集,而“getFirehoseRunningDates”与“getFirehoseAnalyzeDates”给出数据库更新日期。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> getFirehoseDatasets()
[1] "ACC" "BLCA" "BRCA" "CESC" "CHOL" "COADREAD" "COAD"
[8] "DLBC" "ESCA" "FPPP" "GBMLGG" "GBM" "HNSC" "KICH"
[15] "KIPAN" "KIRC" "KIRP" "LAML" "LGG" "LIHC" "LUAD"
[22] "LUSC" "MESO" "OV" "PAAD" "PCPG" "PRAD" "READ"
[29] "SARC" "SKCM" "STAD" "STES" "TGCT" "THCA" "THYM"
[36] "UCEC" "UCS" "UVM"
> getFirehoseRunningDates()
[1] "20160128" "20151101" "20150821" "20150601" "20150402" "20150204" "20141206"
[8] "20141017" "20140902" "20140715" "20140614" "20140518" "20140416" "20140316"
[15] "20140215" "20140115" "20131210" "20131114" "20131010" "20130923" "20130809"
[22] "20130715" "20130623" "20130606" "20130523" "20130508" "20130421" "20130406"
[29] "20130326" "20130309" "20130222" "20130203" "20130116" "20121221" "20121206"
[36] "20121114" "20121102" "20121024" "20121020" "20121018" "20121004" "20120913"
[43] "20120825" "20120804" "20120725" "20120707" "20120623" "20120606" "20120525"
[50] "20120515" "20120425" "20120412" "20120321" "20120306" "20120217" "20120124"
[57] "20120110" "20111230" "20111206" "20111128" "20111115" "20111026"
> getFirehoseAnalyzeDates()
[1] "20160128" "20150821" "20150402" "20141017" "20140715" "20140416" "20140115"
[8] "20130923" "20130523" "20130421" "20130326" "20130222" "20130116" "20121221"
[15] "20121024" "20120913" "20120825" "20120725" "20120623" "20120525" "20120425"
[22] "20120321" "20120217" "20120124" "20111230" "20111128" "20111026" "20110921"
[29] "20110728" "20110525" "20110421" "20110327" "20110217" "20110114" "20101223"

在确定数据集名称与更新日期后,就可以使用“getFirehoseData()”来获取数据了,下面来下载数据,以READ(直肠腺癌)为例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
> brcaData <- getFirehoseData(dataset="READ", runDate="20160128",
forceDownload=TRUE, clinical=TRUE, Mutation=TRUE)

gdac.broadinstitute.org_READ.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
trying URL 'http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/READ/20160128/gdac.broadinstitute.org_READ.Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz'
Content type 'application/x-gzip' length 31541 bytes (30 KB)
downloaded 30 KB

gdac.broadinstitute.org_READ.Clinical_Pick_Tier1.Level_4.2016012800.0.0
gdac.broadinstitute.org_READ.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz
trying URL 'http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/READ/20160128/gdac.broadinstitute.org_READ.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz'
Content type 'application/x-gzip' length 896526 bytes (875 KB)
downloaded 875 KB

> brcaData
READ FirehoseData objectStandard run date: 20160128
Analysis running date: NA
Available data types:
clinical: A data frame of phenotype data, dim: 171 x 19
Mutation: A data.frame, dim: 22075 x 39
To export data, use the 'getData' function.

对于getFirehoseData(),以下几个参数是必选的:

1)dataset,可以通过getFirehoseDatasets() 获取;
2)tunData,可以通过getFirehoseRunningDates()获取;
3)gistic2Date,如要获取copy number data,则必须设定这个参数,可以通过getFirehoseAnalyzeDates()获取;

可选参数如下:

1)forceDownload,是否强制下载或使用工作目录中以前下载的旧数据;
2)fileSizeLimit,默认为500MB;
3)getUUIDs,获取UUIDs条形码;

除Clinical之外,还有如下参数可选:

RNAseqGene
clinical
miRNASeqGene
RNAseq2GeneNorm
CNASNP
CNVSNP
CNASeq
CNACGH
Methylation
Mutation
mRNAArray
miRNAArray
RPPAArray

下载完成后,可以使用biocExtract()来提取数据,如提取clinical data,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
> clinicData=biocExtract(brcaData,'clinical')
working on: clinical

> head(clinicData,3)
Composite Element REF years_to_birth vital_status days_to_death
tcga.af.3913 value 60 1 316
tcga.ag.3580 value 67 0 <NA>
tcga.ag.3611 value 72 0 <NA>
days_to_last_followup tumor_tissue_site pathologic_stage pathology_T_stage
tcga.af.3913 <NA> rectum stage iv t3
tcga.ag.3580 244 rectum stage i t1
tcga.ag.3611 424 rectum stage iia t3
pathology_N_stage pathology_M_stage gender
tcga.af.3913 n1 m1 male
tcga.ag.3580 n0 m0 male
tcga.ag.3611 n0 m0 male
date_of_initial_pathologic_diagnosis days_to_last_known_alive
tcga.af.3913 2009 <NA>
tcga.ag.3580 2007 <NA>
tcga.ag.3611 2009 31
radiation_therapy histological_type tumor_stage residual_tumor
tcga.af.3913 no rectal adenocarcinoma <NA> r0
tcga.ag.3580 no rectal adenocarcinoma <NA> r0
tcga.ag.3611 no rectal adenocarcinoma <NA> r0
number_of_lymph_nodes ethnicity
tcga.af.3913 1 not hispanic or latino
tcga.ag.3580 0 <NA>
tcga.ag.3611 0 <NA>

> colnames(clinicData) #查看列名
[1] "Composite Element REF" "years_to_birth"
[3] "vital_status" "days_to_death"
[5] "days_to_last_followup" "tumor_tissue_site"
[7] "pathologic_stage" "pathology_T_stage"
[9] "pathology_N_stage" "pathology_M_stage"
[11] "gender" "date_of_initial_pathologic_diagnosis"
[13] "days_to_last_known_alive" "radiation_therapy"
[15] "histological_type" "tumor_stage"
[17] "residual_tumor" "number_of_lymph_nodes"
[19] "ethnicity"

> length(clinicData) #查看长度
[1] 19

> class(clinicData) #数据类型是data frame
[1] "data.frame"

下面提取mutation data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
> mutationData  = biocExtract(brcaData,'Mutation')
working on: Mutation

> head(mutationData,3)
class: RaggedExperiment
dim: 3 69
assays(34): Hugo_Symbol Entrez_Gene_Id ... COSMIC_Gene Drug_Target
rownames: NULL
colnames(69): TCGA-AF-2689-01A-01W-0831-10 TCGA-AF-2691-01A-01W-0831-10 ...
TCGA-AG-A032-01 TCGA-AG-A036-01
colData names(0):

> colnames(mutationData)
[1] "TCGA-AF-2689-01A-01W-0831-10" "TCGA-AF-2691-01A-01W-0831-10"
[3] "TCGA-AF-2692-01A-01W-0831-10" "TCGA-AF-3400-01A-01W-0831-10"
[5] "TCGA-AF-3913-01" "TCGA-AG-3574-01A-01W-0831-10"
[7] "TCGA-AG-3575-01A-01W-0831-10" "TCGA-AG-3578-01A-01W-0831-10"
[9] "TCGA-AG-3580-01A-01W-0831-10" "TCGA-AG-3581-01A-01W-0831-10"
[11] "TCGA-AG-3582-01A-01W-0831-10" "TCGA-AG-3583-01A-01W-0831-10"
[13] "TCGA-AG-3584-01A-01W-0831-10" "TCGA-AG-3586-01A-02W-0831-10"
[15] "TCGA-AG-3587-01A-01W-0831-10" "TCGA-AG-3593-01A-01W-0831-10"
[17] "TCGA-AG-3594-01A-02W-0831-10" "TCGA-AG-3598-01A-01W-0833-10"
[19] "TCGA-AG-3599-01A-02W-0833-10" "TCGA-AG-3600-01A-01W-0833-10"
[21] "TCGA-AG-3601-01A-01W-0833-10" "TCGA-AG-3602-01A-02W-0833-10"
[23] "TCGA-AG-3605-01A-01W-0833-10" "TCGA-AG-3608-01A-01W-0833-10"
[25] "TCGA-AG-3609-01A-02W-0833-10" "TCGA-AG-3611-01A-01W-0833-10"
[27] "TCGA-AG-3612-01A-01W-0833-10" "TCGA-AG-3726-01"
[29] "TCGA-AG-3727-01" "TCGA-AG-3878-01"
[31] "TCGA-AG-3881-01" "TCGA-AG-3882-01"
[33] "TCGA-AG-3883-01" "TCGA-AG-3887-01"
[35] "TCGA-AG-3890-01" "TCGA-AG-3892-01"
[37] "TCGA-AG-3893-01" "TCGA-AG-3894-01"
[39] "TCGA-AG-3896-01" "TCGA-AG-3898-01"
[41] "TCGA-AG-3901-01" "TCGA-AG-3902-01"
[43] "TCGA-AG-3909-01" "TCGA-AG-3999-01"
[45] "TCGA-AG-4001-01" "TCGA-AG-4005-01"
[47] "TCGA-AG-4007-01" "TCGA-AG-4008-01"
[49] "TCGA-AG-4015-01" "TCGA-AG-A002-01"
[51] "TCGA-AG-A008-01A-01W-A005-10" "TCGA-AG-A00C-01A-01W-A005-10"
[53] "TCGA-AG-A00H-01" "TCGA-AG-A00Y-01A-02W-A005-10"
[55] "TCGA-AG-A011-01" "TCGA-AG-A014-01"
[57] "TCGA-AG-A015-01A-01W-A005-10" "TCGA-AG-A016-01A-01W-A005-10"
[59] "TCGA-AG-A01L-01" "TCGA-AG-A01W-01"
[61] "TCGA-AG-A01Y-01" "TCGA-AG-A020-01"
[63] "TCGA-AG-A025-01" "TCGA-AG-A026-01"
[65] "TCGA-AG-A02G-01" "TCGA-AG-A02N-01"
[67] "TCGA-AG-A02X-01" "TCGA-AG-A032-01"
[69] "TCGA-AG-A036-01"

> length(mutationData)
[1] 22075

> class(mutationData)
[1] "RaggedExperiment"
attr(,"package")
[1] "RaggedExperiment"

下面,可以先看一下突变情况

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> mt=getMutationRate(brcaData)

> head(mt)
Genes MutationRatio
OR5B12 OR5B12 0.04347826
KRT17 KRT17 0.02898551
SLC10A4 SLC10A4 0.01449275
C7orf58 C7orf58 0.04347826
KLF14 KLF14 0.02898551
ZNF133 ZNF133 0.01449275

> class(mt)
[1] "data.frame"

> tail(mt[order(mt$MutationRatio),])
Genes MutationRatio
MUC16 MUC16 0.1884058
SYNE1 SYNE1 0.2028986
TTN TTN 0.3043478
KRAS KRAS 0.5507246
TP53 TP53 0.6666667
APC APC 0.8405797
#APC,KRAS和TP53在READ中的突变率是真的高

绘制生存曲线

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
> survData <- data.frame(Samples=rownames(clinicData),
+ Time=as.numeric(clinicData[,4]),
+ Censor=as.numeric(clinicData[,3])
+ )

> class(survData)
[1] "data.frame"

> nrow(survData)
[1] 171

> survData
Samples Time Censor
1 tcga.af.3913 316 1
2 tcga.ag.3580 NA 0
3 tcga.ag.3611 NA 0
4 tcga.ag.3612 NA 0
5 tcga.ag.3890 NA 0
6 tcga.ag.4008 NA 0
7 tcga.ag.a00h NA 0
8 tcga.ag.a014 NA 0
9 tcga.ag.a015 NA 0
10 tcga.ag.a01n NA 0
11 tcga.ag.a023 1581 1
12 tcga.bm.6198 NA 0
13 tcga.g5.6233 556 1
14 tcga.g5.6235 NA 0
15 tcga.g5.6641 NA 0
16 tcga.af.2687 NA 0
17 tcga.af.2689 1201 1
18 tcga.af.2690 524 1
19 tcga.af.2691 NA 0
20 tcga.af.2692 NA 0
21 tcga.af.2693 NA 0
22 tcga.af.3400 NA 0
23 tcga.af.3911 NA 0
24 tcga.af.3914 NA 0
25 tcga.af.4110 NA 0
26 tcga.af.5654 512 1
27 tcga.af.6136 NA 0
28 tcga.af.6655 NA 0
29 tcga.af.6672 NA 0
30 tcga.af.a56k NA 0
31 tcga.af.a56l NA 0
32 tcga.af.a56n NA 0
33 tcga.ag.3574 1096 1
34 tcga.ag.3575 NA 0
35 tcga.ag.3578 NA 0
36 tcga.ag.3581 NA 0
37 tcga.ag.3582 1096 1
38 tcga.ag.3583 610 1
39 tcga.ag.3584 730 1
40 tcga.ag.3586 NA 0
41 tcga.ag.3587 NA 0
42 tcga.ag.3591 NA 0
43 tcga.ag.3592 NA 0
44 tcga.ag.3593 NA 0
45 tcga.ag.3594 61 1
46 tcga.ag.3598 NA 0
47 tcga.ag.3599 NA 0
48 tcga.ag.3600 NA 0
49 tcga.ag.3601 NA 0
50 tcga.ag.3602 NA 0
51 tcga.ag.3605 NA 0
52 tcga.ag.3608 NA 0
53 tcga.ag.3609 NA 0
54 tcga.ag.3725 NA 0
55 tcga.ag.3726 NA 0
56 tcga.ag.3727 NA 0
57 tcga.ag.3728 NA 0
58 tcga.ag.3731 NA 0
59 tcga.ag.3732 NA 0
60 tcga.ag.3742 NA 0
61 tcga.ag.3878 NA 0
62 tcga.ag.3881 NA 0
63 tcga.ag.3882 NA 0
64 tcga.ag.3883 NA 0
65 tcga.ag.3885 NA 0
66 tcga.ag.3887 NA 0
67 tcga.ag.3891 NA 0
68 tcga.ag.3892 NA 0
69 tcga.ag.3893 NA 0
70 tcga.ag.3894 NA 0
71 tcga.ag.3896 NA 0
72 tcga.ag.3898 NA 0
73 tcga.ag.3901 NA 0
74 tcga.ag.3902 NA 0
75 tcga.ag.3906 NA 0
76 tcga.ag.3909 NA 0
77 tcga.ag.3999 NA 0
78 tcga.ag.4001 NA 0
79 tcga.ag.4005 NA 0
80 tcga.ag.4007 NA 0
81 tcga.ag.4009 NA 0
82 tcga.ag.4015 NA 0
83 tcga.ag.4021 121 1
84 tcga.ag.4022 NA 0
85 tcga.ag.a002 NA 0
86 tcga.ag.a008 NA 0
87 tcga.ag.a00c NA 0
88 tcga.ag.a00y NA 0
89 tcga.ag.a011 NA 0
90 tcga.ag.a016 NA 0
91 tcga.ag.a01j NA 0
92 tcga.ag.a01l NA 0
93 tcga.ag.a01w NA 0
94 tcga.ag.a01y NA 0
95 tcga.ag.a020 NA 0
96 tcga.ag.a025 NA 0
97 tcga.ag.a026 59 1
98 tcga.ag.a02g 1185 1
99 tcga.ag.a02n NA 0
100 tcga.ag.a02x NA 0
101 tcga.ag.a032 NA 0
102 tcga.ag.a036 NA 0
103 tcga.ah.6544 NA 0
104 tcga.ah.6547 76 1
105 tcga.ah.6549 NA 0
106 tcga.ah.6643 1314 1
107 tcga.ah.6644 NA 0
108 tcga.ah.6897 NA 0
109 tcga.ah.6903 NA 0
110 tcga.ci.6619 NA 0
111 tcga.ci.6620 NA 0
112 tcga.ci.6621 NA 0
113 tcga.ci.6622 NA 0
114 tcga.ci.6623 NA 0
115 tcga.ci.6624 NA 0
116 tcga.cl.4957 NA 1
117 tcga.cl.5917 NA 0
118 tcga.cl.5918 NA 0
119 tcga.dc.4745 NA 0
120 tcga.dc.4749 NA 0
121 tcga.dc.5337 NA 0
122 tcga.dc.5869 NA 0
123 tcga.dc.6154 NA 0
124 tcga.dc.6155 NA 0
125 tcga.dc.6156 NA 0
126 tcga.dc.6157 NA 0
127 tcga.dc.6158 334 1
128 tcga.dc.6160 NA 0
129 tcga.dc.6681 NA 0
130 tcga.dc.6682 NA 0
131 tcga.dc.6683 NA 0
132 tcga.dt.5265 NA 0
133 tcga.dy.a0xa NA 0
134 tcga.dy.a1dc 1258 1
135 tcga.dy.a1dd 1741 1
136 tcga.dy.a1de NA 0
137 tcga.dy.a1df 734 1
138 tcga.dy.a1dg 1566 1
139 tcga.dy.a1h8 992 1
140 tcga.ef.5830 NA 0
141 tcga.ef.5831 NA 0
142 tcga.ei.6506 NA 0
143 tcga.ei.6507 NA 0
144 tcga.ei.6508 NA 0
145 tcga.ei.6509 NA 0
146 tcga.ei.6510 NA 0
147 tcga.ei.6511 NA 0
148 tcga.ei.6512 NA 0
149 tcga.ei.6513 NA 0
150 tcga.ei.6514 NA 0
151 tcga.ei.6881 NA 0
152 tcga.ei.6882 NA 0
153 tcga.ei.6883 NA 0
154 tcga.ei.6884 NA 0
155 tcga.ei.6885 NA 0
156 tcga.ei.6917 NA 0
157 tcga.ei.7002 NA 0
158 tcga.ei.7004 NA 0
159 tcga.f5.6464 303 1
160 tcga.f5.6465 NA 0
161 tcga.f5.6571 NA 0
162 tcga.f5.6702 869 1
163 tcga.f5.6810 NA 0
164 tcga.f5.6811 NA 0
165 tcga.f5.6812 NA 0
166 tcga.f5.6813 598 1
167 tcga.f5.6814 NA 0
168 tcga.f5.6861 NA 0
169 tcga.f5.6863 361 1
170 tcga.f5.6864 NA 0
171 tcga.g5.6572 1432 1

> library(survival)
Warning message:
package ‘survival’ was built under R version 3.6.3

> table(survData$Censor)

0 1
143 28

> attach(survData)

> my.surv <- Surv(Time,Censor)

> my.surv
[1] 316 NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ 1581 NA+ 556 NA+
[15] NA+ NA+ 1201 524 NA+ NA+ NA+ NA+ NA+ NA+ NA+ 512 NA+ NA+
[29] NA+ NA+ NA+ NA+ 1096 NA+ NA+ NA+ 1096 610 730 NA+ NA+ NA+
[43] NA+ NA+ 61 NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+
[57] NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+
[71] NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ 121 NA+
[85] NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ 59 1185
[99] NA+ NA+ NA+ NA+ NA+ 76 NA+ 1314 NA+ NA+ NA+ NA+ NA+ NA+
[113] NA+ NA+ NA+ NA NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+
[127] 334 NA+ NA+ NA+ NA+ NA+ NA+ 1258 1741 NA+ 734 1566 992 NA+
[141] NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+ NA+
[155] NA+ NA+ NA+ NA+ 303 NA+ NA+ 869 NA+ NA+ NA+ 598 NA+ NA+
[169] 361 NA+ 1432

> kmfit1 <- survfit(my.surv~1)

> plot(kmfit1)

  • 本文作者:括囊无誉
  • 本文链接: TCGA/RTCGAToolbox/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!