TCGA学习笔记8-临床信息提取

上面,我们对KIRC的表达数据做了差异分析,并对有差异的基因做了主成份分析和聚类热图,在进行下一步分析之前,我们需要对临床数据进行提取。我们先回顾一下,在学习笔记1中,我们在TCGA上下载到的表达数据包含来自530个CASES的611个FILES,即数据来自530个病人,但由于从某些病人身上收集了不止一个样本,因此,样本数大于病例数。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
> library(rjson)
> clinical_trials <- fromJSON(file="clinical.cart.2020-06-25.json") # 类型是列表
> n= length(clinical_trials)
> n
[1] 530 # 530个CASES,与预期一致
>id=classification_of_tumor=tumor_stage=gender=year_of_birth=year_of_death=year_of_diagnosis=days_to_death=age=deadORlive=race=alcohol=years_smoked=c(rep(0,n))
(rep(0,n)) # 定义变量
> age
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[33] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[65] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[97] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[129] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[161] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[193] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[225] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[257] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[289] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[321] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[353] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[385] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[417] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[449] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[481] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[513] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> for(i in 1:n){
+ id[i]=clinical_trials[[i]]$diagnoses[[1]]$submitter_id
+ classification_of_tumor[i]=clinical_trials[[i]]$diagnoses[[1]]$classification_of_tumor
+ tumor_stage[i]=clinical_trials[[i]]$diagnoses[[1]]$tumor_stage
+ gender[i]=clinical_trials[[i]]$demographic$gender
+ year_of_birth[i]=ifelse(is.null(clinical_trials[[i]]$demographic$year_of_birth),"not reported",clinical_trials[[i]]$demographic$year_of_birth)
+ year_of_death[i]=ifelse(is.null(clinical_trials[[i]]$demographic$year_of_death),"not reported",clinical_trials[[i]]$demographic$year_of_death) # 此处使用ifelse(),因为不一定有这个数据,比如有些CASE这里显示“Pairlist of length 0",以下同
+ year_of_diagnosis[i]=ifelse(is.null(clinical_trials[[i]]$diagnoses[[1]]$year_of_diagnosis),"not reported",clinical_trials[[i]]$diagnoses[[1]]$year_of_diagnosis)
+ days_to_death[i]=ifelse(is.null(clinical_trials[[i]]$demographic$days_to_death),"not reported",clinical_trials[[i]]$demographic$days_to_death)
+ age[i]=ifelse(is.null(clinical_trials[[i]]$demographic$age_at_index),"not reported",clinical_trials[[i]]$demographic$age_at_index)
+ deadORlive[i]=ifelse(is.null(clinical_trials[[i]]$demographic$vital_status),"not reported",clinical_trials[[i]]$demographic$vital_status)
+ race[i]=ifelse(is.null(clinical_trials[[i]]$demographic$race),"not reported",clinical_trials[[i]]$demographic$race)
+ alcohol[i]=ifelse(is.null(clinical_trials[[i]]$exposures[[1]]$alcohol_history),"not reported",clinical_trials[[i]]$exposures[[1]]$alcohol_history)
+ years_smoked[i]=ifelse(is.null(clinical_trials[[i]]$exposures[[1]]$years_smoked),"not reported",clinical_trials[[i]]$exposures[[1]]$years_smoked)
+ }
> kidney_clinic <- data.frame(
+ id,
+ classification_of_tumor,
+ tumor_stage,
+ gender,
+ year_of_birth,
+ year_of_death,
+ year_of_diagnosis,
+ days_to_death,
+ age,
+ deadORlive,
+ race,
+ alcohol,
+ years_smoked
+ )
> dim(kidney_clinic)
[1] 530 13
> head(kidney_clinic)
id classification_of_tumor tumor_stage gender
1 TCGA-CZ-5986_diagnosis not reported stage i male
2 TCGA-CZ-4858_diagnosis not reported stage ii male
3 TCGA-B8-5551_diagnosis not reported stage i female
4 TCGA-B0-4817_diagnosis not reported stage iii male
5 TCGA-BP-4325_diagnosis not reported stage i female
6 TCGA-B0-4698_diagnosis not reported stage iv male
year_of_birth year_of_death year_of_diagnosis days_to_death age
1 1945 not reported 2006 not reported 61
2 1966 not reported 2005 2105 39
3 1945 not reported 2010 not reported 65
4 1921 2004 2002 1019 81
5 1937 not reported 2001 not reported 64
6 1928 2003 2003 42 75
deadORlive race alcohol years_smoked
1 Alive white Not Reported not reported
2 Dead white Not Reported not reported
3 Alive black or african american Not Reported not reported
4 Dead white Not Reported not reported
5 Alive white Not Reported not reported
6 Dead white Not Reported not reported

至此,临床分类信息就提取完成了,在后面的分析中,我们需要用到这个数据,比如绘制生存曲线。

  • 本文作者:括囊无誉
  • 本文链接: TCGA/TCGA8Clinic/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!