TCGA学习笔记14-基因集富集分析GSEA

GSEA(Gene Set Enrichment Analysis),是基因集富集分析,也是软件的名称,顾名思议,GSEA用于分析给定分组中差异表达的基因是否在给定的基因集中有富集。

下载GSEA软件:根据操作系统下载相应的版本并安装

软件主界面如下图:

这个软件操作十分容易,简单明了,唯一的难度在于输入数据的准备,简单来说,有两个数据是必须提前准备好的,即表达数据与表型注释

表达数据格式:可以有多种输入格式,这里以TXT为例,我们先在R中将我们需要的数据提取并保存为CSV格式,处理完成后再保存为TXT即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# 回顾一下前面的数据,差异表达筛选之后有4913个基因(result_select),注释后剩下4863个(result_select_annot)
> x1 <- result_select_annot$hgnc_symbol
> length(x1)
[1] 4863
> tail(x1)
[1] "" "SLFNL1-AS1" "HELLPAR" "SNHG4" "LINC01176"
[6] "CICP14"
> x1[x1==""]=NA #把""转换为NA,方便逻辑判断
> tail(x1)
[1] NA "SLFNL1-AS1" "HELLPAR" "SNHG4" "LINC01176"
[6] "CICP14"
> table(is.na(x1))
FALSE TRUE
4259 604 #4863个注释后的基因,居然只有4259个有名字,下面就取出这些基因的表达数据来做富集分析

> write.csv(result_select_annot,"result_select_annot.csv") # 这是差异分析并注释过后的4863个基因
> # Remove BLANKs in EXCEL, rename as "unique_annot.csv"
> unique_annot <- read.csv("unique_annot.csv")
> dim(unique_annot)
[1] 4259 9 # 把没有名字的在EXCEL中去掉,剩下4259个
> head(unique_annot)
ï.. ensembl_gene_id baseMean log2FoldChange lfcSE stat
1 1 ENSG00000000938 1056.5419 1.507987 0.09991137 15.093251
2 2 ENSG00000001617 6403.7959 1.138046 0.08970727 12.686216
3 3 ENSG00000001630 218.0296 -1.154463 0.12005336 -9.616248
4 4 ENSG00000002586 9712.0493 1.265136 0.08533961 14.824725
5 5 ENSG00000002746 240.4362 -3.725467 0.22413035 -16.621875
6 6 ENSG00000002933 29346.9573 1.273440 0.12591051 10.113849
pvalue padj hgnc_symbol
1 1.79e-51 2.15e-50 FGR
2 7.05e-37 4.97e-36 SEMA3F
3 6.83e-22 2.56e-21 CYP51A1
4 1.01e-49 1.14e-48 CD99
5 4.84e-62 8.09e-61 HECW1
6 4.80e-24 1.98e-23 TMEM176A
> data_annot <- data_select[match(unique_annot$ensembl_gene_id,rownames(data_select)),] # 提取表达数据
> dim(data_annot)
[1] 4259 611
> head(data_annot)[,1:3]
TCGA-CZ-5465-01A-01R-1503 TCGA-BP-4355-01A-01R-1289
ENSG00000000938 899 1211
ENSG00000001617 16862 10888
ENSG00000001630 310 77
ENSG00000002586 10729 8611
ENSG00000002746 7 37
ENSG00000002933 60355 29448
TCGA-CZ-5451-01A-01R-1503
ENSG00000000938 1100
ENSG00000001617 9299
ENSG00000001630 305
ENSG00000002586 7618
ENSG00000002746 7
ENSG00000002933 47857
> data_annot$symbol <- unique_annot$hgnc_symbol # 加入基因名称
> dim(data_annot)
[1] 4259 612
> head(data_annot)[,1:3]
TCGA-CZ-5465-01A-01R-1503 TCGA-BP-4355-01A-01R-1289
ENSG00000000938 899 1211
ENSG00000001617 16862 10888
ENSG00000001630 310 77
ENSG00000002586 10729 8611
ENSG00000002746 7 37
ENSG00000002933 60355 29448
TCGA-CZ-5451-01A-01R-1503
ENSG00000000938 1100
ENSG00000001617 9299
ENSG00000001630 305
ENSG00000002586 7618
ENSG00000002746 7
ENSG00000002933 47857
> data_annot2 <- data_annot[,c(612,1:611)] # 把基因名称放到第一列
> dim(data_annot2)
[1] 4259 612
> head(data_annot2)[,1:3]
symbol TCGA-CZ-5465-01A-01R-1503
ENSG00000000938 FGR 899
ENSG00000001617 SEMA3F 16862
ENSG00000001630 CYP51A1 310
ENSG00000002586 CD99 10729
ENSG00000002746 HECW1 7
ENSG00000002933 TMEM176A 60355
TCGA-BP-4355-01A-01R-1289
ENSG00000000938 1211
ENSG00000001617 10888
ENSG00000001630 77
ENSG00000002586 8611
ENSG00000002746 37
ENSG00000002933 29448
> write.csv(data_annot2,"data_annot2.csv") # 这个文件就是GSEA的表达文件,但是在输入GSEA前需要处理,第一列改为基因名称,新加第二列DESCRIPTION,内容为na即可;
> data_annot2$symbol[duplicated(data_annot2$symbol)] # 检查一下基因名称是否有重复
factor(0)
4259 Levels: A4GNT AACSP1 AATBC AATK ABAT ABCA1 ABCA12 ... ZYG11A

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 下面准备表型文件
> phetype <- data.frame(colnames(data_annot2)[2:612],group)
> dim(phetype)
[1] 611 2
> head(phetype)
colnames.data_annot2..2.612. group
1 TCGA-CZ-5465-01A-01R-1503 cancer
2 TCGA-BP-4355-01A-01R-1289 cancer
3 TCGA-CZ-5451-01A-01R-1503 cancer
4 TCGA-B0-5081-01A-01R-1334 cancer
5 TCGA-CZ-5454-11A-01R-1503 normal
6 TCGA-B0-5697-01A-11R-1541 cancer
> phetypeOrder <- phetype[order(phetype$group),]
> head(phetypeOrder)
colnames.data_annot2..2.612. group
5 TCGA-CZ-5454-11A-01R-1503 normal
10 TCGA-B0-5691-11A-01R-1541 normal
21 TCGA-B0-5701-11A-01R-1541 normal
41 TCGA-B2-5636-11A-01R-1541 normal
44 TCGA-B0-5703-11A-01R-1541 normal
45 TCGA-CZ-4864-11A-01R-1503 normal
> tail(phetypeOrder)
colnames.data_annot2..2.612. group
606 TCGA-BP-5200-01A-01R-1426 cancer
607 TCGA-BP-4999-01A-01R-1334 cancer
608 TCGA-GK-A6C7-01A-11R-A33J cancer
609 TCGA-B0-5113-01A-01R-1420 cancer
610 TCGA-CJ-4893-01A-01R-1305 cancer
611 TCGA-CZ-4856-01A-02R-1426 cancer
> write.csv(phetypeOrder,"phetypeOrder.csv") # 表型文件,排序是为方便在GSEA软件中输入

如上图,输入表达文件和表型文件,并且将下载好的基因集文件导入,就可以运行了,如果跑所有的基因集,需要耗费几分钟时间,运行完成后,生成一个网页,里面是对结果的描述。

程序共运行了12388个基因集,使用了4259个基因。那么,富集结果怎么读呢?一般认为,|NES|>1,NOM p-val<0.05,FDR q-val<0.25的通路是显著富集的。

根据这个标准,富集结果并不理想,不论肿瘤或是正常组织中,都不存在符合以上标准的基因集。

当然,以上富集分析有一个硬伤,即只选取了差异表达的基因,并且由于注释的缘故,一部分基因被滤掉了,这可能导致富集结果出现偏差。我们也可以尝试使用所有的基因来做个分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
> dim(data_select)
[1] 18129 611 # 原始数据经筛选后留下18129个基因
> library(biomaRt)
> geneName_select2 <- rownames(data_select)
> length(geneName_select2)
[1] 18129
ensemb <- useDataset("hsapiens_gene_ensembl",mart=useMart("ensembl"))
ensemb
annot2 <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = "ensembl_gene_id",values = geneName_select2, mart=ensemb)
class(annot2)
> dim(annot2)
[1] 17972 2 # 注释后剩下17192个基因
> head(annot2)
ensembl_gene_id hgnc_symbol
1 ENSG00000000419 DPM1
2 ENSG00000000457 SCYL3
3 ENSG00000000460 C1orf112
4 ENSG00000000938 FGR
5 ENSG00000000971 CFH
6 ENSG00000001036 FUCA2
> annot2$hgnc_symbol[duplicated(annot2$hgnc_symbol)]
[1] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[22] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[43] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[64] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[85] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[106] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[127] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[148] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[169] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[190] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[211] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[232] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[253] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[274] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[295] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[316] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[337] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[358] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[379] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[400] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[421] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[442] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[463] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[484] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[505] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[526] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[547] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[568] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[589] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[610] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[631] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[652] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[673] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[694] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[715] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[736] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[757] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[778] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[799] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[820] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[841] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[862] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[883] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[904] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[925] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[946] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[967] "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
[988] "" "" "" "" "" "" "" "" "" "" "" "" ""
[ reached getOption("max.print") -- omitted 589 entries ] # 查重,居然有1589个是"",这些需要去掉
annot2_unique <- annot2[match(unique(annot2$hgnc_symbol),annot2$hgnc_symbol),]
> dim(annot2_unique)
[1] 16383 2 # 去掉重复后剩下16383个基因,其实还有一行""剩下,需要去掉
> head(annot2_unique)
ensembl_gene_id hgnc_symbol
1 ENSG00000000419 DPM1
2 ENSG00000000457 SCYL3
3 ENSG00000000460 C1orf112
4 ENSG00000000938 FGR
5 ENSG00000000971 CFH
6 ENSG00000001036 FUCA2
data_select_annot <- data_select[match(annot2_unique$ensembl_gene_id,rownames(data_select)),]
> dim(data_select_annot)
[1] 16383 611 # 其实应当是16382行,有一行是"",没有去掉
head(data_select_annot)[,1:3]
Descriptions <- rep("NA",16383) # 加一列
data_select_annot2 <- cbind(symbol=annot2_unique$hgnc_symbol,Descriptions,data_select_annot)
dim(data_select_annot2)
head(data_select_annot2)[,1:3]
write.csv(data_select_annot2,"data_select_annot2.csv")
# 在EXCEL中去掉""这一列,并保存为TXT文件
> data_select_annot2$symbol[duplicated(data_select_annot2$symbol)] # 这回没有重复
factor(0)
16383 Levels: A1BG-AS1 A1CF A2M A2M-AS1 A4GALT A4GNT AAAS ... ZZEF1
# 上面这个TXT文件即为表达文件
phetype <- data.frame(colnames(data_select_annot2)[3:613],group)
dim(phetype)
head(phetype)
phetypeOrder <- phetype[order(phetype$group),]
head(phetypeOrder)
tail(phetypeOrder)
write.csv(phetypeOrder,"phetypeOrder.csv") # 表型文件

运行之后,结果是一致的,即没有显著富集的基因集。

  • 本文作者:括囊无誉
  • 本文链接: TCGA/TCGA14GSEA/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!