TCGA学习笔记9-基因注释

我们前面分析得到的4913个差异表达基因的名称使用的是ENSEMBL的ID,比如ENSG00000001630,这类ID是不易记忆的,需要转化成我们熟悉的基因名称,即SYMBOL。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
> library(biomaRt)
> geneName_select <- rownames(result_select) # 提取差异表达基因的ENSEMBL ID,4913个
> length(geneName_select)
[1] 4913
> ensemb <- useDataset("hsapiens_gene_ensembl",mart=useMart("ensembl"))
> ensemb
Object of class 'Mart':
Using the ENSEMBL_MART_ENSEMBL BioMart database
Using the hsapiens_gene_ensembl dataset
> annot <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
+ filters = "ensembl_gene_id",values = geneName_select, mart=ensemb) # 注释结果
Cache found
> class(annot)
[1] "data.frame"
> dim(annot) # 注释后有4863行
[1] 4863 2
> head(annot)
ensembl_gene_id hgnc_symbol
1 ENSG00000000938 FGR
2 ENSG00000001617 SEMA3F
3 ENSG00000001630 CYP51A1
4 ENSG00000002586 CD99
5 ENSG00000002746 HECW1
6 ENSG00000002933 TMEM176A
> dim(result_select) # 原本有4913行,即50个基因无法注释
[1] 4913 6
> result_select$ensembl_gene_id <- rownames(result_select) # 把ensembl_gene_id加进来,为合并提供锚点
> dim(result_select)
[1] 4913 7
> head(result_select)
log2 fold change (MLE): group cancer vs normal
Wald test p-value: group cancer vs normal
DataFrame with 6 rows and 7 columns
baseMean log2FoldChange lfcSE
<numeric> <numeric> <numeric>
ENSG00000000938 1056.54189255002 1.50798734334441 0.0999113699084657
ENSG00000001617 6403.79587078303 1.13804574850297 0.0897072677370336
ENSG00000001630 218.029634030269 -1.15446295077517 0.120053363145966
ENSG00000002586 9712.04932748132 1.26513626330331 0.0853396109411807
ENSG00000002746 240.436190572231 -3.72546669707377 0.224130353161834
ENSG00000002933 29346.9573200702 1.27343989833302 0.125910512841406
stat pvalue
<numeric> <numeric>
ENSG00000000938 15.0932505952622 1.79382718013598e-51
ENSG00000001617 12.6862157014861 7.05112262109708e-37
ENSG00000001630 -9.61624831260688 6.82760868085803e-22
ENSG00000002586 14.8247249940627 1.01394301829956e-49
ENSG00000002746 -16.6218749246506 4.83951527252971e-62
ENSG00000002933 10.1138488724688 4.7962181780376e-24
padj ensembl_gene_id
<numeric> <character>
ENSG00000000938 2.14797179317603e-50 ENSG00000000938
ENSG00000001617 4.97005451002601e-36 ENSG00000001617
ENSG00000001630 2.56374726129402e-21 ENSG00000001630
ENSG00000002586 1.13889547575916e-48 ENSG00000002586
ENSG00000002746 8.09368748853239e-61 ENSG00000002746
ENSG00000002933 1.97615089431008e-23 ENSG00000002933
> class(result_select) # 不是数据框,合并前要先转化
[1] "DESeqResults"
attr(,"package")
[1] "DESeq2"
> result_select_annot <- merge(as.data.frame(result_select),annot,by="ensembl_gene_id")
> dim(result_select_annot) # 合并后多出一列,即hgnc_symbol
[1] 4863 8
> head(result_select_annot)
ensembl_gene_id baseMean log2FoldChange lfcSE stat
1 ENSG00000000938 1056.5419 1.507987 0.09991137 15.093251
2 ENSG00000001617 6403.7959 1.138046 0.08970727 12.686216
3 ENSG00000001630 218.0296 -1.154463 0.12005336 -9.616248
4 ENSG00000002586 9712.0493 1.265136 0.08533961 14.824725
5 ENSG00000002746 240.4362 -3.725467 0.22413035 -16.621875
6 ENSG00000002933 29346.9573 1.273440 0.12591051 10.113849
pvalue padj hgnc_symbol
1 1.793827e-51 2.147972e-50 FGR
2 7.051123e-37 4.970055e-36 SEMA3F
3 6.827609e-22 2.563747e-21 CYP51A1
4 1.013943e-49 1.138895e-48 CD99
5 4.839515e-62 8.093687e-61 HECW1
6 4.796218e-24 1.976151e-23 TMEM176A

注意,在注释过程中,有少部分基因因无法注释而被丢弃,如果这些基因恰恰是表达差异较大的,则会对后面的分析产生重要影响。

  • 本文作者:括囊无誉
  • 本文链接: TCGA/TCGA9Annot/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!