TCGA学习笔记6-主成份分析PCA

上面,我们对RNA表达数据进行了差异分析,下面,我们将进行主成份分析(PCA),用于对数据进行区分和归类。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
> pcaData <- data[rownames(result_select),] # 这里,首先使用差异表达数据,即筛选出来的4913个基因在611个样本中的表达水平
> class(pcaData)
[1] "data.frame"
> dim(pcaData)
[1] 4913 611
> head(pcaData)[,1:3]
TCGA-CZ-5465-01A-01R-1503 TCGA-BP-4355-01A-01R-1289
ENSG00000000938 899 1211
ENSG00000001617 16862 10888
ENSG00000001630 310 77
ENSG00000002586 10729 8611
ENSG00000002746 7 37
ENSG00000002933 60355 29448
TCGA-CZ-5451-01A-01R-1503
ENSG00000000938 1100
ENSG00000001617 9299
ENSG00000001630 305
ENSG00000002586 7618
ENSG00000002746 7
ENSG00000002933 47857
> pcaDataT <- as.data.frame(t(pcaData)) # 行列转置,每行一个样本,每列一个基因
> class(pcaDataT)
[1] "data.frame"
> dim(pcaDataT)
[1] 611 4913
> head(pcaDataT)[,1:3]
ENSG00000000938 ENSG00000001617 ENSG00000001630
TCGA-CZ-5465-01A-01R-1503 899 16862 310
TCGA-BP-4355-01A-01R-1289 1211 10888 77
TCGA-CZ-5451-01A-01R-1503 1100 9299 305
TCGA-B0-5081-01A-01R-1334 2065 9886 103
TCGA-CZ-5454-11A-01R-1503 783 7150 815
TCGA-B0-5697-01A-11R-1541 1857 5288 146
> pcaDataTGroup <- data.frame(pcaDataT,Group=group) # 将分类信息加上,方便后面作图时颜色来区分
> dim(pcaDataTGroup)
[1] 611 4914
> head(pcaDataTGroup)[,4912:4914]
ENSG00000281404 ENSG00000281490 Group
TCGA-CZ-5465-01A-01R-1503 73 446 cancer
TCGA-BP-4355-01A-01R-1289 209 1208 cancer
TCGA-CZ-5451-01A-01R-1503 17 126 cancer
TCGA-B0-5081-01A-01R-1334 180 626 cancer
TCGA-CZ-5454-11A-01R-1503 36 240 normal
TCGA-B0-5697-01A-11R-1541 42 232 cancer

> pca <- prcomp(pcaDataT,scale=TRUE) # 使用R自带的prcomp()函数
> summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 30.5406 21.06796 17.82409 15.56497 13.67455 11.96656 11.39468
PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15
Standard deviation 10.52719 9.12744 8.76353 7.25503 6.82239 6.62584 6.35338 6.16851
PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23
Standard deviation 6.10537 5.99082 5.87263 5.73438 5.61851 5.53551 5.40397 5.20705
PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31
Standard deviation 5.07502 4.94755 4.77045 4.73712 4.66466 4.60046 4.50403 4.48497
PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39
Standard deviation 4.41300 4.40078 4.24311 4.21938 4.11894 4.10427 4.03498 4.00425
PC40 PC41 PC42 PC43 PC44 PC45 PC46 PC47
Standard deviation 3.95901 3.92036 3.85311 3.81686 3.76335 3.72471 3.67439 3.65233
PC48 PC49 PC50 PC51 PC52 PC53 PC54 PC55
Standard deviation 3.62492 3.58215 3.53086 3.51850 3.49220 3.46807 3.40228 3.38611
PC56 PC57 PC58 PC59 PC60 PC61 PC62 PC63
Standard deviation 3.34341 3.33820 3.32907 3.2906 3.26280 3.23725 3.22031 3.19124
PC64 PC65 PC66 PC67 PC68 PC69 PC70 PC71
Standard deviation 3.14392 3.10767 3.09295 3.07358 3.02365 3.01153 2.99903 2.98530
PC72 PC73 PC74 PC75 PC76 PC77 PC78 PC79
Standard deviation 2.96770 2.95324 2.94226 2.92450 2.91141 2.8906 2.87615 2.84936
PC80 PC81 PC82 PC83 PC84 PC85 PC86 PC87
Standard deviation 2.82099 2.78651 2.78489 2.77050 2.76802 2.74951 2.73087 2.72602
PC88 PC89 PC90 PC91 PC92 PC93 PC94 PC95
Standard deviation 2.70287 2.69798 2.66500 2.65009 2.62908 2.6260 2.61238 2.59281
PC96 PC97 PC98 PC99 PC100 PC101 PC102 PC103
Standard deviation 2.59100 2.57340 2.55959 2.54633 2.53601 2.5261 2.50279 2.48348
PC104 PC105 PC106 PC107 PC108 PC109 PC110 PC111
Standard deviation 2.47543 2.45592 2.45236 2.4317 2.41468 2.41164 2.38742 2.37613
PC112 PC113 PC114 PC115 PC116 PC117 PC118 PC119 PC120
Standard deviation 2.35853 2.35056 2.33463 2.3295 2.3250 2.30730 2.30597 2.29750 2.29033
PC121 PC122 PC123 PC124 PC125 PC126 PC127 PC128
Standard deviation 2.26585 2.25312 2.24279 2.23134 2.22793 2.2158 2.20129 2.19728
PC129 PC130 PC131 PC132 PC133 PC134 PC135 PC136 PC137
Standard deviation 2.19296 2.17398 2.16039 2.14977 2.14574 2.13250 2.11185 2.1077 2.1031
PC138 PC139 PC140 PC141 PC142 PC143 PC144 PC145
Standard deviation 2.08261 2.07534 2.07042 2.05222 2.04793 2.04577 2.03694 2.02771
PC146 PC147 PC148 PC149 PC150 PC151 PC152 PC153 PC154
Standard deviation 2.01782 2.00092 1.99539 1.98983 1.9861 1.9781 1.96918 1.95985 1.95552
PC155 PC156 PC157 PC158 PC159 PC160 PC161 PC162
Standard deviation 1.94804 1.94291 1.93641 1.92242 1.92021 1.90568 1.89667 1.89451
PC163 PC164 PC165 PC166 PC167 PC168 PC169 PC170 PC171
Standard deviation 1.88673 1.88252 1.87848 1.87202 1.86426 1.8584 1.8548 1.84692 1.84509
PC172 PC173 PC174 PC175 PC176 PC177 PC178 PC179
Standard deviation 1.83931 1.83135 1.82669 1.82031 1.81080 1.80627 1.80070 1.79712
PC180 PC181 PC182 PC183 PC184 PC185 PC186 PC187
Standard deviation 1.78550 1.77936 1.77550 1.76623 1.75981 1.75181 1.74749 1.74488
PC188 PC189 PC190 PC191 PC192 PC193 PC194 PC195
Standard deviation 1.73577 1.73361 1.7181 1.70630 1.70290 1.69972 1.69536 1.68780
PC196 PC197 PC198 PC199 PC200 PC201 PC202 PC203
Standard deviation 1.67993 1.67835 1.67086 1.66480 1.66048 1.65500 1.65183 1.64799
PC204 PC205 PC206 PC207 PC208 PC209 PC210 PC211
Standard deviation 1.64225 1.64076 1.63780 1.63251 1.62520 1.61368 1.61132 1.60909
PC212 PC213 PC214 PC215 PC216 PC217 PC218 PC219 PC220
Standard deviation 1.60139 1.59764 1.59602 1.59042 1.58228 1.57580 1.5733 1.5654 1.5610
PC221 PC222 PC223 PC224 PC225 PC226 PC227 PC228
Standard deviation 1.55563 1.55465 1.54655 1.54196 1.53549 1.53234 1.52645 1.52422
PC229 PC230 PC231 PC232 PC233 PC234 PC235 PC236
Standard deviation 1.51790 1.51457 1.51286 1.50343 1.50315 1.49795 1.49454 1.48776
PC237 PC238 PC239 PC240 PC241 PC242 PC243 PC244
Standard deviation 1.48438 1.47875 1.47257 1.46959 1.46717 1.45983 1.45757 1.45308
PC245 PC246 PC247 PC248 PC249 PC250 PC251 PC252
Standard deviation 1.44916 1.44823 1.44553 1.44368 1.43596 1.42866 1.42567 1.42135
PC253 PC254 PC255 PC256 PC257 PC258 PC259 PC260 PC261
Standard deviation 1.41682 1.41351 1.41118 1.4086 1.4026 1.3972 1.3949 1.38863 1.38630
PC262 PC263 PC264 PC265 PC266 PC267 PC268 PC269
Standard deviation 1.38335 1.37999 1.37650 1.37295 1.36460 1.36280 1.35885 1.35620
PC270 PC271 PC272 PC273 PC274 PC275 PC276 PC277
Standard deviation 1.34918 1.34286 1.34120 1.33772 1.33425 1.32969 1.32843 1.32246
PC278 PC279 PC280 PC281 PC282 PC283 PC284 PC285
Standard deviation 1.32087 1.32038 1.31198 1.30642 1.30391 1.30088 1.29874 1.29487
PC286 PC287 PC288 PC289 PC290 PC291 PC292 PC293
Standard deviation 1.29066 1.28536 1.28144 1.27949 1.27251 1.27007 1.26710 1.26414
PC294 PC295 PC296 PC297 PC298 PC299 PC300 PC301
Standard deviation 1.26220 1.26014 1.25852 1.25420 1.25096 1.24993 1.24498 1.23962
PC302 PC303 PC304 PC305 PC306 PC307 PC308 PC309 PC310
Standard deviation 1.23400 1.23255 1.22667 1.2240 1.2231 1.2229 1.2214 1.2143 1.2102
PC311 PC312 PC313 PC314 PC315 PC316 PC317 PC318 PC319
Standard deviation 1.2100 1.2048 1.19990 1.19701 1.19515 1.18987 1.18491 1.18245 1.18014
PC320 PC321 PC322 PC323 PC324 PC325 PC326 PC327
Standard deviation 1.17643 1.17621 1.17231 1.16879 1.16163 1.16122 1.15969 1.15713
PC328 PC329 PC330 PC331 PC332 PC333 PC334 PC335
Standard deviation 1.15556 1.15240 1.14952 1.14577 1.14304 1.14126 1.13763 1.13693
PC336 PC337 PC338 PC339 PC340 PC341 PC342 PC343
Standard deviation 1.13375 1.13120 1.12788 1.12463 1.12139 1.11960 1.11934 1.11628
PC344 PC345 PC346 PC347 PC348 PC349 PC350 PC351
Standard deviation 1.11177 1.10918 1.10520 1.10440 1.10241 1.09819 1.09693 1.09243
PC352 PC353 PC354 PC355 PC356 PC357 PC358 PC359
Standard deviation 1.08730 1.08648 1.08409 1.08042 1.07850 1.07407 1.07129 1.06942
PC360 PC361 PC362 PC363 PC364 PC365 PC366 PC367
Standard deviation 1.06691 1.06346 1.06228 1.05964 1.05558 1.05401 1.05002 1.04714
PC368 PC369 PC370 PC371 PC372 PC373 PC374 PC375
Standard deviation 1.04667 1.04370 1.04314 1.03839 1.03722 1.03520 1.02846 1.02695
PC376 PC377 PC378 PC379 PC380 PC381 PC382 PC383
Standard deviation 1.02537 1.02305 1.02137 1.02052 1.01721 1.01152 1.01111 1.00876
PC384 PC385 PC386 PC387 PC388 PC389 PC390 PC391 PC392
Standard deviation 1.00659 1.00372 1.0015 1.0003 0.9959 0.9945 0.9933 0.9895 0.9853
PC393 PC394 PC395 PC396 PC397 PC398 PC399 PC400 PC401
Standard deviation 0.9813 0.9801 0.97715 0.97527 0.97116 0.96965 0.96598 0.96351 0.96042
PC402 PC403 PC404 PC405 PC406 PC407 PC408 PC409
Standard deviation 0.95959 0.95777 0.95501 0.95072 0.94820 0.94598 0.94302 0.94201
PC410 PC411 PC412 PC413 PC414 PC415 PC416 PC417
Standard deviation 0.93933 0.93682 0.93610 0.93316 0.93053 0.92868 0.92602 0.92544
PC418 PC419 PC420 PC421 PC422 PC423 PC424 PC425
Standard deviation 0.91855 0.91759 0.91581 0.91240 0.91112 0.91077 0.90678 0.90618
PC426 PC427 PC428 PC429 PC430 PC431 PC432 PC433
Standard deviation 0.90249 0.89949 0.89669 0.89384 0.89288 0.88850 0.88718 0.88584
PC434 PC435 PC436 PC437 PC438 PC439 PC440 PC441
Standard deviation 0.88341 0.87924 0.87845 0.87462 0.87209 0.86974 0.86603 0.86518
PC442 PC443 PC444 PC445 PC446 PC447 PC448 PC449
Standard deviation 0.86083 0.86061 0.85783 0.85555 0.85338 0.85200 0.85090 0.84969
PC450 PC451 PC452 PC453 PC454 PC455 PC456 PC457
Standard deviation 0.84627 0.84291 0.84245 0.83924 0.83537 0.83425 0.83233 0.82828
PC458 PC459 PC460 PC461 PC462 PC463 PC464 PC465
Standard deviation 0.82678 0.82414 0.82309 0.82012 0.81847 0.81767 0.81494 0.81351
PC466 PC467 PC468 PC469 PC470 PC471 PC472 PC473
Standard deviation 0.80988 0.80492 0.80108 0.80043 0.79918 0.79730 0.79388 0.79141
PC474 PC475 PC476 PC477 PC478 PC479 PC480 PC481
Standard deviation 0.78887 0.78738 0.78554 0.78497 0.78066 0.78024 0.77470 0.77313
PC482 PC483 PC484 PC485 PC486 PC487 PC488 PC489
Standard deviation 0.77203 0.77000 0.76624 0.76326 0.76087 0.75905 0.75893 0.75472
PC490 PC491 PC492 PC493 PC494 PC495 PC496 PC497
Standard deviation 0.75419 0.75251 0.74832 0.74672 0.74510 0.74337 0.74174 0.74037
PC498 PC499 PC500 PC501 PC502 PC503 PC504 PC505
Standard deviation 0.73802 0.73601 0.73324 0.73171 0.72897 0.72684 0.72589 0.72385
PC506 PC507 PC508 PC509 PC510 PC511 PC512 PC513 PC514
Standard deviation 0.72215 0.7172 0.7142 0.7132 0.7091 0.7087 0.7080 0.7049 0.7013
PC515 PC516 PC517 PC518 PC519 PC520 PC521 PC522 PC523
Standard deviation 0.6982 0.6953 0.6942 0.6924 0.6904 0.6880 0.6864 0.68305 0.68213
PC524 PC525 PC526 PC527 PC528 PC529 PC530 PC531
Standard deviation 0.67959 0.67589 0.67421 0.67178 0.66804 0.66633 0.66601 0.66377
PC532 PC533 PC534 PC535 PC536 PC537 PC538 PC539
Standard deviation 0.66191 0.65937 0.65725 0.65584 0.65419 0.65095 0.64724 0.64534
PC540 PC541 PC542 PC543 PC544 PC545 PC546 PC547
Standard deviation 0.64276 0.63951 0.63786 0.63523 0.63117 0.62784 0.62692 0.62319
PC548 PC549 PC550 PC551 PC552 PC553 PC554 PC555
Standard deviation 0.62051 0.61775 0.61483 0.61238 0.61000 0.60674 0.60389 0.60303
PC556 PC557 PC558 PC559 PC560 PC561 PC562 PC563
Standard deviation 0.60151 0.60121 0.59484 0.59220 0.58997 0.58819 0.58589 0.58175
PC564 PC565 PC566 PC567 PC568 PC569 PC570 PC571
Standard deviation 0.57990 0.57774 0.57318 0.57104 0.56929 0.56368 0.56112 0.55675
PC572 PC573 PC574 PC575 PC576 PC577 PC578 PC579
Standard deviation 0.55548 0.55188 0.55145 0.54780 0.54517 0.53832 0.53484 0.53255
PC580 PC581 PC582 PC583 PC584 PC585 PC586 PC587
Standard deviation 0.52787 0.52379 0.52309 0.51867 0.51252 0.51120 0.50921 0.50802
PC588 PC589 PC590 PC591 PC592 PC593 PC594 PC595
Standard deviation 0.50304 0.49706 0.49622 0.48912 0.48581 0.48163 0.48029 0.47163
PC596 PC597 PC598 PC599 PC600 PC601 PC602 PC603
Standard deviation 0.46913 0.46281 0.45966 0.45449 0.45346 0.44524 0.44136 0.43279
PC604 PC605 PC606 PC607 PC608 PC609 PC610 PC611
Standard deviation 0.41556 0.39920 0.39283 0.38120 0.37238 0.35365 0.31898 6.713e-15
[ reached getOption("max.print") -- omitted 2 rows ]

> plot(pca,type="l") # 作图,看一下各个PC的权重,可见,PC1和PC2占了绝大多数权重

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
> str(pca) # 看一下数据结构
List of 5
$ sdev : num [1:611] 30.5 21.1 17.8 15.6 13.7 ...
$ rotation: num [1:4913, 1:611] -0.0177 -0.0113 0.0172 -0.0142 0.0154 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4913] "ENSG00000000938" "ENSG00000001617" "ENSG00000001630" "ENSG00000002586" ...
.. ..$ : chr [1:611] "PC1" "PC2" "PC3" "PC4" ...
$ center : Named num [1:4913] 1086 6694 235 9934 241 ...
..- attr(*, "names")= chr [1:4913] "ENSG00000000938" "ENSG00000001617" "ENSG00000001630" "ENSG00000002586" ...
$ scale : Named num [1:4913] 743 3689 183 5885 619 ...
..- attr(*, "names")= chr [1:4913] "ENSG00000000938" "ENSG00000001617" "ENSG00000001630" "ENSG00000002586" ...
$ x : num [1:611, 1:611] -19.09 -35.24 9.44 -58.72 91.79 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:611] "TCGA-CZ-5465-01A-01R-1503" "TCGA-BP-4355-01A-01R-1289" "TCGA-CZ-5451-01A-01R-1503" "TCGA-B0-5081-01A-01R-1334" ...
.. ..$ : chr [1:611] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"

> data_pca <- cbind(pcaDataTGroup,pca$x[,1:2]) # 将PC1与PC2加进去
> dim(data_pca)
[1] 611 4916
> head(data_pca)[,4913:4916]
ENSG00000281490 Group PC1 PC2
TCGA-CZ-5465-01A-01R-1503 446 cancer -19.085520 10.325704
TCGA-BP-4355-01A-01R-1289 1208 cancer -35.244522 34.570355
TCGA-CZ-5451-01A-01R-1503 126 cancer 9.439234 -23.982777
TCGA-B0-5081-01A-01R-1334 626 cancer -58.723514 23.949627
TCGA-CZ-5454-11A-01R-1503 240 normal 91.788022 78.416490
TCGA-B0-5697-01A-11R-1541 232 cancer -1.465493 -2.567302

# PLOT WITH GGPLOT -----
> library(ggplot2)
> ggplot(data_pca,aes(PC1,PC2,col=Group,fill=Group))+
+ stat_ellipse(geom="polygon",col="black",alpha=0.5)+
+ geom_point(shape=21,col="black",size=1.2)+
+ theme(panel.background = element_rect(fill="transparent",color="black"),
+ panel.grid.minor = element_blank(),
+ panel.grid.major = element_blank())

如图,肿瘤样本与正常样本被很好地分为两组,可以明确区分,这也可被用于预测一个未知样本是否是肿瘤样本或是正常样本。

  • 本文作者:括囊无誉
  • 本文链接: TCGA/TCGA6PCA/
  • 版权声明: 本博客所有文章均为原创作品,转载请注明出处!
------ 本文结束 ------
坚持原创文章分享,您的支持将鼓励我继续创作!