设为“星标”,精彩不错过
本文主要参考官方介绍:https://xlucpu.github.io/MOVICS/MOVICS-VIGNETTE.html
简介
安装
GET Module
准备数据
筛选基因(降维)
确定最佳亚型数量
根据单一算法分型
同时进行多种分型算法
整合多种分型结果
查看分型结果的质量
多组学分型热图
简介分子分型一直是生信数据挖掘的热门技能,用于分子分型的算法非常多,比如大家常见的非负矩阵分解、一致性聚类、PCA等,一致性聚类我们在之前也介绍过了:免疫浸润结果分子分型
今天给大家介绍一个一站式的分子分型R包:MOVICS。
该包与其他分子分型R包最大的不同是它能同时使用多组学的数据,普通的分子分型R包只能通过一种组学数据进行分析,比如只能通过mRNA的表达矩阵进行分析。但是这R包它可以同时通过比如说mRNA、lncRNA、甲基化数据、突变数据进行分型。
之外,它还提供了分型之后每个亚型的探索以及每个亚型内的分析。所以说这是一个一站式的包。这个的功能主要分为三个部分,示意图如下:
图片
第一个部分是根据不同的组学数据进行分型。第二个部分是比较不同的分型。第三个部分是对每个分型进行探索,以及获得每个分型特异性的分子。
每个部分包含的主要函数如下,下面会介绍:
GET Module: get subtypes through multi-omics integrative clustering
getElites(): get elites which are those features that pass the filtering procedure and are used for analysesgetClustNum(): get optimal cluster number by calculating clustering prediction index (CPI) and Gap-statisticsgetalgorithm_name(): get results from one specific multi-omics integrative clustering algorithm with detailed parametersgetMOIC(): get a list of results from multiple multi-omics integrative clustering algorithm with parameters by defaultgetConsensusMOIC(): get a consensus matrix that indicates the clustering robustness across different clustering algorithms and generate a consensus heatmapgetSilhouette(): get quantification of sample similarity using silhoutte score approachgetStdiz(): get a standardized data for generating comprehensive multi-omics heatmapgetMoHeatmap(): get a comprehensive multi-omics heatmap based on clustering resultsCOMP Module: compare subtypes from multiple perspectives
compSurv(): compare survival outcome and generate a Kalan-Meier curve with pairwise comparison if possiblecompClinvar(): compare and summarize clinical features among different identified subtypescompMut(): compare mutational frequency and generate an OncoPrint with significant mutationscompTMB(): compare total mutation burden among subtypes and generate distribution of Transitions and TransversionscompFGA(): compare fraction genome altered among subtypes and generate a barplot for distribution comparisoncompDrugsen(): compare estimated half maximal inhibitory concentration (IC50 ) for drug sensitivity and generate a boxviolin for distribution comparisoncompAgree(): compare agreement of current subtypes with other pre-existed classifications and generate an alluvial diagram and an agreement barplotRUN Module: run marker identification and verify subtypes
runDEA(): run differential expression analysis with three popular methods for choosing, including edgeR, DESeq2, and limmarunMarker(): run biomarker identification to determine uniquely and significantly differential expressed genes for each subtyperunGSEA(): run gene set enrichment analysis (GSEA), calculate activity of functional pathways and generate a pathway-specific heatmaprunGSVA(): run gene set variation analysis to calculate enrichment score of each sample based on given gene set list of interestrunNTP(): run nearest template prediction based on identified biomarkers to evaluate subtypes in external cohortsrunPAM(): run partition around medoids classifier based on discovery cohort to predict subtypes in external cohortsrunKappa(): run consistency evaluation using Kappa statistics between two appraisements that identify or predict current subtypes该包已发表,使用时记得引用:
Lu, X., Meng, J., Zhou, Y., Jiang, L., and Yan, F. (2020). MOVICS: an R package for multi-omics integration and visualization in cancer subtyping. bioRxiv, 2020.2009.2015.297820. [doi.org/10.1101/2020.09.15.297820]安装目前该包在github,只能通过以下方式安装,注意安装时最好先安装依赖包,因为这个包的依赖包非常多,安装过程中非常容易失败。对于初学者来说,这个包的安装不是很友好哦~
# 网络安装devtools::install_github("xlucpu/MOVICS")# 或者下载到本地安装devtools::install_local("E:/R/R包/MOVICS-master.zip")GET Module准备数据
我们先看一下示例数据。
library(MOVICS)##
使用该包自带数据进行演示,这个自带数据是已经清洗好的。过几天再专门写一篇推文介绍怎么准备这个数据。
# TCGA的乳腺癌数据load(system.file("extdata", "brca.tcga.RData", package = "MOVICS", mustWork = TRUE))load(system.file("extdata", "brca.yau.RData", package = "MOVICS", mustWork = TRUE))
brca.tcga里面是多个组学的数据,比如mRNA、lncRNA、甲基化、突变数据等,还有临床信息,比如生存时间和生存状态以及乳腺癌的PAM50分类。
为了演示,这个数据通过MAD筛选了部分数据:
500 mRNAs,500 lncRNA,1,000 promoter CGI probes/genes with high variation30 genes that mutated in at least 3% of the entire cohort.注意,这里最重要的一点是:每种组学的数据的样本数量、名字、顺序应该完全一致。大家可以自己看一下这些数据是什么样的。
names(brca.tcga)## [1] "mRNA.expr" "lncRNA.expr" "meth.beta" "mut.status" "count" ## [6] "fpkm" "maf" "segment" "clin.info"names(brca.yau)## [1] "mRNA.expr" "clin.info"# 提取"mRNA.expr""lncRNA.expr""meth.beta""mut.status"mo.data <- brca.tcga[1:4]# 提取raw count datacount <- brca.tcga$count# 提取fpkm datafpkm <- brca.tcga$fpkm# 提取mafmaf <- brca.tcga$maf# 提取segmented copy numbersegment <- brca.tcga$segment# 提取生存信息surv.info <- brca.tcga$clin.info筛选基因(降维)
getElites,顾名思义,找出精英,找出最牛逼的,也就是说这个函数可以做一些预处理和筛选工作,可以帮你进行数据准备工作。
主要可以做以下预处理:
缺失值插补:直接删除或者knn插补筛选分子:可根据mad, sd, pca, cox, freq(二分类数据)进行筛选其实这个不是第一步,第一步应该是自己先清洗一下数据,比如表达矩阵先进行log转换等。
下面是一些功能演示,还是非常强大的。
缺失值插补:
# scenario 1: 处理缺失值tmp <- brca.tcga$mRNA.expr # get expression datadim(tmp) # check data dimension## [1] 500 643tmp[1,1] <- tmp[2,2] <- NA # 添加几个NAtmp[1:3,1:3] # check data## BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A## SCGB2A2 NA 1.42 7.24## SCGB1D2 10.11 NA 5.88## PIP 4.54 2.59 4.35elite.tmp <- getElites(dat = tmp, method = "mad", na.action = "rm", # 直接删除 elite.pct = 1) # 保留100%的数据## --2 features with NA values are removed.## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) ## [1] 498 643elite.tmp <- getElites(dat = tmp, method = "mad", na.action = "impute", # 使用knn进行插补 elite.pct = 1) ## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) ## [1] 500 643elite.tmp$elite.dat[1:3,1:3] # NA values have been imputed ## BRCA-A03L-01A BRCA-A04R-01A BRCA-A075-01A## SCGB2A2 6.867 1.420 7.24## SCGB1D2 10.110 4.739 5.88## PIP 4.540 2.590 4.35
使用MAD筛选分子:
# scenario 2: 使用MAD筛选,最大中位差tmp <- brca.tcga$mRNA.expr elite.tmp <- getElites(dat = tmp, method = "mad", elite.pct = 0.1) # 保留MAD前10%的基因## missing elite.num then use elite.pctdim(elite.tmp$elite.dat) # 500的10%是50## [1] 50 643#> [1] 50 643elite.tmp <- getElites(dat = tmp, method = "sd", elite.num = 100, # 保留MAD前100的基因 elite.pct = 0.1) # 此时这个参数就不起作用了## elite.num has been provided then discards elite.pct.dim(elite.tmp$elite.dat) ## [1] 100 643
使用PCA筛选分子,需要了解一些关于PCA的基础知识:R语言主成分分析
# scenario 3: 使用PCA筛选分子tmp <- brca.tcga$mRNA.expr # get expression data with 500 featureselite.tmp <- getElites(dat = tmp, method = "pca", pca.ratio = 0.95) # 主成分的比例## --the ratio used to select principal component is set as 0.95dim(elite.tmp$elite.dat) # get 204 elite (PCs) left## [1] 204 643
使用单因素COX回归筛选分子,也就是对每个分子做单因素cox分析,选择有意义的留下,需要提供生存信息:
# scenario 4: 使用cox筛选分子tmp <- brca.tcga$mRNA.expr # get expression data elite.tmp <- getElites(dat = tmp, method = "cox", surv.info = surv.info, # 生存信息,列名必须有'futime'和'fustat' p.cutoff = 0.05, elite.num = 100) # 此时这个参数也是不起作用的## --all sample matched between omics matrix and survival data.## 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%dim(elite.tmp$elite.dat) # get 125 elites## [1] 125 643table(elite.tmp$unicox$pvalue < 0.05) # 125 genes have nominal pvalue < 0.05 in ## ## FALSE TRUE ## 375 125tmp <- brca.tcga$mut.status # get mutation data elite.tmp <- getElites(dat = tmp, method = "cox", surv.info = surv.info, p.cutoff = 0.05, elite.num = 100) ## --all sample matched between omics matrix and survival data.## 7% 13% 20% 27% 33% 40% 47% 53% 60% 67% 73% 80% 87% 93% 100%dim(elite.tmp$elite.dat) # get 3 elites## [1] 3 643table(elite.tmp$unicox$pvalue < 0.05) # 3 mutations have nominal pvalue < 0.05## ## FALSE TRUE ## 27 3
使用突变频率筛选分子,这个是准们用于0/1矩阵这种二分类数据的:
# scenario 5: 使用突变频率筛选tmp <- brca.tcga$mut.status # get mutation data rowSums(tmp) ## PIK3CA TP53 TTN CDH1 GATA3 MLL3 MUC16 MAP3K1 SYNE1 MUC12 DMD ## 208 186 111 83 58 49 48 38 33 32 31 ## NCOR1 FLG PTEN RYR2 USH2A SPTA1 MAP2K4 MUC5B NEB SPEN MACF1 ## 31 30 29 27 27 25 25 24 24 23 23 ## RYR3 DST HUWE1 HMCN1 CSMD1 OBSCN APOB SYNE2 ## 23 22 22 22 21 21 21 21elite.tmp <- getElites(dat = tmp, method = "freq", # must set as 'freq' elite.num = 80, # 这里是指突变频率 elite.pct = 0.1) # 此时该参数不起作用## --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.## elite.num has been provided then discards elite.pct.rowSums(elite.tmp$elite.dat) # 只保留在80个及以上样本中突变的基因## PIK3CA TP53 TTN CDH1 ## 208 186 111 83elite.tmp <- getElites(dat = tmp, method = "freq", elite.pct = 0.2) ## --method of 'freq' only supports binary omics data (e.g., somatic mutation matrix), and in this manner, elite.pct and elite.num are used to cut frequency.## missing elite.num then use elite.pctrowSums(elite.tmp$elite.dat) # only genes that are mutated in over than 0.2*643=128.6 ## PIK3CA TP53 ## 208 186确定最佳亚型数量
根据分子表达量对样本进行分型,分子就是上一步得到的mRNA、lncRNA、miRNA、甲基化矩阵等。
先根据CPI和Gaps-statistics确定分成几个亚型:
optk.brca <- getClustNum(data = mo.data, # 4种组学数据 is.binary = c(F,F,F,T), #前3个不是二分类的,最后一个是 try.N.clust = 2:8, # 尝试亚型数量,从2到8 fig.name = "CLUSTER NUMBER OF TCGA-BRCA")#保存的文件名## calculating Cluster Prediction Index...## 5% complete## 5% complete## 10% complete## 10% complete## 15% complete## 15% complete## 20% complete## 25% complete## 25% complete## 30% complete## 30% complete## 35% complete## 35% complete## 40% complete## 45% complete## 45% complete## 50% complete## 50% complete## 55% complete## 55% complete## 60% complete## 65% complete## 65% complete## 70% complete## 70% complete## 75% complete## 75% complete## 80% complete## 85% complete## 85% complete## 90% complete## 90% complete## 95% complete## 95% complete## 100% complete## calculating Gap-statistics...## visualization done...## --the imputed optimal cluster number is 3 arbitrarily, but it would be better referring to other priori knowledge.
图片
unnamed-chunk-10-186542957会自动在当前工作目录下产生一个PDF格式的图片。
函数给出的结果是3,但是考虑到乳腺癌的PAM0分类,我们选择k=5,也就是分成5个亚型。
所以这个确定最佳亚型个数是根据你自己的需要来的哈,灵活调整~
根据单一算法分型确定分成几个亚型之后,可以通过算法进行分型了。提供了非常多的方法,大家常见的非负矩阵分解、异质性聚类等等都提供了。
比如根据贝叶斯方法进行分型:
# perform iClusterBayes (may take a while)iClusterBayes.res <- getiClusterBayes(data = mo.data, N.clust = 5, type = c("gaussian","gaussian","gaussian","binomial"), n.burnin = 1800, n.draw = 1200, prior.gamma = c(0.5, 0.5, 0.5, 0.5), sdev = 0.05, thin = 3)## clustering done...## feature selection done...
或者使用统一的函数,自己选择方法即可,两种方法得到的结果完全是一样的:
iClusterBayes.res <- getMOIC(data = mo.data, N.clust = 5, methodslist = "iClusterBayes", # 指定算法 type = c("gaussian","gaussian","gaussian","binomial"), # data type corresponding to the list n.burnin = 1800, n.draw = 1200, prior.gamma = c(0.5, 0.5, 0.5, 0.5), sdev = 0.05, thin = 3)
返回的结果包含一个clust.res对象,它有两列:clust列指示样本所属的亚型,samID列记录对应的样本名称。对于提供特征选择过程的算法(如iClusterBayes、CIMLR和MoCluster),结果还包含一个feat.res对象,存储了这种过程的信息。对于涉及分层聚类的算法(例如COCA、ConsensusClustering),样本聚类的相应树状图也将作为clust.dend返回,如果用户想要将它们放在热图中会很有用。
同时进行多种分型算法可以同时根据多种算法进行分型,然后整合它们的结果,得到最终的结果,不是一般的强大:
# perform multi-omics integrative clustering with the rest of 9 algorithmsmoic.res.list <- getMOIC(data = mo.data, methodslist = list("SNF", "PINSPlus", "NEMO", "COCA", "LRAcluster", "ConsensusClustering", "IntNMF", "CIMLR", "MoCluster"), # 9种算法 N.clust = 5, type = c("gaussian", "gaussian", "gaussian", "binomial"))## --you choose more than 1 algorithm and all of them shall be run with parameters by default.## SNF done...## Clustering method: kmeans## Perturbation method: noise## PINSPlus done...## NEMO done...## COCA done...## LRAcluster done...## end fraction## clustered
## ConsensusClustering done...## IntNMF done...## clustering done...## feature selection done...## CIMLR done...## clustering done...## feature selection done...## MoCluster done...
再把贝叶斯的结果一起加进来,这就是10种算法了:
moic.res.list <- append(moic.res.list, list("iClusterBayes" = iClusterBayes.res))# 保存下结果save(moic.res.list, file = "moic.res.list.rda")整合多种分型结果
借鉴了consensus ensembles的想法,实现对多个分型算法结果的整合。
可以画出一个一致性热图:
load(file = "moic.res.list.rda")cmoic.brca <- getConsensusMOIC(moic.res.list = moic.res.list, fig.name = "CONSENSUS HEATMAP", distance = "euclidean", linkage = "average")
图片
unnamed-chunk-15-186542957结果会保存在当前工作目录中。
查看分型结果的质量除了通过上面的热图查看分型结果,还可以使用Silhouette准则判断分型质量。
以下是解释,来源于网络:
Silhouette准则是一种用于聚类分析中的评价方法,它通过对每个数据点与其所属簇内其他数据点之间的距离进行比较,来衡量聚类质量的好坏。Silhouette准则可以帮助我们确定最佳的聚类数量,从而提高聚类分析的可靠性和准确性。 Silhouette准则的计算方法如下:对于每个数据点i,计算它与同簇中其他数据点之间的平均距离ai,以及与最近其他簇中数据点之间的平均距离bi。然后,定义每个数据点的Silhouette系数为: s(i) = (bi - ai) / max(ai, bi) Silhouette系数的取值范围在-1到1之间,其中负值表示数据点更容易被分类到错误的簇中,而正值则表示数据点更容易被正确分类。Silhouette系数的平均值可以用来评估整个聚类的质量,因此,Silhouette准则的目标是最大化Silhouette系数的平均值,从而找到最佳的聚类数量。 当聚类数量增加时,Silhouette系数的平均值通常会先增加后减少。因此,我们需要找到一个聚类数量,使得Silhouette系数的平均值达到最大值。通常,我们会通过绘制Silhouette图来选择最佳的聚类数量。Silhouette图是一种以Silhouette系数为纵轴,聚类数量为横轴的图表,它可以帮助我们直观地理解聚类的质量。 在使用Silhouette准则进行聚类分析时,需要注意以下几点:
Silhouette系数只适用于欧氏距离或相关度量,对于其他距离度量可能不适用。Silhouette系数的计算时间较长,因此在处理大规模数据时需要注意计算效率。Silhouette系数并不是唯一的评价指标,对于特定的聚类问题可能需要采用其他评价指标。结果会保存在当前工作目录中:
getSilhouette(sil = cmoic.brca$sil, # a sil object returned by getConsensusMOIC() fig.path = getwd(), fig.name = "SILHOUETTE", height = 5.5, width = 5)
图片
unnamed-chunk-16-186542957## png ## 2多组学分型热图
分型之后,肯定是要对每个组学数据进行热图展示不同亚型的表达量情况。
不过需要做一些准备工作。
把甲基化的β值矩阵转换为M值矩阵,作者推荐,这样做展示效果更好;数据标准化,画热图之钱一般都会进行这个操作,其实是通过scale进行的,比如把所有数据压缩为[-2,2],超过2的用2表示,小于-2的用-2表示# β值矩阵转换为M值矩阵indata <- mo.dataindata$meth.beta <- log2(indata$meth.beta / (1 - indata$meth.beta))# 对数据进行标准化plotdata <- getStdiz(data = indata, halfwidth = c(2,2,2,NA), # no truncation for mutation centerFlag = c(T,T,T,F), # no center for mutation scaleFlag = c(T,T,T,F)) # no scale for mutation
我们这里就用贝叶斯分型的结果进行展示,首先是提取每个组学的结果,然后每个组学中选择前10个分子进行标注:
feat <- iClusterBayes.res$feat.resfeat1 <- feat[which(feat$dataset == "mRNA.expr"),][1:10,"feature"] feat2 <- feat[which(feat$dataset == "lncRNA.expr"),][1:10,"feature"]feat3 <- feat[which(feat$dataset == "meth.beta"),][1:10,"feature"]feat4 <- feat[which(feat$dataset == "mut.status"),][1:10,"feature"]annRow <- list(feat1, feat2, feat3, feat4)
下面就是画图即可,其实也是借助complexheatmap实现的,只不过帮你简化了很多过程,结果会自动保存在当前工作目录下,MOVICS的默认出图还是很美观的,可能比你自己画的好看~
# 为每个组学的热图自定义颜色,不定义也可mRNA.col <- c("#00FF00", "#008000", "#000000", "#800000", "#FF0000")lncRNA.col <- c("#6699CC", "white" , "#FF3C38")meth.col <- c("#0074FE", "#96EBF9", "#FEE900", "#F00003")mut.col <- c("grey90" , "black")col.list <- list(mRNA.col, lncRNA.col, meth.col, mut.col)# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = iClusterBayes.res$clust.res, # cluster results clust.dend = NULL, # no dendrogram show.rownames = c(F,F,F,F), # specify for each omics data show.colnames = FALSE, # show no sample names annRow = annRow, # mark selected features color = col.list, annCol = NULL, # no annotation for samples annColors = NULL, # no annotation color width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF ICLUSTERBAYES")
图片
unnamed-chunk-19-186542957上面是贝叶斯方法分型结果的展示,你也可以任选一种,毕竟我们有10种算法。
比如选择COCA法的结果进行展示,也是一模一样的用法,结果会自动保存:
# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = moic.res.list$COCA$clust.res, # cluster results clust.dend = moic.res.list$COCA$clust.dend, # show dendrogram for samples color = col.list, width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF COCA")
图片
unnamed-chunk-20-186542957如果你要展示多个临床信息,也是直接添加即可,注意自定义颜色需要使用circlize实现:
# extract PAM50, pathologic stage and age for sample annotationannCol <- surv.info[,c("PAM50", "pstage", "age"), drop = FALSE]# generate corresponding colors for sample annotationannColors <- list(age = circlize::colorRamp2(breaks = c(min(annCol$age), median(annCol$age), max(annCol$age)), colors = c("#0000AA", "#555555", "#AAAA00")), PAM50 = c("Basal" = "blue", "Her2" = "red", "LumA" = "yellow", "LumB" = "green", "Normal" = "black"), pstage = c("T1" = "green", "T2" = "blue", "T3" = "red", "T4" = "yellow", "TX" = "black"))# comprehensive heatmap (may take a while)getMoHeatmap(data = plotdata, row.title = c("mRNA","lncRNA","Methylation","Mutation"), is.binary = c(F,F,F,T), # the 4th data is mutation which is binary legend.name = c("mRNA.FPKM","lncRNA.FPKM","M value","Mutated"), clust.res = cmoic.brca$clust.res, # consensusMOIC results clust.dend = NULL, # show no dendrogram for samples show.rownames = c(F,F,F,F), # specify for each omics data show.colnames = FALSE, # show no sample names show.row.dend = c(F,F,F,F), # show no dendrogram for features annRow = NULL, # no selected features color = col.list, annCol = annCol, # annotation for samples annColors = annColors, # annotation color width = 10, # width of each subheatmap height = 5, # height of each subheatmap fig.name = "COMPREHENSIVE HEATMAP OF CONSENSUSMOIC")
图片
unnamed-chunk-21-186542957是不是非常牛逼?
到这里第一部分的内容就介绍完了,下面就是探索、比较不同的亚型了。
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报。- 2024/10/14“女人不养肝,脸上全是斑”,建议女人常吃这1样,一周2次,保肝抗衰老
- 2024/10/14“有夜尿”与“没有夜尿”的人,哪个更健康?医生说出实话
- 2024/09/09比起扯头花争艳,我更愿称她们为内娱最后一株双生姐妹花
- 2024/09/09时尚消费又进入迭代期,原创设计品牌还能讲出新故事吗?
- 2024/09/02凯盛新材: 关于不向下修正凯盛转债转股价格的公告