R语言简介和肿瘤基因组学常见图的R实现

2016年6月3号

希望达到的目标：

通过R语言来重现2016年年初在Cancer Research发表的544例胃癌基因组文章。

Xiangchun Li, William K.K.Wu, Rui Xing et al. Distinct Subtypes of Gastric Cancer Defined by Molecular Characterization Include Novel Mutational Signatures with Prognostic Capability. Cancer Research 2016; doi:1158/0008-5472.CAN-15-2443.)

Outline

R语言基础
Identification of significantly mutated genes (SMGs)
Mutational landscape of SMGs.
Mutational signature analysis & visualization
Kaplan-Meier survival & Cox regression analyses
Consensus clustering based on nonnegative matrix factorization (NMF)

The R codes in the PPT have been tested with R 3.2.2. You may need to install R 3.2 or newer to proceed with this tutorial.

For toolkits contributed by me visit https://github.com/lixiangchun.

R语言编程环境和资料

R Project: https://www.r-project.org/
Rstudio: https://www.rstudio.com/ (非常好用的R编程环境)
CRAN: https://cran.r-project.org/
Bioconductor: https://www.bioconductor.org/
Github: https://github.com/

操作：下载安装Rstudio

安装必要的软件包

## On MacOS or Linux
install.packages("devtools")
## On Windows, you need to install additional packages
install.packages("devtools")
install.packages("RCurl")
install.packages("httr")
install.packages("rgl")
library(RCurl)
library(httr)
set_config( config( ssl_verifypeer = 0L ) )
options(download.file.method = "wininet") ## Windows only

library(devtools)
install_github("lixiangchun/lxctk")
install_github("lixiangchun/plotr")
library("lxctk")
library("plotr")

R语言简介

初识R语言

1.1 交互式方式使用R (Unix环境下)

$ mkdir work  ## 新建一个目录
$ cd work     ## 进入到新建目录
$ R           ## 在Terminal中运行R
> q()         ## 退出R

1.2 获取函数的帮助文档

> ?lm                    ## 查看lm的帮助文档(同上)
> library(survival)      ## 加载一个R包
> help(package=survival) ## 查看survival包的整体概要

1.3 查看和删除内存中的变量

> x = c(1,2,3)
> ls()           ## 查看有哪些变量
> rm(x, y)       ## 删除指定变量

第2部分 - 简单操作，数字和向量

2.1 向量和赋值(Vectors and assignment)

> x <- c(10.4, 5.6, 3.1, 6.4, 21.7)
> x = c(10.4, 5.6, 3.1, 6.4, 21.7)
> assign("x", c(10.4, 5.6, 3.1, 6.4, 21.7))
> c(10.4, 5.6, 3.1, 6.4, 21.7) -> x
> y <- c(x, 0, x)  # 使用已有向量构建新的向量

2.2 向量运算

> v <- 2*x + y + 1
> cv <- sd(v) / mean(v) ## 变异系数

第2部分 - 简单操作，数字和向量

2.3 产生规则序列(Generating regular sequences)

> seq(-5, 5, by=.2) -> s3
> s4 <- seq(length=51, from=-5, by=.2)
> ?seq ## more information about seq(...)
> s5 <- rep(x, times=5) ##  five copies of x end-to-end in s4
> s6 <- rep(x, each=5) ## repeat each element in x five times

第2部分 - 简单操作，数字和向量

2.4 逻辑型向量(Logical vectors)

x <- c(1,2,5,8,20)
temp <- x > 13
print(temp)

## [1] FALSE FALSE FALSE FALSE  TRUE

print( (7==9) | (7>0)  )

## [1] TRUE

print(( 7==9) & (7>0)  )## 常见运算符：<, <=, >, >=, ==, &, |, !

## [1] FALSE

第2部分 - 简单操作，数字和向量

2.5 缺失值(Missing values)

z <- c(1:3,NA)   # NA: Not Available
ind <- is.na(z)
print(z)

## [1]  1  2  3 NA

print(ind)

## [1] FALSE FALSE FALSE  TRUE

print(0/0)  # NaN: Not a Number. print(Inf - Inf)??

## [1] NaN

第2部分 - 简单操作，数字和向量

2.6 字符向量的建立

Z <- c("green", "blue", "red")
print(Z)

## [1] "green" "blue"  "red"

labs <- paste(LETTERS[1:8], letters[1:8], sep="->")
print(labs)

## [1] "A->a" "B->b" "C->c" "D->d" "E->e" "F->f" "G->g" "H->h"

n <- length(labs) # 获取元素个数
print(n)

## [1] 8

第2部分 - 简单操作，数字和向量

2.7 因子向量的建立

a <- c("green","green","red","blue","green","red")
a <- factor(a)
print(a)

## [1] green green red   blue  green red  
## Levels: blue green red

b <- c(1,1,4,4,3,2,3,2,4)
b <- factor(b)
print(b)

## [1] 1 1 4 4 3 2 3 2 4
## Levels: 1 2 3 4

第2部分 - 简单操作，数字和向量

2.8 动态改变对象（数组）长度

e <- numeric()
print(e)

## numeric(0)

print(length(e))

## [1] 0

e[1] <- 1
e[3] <- 3
e[8] <- 8
print(e)

## [1]  1 NA  3 NA NA NA NA  8

第2部分 - 简单操作，数字和向量

2.9 数据框的建立

a1 <- c(5,6,8)
a2 <- 1:3
a3 <- c("a","b","c")
d <- data.frame(a1, a2, a3)
d$a4 <- c("A","B","C") # 给数据框添加新变量
d$a5 <- c(3, NA, 5)
print(d)

##   a1 a2 a3 a4 a5
## 1  5  1  a  A  3
## 2  6  2  b  B NA
## 3  8  3  c  C  5

更多演示：更改数据框的行名和列名

第2部分 - 简单操作，数字和向量

2.10 文件读取

infile <- system.file('data/stad_regular_GC_originalGenomes.txt', 
              package = "lxctk")
d <- read.table(infile)

查看d的属性

print(mode(d))

## [1] "list"

查看d的行数和列数

print(dim(d))

## [1]  96 455

重现文章中的部分分析和图

加载数据和R包

## Loading all required package in this section
# lxctk is my personal toolkit that has been 
#+tested and widely used in my group. It can 
#+be installed from Github
library(lxctk)

# Misc R codes with my modification
library(plotr)

# Loading required data set in my R package. 
# This dataset includes clinicopathological 
#+information for regular-mutated gastric 
#+cancer and mutation status of SMGs. 
data("RegularMutatedGC")

Identification of SMGs - basic principles

Identification of SMGs

MutSigCV
Implemented in MATLAB, use 8Gb memory.
MutSigCL - MutSig by clustered mutations

# Available in my R package
mutsigclfn(
  bkgrSQLiteDB,
  obs_data,
  outfile = 'out.txt',
  type = "CL",
  nperm = 10000,
  mc.cores = 4,
  bkgr_data = dbConnect(dbDriver('SQLite'), bkgrSQLiteDB)
  )

MutSigFN - MutSig by functional mutation

## Available in my R package
mutsigclfn(
  bkgrSQLiteDB,
  obs_data,
  outfile = 'out.txt',
  type = 'FN',
  nperm = 10000,
  mc.cores = 4,
  bkgr_data = dbConnect(dbDriver('SQLite'), bkgrSQLiteDB)
  )

Mutational landscape of SMGs

Prepare required inputs

# Define a legend panel, different integer represents different
#+mutation types. E.g. 2 - Synonymous, 3 - Missense, ...
# '0' is used by default to represent synonymous mutation.
legend.panel <- data.frame(V1=c(7,6,5,4,3,2), 
V2=c("Nonsense","Splice site","Frame shift",
     "Inframe indel","Missense","Syn."))
# The molecular subtype
nmfsubtypes <- RegularMutatedGC$nmf_clustid
# Mutation matrix of SMGs across samples.
x <- RegularMutatedGC[, 15:45] # Columns are SMGs.
# Sort x column-by-column for better visualization
x <- sortDataFrame(x, decreasing = TRUE)
# Color codes will be used in visualizing matrix 'x'.
cols <- c('white','grey88','#644B39','forestgreen',
  '#FF8B00','#9867CC','#DB1C00')

Mutational landscape of SMGs

Plot mutational landscape of SMGs

# Passed processed inputs to oncoprinter(...)
# Parameters:
#  x:  matrix recording SMG mutation category (column) in tumor
#+samples (row).
#  cols: colors used for mutation categories
#  legend.panel: legend panel mapping mutation category to 
#+mutation type (or name)

oncoprinter(x, cols, legend.panel)

Mutational landscape of SMGs

An example of SMG mutation landscape

oncoprinter(x, cols, legend.panel)

Mutational landscape of SMGs

How to visualize SMG mutation landscape by subtypes?

xx <- sort.data.frame.by.index(x, nmfsubtypes)
oncoprinter(xx, cols, legend.panel)

Kaplan-Meier survial & Cox analyses

Prepare inputs for KM analysis

d <- RegularMutatedGC
# Select specific items
d <- d[d$Sex!="" & d$Stage!="" & (d$Lauren_subtype == "diffuse" |
                        d$Lauren_subtype == 'intestinal'),]
# Make sure that the data type is appropriate
d$Sex <- factor(d$Sex, levels=c('F','M')) # Use female as ref group
# Use intestinal as reference group
d$Lauren_subtype <- factor(d$Lauren_subtype, 
                      levels=c('intestinal', 'diffuse'))
d$Stage <- factor(d$Stage, levels=c('Early','Late'))
d$nmf_clustid <- factor(d$nmf_clustid, levels=c(1,2))
d$cohort <- d$Study
# Only TCGA and Tianjin cohorts have both Surv_time and vital_status. 
d$cohort[d$cohort!='TCGA'] <- 'Asian'
d$cohort <- factor(d$cohort,levels=c('Asian','TCGA'))

Kaplan-Meier survival analysis

## Colors used for C1/2
C1.col <- '#BE5C81'
C2.col <- '#495456'
## Survival analysis
plot.survfit.lixc(Surv(Surv_time, vital_status) ~ nmf_clustid, d)

Kaplan-Meier survival analysis

## More controls on KM survival analysis
## Parameters:
##  legend.labels:  legends for survival curves
##  lwd:  line width
##  legend.xycoord: coordinates of the legend
##  ci.lab.xycoord: coordinates of p-value
##  col: colors
## For more info, please read its manpage.
plot.survfit.lixc(Surv(Surv_time, vital_status) ~ nmf_clustid, d,
    legend.labels = c("C1","C2"), ci.lab.xycoord=c(30,0.15),
    legend.xycoord=c(15,0.4), no.ci.lab = TRUE, lwd=3, 
    col = c(C1.col, C2.col), xlim=c(0,80), ylim=c(0,1.1))

Kaplan-Meier survival analysis

Kaplan-Meier survival analysis with risk table

## Survival fit
fit <- survfit(Surv(Surv_time, vital_status) ~ nmf_clustid, d)
## ggsurv
ggsurv(fit)
## More controls on ggsurv 
##  legend.labels:  legend labels
##  col.surv:       colors for survival curves
##  confin:         remove survival curve confidence intervals
##  pval:           coordinate to place p-value
##  pval.size:      font size of p-value
##  ggdefault:      use ggplot default color scheme
##  grid:           add grid
ggsurv(fit, legend.labels = c("C1","C2"), col.surv = c(C1.col,
    C2.col), confin = FALSE, pval = c(9, 0.3), pval.size = 4,
    ggdefault = TRUE, grid = TRUE)

Univariate Cox regression analysis

## Univariate Cox analysis in a single function. A forest plot is
#+produced to exhibit hazards ratio, associated 95% CIs and p-values.
unicoxph(Surv(Surv_time, vital_status) ~ Age + Sex + Lauren_subtype
         + Stage + nmf_clustid, d)

## More control in unicoxph
## Variable names to be displayed on the forest plot
variables_to_display <- c("Age", "Sex (Male vs. Female)", 
                          "Lauren's subtype (D vs. I)",
                          "Stage (III/IV vs. I/II",
                          "Mol. subtype (C2 vs. C1)")
unicoxph(Surv(Surv_time, vital_status) ~ Age + Sex + Lauren_subtype
         + Stage + nmf_clustid, d, variables_to_display)

Univariate Cox regression analysis

Multivariate Cox regression analysis

## Use coxph to perform multivariate Cox regression analysis
mcox <- coxph(Surv(Surv_time, vital_status) ~ Age + Sex +
                Lauren_subtype + Stage + nmf_clustid, d)

## Use forest plot to visualize mcox
plot.coxph(coxObj = mcox)

## Use user-specified variable names
plot.coxph(coxObj = mcox, variable.display = variables_to_display)

## For more info about coxph, see its manpage in R survival package.

Multivariate Cox regression analysis

Mutational signature deciphering

An example of mutational signature deciphering can be found at http://lixiangchun.github.io/msig/

Mutational signature deciphering

msig_cols <- c('#702E33','#BE5C81','#836E8D','#EC8C70','#596C22',
               '#A7B762')

data('plot.mutation.signature.ex')

# If you want to save figure to file, supply filename to `figpdf`.
plot.mutation.signature(df, figpdf=NULL, color=msig_cols)

# 
plot.mutational.exposures(NULL,
    system.file('data/Rank_eq_9.exposures.txt', package = 'lxctk'))

# legoplot representation of 96 mutational categories
legoplot(system.file('data/stad_regular_GC_originalGenomes.txt'))

Plotting mutational signatures

msig_cols <- c('#702E33','#BE5C81','#836E8D','#EC8C70',
               '#596C22','#A7B762')
data('plot.mutation.signature.ex')
plot.mutation.signature(df, figpdf=NULL, color=msig_cols)

Mutational signature deciphering

plot.mutational.exposures(NULL,
    system.file('data/Rank_eq_9.exposures.txt', package = 'lxctk'))

希望达到的目标：

Outline

R语言编程环境和资料

安装必要的软件包

R语言简介

初识R语言

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

第2部分 - 简单操作，数字和向量

重现文章中的部分分析和图

加载数据和R包

Identification of SMGs - basic principles

Identification of SMGs

Mutational landscape of SMGs

Prepare required inputs

Mutational landscape of SMGs

Plot mutational landscape of SMGs

Mutational landscape of SMGs

An example of SMG mutation landscape

Mutational landscape of SMGs

How to visualize SMG mutation landscape by subtypes?

Kaplan-Meier survial & Cox analyses

Prepare inputs for KM analysis

Kaplan-Meier survival analysis

Kaplan-Meier survival analysis

Kaplan-Meier survival analysis

Kaplan-Meier survival analysis

Kaplan-Meier survival analysis with risk table

Univariate Cox regression analysis

Univariate Cox regression analysis

Multivariate Cox regression analysis

Multivariate Cox regression analysis

Mutational signature deciphering

Mutational signature deciphering

Plotting mutational signatures

Mutational signature deciphering

谢谢大家！