Title: | Linear Predictor Score, for Binary Inference from Multiple Continuous Variables |
---|---|
Description: | An implementation of the Linear Predictor Score approach, as initiated by Radmacher et al. (J Comput Biol 2001) and enhanced by Wright et al. (PNAS 2003) for gene expression signatures. Several tools for unsupervised clustering of gene expression data are also provided. |
Authors: | Sylvain Mareschal |
Maintainer: | Sylvain Mareschal <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.17 |
Built: | 2024-11-01 11:15:29 UTC |
Source: | https://github.com/maressyl/r.lps |
This function draws a heat map ordered according to hierarchical clusterings, similarly to heatmap
. It offers more control on layout and allows multiple row annotations.
hclust.ward
is derivated from 'stats' package hclust
, with an alternative default (as arguments can not be passed to it).
dist.COR
mimics 'stats' package dist
, computing distances as 1 - Pearson's correlation coefficient.
clusterize(expr, side = NULL, cex.col = NA, cex.row = NA, mai.left = NA, mai.bottom = NA, mai.right = 0.1, mai.top = 0.1, side.height = 1, side.col = NULL, side.srt = 0, side.cex = 1, col.heatmap = heat(), zlim = "0 centered", zlim.trim = 0.02, norm = c("rows", "columns", "none"), norm.clust = TRUE, norm.robust = FALSE, customLayout = FALSE, getLayout = FALSE, plot = TRUE, widths = c(1, 4), heights = c(1, 4), order.genes = NULL, order.samples = NULL, fun.dist = dist.COR, fun.hclust = hclust.ward, clust.genes = NULL, clust.samples = NULL) dist.COR(input) hclust.ward(input)
clusterize(expr, side = NULL, cex.col = NA, cex.row = NA, mai.left = NA, mai.bottom = NA, mai.right = 0.1, mai.top = 0.1, side.height = 1, side.col = NULL, side.srt = 0, side.cex = 1, col.heatmap = heat(), zlim = "0 centered", zlim.trim = 0.02, norm = c("rows", "columns", "none"), norm.clust = TRUE, norm.robust = FALSE, customLayout = FALSE, getLayout = FALSE, plot = TRUE, widths = c(1, 4), heights = c(1, 4), order.genes = NULL, order.samples = NULL, fun.dist = dist.COR, fun.hclust = hclust.ward, clust.genes = NULL, clust.samples = NULL) dist.COR(input) hclust.ward(input)
expr |
A numeric matrix, holding features (genes) in columns and observations (samples) in rows. Rows and columns will be ordered according to hierarchical clustering results. |
side |
To be passed to |
cex.col |
To be passed to |
cex.row |
To be passed to |
mai.left |
To be passed to |
mai.bottom |
To be passed to |
mai.right |
To be passed to |
mai.top |
To be passed to |
side.height |
To be passed to |
side.col |
To be passed to |
side.srt |
To be passed to |
side.cex |
To be passed to |
col.heatmap |
To be passed to |
zlim |
To be passed to |
zlim.trim |
To be passed to |
norm |
To be passed to |
norm.clust |
Single logical value, whether to apply normalization before clustering or after. Normalization applied depends on |
norm.robust |
To be passed to |
customLayout |
Single logical value, as |
getLayout |
Single logical value, whether to only return the |
plot |
To be passed to |
widths |
To be passed to |
heights |
To be passed to |
order.genes |
A function taking the gene dendrogram and |
order.samples |
A function taking the sample dendrogram and |
fun.dist |
A function to be used for distance computation in clustering. Default value uses 1 - Pearson's correlation as distance. See |
fun.hclust |
A function to be used for agglomeration in clustering. See |
clust.genes |
If not |
clust.samples |
If not |
input |
clusterize
invisibly returns the same list as heat.map
, plus :
genes |
The gene dendrogram. |
samples |
The sample dendrogram. |
See hclust
and dist
respectively for the other functions.
Sylvain Mareschal
heat.map
, heatmap
, hclust
, dist
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr)[,1:100] # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Simple heat map clusterize(expr) # With annotation (row named data.frame) side <- data.frame(group, row.names=rownames(expr)) clusterize(expr, side=side)
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr)[,1:100] # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Simple heat map clusterize(expr) # With annotation (row named data.frame) side <- data.frame(group, row.names=rownames(expr)) clusterize(expr, side=side)
This function generates a ramp of colors for heat.map
derivated functions.
heat(colors = c("#8888FF", "#000000", "#FF4444"), n = 256, shapeFun = heat.exp, ...) heat.exp(n, part, base = 1.015) heat.lin(n, part)
heat(colors = c("#8888FF", "#000000", "#FF4444"), n = 256, shapeFun = heat.exp, ...) heat.exp(n, part, base = 1.015) heat.lin(n, part)
colors |
Character vector of length 3, determining starting, middle and final colors. |
n |
Single integer value, amount of colors / values to generate. |
shapeFun |
Function taking at least 2 arguments : |
... |
Further arguments to |
part |
Single integer, defined as 1 while generating colors between the first two boundaries, and 2 otherwise. |
base |
Single numeric value, base for exponential slope. |
heat
returns a character vector of colors in hexadecimal representation.
heat.lin
and heat.expr
return n
numeric values, defining a curve whose slope will be mimiced during color interpolation.
Sylvain Mareschal
heat.map
, clusterize
, predict.LPS
# Classical heatmap colors palette <- heat(c("green", "black", "red")) heat.scale(zlim=c(-2,2), col.heatmap=palette) # Two distinct shapes provided heat.scale(zlim=c(-2,2), col.heatmap=heat(shapeFun=heat.lin)) heat.scale(zlim=c(-2,2), col.heatmap=heat(shapeFun=heat.exp))
# Classical heatmap colors palette <- heat(c("green", "black", "red")) heat.scale(zlim=c(-2,2), col.heatmap=palette) # Two distinct shapes provided heat.scale(zlim=c(-2,2), col.heatmap=heat(shapeFun=heat.lin)) heat.scale(zlim=c(-2,2), col.heatmap=heat(shapeFun=heat.exp))
This function draws a heatmap from a matrix, similarly to image
. It also offers normalization and annotation features, with more control than heatmap
.
side
can provide multiple sample annotations, and are handled differently depending on their class :
are attributed grey shades from the minimum to the maximum, which are provided in the legend
have their levels attributed colors using a default or custom palette. Hexadecimal color codes starting with #
and color names known by R are used "as is".
are printed as is in a blank cell. Hexadecimal color codes starting with #
and color names known by R are used as background colors instead of text.
are ploted in dark (TRUE) or light (FALSE) gray, leaving NAs in white.
heat.map(expr, side = NULL, cex.col = NA, cex.row = NA, mai.left = NA, mai.bottom = NA, mai.right = 0.1, mai.top = 0.1, side.height = 1, side.col = NULL, side.srt = 0, side.cex = 1, col.heatmap = heat(), zlim = "0 centered", zlim.trim = 0.02, norm = c("rows", "columns", "none"), norm.robust = FALSE, customLayout = FALSE, getLayout = FALSE, font = c(1, 3), xaxt = "s", yaxt = "s")
heat.map(expr, side = NULL, cex.col = NA, cex.row = NA, mai.left = NA, mai.bottom = NA, mai.right = 0.1, mai.top = 0.1, side.height = 1, side.col = NULL, side.srt = 0, side.cex = 1, col.heatmap = heat(), zlim = "0 centered", zlim.trim = 0.02, norm = c("rows", "columns", "none"), norm.robust = FALSE, customLayout = FALSE, getLayout = FALSE, font = c(1, 3), xaxt = "s", yaxt = "s")
expr |
A numeric matrix, holding features (genes) in columns and observations (samples) in rows. Column and row order will not be altered. |
side |
An annotation |
cex.col |
Single numeric value, character exapansion factor for column names. |
cex.row |
Single numeric value, character exapansion factor for row names. |
mai.left |
Single numeric value, left margin in inches (for row names). Use |
mai.bottom |
Single numeric value, bottom margin in inches (for column names). Use |
mai.right |
Single numeric value, right margin in inches (for higher level functions). See |
mai.top |
Single numeric value, top margin in inches. See |
side.height |
Single numeric value, scaling factor for annotation track. |
side.col |
A function returning as many colors as requested by its sole argument, defining the colors to be used for |
side.srt |
Single numeric value, determining the string rotation angle when writing character side columns (default is 0, horizontal, 90 is suggested for vertical text on busy heat maps). |
side.cex |
Single numeric value, the character expansion factor to use for character side columns. |
col.heatmap |
Character vector of colors, to be used for the cells of the heat map. |
zlim |
Numeric vector of length two, defining minimal and maximal |
zlim.trim |
Single numeric value between 0 and 1, defining the proportion of extreme values (equally split on both sides) to remove before computing "0 centered" or "range" |
norm |
Single character value, normalization to be performed (use "none" to perform no normalization). "rows" will center and scale genes, while "columns" will center and scale samples. The functions used depend on |
norm.robust |
Single logical value, if |
customLayout |
Single logical value, as |
getLayout |
Single logical value, whether to only return the |
font |
Integer vector of length two, the |
xaxt |
Single letter, whether to print column labels ("s") or not ("n"). |
yaxt |
Single letter, whether to print row labels ("s") or not ("n"). |
Invisibly returns a named list :
zlim |
Final value of the |
col.heatmap |
Final value of the |
legend |
If |
cex.col |
Final value of the |
cex.row |
Final value of the |
mai.left |
Final value of the |
mai.bottom |
Final value of the |
Sylvain Mareschal
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr)[,1:100] # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Simple heat map heat.map(expr) # With annotation (row named data.frame) side <- data.frame(group, row.names=rownames(expr)) heat.map(expr, side=side)
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr)[,1:100] # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Simple heat map heat.map(expr) # With annotation (row named data.frame) side <- data.frame(group, row.names=rownames(expr)) heat.map(expr, side=side)
This function plots a color scale using a custom color palette, to legend heat.map
derivated functions.
heat.scale(zlim, col.heatmap, at = -10:10, labels = NULL, horiz = TRUE, robust = FALSE, customMar = FALSE, title=NA)
heat.scale(zlim, col.heatmap, at = -10:10, labels = NULL, horiz = TRUE, robust = FALSE, customMar = FALSE, title=NA)
zlim |
Numeric vector of length 2, minimum and maximum of values in the palette. Should correspond to |
col.heatmap |
Character vector of colors used in the heat map. Should correspond to |
at |
Numeric vector, values shown in the axis. |
labels |
Character vector as long as |
horiz |
Single logical value, whether to plot an horizontal or a vertical scale. |
robust |
Single logical value, whether to legend |
customMar |
Single logical value, whether to skip the call to |
title |
Single character value, the axis title to use ( |
Sylvain Mareschal
heat.map
, clusterize
, predict.LPS
This function trains a Linear Predictor Score model, given pre-computed coefficients. It uses data with known classes to fit the model.
It has numerous way to be called, and all the arguments are not mandatory. See the 'Examples' section.
LPS(data, coeff, response, k, threshold, formula, method = "fdr", ...)
LPS(data, coeff, response, k, threshold, formula, method = "fdr", ...)
data |
Continuous data used to retrieve classes, as a |
coeff |
Pre-computed coefficients for the model, as returned by |
response |
Already known classes for the samples provided in |
k |
Single |
threshold |
Single |
formula |
A |
method |
Single character value, to be passed to |
... |
Further arguments are passed to |
An object of (S3) class "LPS" :
coeff |
Named numeric vector, the coefficients used in the model. |
classes |
Character vector, the labels of the two groups to be predicted. |
scores |
List of two numeric vectors, training dataset scores sorted by group. |
means |
Numeric vector, score means of each group in the training dataset. |
sds |
Numeric vector, score |
ovl |
Numeric value, overlapping coefficient as returned by |
k |
Integer value, amount of features selected in the model (if relevant). |
p.threshold |
Numeric value, threshold used for feature selection (if relevant). |
p.method |
Character value, p-value correction used for feature selection (if relevant). |
As expression values are directly used in the score, gene centering and scaling are strongly recommended. For Affymetrix raw expression values (strictly positive, linear and absolute), Wright et al. suggests a multiplicative centering on a median of 1000 followed by a log2 transformation. For log-ratio, gene centering and scaling should not be necessary, as they are naturally 0-centered.
Using a numeric matrix as data
and a factor as response
is the fastest way to compute coefficients, if time consumption matters (as in cross-validation schemes). formula
is there only for consistency with R modeling functions, and to provide response
, k
or threshold
in a single way.
Sylvain Mareschal
Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol. 2002;9(3):505-11.
Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.
Bohers E, Mareschal S, Bouzelfen A, Marchand V, Ruminy P, Maingonnat C, Menard AL, Etancelin P, Bertrand P, Dubois S, Alcantara M, Bastard C, Tilly H, Jardin F. Targetable activating mutations are very frequent in GCB and ABC diffuse large B-cell lymphoma. Genes Chromosomes Cancer. 2014 Feb;53(2):144-53.
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Coefficients coeff <- LPS.coeff(data=expr, response=group) # 10 best features (straightforward) m <- LPS(data=expr, coeff=coeff, response=group, k=10) # 10 best features (formula) ### 'k' MUST be an integer, or will be understood as a 'threshold' ### Numbers are "numeric", enforce integer with "L" or "as.integer" m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~10L) k <- as.integer(10) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~k) # FDR threshold thr <- 0.01 m <- LPS(data=expr, coeff=coeff, response=group, threshold=thr) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~0.01) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~thr) # Custom model m <- LPS(data=expr, coeff=coeff[ c("27481","17013") ,], response=group, k=2) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~`27481`+`17013`) ### Notice backticks in formula for syntactically invalid names # Complete model m <- LPS(data=expr, coeff=coeff, response=group, k=ncol(expr)) m <- LPS(data=expr, coeff=coeff, response=group, threshold=1) ### m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~.) ### The last is correct but (really) slow on large datasets
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Coefficients coeff <- LPS.coeff(data=expr, response=group) # 10 best features (straightforward) m <- LPS(data=expr, coeff=coeff, response=group, k=10) # 10 best features (formula) ### 'k' MUST be an integer, or will be understood as a 'threshold' ### Numbers are "numeric", enforce integer with "L" or "as.integer" m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~10L) k <- as.integer(10) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~k) # FDR threshold thr <- 0.01 m <- LPS(data=expr, coeff=coeff, response=group, threshold=thr) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~0.01) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~thr) # Custom model m <- LPS(data=expr, coeff=coeff[ c("27481","17013") ,], response=group, k=2) m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~`27481`+`17013`) ### Notice backticks in formula for syntactically invalid names # Complete model m <- LPS(data=expr, coeff=coeff, response=group, k=ncol(expr)) m <- LPS(data=expr, coeff=coeff, response=group, threshold=1) ### m <- LPS(data=as.data.frame(expr), coeff=coeff, formula=group~.) ### The last is correct but (really) slow on large datasets
As Linear Predictor Score coefficients are genuinely t statistics, this function provides a faster implementation for large datasets than using t.test
.
LPS.coeff(data, response, formula = ~1, type = c("t", "limma"), p.value = TRUE, log = FALSE, weighted = FALSE, ...)
LPS.coeff(data, response, formula = ~1, type = c("t", "limma"), p.value = TRUE, log = FALSE, weighted = FALSE, ...)
data |
Continuous data used to retrieve classes, as a |
response |
Already known classes for the samples provided in |
formula |
A |
type |
Single character value, "t" to compute genuine t statistics (unequal variances and unpaired samples) or "limma" to use the lmFit() and eBayes() t statistics from this microarray oriented Bioconductor package. |
p.value |
Single logical value, whether to compute (two-sided) p-values or not. |
log |
Single logical value, whether to log-transform t or not (sign will be preserved). Original description of the LPS does not include log-transformation, but it may be useful to not over-weight discriminant genes in large series. Values between -1 and 1 are transformed to 0 to avoid sign shifting, as it generally comes with non significant p-values. |
weighted |
Single logical value, whether to divide t (or log-transformed t) by gene mean or not. We recommend to normalize data only by samples and use |
... |
Further arguments are passed to |
Always returns a row named numeric matrix, with a "t" column holding statistics computed. If p.value
is TRUE, a second "p.value" column is added.
Using a numeric matrix as data
and a factor as response
is the fastest way to compute coefficients, if time consumption matters (as in cross-validation schemes). formula
was added only for consistency with other R modeling functions, and eventually to subset features to compute coefficients for.
Sylvain Mareschal
http://www.bioconductor.org/packages/release/bioc/html/limma.html
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # All features, all samples k <- LPS.coeff(data=expr, response=group) k <- LPS.coeff(formula=group~1, data=as.data.frame(expr)) ### LPS.coeff(formula=group~., data=as.data.frame(expr), na.action=na.pass) ### The last is correct but (really) slow on large datasets # Feature subset, all samples k <- LPS.coeff(data=expr[, c("27481","17013") ], response=group) k <- LPS.coeff(formula=group~`27481`+`17013`, data=as.data.frame(expr)) ### Notice backticks in formula for syntactically invalid names # All features, sample subset training <- rosenwald.cli$set == "Training" ### training <- sample.int(nrow(expr), 10) ### training <- which(rosenwald.cli$set == "Training") ### training <- rownames(subset(rosenwald.cli, set == "Training")) k <- LPS.coeff(data=expr, response=group, subset=training) k <- LPS.coeff(formula=group~1, data=as.data.frame(expr), subset=training) # NA handling by model.frame() k <- LPS.coeff(formula=group~1, data=as.data.frame(expr), na.action=na.omit)
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # All features, all samples k <- LPS.coeff(data=expr, response=group) k <- LPS.coeff(formula=group~1, data=as.data.frame(expr)) ### LPS.coeff(formula=group~., data=as.data.frame(expr), na.action=na.pass) ### The last is correct but (really) slow on large datasets # Feature subset, all samples k <- LPS.coeff(data=expr[, c("27481","17013") ], response=group) k <- LPS.coeff(formula=group~`27481`+`17013`, data=as.data.frame(expr)) ### Notice backticks in formula for syntactically invalid names # All features, sample subset training <- rosenwald.cli$set == "Training" ### training <- sample.int(nrow(expr), 10) ### training <- which(rosenwald.cli$set == "Training") ### training <- rownames(subset(rosenwald.cli, set == "Training")) k <- LPS.coeff(data=expr, response=group, subset=training) k <- LPS.coeff(formula=group~1, data=as.data.frame(expr), subset=training) # NA handling by model.frame() k <- LPS.coeff(formula=group~1, data=as.data.frame(expr), na.action=na.omit)
Quantify the overlap between gaussian distributions of the two group scores, to assess model efficiency (best models should not overlap, to prevent from false discovery).
OVL(means, sds, cutoff=1e-4, n=1e4)
OVL(means, sds, cutoff=1e-4, n=1e4)
means |
Numeric vector of two values, the means of the gaussian distributions. |
sds |
Numeric vector of two values, the standard deviations of the gaussian distributions. |
cutoff |
Single numeric value, minimal quantile for integration range definition (distributions will be considered between their |
n |
Single integer value, the amount of equi-distant points to use for the computation. The greater it is, the more precise the returned value will be. |
Returns the proportion of the overlap between the two gaussian distributions N1 and N2, i.e. min(N1, N2) / (N1 + N2)
.
Sylvain Mareschal
# Full overlap between identical distributions OVL(c(0,0), c(1,1)) # Increasing shift OVL(c(0,1), c(1,1)) OVL(c(0,2), c(1,1)) OVL(c(0,3), c(1,1)) OVL(c(0,10), c(1,1))
# Full overlap between identical distributions OVL(c(0,0), c(1,1)) # Increasing shift OVL(c(0,1), c(1,1)) OVL(c(0,2), c(1,1)) OVL(c(0,3), c(1,1)) OVL(c(0,10), c(1,1))
This function plots the distributions of the LPS scores in each group for a fitted LPS
object.
## S3 method for class 'LPS' plot(x, y, method=c("Wright", "Radmacher", "exact"), threshold = 0.9, values = FALSE, col.classes = c("#FFCC00", "#1144CC"), xlim, yaxt = "s", xlab = "LPS", ylab, las = 0, lwd = 2,...)
## S3 method for class 'LPS' plot(x, y, method=c("Wright", "Radmacher", "exact"), threshold = 0.9, values = FALSE, col.classes = c("#FFCC00", "#1144CC"), xlim, yaxt = "s", xlab = "LPS", ylab, las = 0, lwd = 2,...)
x |
An object of class |
y |
Single character value defining y axis : "density" or (bayesian) "probability". |
method |
Single character value, the method to use for predictions. See |
threshold |
Single numeric value, the confidence threshold to use for the "gray zone" (scores for which none of the two groups can be assigned with a probability greater than this threshold). See |
values |
Single logical value, whether to plot individual scores from the training series or not. |
col.classes |
Character vector of two values giving to each class a distinct color. |
xlim |
To be passed to |
yaxt |
|
xlab |
To be passed to |
ylab |
To be passed to |
las |
|
lwd |
|
... |
Sylvain Mareschal
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Coefficients coeff <- LPS.coeff(data=expr, response=group) # 10 best features model m <- LPS(data=expr, coeff=coeff, response=group, k=10) # Distributions of scores in each group plot(m, "density") # Probability for each group along the score axis plot(m, "probability", yaxt="s")
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Coefficients coeff <- LPS.coeff(data=expr, response=group) # 10 best features model m <- LPS(data=expr, coeff=coeff, response=group, k=10) # Distributions of scores in each group plot(m, "density") # Probability for each group along the score axis plot(m, "probability", yaxt="s")
This function allow predictions to be made from a fitted LPS
model and a new dataset.
It can also plot a gene expression heatmap to visualize results of the prediction.
## S3 method for class 'LPS' predict(object, newdata, type=c("class", "probability", "score"), method = c("Wright", "Radmacher", "exact"), threshold = 0.9, na.rm = TRUE, subset = NULL, col.lines = "#FFFFFF", col.classes = c("#FFCC00", "#1144CC"), plot = FALSE, side = NULL, cex.col = NA, cex.row = NA, mai.left = NA, mai.bottom = NA, mai.right = 1, mai.top = 0.1, side.height = 1, side.col = NULL, col.heatmap = heat(), zlim = "0 centered", norm = c("rows", "columns", "none"), norm.robust = FALSE, customLayout = FALSE, getLayout = FALSE, ...)
## S3 method for class 'LPS' predict(object, newdata, type=c("class", "probability", "score"), method = c("Wright", "Radmacher", "exact"), threshold = 0.9, na.rm = TRUE, subset = NULL, col.lines = "#FFFFFF", col.classes = c("#FFCC00", "#1144CC"), plot = FALSE, side = NULL, cex.col = NA, cex.row = NA, mai.left = NA, mai.bottom = NA, mai.right = 1, mai.top = 0.1, side.height = 1, side.col = NULL, col.heatmap = heat(), zlim = "0 centered", norm = c("rows", "columns", "none"), norm.robust = FALSE, customLayout = FALSE, getLayout = FALSE, ...)
object |
An object of class |
newdata |
Continuous data used to retrieve classes, as a |
type |
Single character value, return type of the predictions to be made ("class", "probability" or "score"). See 'Value' section. |
method |
Single character value, the method to use to make predictions ("Wright", "Radmacher" or "exact"). See 'Details' section. |
threshold |
Threshold to use for class prediction. "Wright" method was designed with 0.9, "Radmacher" method makes no use of the threshold. |
na.rm |
Single logical value, if TRUE samples with one or many |
subset |
A subsetting vector to apply on |
col.lines |
If |
col.classes |
If |
plot |
To be passed to |
side |
To be passed to |
cex.col |
To be passed to |
cex.row |
To be passed to |
mai.left |
To be passed to |
mai.bottom |
To be passed to |
mai.right |
To be passed to |
mai.top |
To be passed to |
side.height |
To be passed to |
side.col |
To be passed to |
col.heatmap |
To be passed to |
zlim |
To be passed to |
norm |
To be passed to |
norm.robust |
To be passed to |
customLayout |
To be passed to |
getLayout |
To be passed to |
... |
Ignored, just there to match the |
The "Compound covariate predictor" from Radmacher et al. (method
= "Radmacher") simply assign each sample to the closest group (comparing the sample score to the mean scores of each group in the training dataset).
The "Linear Predictor Score" from Wright et al. (method
= "Wright") modelizes scores in each training sub-group with a distinct gaussian distribution, and computes the probability for a sample to be in one of them or the other using a bayesian rule.
The "exact" mode is still under development and should not be used.
For a "class" type
, returns a character vector with group assignment for each new sample (possibly NA
), named according to data
row names.
For a "probability" type
, returns a numeric matrix with two columns (probabilities to be in each group) and a row for each new sample, row named according to data
row names and column named according to the group labels.
For a "score" type
, returns a numeric vector with LPS score for each new sample, named according to data
row names. Notice the score is the same for all method
s.
If plot
is TRUE
, returns the list returned by heat.map
, with data described above in the first unammed element.
Sylvain Mareschal
Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. J Comput Biol. 2002;9(3):505-11.
Wright G, Tan B, Rosenwald A, Hurt EH, Wiestner A, Staudt LM. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Coefficients coeff <- LPS.coeff(data=expr, response=group) # 10 best features model m <- LPS(data=expr, coeff=coeff, response=group, k=10) # Class prediction plot predict(m, expr, plot=TRUE) # Wright et al. class prediction table( group, prediction = predict(m, expr), exclude = NULL ) # More stringent threshold table( group, prediction = predict(m, expr, threshold=0.99), exclude = NULL ) # Radmacher et al. class prediction table( group, prediction = predict(m, expr, method="Radmacher"), exclude = NULL ) # Probabilities predict(m, expr, type="probability", method="Wright") predict(m, expr, type="probability", method="Radmacher") predict(m, expr, type="probability", method="exact") # Probability plot predict(m, expr, type="probability", plot=TRUE) # Annotated probability plot side <- data.frame(group, row.names=rownames(expr)) predict(m, expr, side=side, type="probability", plot=TRUE) # Score plot predict(m, expr, type="score", plot=TRUE)
# Data with features in columns data(rosenwald) group <- rosenwald.cli$group expr <- t(rosenwald.expr) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Coefficients coeff <- LPS.coeff(data=expr, response=group) # 10 best features model m <- LPS(data=expr, coeff=coeff, response=group, k=10) # Class prediction plot predict(m, expr, plot=TRUE) # Wright et al. class prediction table( group, prediction = predict(m, expr), exclude = NULL ) # More stringent threshold table( group, prediction = predict(m, expr, threshold=0.99), exclude = NULL ) # Radmacher et al. class prediction table( group, prediction = predict(m, expr, method="Radmacher"), exclude = NULL ) # Probabilities predict(m, expr, type="probability", method="Wright") predict(m, expr, type="probability", method="Radmacher") predict(m, expr, type="probability", method="exact") # Probability plot predict(m, expr, type="probability", plot=TRUE) # Annotated probability plot side <- data.frame(group, row.names=rownames(expr)) predict(m, expr, side=side, type="probability", plot=TRUE) # Score plot predict(m, expr, type="score", plot=TRUE)
This dataset contains 60 Diffuse Large B-Cell Lymphomas analysed on Lymphochip microarrays, as published by Rosenwald et al. The "Germinal Center B-cell like" and "Activated B-Cell like" subtypes, as determined by hierarchical clustering, were predicted by a LPS approach in Wright et al.
To minimize package size, values were rounded at 3 decimals and only 60 DLBCL from the 240 series were randomly selected (40 from the "Training" set, 20 from the "Validation" set), excluding "Type III" sub-types.
data(rosenwald)
data(rosenwald)
rosenwald.expr
is a numeric matrix of expression values, with probes in rows and samples in columns. Both dimensions are named, probes by there "UNIQID" and samples by there "LYM numbers". Many NA
values are present.
rosenwald.cli
is a data.frame with a row for each sample, and 4 factor
columns described below. Rows are named by samples "LYM numbers", in the same order than rosenwald.expr
.
the "Training" or "Validation" set the sample comes from.
the DLBCL sub-type that is to be predicted ("GCB" or "ABC").
follow-up of the patient, in years.
status of the patient at the end of the follow-up ("Dead" or "Alive").
Rosenwald A et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N Engl J Med. 2002 Jun 20;346(25):1937-47.
Wright G et al. A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proc Natl Acad Sci U S A. 2003 Aug 19;100(17):9991-6.
This function generates color shades for each individual, according to their respective right-censored survival data (event occurred or not, after which follow-up time). This can prove useful to annotate heat maps with survival data.
Two color scales are used, one for right-censored individuals (lost of sight before the event occurs, yellow with default colors) and an other for individual with observed events (death, relapse ... black in default colors). Shades are generated according to their impact : fast events and long follow-ups without event have strong colors, while late events and short follow-up without event are light-colored.
surv.colors(time, event, eventColors = c("#000000", "#CCCCCC"), censColors = c("#FFFFEE", "#FFDD00"))
surv.colors(time, event, eventColors = c("#000000", "#CCCCCC"), censColors = c("#FFFFEE", "#FFDD00"))
time |
Numeric vector, the follow-up times of each individual (see |
event |
Logical vector, whether an event (death, relapse ...) occured at the end of each individual follow-up or not (see |
eventColors |
Character vector of length 2, the boundaries of the color scale to generate for individuals with events. |
censColors |
Character vector of length 2, the boundaries of the color scale to generate for right-censored individuals. |
Returns a character vector, named according to time
names.
Sylvain Mareschal
# Rosenwald's dataset (hand-picked prognostic probes) data(rosenwald) probes <- c("30580", "16006", "32315", "16978", "26588") expr <- t(rosenwald.expr[ probes ,]) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Survival colors surv <- with(rosenwald.cli, surv.colors(time=follow.up, event=status=="Dead")) # Color scale legend with(rosenwald.cli, surv.scale(time=follow.up, event=status=="Dead")) # Annotated clustering side <- data.frame(OS=surv, row.names=rownames(rosenwald.cli)) clusterize(expr, side=side)
# Rosenwald's dataset (hand-picked prognostic probes) data(rosenwald) probes <- c("30580", "16006", "32315", "16978", "26588") expr <- t(rosenwald.expr[ probes ,]) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Survival colors surv <- with(rosenwald.cli, surv.colors(time=follow.up, event=status=="Dead")) # Color scale legend with(rosenwald.cli, surv.scale(time=follow.up, event=status=="Dead")) # Annotated clustering side <- data.frame(OS=surv, row.names=rownames(rosenwald.cli)) clusterize(expr, side=side)
This function plots a color scale using a custom color palette, to legend surv.colors
annotations.
surv.scale(time, event, eventColors = c("#000000", "#CCCCCC"), censColors = c("#FFFFEE", "#FFDD00"))
surv.scale(time, event, eventColors = c("#000000", "#CCCCCC"), censColors = c("#FFFFEE", "#FFDD00"))
time |
Numeric vector, the follow-up times of each individual (see |
event |
Logical vector, whether an event (death, relapse ...) occured at the end of each individual follow-up or not (see |
eventColors |
Character vector of length 2, the boundaries of the color scale to generate for individuals with events. |
censColors |
Character vector of length 2, the boundaries of the color scale to generate for right-censored individuals. |
Sylvain Mareschal
surv.colors
, survival::Surv
# Rosenwald's dataset (hand-picked prognostic probes) data(rosenwald) probes <- c("30580", "16006", "32315", "16978", "26588") expr <- t(rosenwald.expr[ probes ,]) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Survival colors surv <- with(rosenwald.cli, surv.colors(time=follow.up, event=status=="Dead")) # Annotated clustering side <- data.frame(OS=surv, row.names=rownames(rosenwald.cli)) clusterize(expr, side=side) # Color scale legend with(rosenwald.cli, surv.scale(time=follow.up, event=status=="Dead"))
# Rosenwald's dataset (hand-picked prognostic probes) data(rosenwald) probes <- c("30580", "16006", "32315", "16978", "26588") expr <- t(rosenwald.expr[ probes ,]) # NA imputation (feature's mean to minimize impact) f <- function(x) { x[ is.na(x) ] <- round(mean(x, na.rm=TRUE), 3); x } expr <- apply(expr, 2, f) # Survival colors surv <- with(rosenwald.cli, surv.colors(time=follow.up, event=status=="Dead")) # Annotated clustering side <- data.frame(OS=surv, row.names=rownames(rosenwald.cli)) clusterize(expr, side=side) # Color scale legend with(rosenwald.cli, surv.scale(time=follow.up, event=status=="Dead"))