The Unofficial ENAR 2015 Itinerary Maker

It’s almost ENAR 2015! The final program is out with all the sessions. The last conference I went to, the International Stroke Conference, had a program planner hosted by abstracts online.

Although there are parts of this system I would like to change, I believe it is helpful for looking up sessions, presenters, and especially posters. Therefore, I introduce

Functions and How to Use

Here is an example screen shot of the shiny app:

Each of the functions are as follows:

• Type of Session – you can choose from different session types, whether you want to limit to posters or short courses
• Select the day to subset data
• Select a specific session
• Search – this text field uses grep (after lower casing the field) to search the title and autor fields for relevant text.
• Download – this will download a CSV file of the subsetted table. Now note that this will not be exactly the table, but a Google Calendar-friendly format. This will be discussed in the next session.
• Donate button: I spent a good deal of work on this app and I believe it improves the conference. If you agree and would like to donate some money and/or a beer at ENAR 2015, I’d appreciate it.

Individual Talks

Each individual session can be added to a Google Calendar using the Add to my Google calendar button next to each session. For standard posters, this will add the poster as the entire poster session. For specific talks, it will not add the complete session, but simply that talk.

Feedback

Is any information is incorrect please let me know, either at @StrictlyStat or muschellij2@gmail.com. I spent a good deal of time cleaning the text from the PDF so I believe it should be mostly correct but obviously any last-minute changes I did not capture.

My Sessions

Please stop by the poster session Poster Number 2b. (PDF of poster) and if you’re interested in neuroimaging processing and using R, please sign up for the “T4: A Tutorial for Multisequence Clinical Structural Brain MRI” that we are running.

Code for app

The app is hosted on my GitHub along with the data used to run the app.

If the app crashes

A backup (or mirror) shiny app is located at http://162.129.13.127/ENAR_2015/.

Using Tables for Statistics on Large Vectors

This is the first post I’ve written in a while. I have been somewhat radio silent on social media, but I’m jumping back in.

Now, I work with brain images, which can have millions of elements (referred to as voxels). Many of these elements are zero (for background). We want to calculate basic statistics on the data usually and I wanted to describe how you can speed up operations or reduce memory requirements if you want to calculate many statistics on a large vector with integer values by using summary tables.

Why to use Tables

Tables are relatively computationally expensive to calculate. They must operate over the entire vector, find the unique values, and bin the data into these values. Let $n$ be the length of the vector. For integer vectors (i.e. whole number), the number of unique values is much less than $n$. Therefore, the table is stored much more efficiently than the entire vector.

Tables are sufficient statistics

You can think of the frequencies and bins as summary statistics for the entire distribution of the data. I will not discuss a formal proof here, but you can easily re-create the entire vector using the table (see epitools::expand.table for a function to do this), and thus the table is a sufficient (but not likely a minimal) statistic.

As a sufficient statistic, we can create any statistic that we’d like relatively easy. Now, R has very efficient functions for many statistics, such as the median and quantiles, so it may not make sense why we’d want to rewrite some of these functions using tables.

I can think of 2 reasons: 1) you want to calculate many statistics on the data and don’t want to pass the vector in multiple times, and 2) you want to preprocess the data to summarize the data into tables to only use these in memory versus the entire vector.

Here are some examples when this question has been asked on stackoverflow: 1, 2 and the R list-serv: 1. What we’re going to do is show some basic operations on tables to get summary statistics and show they agree.

R Implementation

Let’s make a large vector:

set.seed(20150301)
vec = sample(-10:100, size= 1e7, replace = TRUE)


Quantile function for tables

I implemented a quantile function for tables (of only type 1). The code takes in a table, creates the cumulative sum, extracts the unique values of the table, then computes and returns the quantiles.

quantile.table = function(tab, probs = c(0, 0.25, 0.5, 0.75, 1)){
n = sum(tab)
#### get CDF
cs = cumsum(tab)
### get values (x)
uvals = unique(as.numeric(names(tab)))

#  can add different types of quantile, but using default
m = 0
qs = sapply(probs, function(prob){
np = n * prob
j = floor(np) + m
g = np + m - j
# type == 1
gamma = as.numeric(g != 0)
cs &lt;= j
quant = uvals[min(which(cs &gt;= j))]
return(quant)
})
dig &lt;- max(2L, getOption(&quot;digits&quot;))
names(qs) &lt;- paste0(if (length(probs) &lt; 100)
formatC(100 * probs, format = &quot;fg&quot;, width = 1, digits = dig)
else format(100 * probs, trim = TRUE, digits = dig),
&quot;%&quot;)
return(qs)
}


Quantile Benchmarks

Let’s benchmark the quantile functions: 1) creating the table and then getting the quantiles, 2) creating an empircal CDF function then creating the quantiles, 3) creating the quantiles on the original data.

library(microbenchmark)
options(microbenchmark.unit='relative')
qtab = function(vec){
tab = table(vec)
quantile.table(tab)
}
qcdf = function(vec){
cdf = ecdf(vec)
quantile(cdf, type=1)
}
# quantile(vec, type = 1)
microbenchmark(qtab(vec), qcdf(vec), quantile(vec, type = 1), times = 10L)

Unit: relative
expr       min        lq     mean    median       uq
qtab(vec) 12.495569 12.052644 9.109178 11.589662 7.499691
qcdf(vec)  5.407606  5.802752 4.375459  5.553492 3.708795
quantile(vec, type = 1)  1.000000  1.000000 1.000000  1.000000 1.000000
max neval cld
5.481202    10   c
2.653728    10  b
1.000000    10 a


More realistic benchmarks

Not surprisingly, simply running quantile on the vector beats the other 2 methods, by far. So computational speed may not be beneficial for using a table. But if tables or CDFs are already created in a previous processing step, we should compare that procedure:

options(microbenchmark.unit=&quot;relative&quot;)
tab = table(vec)
cdf = ecdf(vec)
all.equal(quantile.table(tab), quantile(cdf, type=1))

[1] TRUE

all.equal(quantile.table(tab), quantile(vec, type=1))

[1] TRUE

microbenchmark(quantile.table(tab), quantile(cdf, type=1), quantile(vec, type = 1), times = 10L)

Unit: relative
expr      min       lq     mean   median       uq
quantile.table(tab)    1.000    1.000   1.0000    1.000   1.0000
quantile(cdf, type = 1)  774.885 1016.172 596.3217 1144.063 868.8105
quantile(vec, type = 1) 1029.696 1122.550 653.2146 1199.143 910.3743
max neval cld
1.0000    10  a
198.1590    10   b
206.5936    10   b


As we can see, if you had already computed tables, then you get the same quantiles as performing the operation on the vector, and also much faster results. Using quantile on a ecdf object is not much better, which mainly is due to the fact that the quantile function remakes the factor and then calculate quantiles:

stats:::quantile.ecdf

function (x, ...)
quantile(evalq(rep.int(x, diff(c(0, round(nobs * y)))), environment(x)),
...)
&lt;bytecode: 0x107493e28&gt;
&lt;environment: namespace:stats&gt;


Median for tables

Above we show the quantile.table function, so the median function is trivial where probs = 0.5:

median.table = function(tab){
quantile.table(tab, probs = 0.5)
}


Mean of a table

Other functions can be used to calculate statstics on the table, such as the mean:

mean.table = function(tab){
uvals = unique(as.numeric(names(tab)))
sum(uvals * tab)/sum(tab)
}
mean.table(tab)

[1] 44.98991

mean(tab)

[1] 44.98991

mean(cdf)

Warning in mean.default(cdf): argument is not numeric or logical:
returning NA

[1] NA


As we see, we can simply use mean and do not need to define a new function for tables.

mean(vec)

[1] 44.98991

all.equal(mean(tab), mean(vec))

[1] TRUE


Subsetting tables

One problem with using mean vs. mean.table is when you subset the table or perform an operation that causes it to lose the attribute of the class of table. For example, let’s say I want to estimate the mean of the data for values $> 0$:

mean(vec[vec &gt; 0])

[1] 50.50371

over0 = tab[as.numeric(names(tab)) &gt; 0]
mean(over0)

[1] 90065.98

mean.table(over0)

[1] 50.50371

class(over0)

[1] &quot;array&quot;


We see that after subsetting, over0 is an array and not a table, so mean computes the mean using the array method, treating the frequences as data and the estimated mean is not correct. mean.table calculates the correct value, as it does not depend on the class of tab. Another way to circumvent this is to reassign a class of table to over0:

class(over0) = &quot;table&quot;
mean(over0)

[1] 50.50371


This process requires the user to know what the class is of the object passed to mean, and may not be correct if the user changes the class of the object.

Aside on NA values

Let’s see what happens when there are NAs in the vector. We’ll put in 20 NA values:

navec = vec
navec[sample(length(navec), 20)] = NA
natab = table(navec, useNA=&quot;ifany&quot;)
nacdf = ecdf(navec)
mean(navec)

[1] NA

mean(natab)

[1] NA

# mean(nacdf)


We see that if we table the data with NA being a category, then any operation that returns NA if NA are present will return NA. For example, if we do a table on the data with the table option useNA="always", then the mean will be NA even though no NA are present in the original vector. Also, ecdf objects do not keep track of NA values after they are computed.

tab2 = table(vec, useNA=&quot;always&quot;)
mean(tab2)

[1] NA

nonatab = table(navec, useNA=&quot;no&quot;)
mean(nonatab)

[1] 44.98993

mean(navec, na.rm=TRUE)

[1] 44.98993


If you are using tables for statistics, the equivalent of na.rm=FALSE is table(..., useNA="ifany") and na.rm=TRUE is table(..., useNA="no"). We also see that an object of ecdf do not ever show NAs. Although we said tables are sufficient statistics, that may not be entirely correct if depending on how you make the table when the data have missing data.

Mean benchmark

Let’s benchmark the mean function, assuming we have pre-computed the table:

options(microbenchmark.unit=&quot;relative&quot;)
microbenchmark(mean(tab), mean(vec), times = 10L)

Unit: relative
expr      min       lq     mean   median       uq      max neval cld
mean(tab)   1.0000   1.0000   1.0000   1.0000   1.0000  1.00000    10  a
mean(vec) 374.0648 132.3851 111.2533 104.7355 112.7517 75.21185    10   b


Again, if we have the table pre-computed, then estimating means is much faster using the table.

Getting standard deviation

The mean example may be misleading when we try sd on the table:

sd(vec)

[1] 32.04476

sd(tab)

[1] 302.4951


This are not even remotely close. This is because sd is operating on the table as if it were a vector and not a frequency table.

Note, we cannot calculate sd from the ecdf object:

sd(cdf)

Error in as.double(x): cannot coerce type 'closure' to vector of type 'double'


SD and Variance for frequency table

We will create a function to run sd on a table:

var.table = function(tab){
m = mean(tab)
uvals = unique(as.numeric(names(tab)))
n = sum(tab)
sq = (uvals - m)^2
## sum of squared terms
var = sum(sq * tab) / (n-1)
return(var)
}
sd.table = function(tab){
sqrt(var.table(tab))
}
sd.table(tab)

[1] 32.04476


We create the mean, get the squared differences, and sum these up (sum(sq * tab)) , divide by n-1 to get the variance and the sd is the square root of the variance.

Benchmarking SD

Let’s similarly benchmark the data for sd:

options(microbenchmark.unit=&quot;relative&quot;)
microbenchmark(sd.table(tab), sd(vec), times = 10L)

Unit: relative
expr      min       lq    mean   median       uq      max neval
sd.table(tab)   1.0000   1.0000   1.000    1.000   1.0000   1.0000    10
sd(vec) 851.8676 952.7785 847.225 1142.225 732.3427 736.2757    10
cld
a
b


Mode of distribution

Another statistic we may want for tabular data is the mode. We can simply find the maximum frequency in the table. The multiple option returns multiple values if there is a tie for the maximum frequency.

mode.table = function(tab, multiple = TRUE){
uvals = unique(as.numeric(names(tab)))
ind = which.max(tab)
if (multiple){
ind = which(tab == max(tab))
}
uvals[ind]
}
mode.table(tab)

[1] 36


Memory of each object

We wish to simply show the memory profile for using a table verus the entire vector:

format(object.size(vec), &quot;Kb&quot;)

[1] &quot;39062.5 Kb&quot;

format(object.size(tab), &quot;Kb&quot;)

[1] &quot;7.3 Kb&quot;

round(as.numeric(object.size(vec) / object.size(tab)))

[1] 5348


We see that the table much smaller than the vector. Therefore, computing and storing summary tables for integer data can be much more efficient.

Conclusion

Tables are computationally expensive. If tables are pre-computed for integer data, however, then statistics can be calculated quickly and accurately, even if NAs are present. These tables are also much smaller in memory so that they can be stored with less space. This may be an important thing to think about computing and storage of large vectors in the future.

The average Stripe employee! Congrats to Alyssa!

Recently, my colleague and fellow blogger Alyssa Frazee accepted a job at Stripe. All of us at JHU Biostat are happy for her, yet sad to see her go.

While perusing Stripe’s website, I found the About page, where each employee has a photo of themselves. I’ve been playing around with some PCA and decompositions, so I figured I’d play around with these photos and make some principal components/eigenfaces. (I think it’s funny when people use the SVD/Eigenvalue decomposition in a new field and name the new thing the eigen-whatever.)

Extracting the HTML

Let’s note that stripe uses https and not http for their website (not surprisingly as they do secure payment systems).

library(RCurl)
library('httr')
library('XML')

url.stub = 'https://stripe.com/'


As they use https, you cannot simply read the data into R using readLines or other functions. For this, I used curl in the RCurl package. I defined my certification, got the page, extracted the content as a character vector (imaginatively named x), then parsed the HTML using the XML pagckage.

cafile <- system.file('CurlSSL', 'cacert.pem', package = 'RCurl')
page <- GET(
url.stub,
config(cainfo = cafile)
)

x <- content(page, as='text')

#########################
# Parse HTML
#########################
doc <- htmlTreeParse(x, asText=TRUE, useInternal=TRUE)


Now that I have parsed the HTML document, I can use XPath. If you look at the source of the HTML, there is a div with the id of about, which contains all the links. The xpathSApply function takes the document, the XPath query, which says I want to go to that div, grab all img tags and then get the src.

#########################
# Get face URLs
#########################
urls = xpathSApply(doc,
path=paste0(stub, '//img'),
xmlGetAttr, 'src')


I then created an output directory imgdir where I’ll store the images (stored as pngs). Below is just some checking to see if I have already downloaded (in case I had to re-run the code) and only downloads images I don’t already have.

img.urls = paste0(url.stub, urls)
out.imgs = file.path(imgdir, basename(img.urls))

stopifnot(!any(duplicated(img.urls)))
have = file.exists(out.imgs)
img.urls = img.urls[!have]
out.imgs = out.imgs[!have]
###########
##########
for (iimg in seq_along(img.urls)){
method='curl')
}


Again, since Stripe uses https, we cannot just use download.file with the default method. I again used curl to get the images. I (manually) downloaded and cropped the image from Alyssa’s biostat page to add her to the Stripe set.

Analyze the Images

I now take all the images, read them in using readPNG. readPNG returns an array, and the first 3 dimensions are the RGB if the image is color; they are not 3D arrays if the images are grayscale, but none in this set are. The 4th dimension is the alpha level if there is opacity, but this information is discarded in the readPNG(img.f)[, , 1:3] statement.

library(png)
library(pixmap)
library(matrixStats)
imgs = list.files(imgdir, pattern='.png$', full.names = TRUE) n_imgs = length(imgs) img.list = vector(mode= 'list', length = n_imgs) iimg = 2 for ( iimg in seq(n_imgs)){ img.f = imgs[iimg] img.list[[iimg]] = readPNG(img.f)[, , 1:3] }  Same Image Size To make things easier, I only kept images that were 200 pixels by 200 pixels, so each image was the same size. If you had images of different sizes, you may want to do interpolation to get the same size and resolution. dims = lapply(img.list, dim) ################################ # Don't feel like interpolating - only keeping 200x200x3 ################################ dimg = c(200, 200, 3) keep = sapply(dims, function(x) all(x == dimg)) img.list = img.list[keep] imgs = imgs[keep] dims = dims[keep]  We then make a matrix of 12000 by N (N = 167), where the rows are the concatenated values from the red, green, and blue values. ################################ # Making Matrix: P x N ################################ mat = t(sapply(img.list, c)) cmeans = colMeans(mat) sds = colSds(mat)  Mean Image A small function makeimg takes in a vector/matrix, creates an array of $200\times200\times3$ and plots the image using pixmapRGB from the pixmap package. Here we plot the “Average Striper”. makeimg = function(x, trunc = FALSE, ...){ x = array(x, dim = dimg) if (trunc){ x[x < 0] = 0 x[x &gt; 1] = 1 } plot(pixmapRGB(x), ...) } makeimg(cmeans, main = 'Average Striper')  PCA Although this is what’s in common for Stripe pictures, let’s do a quick PCA (or equivalently SVD) to get the principal components after centering and scaling our data to see what’s different: # ############# # # Centering and scaling matrix # ############# X = t(t(mat) - cmeans) X = t(t(X) / sds) pca = prcomp(X, center=FALSE)  We can get the percent variance explained from standardized eigenvalues (proportional to the squared deviances), or just use screeplot: devs = pca$sdev^2 / sum(pca$sdev^2) plot(1-devs, main='Percent Variance Explained', type='l')  screeplot(pca)  Plot the PCs Although we would need about 3 components to recover a large percent of the variance of the data. For illustration, we plot the mean image and the first 9 principal components (PCs). V <- pca$rotation #sample PCs from PCA
################################
# Plotting Mean Image and PCs
################################

par(mfrow=c(2, 5))
par(oma = rep(2, 4), mar=c(0, 0, 3, 0))
makeimg(cmeans, main = 'Average Striper')

for (i in 1:9){
makeimg(V[,i],main = paste0('PC ', i))  #PCs from sample data
}


Conclusion

This post was more about congratulating Alyssa with some analysis, but I still want to discuss the results.

We can see some pattern in the data from the PCs, but you need many PCs to explain a larger percent of the variance in the data. That is not surprising; this data is not standardized in the way people took the pictures, such as front-facing, with different backgrounds, and I’m using the color information rather than black and white.

We would likely also have more interpretable results if we registered images. In neuroimaging, we register brains to each other and average them to make a template image. We could do that in this analysis and should do so if this was a real project and not a post.

Moreover, we are doing a PCA on non-negative values bounded between 0 and 1. I think this is one of the most interesting aspects of the data. In many analyses using PCA we actually always have positive values. For example people’s food choices is one example where non-negative matrix factorization is used; you can’t eat negative calories…if only. I think this is something to look into for people who are doing PCA on strictly positive values. Although you demean and scale the data and make values negative, you can re-construct data from this components and their scores to get non-interpretable values such as those outside [0, 1]. I’m looking into the nsprcomp package for non-negative PCA for future research work.

Adding Cost Functions to ROCR performance objects

In my last post, I gave an introduction of the ROCR package and how to use it for ROC analysis.

In the ROCR reference manual, it states “new performance measures can be added using a standard interface”, but I have not found that to be so. I may have missed some crucial step, but others have tried to adapt new performance measures. One example I came across had “patched” the performance code to use a new performance measure wss (Work Saved over Sampling). I liked some parts of what they did, but wanted to add my own measure and allow for a user to pass a new measure into a function without having to re-copy all the code.

Dice

I wanted to add an overlap measure known as the Dice coefficient, aka Dice Similarity Index (DSI), or Sorensen-Dice Coefficient. Let's define TP to be the number of true positives, TN to be true negatives, FP to be false positives, and FN to be false negatives, and RN/RP to be row negatives/positives and CN/CP be column negatives/positives. Then our 2-by-2 table of predictions vs. true values is:

0 1 Total
0 TN FP RN
1 FN TP RP
Total CN CP N

Let's say rows are predictions and columns are true values, but these are interchangeable.

The Dice coefficient is defined in terms of these values as:
$\frac{2 \times TP}{2\times TP + FP + FN} = \frac{2 \times TP}{RP + CP}$

In every prediction object, there are slots for tp, tn, fp, and fn (see ?"prediction-class" for more). Therefore, I can simply take these slots to make my Dice coefficient. Here's how I did it:

dice <- function(prediction.obj){
if (class(prediction.obj) != "prediction") {
stop(paste("Wrong argument types: First argument must be of type",
"'prediction'"))
}
argnames <- c()
x.values <- list()
y.values <- list()
for (i in 1:length(prediction.obj@predictions)) {
fp = prediction.obj@fp[[i]]
tp = prediction.obj@tp[[i]]
fn = prediction.obj@fn[[i]]
tn = prediction.obj@tn[[i]]
cutoffs = prediction.obj@cutoffs[[i]]
meas_dice = 2 * tp / (2*tp + fp + fn)
x.values <- c(x.values, list(cutoffs))
y.values <- c(y.values, list(meas_dice))
}
if (!(length(x.values) == 0 || length(x.values) == length(y.values))) {
stop("Consistency error.")
}
return(new("performance", x.name = "cutoff",
y.name = "dice",
alpha.name = "none",
x.values = x.values, y.values = y.values,
alpha.values = list())
)
}


Essentially, I copied the performance function from ROCR, made some adjustments and put in my calculation (into the object meas_dice) in there. That's great! Now I have this handy function to use when I want.

A more general solution

Although this solved my current problem, I thought more about how to add more cost functions in a more general way.

Here is my solution:

# copied from original function
myperformance <- function (prediction.obj, measure,
x.measure = "cutoff", ...)
{
envir.list <- my.define.environments(...)
long.unit.names <- envir.list$long.unit.names function.names <- envir.list$function.names
obligatory.x.axis <- envir.list$obligatory.x.axis optional.arguments <- envir.list$optional.arguments
default.values <- envir.list$default.values if (class(prediction.obj) != "prediction"){ stop(paste("Wrong argument types: First argument must be of type", "'prediction'")) } if (!exists(measure, where = long.unit.names, inherits = FALSE)){ stop(paste("Measure", measure, "not found")) } if (!exists(x.measure, where = long.unit.names, inherits = FALSE)){ stop(paste("Measure", measure, "not found")) } if (exists(x.measure, where = obligatory.x.axis, inherits = FALSE)) { message <- paste("The performance measure", x.measure, "can only be used as 'measure', because it has", "the following obligatory 'x.measure':\n", get(x.measure, envir = obligatory.x.axis)) stop(message) } if (exists(measure, where = obligatory.x.axis, inherits = FALSE)) { x.measure <- get(measure, envir = obligatory.x.axis) } if (x.measure == "cutoff" || exists(measure, where = obligatory.x.axis, inherits = FALSE)) { optional.args <- list(...) argnames <- c() if (exists(measure, where = optional.arguments, inherits = FALSE)) { argnames <- get(measure, envir = optional.arguments) default.arglist <- list() for (i in 1:length(argnames)) { default.arglist <- c(default.arglist, get(paste(measure, ":", argnames[i], sep = ""), envir = default.values, inherits = FALSE)) } names(default.arglist) <- argnames for (i in 1:length(argnames)) { templist <- list(optional.args, default.arglist[[i]]) names(templist) <- c("arglist", argnames[i]) optional.args <- do.call(".farg", templist) } } optional.args <- .select.args(optional.args, argnames) function.name <- get(measure, envir = function.names) x.values <- list() y.values <- list() for (i in 1:length(prediction.obj@predictions)) { argumentlist <- .sarg(optional.args, predictions = prediction.obj@predictions[[i]], labels = prediction.obj@labels[[i]], cutoffs = prediction.obj@cutoffs[[i]], fp = prediction.obj@fp[[i]], tp = prediction.obj@tp[[i]], fn = prediction.obj@fn[[i]], tn = prediction.obj@tn[[i]], n.pos = prediction.obj@n.pos[[i]], n.neg = prediction.obj@n.neg[[i]], n.pos.pred = prediction.obj@n.pos.pred[[i]], n.neg.pred = prediction.obj@n.neg.pred[[i]]) ans <- do.call(function.name, argumentlist) if (!is.null(ans[[1]])) x.values <- c(x.values, list(ans[[1]])) y.values <- c(y.values, list(ans[[2]])) } if (!(length(x.values) == 0 || length(x.values) == length(y.values))) { stop("Consistency error.") } return(new("performance", x.name = get(x.measure, envir = long.unit.names), y.name = get(measure, envir = long.unit.names), alpha.name = "none", x.values = x.values, y.values = y.values, alpha.values = list())) } else { perf.obj.1 <- myperformance(prediction.obj, measure = x.measure, ...) perf.obj.2 <- myperformance(prediction.obj, measure = measure, ...) return(.combine.performance.objects(perf.obj.1, perf.obj.2)) } }  What is all this code? First off, myperformance is exactly the code from the performance function in ROCR, except the first line is: envir.list <- my.define.environments(...)  instead of this line from ROCR::performance envir.list <- .define.environments()  Note that my.define.environments takes arguments, whereas .define.environments does not. This is a crucial difference; this is where you put your measure's code. New Environments If you look at the code for .define.environments: library(ROCR) head(.define.environments)  1 function () 2 { 3 long.unit.names <- new.env() 4 assign("none", "None", envir = long.unit.names) 5 assign("cutoff", "Cutoff", envir = long.unit.names) 6 assign("acc", "Accuracy", envir = long.unit.names)  we see the code new.env() being executed. In the beginning of the function, it defines the long.unit.names environment as well as other environments. So every time ROCR::performance is called, it creates a new environment with the names of the measures and functions ROCR uses. This is important since we cannot assign new names and objects to an existing, fixed environment or namespace like we could in other scenarios. Hence why I created my.define.environments: my.define.environments <- function(funnames = NULL, # name of measure longnames = NULL, # name of actual thing exprs = NULL, # list of 2 character vectors to be expressed optargs, # list default.vals, xaxis ) { # get original environments envir.list <- ROCR::.define.environments() long.unit.names = envir.list$long.unit.names
function.names = envir.list$function.names obligatory.x.axis = envir.list$obligatory.x.axis
optional.arguments = envir.list$optional.arguments default.values = envir.list$default.values

.performance.dice <- function (predictions, labels, cutoffs, fp,
tp, fn, tn, n.pos,
n.neg, n.pos.pred, n.neg.pred) {
list(cutoffs, 2 * tp / (2*tp + fp + fn))

}

assign("dice", .performance.dice,
envir=function.names)

assign("dice", "Sorensen-Dice coefficient",
envir=long.unit.names)

#######################################
#######################################
if (!is.null(funnames)){
stopifnot(
length(funnames) == length(longnames) &&
length(funnames) == length(exprs)
)
if (!missing(optargs)){
stopifnot(length(optargs) == length(funnames))
}
if (!missing(optargs)){
stopifnot(length(default.vals) == length(funnames))
}
if (!missing(xaxis)){
stopifnot(length(xaxis) == length(funnames))
}

for (iname in seq_along(funnames)){
ie1 = exprs[[iname]][[1]]
ie2 = exprs[[iname]][[2]]
funcname = paste0("func <- function (predictions, labels,
cutoffs, fp,
tp, fn, tn, n.pos,
n.neg, n.pos.pred, n.neg.pred) {
list(", ie1, ", ", ie2, ") }")
eval(parse(text=funcname))

assign(funnames[iname], func,
envir=function.names)
assign(funnames[iname], longnames[iname],
envir=long.unit.names)

#############
# optional arguments
#############
if (!missing(optargs)){
oargs = optargs[[iname]]
for (ioarg in seq_along(oargs)){
assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
envir=optional.arguments)
}
}

#############
# Default values
#############
if (!missing(default.vals)){
oargs = default.vals[[iname]]
for (ioarg in seq_along(oargs)){
assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
envir=default.values)
}
}

if (!missing(default.vals)){
oargs = default.vals[[iname]]
for (ioarg in seq_along(oargs)){
assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
envir=obligatory.x.axis)
}
}

}
} # is is.null

list(
long.unit.names = long.unit.names,
function.names = function.names,
obligatory.x.axis = obligatory.x.axis,
optional.arguments = optional.arguments,
default.values = default.values
)
}


We see that my.define.environments creates new environments too though! Yes, my.define.environments essentially does the same thing, but I can add my dice functiont inside my.define.environments and this measure can then be used in future work in other projects by using the same code. Moreover, the fact that arguments can be passed into my.define.environments allows you to create a measure on-the-fly.

Below is an example of how you can use custom measures based on the code above.

Example

Here I will plot the Jaccard Index, which is not implemented in the performance function. The Jaccard index formula is similar to dice and is represented as:

$\frac{TP}{TP + FP + FN}$

We can implement this cost function by supplying our measure, which must match our function name in funnames, the human-readable name in longnames and a list of 2-element character vectors in exprs. For scalar measures, the first element is "cutoffs" and the second element is the expression (to be evaluated by R) of the measure to be used.

data(ROCR.simple)
pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels)
perf.jaccard = myperformance(pred,
measure = "jaccard",
funnames = "jaccard",
longnames="Jaccard Index",
exprs = list(c("cutoffs", "tp / (tp + fp + fn)")))
plot(perf.jaccard)


Viola! We now have a way to create any cost function (that can be formulated in the terms of the objects of a prediction object).

Here is the example with using Dice:

perf.dice = myperformance(pred, measure = "dice")
plot(perf.dice)


As we already added .performance.dice to my.define.environments, we can simply call it as a measure.

Passing in 2 new measures:

The length of funnames must be the same as that of longnames and exprs (exprs must be a list). You can pass in vectors of funnames and longnames and a list of exprs so that you define multiple measures.
And we can pass in 2 new measures and get a performance object of them. In these cases, you will likely only want to pass in a maximum of 2 measures as a performance object will only compute 2 outputs for the x.values and y.values slots.

perf.both = myperformance(pred, x.measure = "dice",
measure = "jaccard",
funnames = c("dice", "jaccard"),
longnames=c("Dice Index", "Jaccard Index"),
exprs = list(c("cutoffs", "2 * tp / (2*tp + fp + fn)"),
c("cutoffs", "tp / (tp + fp + fn)")))

plot(perf.both)


If you look closely, you'll see there is some odd plotting in the upper right tail of the function. The functions may not be monotonic when you get them out of the performance object, so you may want to sort by the x measure first for plotting purposes:

both = data.frame(cbind(x= perf.both@x.values[[1]], y = perf.both@y.values[[1]]))
both = both[ order(both$x), ] colnames(both) = c(perf.both@x.name, perf.both@y.name) plot(both, type="l")  Conclusion Overall, you can add new measures to the performance object in R using the code above. It's a shame that the package is orphaned; I like using it for many ROC functions and measure computations. Then again, I'm not volunteering to maintain it. Although the package says new performance measures can be added using a standard interface”, I could not find a way to do so. Hopefully the code above allows you to implement a new measure if you ever choose to do so. Have fun ROC’ing around the Christmas tree! Boom! – You just got punned. A small introduction to the ROCR package I've been doing some classification with logistic regression in brain imaging recently. I have been using the ROCR package, which is helpful at estimating performance measures and plotting these measures over a range of cutoffs. The prediction and performance functions are the workhorses of most of the analyses in ROCR I've been doing. For those who haven't used ROCR before, the format of the prediction function is: prediction(predictions, labels, label.ordering = NULL)  where predictions are some predicted measure (usually continuous) for the “truth”, which are the labels. In many applications, predictions are estimated probabilities (or log odds) and the labels are binary values. Both arguments can take a vector, matrix, or data.frame for prediction, but dim(predictions) must equal dim(labels). In this post, I'll go through creating prediction and performance objects and extracting the results. Prediction objects Simple example: one set of prediction and labels Let's show a simple example from the prediction help file, that uses a prediction and label vector (i.e. not a matrix). We see the data is some continuous prediction and binary label: library(ROCR) data(ROCR.simple) head(cbind(ROCR.simple$predictions, ROCR.simple$labels), 5)   [,1] [,2] [1,] 0.6125478 1 [2,] 0.3642710 1 [3,] 0.4321361 0 [4,] 0.1402911 0 [5,] 0.3848959 0  Now, let's makde the prediction object and show its contents: pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels) class(pred)  [1] "prediction" attr(,"package") [1] "ROCR"  slotNames(pred)   [1] "predictions" "labels" "cutoffs" "fp" "tp" [6] "tn" "fn" "n.pos" "n.neg" "n.pos.pred" [11] "n.neg.pred"  We see the the returned result of prediction is an object of class prediction, which an S4 object with a series of slots. Let's look at the length of each slot and the class: sn = slotNames(pred) sapply(sn, function(x) length(slot(pred, x)))  predictions labels cutoffs fp tp tn 1 1 1 1 1 1 fn n.pos n.neg n.pos.pred n.neg.pred 1 1 1 1 1  sapply(sn, function(x) class(slot(pred, x)))  predictions labels cutoffs fp tp tn "list" "list" "list" "list" "list" "list" fn n.pos n.neg n.pos.pred n.neg.pred "list" "list" "list" "list" "list"  We see that each slot has length 1 and is a list. Example: multiple sets of prediction and labels Let's use the ROCR.hiv dataset to show how this works if more than one set of predictions and labels are supplied. Here we pass a list of $$10$$ predictions and a list of labels to the prediction function: data(ROCR.hiv) manypred = prediction(ROCR.hiv$hiv.nn$predictions, ROCR.hiv$hiv.nn$labels) sapply(sn, function(x) length(slot(manypred, x)))  predictions labels cutoffs fp tp tn 10 10 10 10 10 10 fn n.pos n.neg n.pos.pred n.neg.pred 10 10 10 10 10  sapply(sn, function(x) class(slot(manypred, x)))  predictions labels cutoffs fp tp tn "list" "list" "list" "list" "list" "list" fn n.pos n.neg n.pos.pred n.neg.pred "list" "list" "list" "list" "list"  We see that all the slots are still lists, but now they have length $$10$$, corresponding to the $$10$$ predictions/labels. We would get the same result if the 2 arguments were matrices, but that would require all predictions and labels to have the same length. Using a list of predictions/labels is a bit more flexible. Performance objects From the help file of performance, the syntax for this function is: performance(prediction.obj, measure, x.measure="cutoff", ...)  We see that the first argument is a prediction object, and the second is a measure. If you run ?performance, you can see all the performance measures implemented. We will do example of some commonly estimated measures: receiver operating characteristic (ROC) curves, accuracy, area under the curve (AUC), and partial AUC (pAUC). ROC Curve Simple example: one set of prediction and labels We will do an ROC curve, which plots the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis: roc.perf = performance(pred, measure = "tpr", x.measure = "fpr") plot(roc.perf) abline(a=0, b= 1)  At every cutoff, the TPR and FPR are calculated and plotted. The smoother the graph, the more cutoffs the predictions have. We also plotted a 45-degree line, which represents, on average, the performance of a Uniform(0, 1) random variable. The further away from the diagonal line, the better. Overall, we see that we see gains in sensitivity (true positive rate, > 80%), trading off a false positive rate (1- specificity), up until about 15% FPR. After an FPR of 15%, we don't see significant gains in TPR for a tradeoff of increased FPR. Example: multiple sets of prediction and labels The same can be done if you have many predictions and labels: many.roc.perf = performance(manypred, measure = "tpr", x.measure = "fpr") plot(many.roc.perf, col=1:10) abline(a=0, b= 1)  Essentially, the plot function on a performance object with multiple predictions and labels will loop over the lists and plot the ROC for each one. Overall, we see the performance of each prediction is similar. The pROC package, described in the conclusion, can test the performance between ROC curves. Advanced: If you want to see how performance objects are plotted, use getMethod("plot", signature = c(x="performance",y="missing")) and ROCR:::.plot.performance. Limiting to a FPR: partial ROC curve You may only want to accept a false positive rate of a certain level, let's say 10%. The function pROC below will only keep values less than or equal to the FPR you set. pROC = function(pred, fpr.stop){ perf <- performance(pred,"tpr","fpr") for (iperf in seq_along(perf@x.values)){ ind = which(perf@x.values[[iperf]] <= fpr.stop) perf@y.values[[iperf]] = perf@y.values[[iperf]][ind] perf@x.values[[iperf]] = perf@x.values[[iperf]][ind] } return(perf) }  Let's use this on the simple cases and plot the partial ROC curve: proc.perf = pROC(pred, fpr.stop=0.1) plot(proc.perf) abline(a=0, b= 1)  Thus, if we can only accept a FPR of 10%, the model is only giving 50% sensitivity (TPR) at 10% FPR (1-specificity). Getting an “optimal” cut point In some applications of ROC curves, you want the point closest to the TPR of $$1$$ and FPR of $$0$$. This cut point is “optimal” in the sense it weighs both sensitivity and specificity equally. To deterimine this cutoff, you can use the code below. The code takes in BOTH the performance object and prediction object and gives the optimal cutoff value of your predictions: opt.cut = function(perf, pred){ cut.ind = mapply(FUN=function(x, y, p){ d = (x - 0)^2 + (y-1)^2 ind = which(d == min(d)) c(sensitivity = y[[ind]], specificity = 1-x[[ind]], cutoff = p[[ind]]) }, perf@x.values, perf@y.values, pred@cutoffs) } print(opt.cut(roc.perf, pred))   [,1] sensitivity 0.8494624 specificity 0.8504673 cutoff 0.5014893  Now, there is a cost measure in the ROCR package that you can use to create a performance object. If you use it to find the minimum cost, then it will give you the same cutoff as opt.cut, but not give you the sensitivity and specificity. cost.perf = performance(pred, "cost") pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]  [1] 0.5014893  Different costs for FP and FN The output from opt.cut and a performance object with measure cost are NOT equivalent if false positives and false negatives are not weighted equally. The cost.fn and cost.fp arguments can be passed to performance, corresponding to the cost of a false negative and false positive, respectively. Let's say false positives are twice as costly as false negatives, and let's get a cut point: cost.perf = performance(pred, "cost", cost.fp = 2, cost.fn = 1) pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]  [1] 0.5294022  Thus, we have a different “optimal” cut point with this changed cost function. In many real-life applications of biomarkers, the cost of a false positive and false negative are not the same. For example, missing someone with a disease based on a test may cost a hospital$1,000,000 in lawsuits, but treating someone who did not have the disease may cost $100,000 in treatments. In that case, the cost of a false negative is 10 times that of a false positive, strictly in monetary measures. No cost analysis is this simple and is usually based on many factors, but most analyses do not have equal cost for a false positive versus a false negative. The code is the same for the optimal cutoff for the multiple prediction data: print(opt.cut(many.roc.perf, manypred))   [,1] [,2] [,3] [,4] [,5] sensitivity 0.8076923 0.8205128 0.7692308 0.8205128 0.7564103 specificity 0.7902622 0.7827715 0.8501873 0.8164794 0.8464419 cutoff -0.5749773 -0.5640632 -0.4311301 -0.5336958 -0.4863360 [,6] [,7] [,8] [,9] [,10] sensitivity 0.7820513 0.7948718 0.7820513 0.7435897 0.7435897 specificity 0.8089888 0.8314607 0.8089888 0.8352060 0.8501873 cutoff -0.5364402 -0.4816705 -0.5388664 -0.4777073 -0.4714354  Accuracy Simple example: one set of prediction and labels Another cost measure that is popular is overall accuracy. This measure optimizes the correct results, but may be skewed if there are many more negatives than positives, or vice versa. Let's get the overall accuracy for the simple predictions and plot it: acc.perf = performance(pred, measure = "acc") plot(acc.perf)  What if we actually want to extract the maximum accuracy and the cutoff corresponding to that? In the performance object, we have the slot x.values, which corresponds to the cutoff in this case, and y.values, which corresponds to the accuracy of each cutoff. We'll grab the index for maximum accuracy and then grab the corresponding cutoff: ind = which.max( slot(acc.perf, "y.values")[[1]] ) acc = slot(acc.perf, "y.values")[[1]][ind] cutoff = slot(acc.perf, "x.values")[[1]][ind] print(c(accuracy= acc, cutoff = cutoff))   accuracy cutoff 0.8500000 0.5014893  Hooray! Then you can go forth and threshold your model using the cutoff for (in hopes) maximum accuracy in your test data. Example: multiple sets of prediction and labels Again, we will do the same with many predictions and labels, but must iterate over the results (using a mapply statement): many.acc.perf = performance(manypred, measure = "acc") sapply(manypred@labels, function(x) mean(x == 1))   [1] 0.226087 0.226087 0.226087 0.226087 0.226087 0.226087 0.226087 [8] 0.226087 0.226087 0.226087  mapply(function(x, y){ ind = which.max( y ) acc = y[ind] cutoff = x[ind] return(c(accuracy= acc, cutoff = cutoff)) }, slot(many.acc.perf, "x.values"), slot(many.acc.perf, "y.values"))   [,1] [,2] [,3] [,4] [,5] [,6] accuracy 0.86376812 0.881159420 0.8666667 0.8724638 0.8724638 0.8753623 cutoff 0.02461465 -0.006091327 0.2303707 -0.1758013 0.1251976 -0.2153779 [,7] [,8] [,9] [,10] accuracy 0.8753623 0.8724638 0.8637681 0.86376812 cutoff -0.2066697 0.1506282 0.2880392 0.06536471  We see that these cutoffs are not the same as those using the opt.cut from above. This is due to the the fact that the proportion of positive cases is much less than 50%. Area under the curve (AUC) and partial AUC (pAUC) Simple example: one set of prediction and labels The area under curve summarizes the ROC curve just by taking the area between the curve and the x-axis. the Let's get the area under the curve for the simple predictions: auc.perf = performance(pred, measure = "auc") auc.perf@y.values  [[1]] [1] 0.8341875  As you can see, the result is a scalar number, the area under the curve (AUC). This number ranges from $$0$$ to $$1$$ with $$1$$ indicating 100% specificity and 100% sensitivity. As before, if you only want to accept a fixed FPR, we can calculate a partial AUC, using the fpr.stop argument: pauc.perf = performance(pred, measure = "auc", fpr.stop=0.1) pauc.perf@y.values  [[1]] [1] 0.02780625  Now, we see the pAUC to be much lower. It is of note that this value can range from $$0$$ to whatever fpr.stop is. In order to standardize it to $$1$$, you can divide it by fpr.stop to give a $$[0, 1]$$ measure: pauc.perf@y.values = lapply(pauc.perf@y.values, function(x) x / 0.1) pauc.perf@y.values  [[1]] [1] 0.2780625  Although this measure is more comparable to the full AUC measure, it is still low. Note, there is no “one” cutoff for AUC or pAUC, as it measures the performance over all cutoffs. Also, plotting functions for scalar outcome measures (such as AUC) do not work for performance objects. The code for the multiple predictions is the same. manypauc.perf = performance(manypred, measure = "auc", fpr.stop=0.1) manypauc.perf@y.values = lapply(manypauc.perf@y.values, function(x) x / 0.1) manypauc.perf@y.values  [[1]] [1] 0.500048 [[2]] [1] 0.5692404 [[3]] [1] 0.5182944 [[4]] [1] 0.5622299 [[5]] [1] 0.5379814 [[6]] [1] 0.5408624 [[7]] [1] 0.5509939 [[8]] [1] 0.5334678 [[9]] [1] 0.4979353 [[10]] [1] 0.4870354  Note, use sapply instead of lapply if you want the result to be a vector. Conclusion For ROC analysis the ROCR package has good methods and many built in measures. Other packages, such as the pROC package, can be useful for many functions and analyses, especially testing the difference between ROC and pROC curves. In some ways, you may want to use proc admissibly over ROCR, especially because (when I checked Dec 18, 2014) the ROCR package was orphaned. But if you are working in ROCR, I hope this give you some examples of how to fit the objects and extract the results. My Commonly Done ggplot2 graphs: Part 2 In my last post I described some of my commonly done ggplot2 graphs. It seems as though some people are interested in these, so I was going to follow this up with other plots I make frequently. Scatterplot colored by continuous variable The setup of the data for the scatterplots will be the same as the previous post, one x variable and one y variable. library(ggplot2) set.seed(20141106) data = data.frame(x = rnorm(1000, mean=6), batch = factor(rbinom(1000, size=4, prob = 0.5))) data$group1 = 1-rbeta(1000, 10, 2)
mat = model.matrix(~ batch, data=data)
mat = mat[, !colnames(mat) %in% &quot;(Intercept)&quot;]
betas = rbinom(ncol(mat), size=20, prob = 0.5)
data$quality = rowSums(t(t(mat) * sample(-2:2))) data$dec.quality = cut(data$quality, breaks = unique(quantile(data$quality,
probs = seq(0, 1, by=0.1))),
include.lowest = TRUE)

batch.effect = t(t(mat) * betas)
batch.effect = rowSums(batch.effect)
data$y = data$x * 5 + rnorm(1000) + batch.effect  +
data$quality * rnorm(1000, sd = 2) data$group2 = runif(1000)


I have added 2 important new variables, quality and batch. The motivation for these variables is akin to an RNAseq analysis set where you have a quality measure like read depth, and where the data were processed in different batches. The y variable is based both on the batch effect and the quality.

We construct the ggplot2 object for plotting x against y as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()
print(g)


Coloring by a 3rd Variable (Discrete)

Let's plot the x and y data by the different batches:

print({ g + aes(colour=batch)})


We can see from this example how to color by another third discrete variable. In this example, we see that the relationship is different by each batch (each are shifted).

Coloring by a 3rd Variable (Continuous)

Let's color by quality, which is continuous:

print({ gcol = g + aes(colour=quality)})


The default option is to use black as a low value and blue to be a high value. I don't always want this option, as I cannot always see differences clearly. Let's change the gradient of low to high values using scale_colour_gradient:

print({ gcol + scale_colour_gradient(low = &quot;red&quot;, high=&quot;blue&quot;) })


This isn't much better. Let's call the middle quality gray and see if we can see better separation:

print({ gcol_grad = gcol +
scale_colour_gradient2(low = &quot;red&quot;, mid = &quot;gray&quot;, high=&quot;blue&quot;) })


Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)

Another option is to break the quality into deciles (before plotting) and then coloring by these as a discrete variable:

print({ gcol_dec = g + aes(colour=dec.quality) })


Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable

We can combine these to show each batch in different facets and coloring by quality:

print({ gcol_grad + facet_wrap(~ batch )})


We can compound all these operations by passing transformations to scale_colour_gradient such as scale_colour_gradient(trans = "sqrt"). But enough with scatterplots.

Distributions! Lots of them.

One of the gaping holes in my last post was that I did not do any plots of distributions/densities of data. I ran the same code from the last post to get the longitudinal data set named dat.

Histograms

Let's say I want a distribution of my yij variable for each person across times:

library(plyr)
g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
ghist = g+ geom_histogram(binwidth = 3)
print(ghist)


Hmm, that's not too informative. By default, the histograms stack on top of each other. We can change this by setting position to be identity:

ghist = g+ geom_histogram(binwidth = 3, position =&quot;identity&quot;)
print(ghist)


There are still too many histograms. Let's plot a subset.

Aside: Using the %+% operator

The %+% operator allows you to reset what dataset is in the ggplot2 object. The data must have the same components (e.g. variable names); I think this is most useful for plotting subsets of data.

nobs = 10
npick = 5


Let's plot the density of $$5$$ people people with $$10$$ or more observations both using geom_density and geom_line(stat = "density"). We will also change the binwidth:

tab = table(dat$id) ids = names(tab)[tab &gt;= nobs] ids = sample(ids, npick) sub = dat[ dat$id %in% ids, ]
ghist = g+ geom_histogram(binwidth = 5, position =&quot;identity&quot;)
ghist %+% sub


Overlaid Histograms with Alpha Blending

Let's alpha blend these histograms to see the differences:

ggroup = ggplot(data=sub, aes(x=yij, fill=factor(id))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position =&quot;identity&quot;, alpha = 0.33)
grouphist


Similarly, we can plot over the 3 groups in our data:

ggroup = ggplot(data=dat, aes(x=yij, fill=factor(group))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position =&quot;identity&quot;, alpha = 0.33)
grouphist


These histograms are something I commonly do when I want overlay the data in some way. More commonly though, espeically with MANY distributions, I plot densities.

Densities

We can again plot the distribution of $$y_{ij}$$ for each person by using kernel density estimates, filled a different color for each person:

g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
print({gdens = g+ geom_density() })


As the filling overlaps a lot and blocks out other densities, we can use just different colors per person/id/group:

g = ggplot(data=dat, aes(x=yij, colour=factor(id))) +   guides(colour=FALSE)
print({gdens = g+ geom_density() })


I'm not a fan that the default for stat_density is that the geom = "area". This creates a line on the x-axis that closes the object. This is very important if you want to fill the density with different colors. Most times though, I want simply the line of the density with no bottom line. We can achieve this with:

print({gdens2 = g+ geom_line(stat = &quot;density&quot;)})


What if we set the option to fill the lines now? Well lines don't take fill, so it will not colour each line differently.

gdens3 = ggplot(data=dat, aes(x=yij, fill=factor(id))) + geom_line(stat = &quot;density&quot;) +  guides(colour=FALSE)
print({gdens3})


Now, regardless of the coloring, you can't really see the difference in people since there are so many. What if we want to do the plot with a subset of the data and the object is already constructed? Again, use the %+% operator.

Overlaid Densities with Alpha Blending

Let's take just different subsets of groups, not people, and plot the densities, with alpha blending:

print({ group_dens = ggroup+ geom_density(alpha = 0.3) })


That looks much better than the histogram example for groups. Now let's show these with lines:

print({group_dens2 = ggroup+ geom_line(stat = &quot;density&quot;)})


What happened? Again, lines don't take fill, they take colour:

print({group_dens2 = ggroup+ geom_line(aes(colour=group), stat = &quot;density&quot;)})


I'm a firm believer of legends begin IN plots, so let's push that in there and make it blend in:

print({
group_dens3 =
group_dens2 +  theme(legend.position = c(.75, .75),
legend.background = element_rect(fill=&quot;transparent&quot;),
legend.key = element_rect(fill=&quot;transparent&quot;,
color=&quot;transparent&quot;))
})


Lastly, I'll create a dataset of the means of the datasets and put vertical lines for the mean:

gmeans = ddply(dat, .(group), summarise,
mean = mean(yij))
group_dens3 + geom_vline(data=gmeans,
aes(colour = group, xintercept = mean))


Conclusion

Overall, this post should give you a few ways to play around with densities and such for plotting. All the same changes as the previous examples with scatterplots, such as facetting, can be used with these distribution plots. Many times, you want to look at the data in very different ways. Histograms can allow you to see outliers in some ways that densities do not because they smooth over the data. Either way, the mixture of alpha blending, coloring, and filling (though less useful for many distributions) can give you a nice description of what's going on a distributional level in your data.

PS: Boxplots

You can also do boxplots for each group, but these tend to separate well and colour relatively well using defaults, so I wil not discuss them here. My only note is that you can (and should) overlay points on the boxplot rather than just plot the histogram. You may need to jitter the points, alpha blend them, subsample the number of points, or clean it up a bit, but I try to display more DATA whenever possible.

My Commonly Done ggplot2 graphs

In my last post, I discussed how ggplot2 is not always the answer to the question “How should I plot this” and that base graphics were still very useful.

Why Do I use ggplot2 then?

The overall question still remains: why (do I) use ggplot2?

ggplot2 vs lattice

For one, ggplot2 replaced the lattice package for many plot types for me. The lattice package is a great system, but if you are plotting multivariate data, I believe you should choose lattice or ggplot2. I chose ggplot2 for the syntax, added capabilities, and the philosophy behind it. The fact Hadley Wickham is the developer never hurts either.

Having multiple versions of the same plot, with slight changes

Many times I want to do the same plot over and over, but vary one aspect of it, such as color of the points by a grouping variable, and then switch the color to another grouping variable. Let me give a toy example, where we have an x and a y with two grouping variables: group1 and group2.

library(ggplot2)
set.seed(20141016)
data = data.frame(x = rnorm(1000, mean=6))
data$group1 = rbinom(n = 1000, size =1 , prob =0.5) data$y = data$x * 5 + rnorm(1000) data$group2 = runif(1000) > 0.2


We can construct the ggplot2 object as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()


The ggplot command takes the data.frame you want to use and use the aes to specify which aesthetics we want to specify, here we specify the x and y. Some aesthetics are optional depending on the plot, some are not. I think it’s safe to say you always need an x. I then “add” (using +) to this object a “layer”: I want a geometric “thing”, and that thing is a set of points, hence I use geom_points. I’m doing a scatterplot.

If you just call the object g, print is called by default, which plots the object and we see our scatterplot.

g


I can color by a grouping variable and we can add that aesthetic:

g + aes(colour = group1)


g + aes(colour = factor(group1))


g + aes(colour = group2)


Note, g is the original plot, and I can add aes to this plot, which is the same as if I did ggplot2(data, aes(...)) in the original call that generated g. NOTE if the aes you are adding was not a column of the data.frame when you created the plot, you will get an error. For example, let’s add a new column to the data and then add it to g:

data$newcol = rbinom(n = nrow(data), size=2, prob = 0.5) g + aes(colour=factor(newcol))  Error: object 'newcol' not found  This fails because the way the ggplot2 object was created. If we had added this column to the data, created the plot, then added the newcol as an aes, the command would work fine. g2 = ggplot(data, aes(x = x, y=y)) + geom_point() g2 + aes(colour=factor(newcol))  We see in the first plot with colour = group1, ggplot2 sees a numeric variable group1, so tries a continuous mapping scheme for the color. The default is to do a range of blue colors denoting intensity. If we want to force it to a discrete mapping, we can turn it into a factor colour = factor(group1). We see the colors are very different and are not a continuum of blue, but colors that separate groups better. The third plot illustrates that when ggplot2 takes logical vectors for mappings, it factors them, and maps the group to a discrete color. Slight Changes with additions In practice, I do this iterative process many times and the addition of elements to a common template plot is very helpful for speed and reproducing the same plot with minor tweaks. In addition to doing similar plots with slight grouping changes I also add different lines/fits on top of that. In the previous example, we colored by points by different grouping variables. In other examples, I tend to change little things, e.g. a different smoother, a different subset of points, constraining the values to a certain range, etc. I believe in this way, ggplot2 allows us to create plots in a more structured way, without copying and pasting the entire command or creating a user-written wrapper function as you would in base. This should increase reproducibility by decreasing copy-and-paste errors. I tend to have many copy-paste errors, so I want to limit them as much as possible. ggplot reduces number of commands I have to do One example plot I make frequently is a scatterplot with a smoother to estimate the shape of bivariate data. In base, I usually have to run at least 3 commands to do this, e.g. loess, plot, and lines. In ggplot2, geom_smooth() takes care of this for you. Moreover, it does the smoothing by each different aesthetics (aka smoothing per group), which is usually what I want do as well (and takes more than 3 lines in base, usually a for loop or apply statement). g2 + geom_smooth()  By default in geom_smooth, it includes the standard error of the estimated relationship, but I usually only look at the estimate for a rough sketch of the relationship. Moreover, if the data are correlated (such as in longitudinal data), the standard errors given by default methods are usually are not accurate anyway. g2 + geom_smooth(se = FALSE)  Therefore, on top of the lack of copying and pasting, you can reduce the number of lines of code. Some say “but I can make a function that does that” – yes you can. Or you can use the commands already there in ggplot2. Faceting The other reason I frequently use ggplot2 is for faceting. You can do the same graph, conditioned on levels of a variable, which I frequently used. Many times you want to do a graph, subset by another variable, such as treatment/control, male/female, cancer/control, etc. g + facet_wrap(~ group1)  g + facet_wrap(~ group2)  g + facet_wrap(group2 ~ group1)  g + facet_wrap( ~ group2 + group1)  g + facet_grid(group2 ~ group1)  Spaghetti plot with Overall smoother I also frequently have longitudinal data and make spaghetti plot for a per-person trajectory over time. For this example I took code from StackOverflow to create some longitudinal data. library(MASS) library(nlme) ### set number of individuals n <- 200 ### average intercept and slope beta0 <- 1.0 beta1 <- 6.0 ### true autocorrelation ar.val <- .4 ### true error SD, intercept SD, slope SD, and intercept-slope cor sigma <- 1.5 tau0 <- 2.5 tau1 <- 2.0 tau01 <- 0.3 ### maximum number of possible observations m <- 10 ### simulate number of observations for each individual p <- round(runif(n,4,m)) ### simulate observation moments (assume everybody has 1st obs) obs <- unlist(sapply(p, function(x) c(1, sort(sample(2:m, x-1, replace=FALSE))))) ### set up data frame dat <- data.frame(id=rep(1:n, times=p), obs=obs) ### simulate (correlated) random effects for intercepts and slopes mu <- c(0,0) S <- matrix(c(1, tau01, tau01, 1), nrow=2) tau <- c(tau0, tau1) S <- diag(tau) %*% S %*% diag(tau) U <- mvrnorm(n, mu=mu, Sigma=S) ### simulate AR(1) errors and then the actual outcomes dat$eij <- unlist(sapply(p, function(x) arima.sim(model=list(ar=ar.val), n=x) * sqrt(1-ar.val^2) * sigma))
dat$yij <- (beta0 + rep(U[,1], times=p)) + (beta1 + rep(U[,2], times=p)) * log(dat$obs) + dat$eij  I will first add an alpha level to the plotting lines for the next plot (remember this must be done before the original plot is created). tspag will be the template plot, and I will create a spaghetti plot (spag) where each colour represents an id: library(plyr) dat = ddply(dat, .(id), function(x){ x$alpha = ifelse(runif(n = 1) > 0.9, 1, 0.1)
x\$grouper = factor(rbinom(n=1, size =3 ,prob=0.5), levels=0:3)
x
})
tspag = ggplot(dat, aes(x=obs, y=yij)) +
geom_line() + guides(colour=FALSE) + xlab("Observation Time Point") +
ylab("Y")
spag = tspag + aes(colour = factor(id))
spag


Many other times I want to group by id but plot just a few lines (let’s say 10% of them) dark and the other light, and not colour them:

bwspag = tspag + aes(alpha=alpha, group=factor(id)) + guides(alpha=FALSE)
bwspag


Overall, these 2 plots are useful when you have longitudinal data and don’t want to loop over ids or use lattice. The great addition is that all the faceting and such above can be used in conjunction with these plots to get spaghetti plots by subgroup.

spag + facet_wrap(~ grouper)


Spaghetti plot with overall smoother

If you want a smoother for the overall group in addition to the spaghetti plot, you can just add geom_smooth:

sspag = spag + geom_smooth(se=FALSE, colour="black", size=2)
sspag


sspag + facet_wrap(~ grouper)


bwspag + facet_wrap(~ grouper)


Note that the group aesthetic and colour aesthetic do not perform the same way for some operations. For example, let’s try to smooth bwswag:

bwspag + facet_wrap(~ grouper) + geom_smooth(se=FALSE, colour="red")


We see that it smooths each id, which is not what we want. We can achieve the desired result by setting the group aesthetic:

bwspag + facet_wrap(~ grouper) +
geom_smooth(aes(group=1), se=FALSE, colour="red", size =2)


I hope that this demonstrates some of the simple yet powerful commands ggplot2 allows users to execute. I agree that some behavior may not seem straightforward at first glance, but becomes more understandable as one uses ggplot2 more.

Another Case this is Useful – Save Plot Twice

Another (non-plotting) example I want to show is how saving ggplot2 objects can make saving duplicate plots much easier. In many cases for making plots to show to others, I open up a PDF device using pdf, make a series of plots, and then close the PDF. There may be 1-2 plots that are the real punchline, and I want to make a high-res PNG of them separately. Let me be clear, I want both – the PDF and PNG. For example:

pdf(tempfile())
print({g1 = g + aes(colour = group1)})
print({g1fac = g + aes(colour = factor(group1))})
print({g2 = g + aes(colour = group2)})
dev.off()


png(tempfile(), res = 300, height =7, width= 7, units = "in")
print(g2)
dev.off()


I am printing the objects, while assigning them. (I have to use the { brackets because I use = for assignment and print would evaluate that as arguments without {}). As g2 is already saved as an object, I can close the PDF, open a png and then print that again.

No copying and pasting was needed for remaking the plot, nor some weird turning off and on devices.

Conclusion

I (and think you should) use ggplot2 for a lot of reasons, especially the grammar and philosophy. The plots I have used here are some powerful representations of data that are simple to execute. I believe using this system reflects and helps the true iterative process of making figures. The syntax may not seem intuitive to a long-time R user, but I believe the startup cost is worth the effort. Let me know if you’d like to see any other plots that you commonly use.