Adding Cost Functions to ROCR performance objects

In my last post, I gave an introduction of the ROCR package and how to use it for ROC analysis.

In the ROCR reference manual, it states “new performance measures can be added using a standard interface”, but I have not found that to be so. I may have missed some crucial step, but others have tried to adapt new performance measures. One example I came across had “patched” the performance code to use a new performance measure wss (Work Saved over Sampling). I liked some parts of what they did, but wanted to add my own measure and allow for a user to pass a new measure into a function without having to re-copy all the code.

Dice

I wanted to add an overlap measure known as the Dice coefficient, aka Dice Similarity Index (DSI), or Sorensen-Dice Coefficient. Let's define TP to be the number of true positives, TN to be true negatives, FP to be false positives, and FN to be false negatives, and RN/RP to be row negatives/positives and CN/CP be column negatives/positives. Then our 2-by-2 table of predictions vs. true values is:

0 1 Total
0 TN FP RN
1 FN TP RP
Total CN CP N

Let's say rows are predictions and columns are true values, but these are interchangeable.

The Dice coefficient is defined in terms of these values as:
\frac{2 \times TP}{2\times TP + FP + FN} = \frac{2 \times TP}{RP + CP}

In every prediction object, there are slots for tp, tn, fp, and fn (see ?"prediction-class" for more). Therefore, I can simply take these slots to make my Dice coefficient. Here's how I did it:

dice <- function(prediction.obj){
    if (class(prediction.obj) != "prediction") {
        stop(paste("Wrong argument types: First argument must be of type", 
            "'prediction'"))    
    }
    argnames <- c()
    x.values <- list()
    y.values <- list()
    for (i in 1:length(prediction.obj@predictions)) {
      fp = prediction.obj@fp[[i]] 
        tp = prediction.obj@tp[[i]]
        fn = prediction.obj@fn[[i]]
        tn = prediction.obj@tn[[i]]
        cutoffs = prediction.obj@cutoffs[[i]]
        meas_dice = 2 * tp / (2*tp + fp + fn)
        x.values <- c(x.values, list(cutoffs))
        y.values <- c(y.values, list(meas_dice))
    }
    if (!(length(x.values) == 0 || length(x.values) == length(y.values))) {
        stop("Consistency error.")
    }
    return(new("performance", x.name = "cutoff", 
        y.name = "dice", 
        alpha.name = "none", 
        x.values = x.values, y.values = y.values, 
        alpha.values = list())
    )
}

Essentially, I copied the performance function from ROCR, made some adjustments and put in my calculation (into the object meas_dice) in there. That's great! Now I have this handy function to use when I want.

A more general solution

Although this solved my current problem, I thought more about how to add more cost functions in a more general way.

Here is my solution:

# copied from original function
myperformance <- function (prediction.obj, measure, 
    x.measure = "cutoff", ...)
{
    envir.list <- my.define.environments(...)
    long.unit.names <- envir.list$long.unit.names
    function.names <- envir.list$function.names
    obligatory.x.axis <- envir.list$obligatory.x.axis
    optional.arguments <- envir.list$optional.arguments
    default.values <- envir.list$default.values
    if (class(prediction.obj) != "prediction"){
              stop(paste("Wrong argument types: First argument must be of type",
            "'prediction'"))
    }
    if (!exists(measure,  where = long.unit.names, inherits = FALSE)){
      stop(paste("Measure", measure, "not found"))
    }
    if (!exists(x.measure,  where = long.unit.names, inherits = FALSE)){
      stop(paste("Measure", measure, "not found"))
    }    
    if (exists(x.measure, where = obligatory.x.axis, inherits = FALSE)) {
        message <- paste("The performance measure", x.measure,
            "can only be used as 'measure', because it has",
            "the following obligatory 'x.measure':\n", get(x.measure,
                envir = obligatory.x.axis))
        stop(message)
    }
    if (exists(measure, where = obligatory.x.axis, inherits = FALSE)) {
        x.measure <- get(measure, envir = obligatory.x.axis)
    }
    if (x.measure == "cutoff" || exists(measure, where = obligatory.x.axis,
        inherits = FALSE)) {
        optional.args <- list(...)
        argnames <- c()
        if (exists(measure, where = optional.arguments, inherits = FALSE)) {
            argnames <- get(measure, envir = optional.arguments)
            default.arglist <- list()
            for (i in 1:length(argnames)) {
                default.arglist <- c(default.arglist, get(paste(measure,
                  ":", argnames[i], sep = ""), envir = default.values,
                  inherits = FALSE))
            }
            names(default.arglist) <- argnames
            for (i in 1:length(argnames)) {
                templist <- list(optional.args, default.arglist[[i]])
                names(templist) <- c("arglist", argnames[i])
                optional.args <- do.call(".farg", templist)
            }
        }
        optional.args <- .select.args(optional.args, argnames)
        function.name <- get(measure, envir = function.names)
        x.values <- list()
        y.values <- list()
        for (i in 1:length(prediction.obj@predictions)) {
            argumentlist <- .sarg(optional.args, predictions = prediction.obj@predictions[[i]],
                labels = prediction.obj@labels[[i]], cutoffs = prediction.obj@cutoffs[[i]],
                fp = prediction.obj@fp[[i]], tp = prediction.obj@tp[[i]],
                fn = prediction.obj@fn[[i]], tn = prediction.obj@tn[[i]],
                n.pos = prediction.obj@n.pos[[i]], n.neg = prediction.obj@n.neg[[i]],
                n.pos.pred = prediction.obj@n.pos.pred[[i]],
                n.neg.pred = prediction.obj@n.neg.pred[[i]])
            ans <- do.call(function.name, argumentlist)
            if (!is.null(ans[[1]]))
                x.values <- c(x.values, list(ans[[1]]))
            y.values <- c(y.values, list(ans[[2]]))
        }
        if (!(length(x.values) == 0 || length(x.values) == length(y.values))) {
            stop("Consistency error.")
        }
        return(new("performance", x.name = get(x.measure, envir = long.unit.names),
            y.name = get(measure, envir = long.unit.names), alpha.name = "none",
            x.values = x.values, y.values = y.values, alpha.values = list()))
    }
    else {
        perf.obj.1 <- myperformance(prediction.obj, measure = x.measure,
            ...)
        perf.obj.2 <- myperformance(prediction.obj, measure = measure,
            ...)
        return(.combine.performance.objects(perf.obj.1, perf.obj.2))
    }
}

What is all this code?

First off, myperformance is exactly the code from the performance function in ROCR, except the first line is:

envir.list <- my.define.environments(...)

instead of this line from ROCR::performance

envir.list <- .define.environments()

Note that my.define.environments takes arguments, whereas .define.environments does not. This is a crucial difference; this is where you put your measure's code.

New Environments

If you look at the code for .define.environments:

library(ROCR)
head(.define.environments)
1 function ()                                            
2 {                                                      
3     long.unit.names <- new.env()                       
4     assign("none", "None", envir = long.unit.names)    
5     assign("cutoff", "Cutoff", envir = long.unit.names)
6     assign("acc", "Accuracy", envir = long.unit.names) 

we see the code new.env() being executed. In the beginning of the function, it defines the long.unit.names environment as well as other environments. So every time ROCR::performance is called, it creates a new environment with the names of the measures and functions ROCR uses. This is important since we cannot assign new names and objects to an existing, fixed environment or namespace like we could in other scenarios. Hence why I created my.define.environments:

my.define.environments <- function(funnames = NULL, # name of measure
    longnames = NULL, # name of actual thing
    exprs = NULL, # list of 2 character vectors to be expressed
    optargs, # list
    default.vals,
    xaxis
    )
{
  # get original environments
  envir.list <- ROCR::.define.environments()
  long.unit.names = envir.list$long.unit.names
  function.names = envir.list$function.names   
  obligatory.x.axis = envir.list$obligatory.x.axis
  optional.arguments = envir.list$optional.arguments         
  default.values = envir.list$default.values

    .performance.dice <- function (predictions, labels, cutoffs, fp, 
        tp, fn, tn, n.pos, 
        n.neg, n.pos.pred, n.neg.pred) {
        list(cutoffs, 2 * tp / (2*tp + fp + fn))

    }  

    assign("dice", .performance.dice, 
        envir=function.names)

    assign("dice", "Sorensen-Dice coefficient", 
        envir=long.unit.names)    

    #######################################
    # Allow for general adding
    #######################################
    if (!is.null(funnames)){
        stopifnot(
            length(funnames) == length(longnames) &&
            length(funnames) == length(exprs)
            )
        if (!missing(optargs)){
          stopifnot(length(optargs) == length(funnames))
        }
        if (!missing(optargs)){
          stopifnot(length(default.vals) == length(funnames))
        } 
        if (!missing(xaxis)){
          stopifnot(length(xaxis) == length(funnames))
        }       

        for (iname in seq_along(funnames)){  
          ie1 = exprs[[iname]][[1]]
          ie2 = exprs[[iname]][[2]]
          funcname = paste0("func <- function (predictions, labels, 
                            cutoffs, fp, 
                            tp, fn, tn, n.pos, 
                            n.neg, n.pos.pred, n.neg.pred) {
            list(", ie1, ", ", ie2, ") }")
          eval(parse(text=funcname))

            assign(funnames[iname], func, 
                envir=function.names)
            assign(funnames[iname], longnames[iname],
                envir=long.unit.names)

            #############
            # optional arguments
            #############
            if (!missing(optargs)){
              oargs = optargs[[iname]]
              for (ioarg in seq_along(oargs)){
                assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
                  envir=optional.arguments)
              }
            }

            #############
            # Default values
            #############
            if (!missing(default.vals)){
              oargs = default.vals[[iname]]
              for (ioarg in seq_along(oargs)){
                assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
                  envir=default.values)
              }
            }

            if (!missing(default.vals)){
              oargs = default.vals[[iname]]
              for (ioarg in seq_along(oargs)){
                assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
                  envir=obligatory.x.axis)
              }
            }            

        }
    } # is is.null

  list(
    long.unit.names = long.unit.names,
    function.names = function.names,
    obligatory.x.axis = obligatory.x.axis,
    optional.arguments = optional.arguments,
    default.values = default.values
  )
}

We see that my.define.environments creates new environments too though! Yes, my.define.environments essentially does the same thing, but I can add my dice functiont inside my.define.environments and this measure can then be used in future work in other projects by using the same code. Moreover, the fact that arguments can be passed into my.define.environments allows you to create a measure on-the-fly.

Below is an example of how you can use custom measures based on the code above.

Example

Here I will plot the Jaccard Index, which is not implemented in the performance function. The Jaccard index formula is similar to dice and is represented as:

\frac{TP}{TP + FP + FN}

We can implement this cost function by supplying our measure, which must match our function name in funnames, the human-readable name in longnames and a list of 2-element character vectors in exprs. For scalar measures, the first element is "cutoffs" and the second element is the expression (to be evaluated by R) of the measure to be used.

data(ROCR.simple)
pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels)
perf.jaccard = myperformance(pred, 
                             measure = "jaccard", 
              funnames = "jaccard", 
              longnames="Jaccard Index", 
              exprs = list(c("cutoffs", "tp / (tp + fp + fn)")))
plot(perf.jaccard)

plot of chunk unnamed-chunk-6

Viola! We now have a way to create any cost function (that can be formulated in the terms of the objects of a prediction object).

Here is the example with using Dice:

perf.dice = myperformance(pred, measure = "dice")
plot(perf.dice)

plot of chunk unnamed-chunk-7

As we already added .performance.dice to my.define.environments, we can simply call it as a measure.

Passing in 2 new measures:

The length of funnames must be the same as that of longnames and exprs (exprs must be a list). You can pass in vectors of funnames and longnames and a list of exprs so that you define multiple measures.
And we can pass in 2 new measures and get a performance object of them. In these cases, you will likely only want to pass in a maximum of 2 measures as a performance object will only compute 2 outputs for the x.values and y.values slots.

perf.both = myperformance(pred, x.measure = "dice", 
                             measure = "jaccard", 
                             funnames = c("dice", "jaccard"), 
                            longnames=c("Dice Index", "Jaccard Index"), 
              exprs = list(c("cutoffs", "2 * tp / (2*tp + fp + fn)"),
                           c("cutoffs", "tp / (tp + fp + fn)")))

plot(perf.both)

plot of chunk unnamed-chunk-8

If you look closely, you'll see there is some odd plotting in the upper right tail of the function. The functions may not be monotonic when you get them out of the performance object, so you may want to sort by the x measure first for plotting purposes:

both = data.frame(cbind(x= perf.both@x.values[[1]], y = perf.both@y.values[[1]]))
both = both[ order(both$x), ]
colnames(both) = c(perf.both@x.name, perf.both@y.name)
plot(both, type="l")

plot of chunk unnamed-chunk-9

Conclusion

Overall, you can add new measures to the performance object in R using the code above. It's a shame that the package is orphaned; I like using it for many ROC functions and measure computations. Then again, I'm not volunteering to maintain it. Although the package says new performance measures can be added using a standard interface”, I could not find a way to do so. Hopefully the code above allows you to implement a new measure if you ever choose to do so. Have fun ROC’ing around the Christmas tree! Boom! – You just got punned.

Advertisements

A small introduction to the ROCR package

I've been doing some classification with logistic regression in brain imaging recently. I have been using the ROCR package, which is helpful at estimating performance measures and plotting these measures over a range of cutoffs.

The prediction and performance functions are the workhorses of most of the analyses in ROCR I've been doing. For those who haven't used ROCR before, the format of the prediction function is:

prediction(predictions, labels, label.ordering = NULL)

where predictions are some predicted measure (usually continuous) for the “truth”, which are the labels. In many applications, predictions are estimated probabilities (or log odds) and the labels are binary values. Both arguments can take a vector, matrix, or data.frame for prediction, but dim(predictions) must equal dim(labels).

In this post, I'll go through creating prediction and performance objects and extracting the results.

Prediction objects

Simple example: one set of prediction and labels

Let's show a simple example from the prediction help file, that uses a prediction and label vector (i.e. not a matrix). We see the data is some continuous prediction and binary label:

library(ROCR)
data(ROCR.simple)
head(cbind(ROCR.simple$predictions, ROCR.simple$labels), 5)
          [,1] [,2]
[1,] 0.6125478    1
[2,] 0.3642710    1
[3,] 0.4321361    0
[4,] 0.1402911    0
[5,] 0.3848959    0

Now, let's makde the prediction object and show its contents:

pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels)
class(pred)
[1] "prediction"
attr(,"package")
[1] "ROCR"
slotNames(pred)
 [1] "predictions" "labels"      "cutoffs"     "fp"          "tp"         
 [6] "tn"          "fn"          "n.pos"       "n.neg"       "n.pos.pred" 
[11] "n.neg.pred" 

We see the the returned result of prediction is an object of class prediction, which an S4 object with a series of slots. Let's look at the length of each slot and the class:

sn = slotNames(pred)
sapply(sn, function(x) length(slot(pred, x)))
predictions      labels     cutoffs          fp          tp          tn 
          1           1           1           1           1           1 
         fn       n.pos       n.neg  n.pos.pred  n.neg.pred 
          1           1           1           1           1 
sapply(sn, function(x) class(slot(pred, x)))
predictions      labels     cutoffs          fp          tp          tn 
     "list"      "list"      "list"      "list"      "list"      "list" 
         fn       n.pos       n.neg  n.pos.pred  n.neg.pred 
     "list"      "list"      "list"      "list"      "list" 

We see that each slot has length 1 and is a list.

Example: multiple sets of prediction and labels

Let's use the ROCR.hiv dataset to show how this works if more than one set of predictions and labels are supplied. Here we pass a list of \(10\) predictions and a list of labels to the prediction function:

data(ROCR.hiv)
manypred = prediction(ROCR.hiv$hiv.nn$predictions, ROCR.hiv$hiv.nn$labels)
sapply(sn, function(x) length(slot(manypred, x)))
predictions      labels     cutoffs          fp          tp          tn 
         10          10          10          10          10          10 
         fn       n.pos       n.neg  n.pos.pred  n.neg.pred 
         10          10          10          10          10 
sapply(sn, function(x) class(slot(manypred, x)))
predictions      labels     cutoffs          fp          tp          tn 
     "list"      "list"      "list"      "list"      "list"      "list" 
         fn       n.pos       n.neg  n.pos.pred  n.neg.pred 
     "list"      "list"      "list"      "list"      "list" 

We see that all the slots are still lists, but now they have length \(10\), corresponding to the \(10\) predictions/labels. We would get the same result if the 2 arguments were matrices, but that would require all predictions and labels to have the same length. Using a list of predictions/labels is a bit more flexible.

Performance objects

From the help file of performance, the syntax for this function is:

performance(prediction.obj, measure, x.measure="cutoff", ...)

We see that the first argument is a prediction object, and the second is a measure. If you run ?performance, you can see all the performance measures implemented.

We will do example of some commonly estimated measures: receiver operating characteristic (ROC) curves, accuracy, area under the curve (AUC), and partial AUC (pAUC).

ROC Curve

Simple example: one set of prediction and labels

We will do an ROC curve, which plots the false positive rate (FPR) on the x-axis and the true positive rate (TPR) on the y-axis:

roc.perf = performance(pred, measure = "tpr", x.measure = "fpr")
plot(roc.perf)
abline(a=0, b= 1)

plot of chunk roc_pred

At every cutoff, the TPR and FPR are calculated and plotted. The smoother the graph, the more cutoffs the predictions have. We also plotted a 45-degree line, which represents, on average, the performance of a Uniform(0, 1) random variable. The further away from the diagonal line, the better. Overall, we see that we see gains in sensitivity (true positive rate, > 80%), trading off a false positive rate (1- specificity), up until about 15% FPR. After an FPR of 15%, we don't see significant gains in TPR for a tradeoff of increased FPR.

Example: multiple sets of prediction and labels

The same can be done if you have many predictions and labels:

many.roc.perf = performance(manypred, measure = "tpr", x.measure = "fpr")
plot(many.roc.perf, col=1:10)
abline(a=0, b= 1)

plot of chunk roc_preds

Essentially, the plot function on a performance object with multiple predictions and labels will loop over the lists and plot the ROC for each one.

Overall, we see the performance of each prediction is similar. The pROC package, described in the conclusion, can test the performance between ROC curves.

Advanced: If you want to see how performance objects are plotted, use getMethod("plot", signature = c(x="performance",y="missing")) and ROCR:::.plot.performance.

Limiting to a FPR: partial ROC curve

You may only want to accept a false positive rate of a certain level, let's say 10%. The function pROC below will only keep values less than or equal to the FPR you set.

pROC = function(pred, fpr.stop){
    perf <- performance(pred,"tpr","fpr")
    for (iperf in seq_along(perf@x.values)){
        ind = which(perf@x.values[[iperf]] <= fpr.stop)
        perf@y.values[[iperf]] = perf@y.values[[iperf]][ind]
        perf@x.values[[iperf]] = perf@x.values[[iperf]][ind]
    }
    return(perf)
}

Let's use this on the simple cases and plot the partial ROC curve:

proc.perf = pROC(pred, fpr.stop=0.1)
plot(proc.perf)
abline(a=0, b= 1)

plot of chunk pROC

Thus, if we can only accept a FPR of 10%, the model is only giving 50% sensitivity (TPR) at 10% FPR (1-specificity).

Getting an “optimal” cut point

In some applications of ROC curves, you want the point closest to the TPR of \(1\) and FPR of \(0\). This cut point is “optimal” in the sense it weighs both sensitivity and specificity equally. To deterimine this cutoff, you can use the code below. The code takes in BOTH the performance object and prediction object and gives the optimal cutoff value of your predictions:

opt.cut = function(perf, pred){
    cut.ind = mapply(FUN=function(x, y, p){
        d = (x - 0)^2 + (y-1)^2
        ind = which(d == min(d))
        c(sensitivity = y[[ind]], specificity = 1-x[[ind]], 
            cutoff = p[[ind]])
    }, perf@x.values, perf@y.values, pred@cutoffs)
}
print(opt.cut(roc.perf, pred))
                 [,1]
sensitivity 0.8494624
specificity 0.8504673
cutoff      0.5014893

Now, there is a cost measure in the ROCR package that you can use to create a performance object. If you use it to find the minimum cost, then it will give you the same cutoff as opt.cut, but not give you the sensitivity and specificity.

cost.perf = performance(pred, "cost")
pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]
[1] 0.5014893

Different costs for FP and FN

The output from opt.cut and a performance object with measure cost are NOT equivalent if false positives and false negatives are not weighted equally. The cost.fn and cost.fp arguments can be passed to performance, corresponding to the cost of a false negative and false positive, respectively. Let's say false positives are twice as costly as false negatives, and let's get a cut point:

cost.perf = performance(pred, "cost", cost.fp = 2, cost.fn = 1)
pred@cutoffs[[1]][which.min(cost.perf@y.values[[1]])]
[1] 0.5294022

Thus, we have a different “optimal” cut point with this changed cost function. In many real-life applications of biomarkers, the cost of a false positive and false negative are not the same. For example, missing someone with a disease based on a test may cost a hospital $1,000,000 in lawsuits, but treating someone who did not have the disease may cost $100,000 in treatments. In that case, the cost of a false negative is 10 times that of a false positive, strictly in monetary measures. No cost analysis is this simple and is usually based on many factors, but most analyses do not have equal cost for a false positive versus a false negative.

The code is the same for the optimal cutoff for the multiple prediction data:

print(opt.cut(many.roc.perf, manypred))
                  [,1]       [,2]       [,3]       [,4]       [,5]
sensitivity  0.8076923  0.8205128  0.7692308  0.8205128  0.7564103
specificity  0.7902622  0.7827715  0.8501873  0.8164794  0.8464419
cutoff      -0.5749773 -0.5640632 -0.4311301 -0.5336958 -0.4863360
                  [,6]       [,7]       [,8]       [,9]      [,10]
sensitivity  0.7820513  0.7948718  0.7820513  0.7435897  0.7435897
specificity  0.8089888  0.8314607  0.8089888  0.8352060  0.8501873
cutoff      -0.5364402 -0.4816705 -0.5388664 -0.4777073 -0.4714354

Accuracy

Simple example: one set of prediction and labels

Another cost measure that is popular is overall accuracy. This measure optimizes the correct results, but may be skewed if there are many more negatives than positives, or vice versa. Let's get the overall accuracy for the simple predictions and plot it:

acc.perf = performance(pred, measure = "acc")
plot(acc.perf)

plot of chunk accc_pred

What if we actually want to extract the maximum accuracy and the cutoff corresponding to that? In the performance object, we have the slot x.values, which corresponds to the cutoff in this case, and y.values, which corresponds to the accuracy of each cutoff. We'll grab the index for maximum accuracy and then grab the corresponding cutoff:

ind = which.max( slot(acc.perf, "y.values")[[1]] )
acc = slot(acc.perf, "y.values")[[1]][ind]
cutoff = slot(acc.perf, "x.values")[[1]][ind]
print(c(accuracy= acc, cutoff = cutoff))
 accuracy    cutoff 
0.8500000 0.5014893 

Hooray! Then you can go forth and threshold your model using the cutoff for (in hopes) maximum accuracy in your test data.

Example: multiple sets of prediction and labels

Again, we will do the same with many predictions and labels, but must iterate over the results (using a mapply statement):

many.acc.perf = performance(manypred, measure = "acc")
sapply(manypred@labels, function(x) mean(x == 1))
 [1] 0.226087 0.226087 0.226087 0.226087 0.226087 0.226087 0.226087
 [8] 0.226087 0.226087 0.226087
mapply(function(x, y){
  ind = which.max( y )
  acc = y[ind]
  cutoff = x[ind]
  return(c(accuracy= acc, cutoff = cutoff))
}, slot(many.acc.perf, "x.values"), slot(many.acc.perf, "y.values"))
               [,1]         [,2]      [,3]       [,4]      [,5]       [,6]
accuracy 0.86376812  0.881159420 0.8666667  0.8724638 0.8724638  0.8753623
cutoff   0.02461465 -0.006091327 0.2303707 -0.1758013 0.1251976 -0.2153779
               [,7]      [,8]      [,9]      [,10]
accuracy  0.8753623 0.8724638 0.8637681 0.86376812
cutoff   -0.2066697 0.1506282 0.2880392 0.06536471

We see that these cutoffs are not the same as those using the opt.cut from above. This is due to the the fact that the proportion of positive cases is much less than 50%.

Area under the curve (AUC) and partial AUC (pAUC)

Simple example: one set of prediction and labels

The area under curve summarizes the ROC curve just by taking the area between the curve and the x-axis. the Let's get the area under the curve for the simple predictions:

auc.perf = performance(pred, measure = "auc")
auc.perf@y.values
[[1]]
[1] 0.8341875

As you can see, the result is a scalar number, the area under the curve (AUC). This number ranges from \(0\) to \(1\) with \(1\) indicating 100% specificity and 100% sensitivity.

As before, if you only want to accept a fixed FPR, we can calculate a partial AUC, using the fpr.stop argument:

pauc.perf = performance(pred, measure = "auc", fpr.stop=0.1)
pauc.perf@y.values
[[1]]
[1] 0.02780625

Now, we see the pAUC to be much lower. It is of note that this value can range from \(0\) to whatever fpr.stop is. In order to standardize it to \(1\), you can divide it by fpr.stop to give a \([0, 1]\) measure:

pauc.perf@y.values = lapply(pauc.perf@y.values, function(x) x / 0.1)
pauc.perf@y.values
[[1]]
[1] 0.2780625

Although this measure is more comparable to the full AUC measure, it is still low. Note, there is no “one” cutoff for AUC or pAUC, as it measures the performance over all cutoffs. Also, plotting functions for scalar outcome measures (such as AUC) do not work for performance objects. The code for the multiple predictions is the same.

manypauc.perf = performance(manypred, measure = "auc", fpr.stop=0.1)
manypauc.perf@y.values = lapply(manypauc.perf@y.values, function(x) x / 0.1)
manypauc.perf@y.values
[[1]]
[1] 0.500048

[[2]]
[1] 0.5692404

[[3]]
[1] 0.5182944

[[4]]
[1] 0.5622299

[[5]]
[1] 0.5379814

[[6]]
[1] 0.5408624

[[7]]
[1] 0.5509939

[[8]]
[1] 0.5334678

[[9]]
[1] 0.4979353

[[10]]
[1] 0.4870354

Note, use sapply instead of lapply if you want the result to be a vector.

Conclusion

For ROC analysis the ROCR package has good methods and many built in measures. Other packages, such as the pROC package, can be useful for many functions and analyses, especially testing the difference between ROC and pROC curves. In some ways, you may want to use proc admissibly over ROCR, especially because (when I checked Dec 18, 2014) the ROCR package was orphaned. But if you are working in ROCR, I hope this give you some examples of how to fit the objects and extract the results.

My Commonly Done ggplot2 graphs: Part 2

In my last post I described some of my commonly done ggplot2 graphs. It seems as though some people are interested in these, so I was going to follow this up with other plots I make frequently.

Scatterplot colored by continuous variable

The setup of the data for the scatterplots will be the same as the previous post, one x variable and one y variable.

library(ggplot2)
set.seed(20141106)
data = data.frame(x = rnorm(1000, mean=6), 
                  batch = factor(rbinom(1000, size=4, prob = 0.5)))
data$group1 = 1-rbeta(1000, 10, 2)
mat = model.matrix(~ batch, data=data)
mat = mat[, !colnames(mat) %in% &quot;(Intercept)&quot;]
betas = rbinom(ncol(mat), size=20, prob = 0.5)
data$quality = rowSums(t(t(mat) * sample(-2:2)))
data$dec.quality = cut(data$quality, 
                       breaks = unique(quantile(data$quality, 
                                         probs = seq(0, 1, by=0.1))),
                       include.lowest = TRUE)

batch.effect = t(t(mat) * betas)
batch.effect = rowSums(batch.effect)
data$y = data$x * 5 + rnorm(1000) + batch.effect  + 
  data$quality * rnorm(1000, sd = 2)

data$group2 = runif(1000)

I have added 2 important new variables, quality and batch. The motivation for these variables is akin to an RNAseq analysis set where you have a quality measure like read depth, and where the data were processed in different batches. The y variable is based both on the batch effect and the quality.

We construct the ggplot2 object for plotting x against y as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()
print(g)

plot of chunk g_create

Coloring by a 3rd Variable (Discrete)

Let's plot the x and y data by the different batches:

print({ g + aes(colour=batch)})

plot of chunk gbatch

We can see from this example how to color by another third discrete variable. In this example, we see that the relationship is different by each batch (each are shifted).

Coloring by a 3rd Variable (Continuous)

Let's color by quality, which is continuous:

print({ gcol = g + aes(colour=quality)})

plot of chunk quality

The default option is to use black as a low value and blue to be a high value. I don't always want this option, as I cannot always see differences clearly. Let's change the gradient of low to high values using scale_colour_gradient:

print({ gcol + scale_colour_gradient(low = &quot;red&quot;, high=&quot;blue&quot;) })

plot of chunk qual_col

This isn't much better. Let's call the middle quality gray and see if we can see better separation:

print({ gcol_grad = gcol + 
          scale_colour_gradient2(low = &quot;red&quot;, mid = &quot;gray&quot;, high=&quot;blue&quot;) })

plot of chunk qual_col_mid

Scatterplot with Coloring by a 3rd Variable (Continuous broken into Discrete)

Another option is to break the quality into deciles (before plotting) and then coloring by these as a discrete variable:

print({ gcol_dec = g + aes(colour=dec.quality) })

plot of chunk decqual

Scatterplot with Coloring by 3rd Continuous Variable Faceted by a 4th Discrete Variable

We can combine these to show each batch in different facets and coloring by quality:

print({ gcol_grad + facet_wrap(~ batch )})

plot of chunk decqual_batch

We can compound all these operations by passing transformations to scale_colour_gradient such as scale_colour_gradient(trans = "sqrt"). But enough with scatterplots.

Distributions! Lots of them.

One of the gaping holes in my last post was that I did not do any plots of distributions/densities of data. I ran the same code from the last post to get the longitudinal data set named dat.

Histograms

Let's say I want a distribution of my yij variable for each person across times:

library(plyr)
g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
ghist = g+ geom_histogram(binwidth = 3) 
print(ghist)

plot of chunk unnamed-chunk-1

Hmm, that's not too informative. By default, the histograms stack on top of each other. We can change this by setting position to be identity:

ghist = g+ geom_histogram(binwidth = 3, position =&quot;identity&quot;) 
print(ghist)

plot of chunk ghist_ident

There are still too many histograms. Let's plot a subset.

Aside: Using the %+% operator

The %+% operator allows you to reset what dataset is in the ggplot2 object. The data must have the same components (e.g. variable names); I think this is most useful for plotting subsets of data.

nobs = 10
npick = 5

Let's plot the density of \(5\) people people with \(10\) or more observations both using geom_density and geom_line(stat = "density"). We will also change the binwidth:

tab = table(dat$id)
ids = names(tab)[tab &gt;= nobs]
ids = sample(ids, npick)
sub = dat[ dat$id %in% ids, ]
ghist = g+ geom_histogram(binwidth = 5, position =&quot;identity&quot;) 
ghist %+% sub

plot of chunk sub

Overlaid Histograms with Alpha Blending

Let's alpha blend these histograms to see the differences:

ggroup = ggplot(data=sub, aes(x=yij, fill=factor(id))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position =&quot;identity&quot;, alpha = 0.33) 
grouphist

plot of chunk unnamed-chunk-2

Similarly, we can plot over the 3 groups in our data:

ggroup = ggplot(data=dat, aes(x=yij, fill=factor(group))) + guides(fill=FALSE)
grouphist = ggroup+ geom_histogram(binwidth = 5, position =&quot;identity&quot;, alpha = 0.33) 
grouphist

plot of chunk unnamed-chunk-3

These histograms are something I commonly do when I want overlay the data in some way. More commonly though, espeically with MANY distributions, I plot densities.

Densities

We can again plot the distribution of \(y_{ij}\) for each person by using kernel density estimates, filled a different color for each person:

g = ggplot(data=dat, aes(x=yij, fill=factor(id))) +   guides(fill=FALSE)
print({gdens = g+ geom_density() })

plot of chunk gdens

As the filling overlaps a lot and blocks out other densities, we can use just different colors per person/id/group:

g = ggplot(data=dat, aes(x=yij, colour=factor(id))) +   guides(colour=FALSE)
print({gdens = g+ geom_density() })

plot of chunk colourdens

I'm not a fan that the default for stat_density is that the geom = "area". This creates a line on the x-axis that closes the object. This is very important if you want to fill the density with different colors. Most times though, I want simply the line of the density with no bottom line. We can achieve this with:

print({gdens2 = g+ geom_line(stat = &quot;density&quot;)})

plot of chunk gdens2

What if we set the option to fill the lines now? Well lines don't take fill, so it will not colour each line differently.

gdens3 = ggplot(data=dat, aes(x=yij, fill=factor(id))) + geom_line(stat = &quot;density&quot;) +  guides(colour=FALSE)
print({gdens3})

plot of chunk gdens3

Now, regardless of the coloring, you can't really see the difference in people since there are so many. What if we want to do the plot with a subset of the data and the object is already constructed? Again, use the %+% operator.

Overlaid Densities with Alpha Blending

Let's take just different subsets of groups, not people, and plot the densities, with alpha blending:

print({ group_dens = ggroup+ geom_density(alpha = 0.3) })

plot of chunk group_dens

That looks much better than the histogram example for groups. Now let's show these with lines:

print({group_dens2 = ggroup+ geom_line(stat = &quot;density&quot;)})

plot of chunk group_dens2

What happened? Again, lines don't take fill, they take colour:

print({group_dens2 = ggroup+ geom_line(aes(colour=group), stat = &quot;density&quot;)})

plot of chunk group_dens3

I'm a firm believer of legends begin IN plots, so let's push that in there and make it blend in:

print({
  group_dens3 =
  group_dens2 +  theme(legend.position = c(.75, .75),
        legend.background = element_rect(fill=&quot;transparent&quot;),
        legend.key = element_rect(fill=&quot;transparent&quot;, 
                                  color=&quot;transparent&quot;))
})

plot of chunk group_dens_legend

Lastly, I'll create a dataset of the means of the datasets and put vertical lines for the mean:

gmeans = ddply(dat, .(group), summarise,
              mean = mean(yij))
group_dens3 + geom_vline(data=gmeans, 
                         aes(colour = group, xintercept = mean))

plot of chunk group_mean

Conclusion

Overall, this post should give you a few ways to play around with densities and such for plotting. All the same changes as the previous examples with scatterplots, such as facetting, can be used with these distribution plots. Many times, you want to look at the data in very different ways. Histograms can allow you to see outliers in some ways that densities do not because they smooth over the data. Either way, the mixture of alpha blending, coloring, and filling (though less useful for many distributions) can give you a nice description of what's going on a distributional level in your data.

PS: Boxplots

You can also do boxplots for each group, but these tend to separate well and colour relatively well using defaults, so I wil not discuss them here. My only note is that you can (and should) overlay points on the boxplot rather than just plot the histogram. You may need to jitter the points, alpha blend them, subsample the number of points, or clean it up a bit, but I try to display more DATA whenever possible.

My Commonly Done ggplot2 graphs

In my last post, I discussed how ggplot2 is not always the answer to the question “How should I plot this” and that base graphics were still very useful.

Why Do I use ggplot2 then?

The overall question still remains: why (do I) use ggplot2?

ggplot2 vs lattice

For one, ggplot2 replaced the lattice package for many plot types for me. The lattice package is a great system, but if you are plotting multivariate data, I believe you should choose lattice or ggplot2. I chose ggplot2 for the syntax, added capabilities, and the philosophy behind it. The fact Hadley Wickham is the developer never hurts either.

Having multiple versions of the same plot, with slight changes

Many times I want to do the same plot over and over, but vary one aspect of it, such as color of the points by a grouping variable, and then switch the color to another grouping variable. Let me give a toy example, where we have an x and a y with two grouping variables: group1 and group2.

library(ggplot2)
set.seed(20141016)
data = data.frame(x = rnorm(1000, mean=6))
data$group1 = rbinom(n = 1000, size =1 , prob =0.5)
data$y = data$x * 5 + rnorm(1000)
data$group2 = runif(1000) > 0.2

We can construct the ggplot2 object as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()

The ggplot command takes the data.frame you want to use and use the aes to specify which aesthetics we want to specify, here we specify the x and y. Some aesthetics are optional depending on the plot, some are not. I think it’s safe to say you always need an x. I then “add” (using +) to this object a “layer”: I want a geometric “thing”, and that thing is a set of points, hence I use geom_points. I’m doing a scatterplot.

If you just call the object g, print is called by default, which plots the object and we see our scatterplot.

g

plot of chunk print_g

I can color by a grouping variable and we can add that aesthetic:

g + aes(colour = group1)

plot of chunk color_group

g + aes(colour = factor(group1))

plot of chunk color_group

g + aes(colour = group2)

plot of chunk color_group

Note, g is the original plot, and I can add aes to this plot, which is the same as if I did ggplot2(data, aes(...)) in the original call that generated g. NOTE if the aes you are adding was not a column of the data.frame when you created the plot, you will get an error. For example, let’s add a new column to the data and then add it to g:

data$newcol = rbinom(n = nrow(data), size=2, prob = 0.5)
g + aes(colour=factor(newcol))
Error: object 'newcol' not found

This fails because the way the ggplot2 object was created. If we had added this column to the data, created the plot, then added the newcol as an aes, the command would work fine.

g2 = ggplot(data, aes(x = x, y=y)) + geom_point()
g2 + aes(colour=factor(newcol))

plot of chunk g2_create2

We see in the first plot with colour = group1, ggplot2 sees a numeric variable group1, so tries a continuous mapping scheme for the color. The default is to do a range of blue colors denoting intensity. If we want to force it to a discrete mapping, we can turn it into a factor colour = factor(group1). We see the colors are very different and are not a continuum of blue, but colors that separate groups better.

The third plot illustrates that when ggplot2 takes logical vectors for mappings, it factors them, and maps the group to a discrete color.

Slight Changes with additions

In practice, I do this iterative process many times and the addition of elements to a common template plot is very helpful for speed and reproducing the same plot with minor tweaks.

In addition to doing similar plots with slight grouping changes I also add different lines/fits on top of that. In the previous example, we colored by points by different grouping variables. In other examples, I tend to change little things, e.g. a different smoother, a different subset of points, constraining the values to a certain range, etc. I believe in this way, ggplot2 allows us to create plots in a more structured way, without copying and pasting the entire command or creating a user-written wrapper function as you would in base. This should increase reproducibility by decreasing copy-and-paste errors. I tend to have many copy-paste errors, so I want to limit them as much as possible.

ggplot reduces number of commands I have to do

One example plot I make frequently is a scatterplot with a smoother to estimate the shape of bivariate data.

In base, I usually have to run at least 3 commands to do this, e.g. loess, plot, and lines. In ggplot2, geom_smooth() takes care of this for you. Moreover, it does the smoothing by each different aesthetics (aka smoothing per group), which is usually what I want do as well (and takes more than 3 lines in base, usually a for loop or apply statement).

g2 + geom_smooth()

plot of chunk smooth
By default in geom_smooth, it includes the standard error of the estimated relationship, but I usually only look at the estimate for a rough sketch of the relationship. Moreover, if the data are correlated (such as in longitudinal data), the standard errors given by default methods are usually are not accurate anyway.

g2 + geom_smooth(se = FALSE)

plot of chunk smooth_no_se

Therefore, on top of the lack of copying and pasting, you can reduce the number of lines of code. Some say “but I can make a function that does that” – yes you can. Or you can use the commands already there in ggplot2.

Faceting

The other reason I frequently use ggplot2 is for faceting. You can do the same graph, conditioned on levels of a variable, which I frequently used. Many times you want to do a graph, subset by another variable, such as treatment/control, male/female, cancer/control, etc.

g + facet_wrap(~ group1)

plot of chunk facet_group

g + facet_wrap(~ group2)

plot of chunk facet_group

g + facet_wrap(group2 ~ group1)

plot of chunk facet_group

g + facet_wrap( ~ group2 + group1)

plot of chunk facet_group

g + facet_grid(group2 ~ group1)

plot of chunk facet_group

Spaghetti plot with Overall smoother

I also frequently have longitudinal data and make spaghetti plot for a per-person trajectory over time.

For this example I took code from StackOverflow to create some longitudinal data.

library(MASS)
library(nlme)

### set number of individuals
n <- 200

### average intercept and slope
beta0 <- 1.0
beta1 <- 6.0

### true autocorrelation
ar.val <- .4

### true error SD, intercept SD, slope SD, and intercept-slope cor
sigma <- 1.5
tau0  <- 2.5
tau1  <- 2.0
tau01 <- 0.3

### maximum number of possible observations
m <- 10

### simulate number of observations for each individual
p <- round(runif(n,4,m))

### simulate observation moments (assume everybody has 1st obs)
obs <- unlist(sapply(p, function(x) c(1, sort(sample(2:m, x-1, replace=FALSE)))))

### set up data frame
dat <- data.frame(id=rep(1:n, times=p), obs=obs)

### simulate (correlated) random effects for intercepts and slopes
mu  <- c(0,0)
S   <- matrix(c(1, tau01, tau01, 1), nrow=2)
tau <- c(tau0, tau1)
S   <- diag(tau) %*% S %*% diag(tau)
U   <- mvrnorm(n, mu=mu, Sigma=S)

### simulate AR(1) errors and then the actual outcomes
dat$eij <- unlist(sapply(p, function(x) arima.sim(model=list(ar=ar.val), n=x) * sqrt(1-ar.val^2) * sigma))
dat$yij <- (beta0 + rep(U[,1], times=p)) + (beta1 + rep(U[,2], times=p)) * log(dat$obs) + dat$eij

I will first add an alpha level to the plotting lines for the next plot (remember this must be done before the original plot is created). tspag will be the template plot, and I will create a spaghetti plot (spag) where each colour represents an id:

library(plyr)
dat = ddply(dat, .(id), function(x){
  x$alpha = ifelse(runif(n = 1) > 0.9, 1, 0.1)
  x$grouper = factor(rbinom(n=1, size =3 ,prob=0.5), levels=0:3)
  x
})
tspag = ggplot(dat, aes(x=obs, y=yij)) + 
  geom_line() + guides(colour=FALSE) + xlab("Observation Time Point") +
  ylab("Y")
spag = tspag + aes(colour = factor(id))
spag

plot of chunk spag

Many other times I want to group by id but plot just a few lines (let’s say 10% of them) dark and the other light, and not colour them:

bwspag = tspag + aes(alpha=alpha, group=factor(id)) + guides(alpha=FALSE)
bwspag

plot of chunk unnamed-chunk-1

Overall, these 2 plots are useful when you have longitudinal data and don’t want to loop over ids or use lattice. The great addition is that all the faceting and such above can be used in conjunction with these plots to get spaghetti plots by subgroup.

spag + facet_wrap(~ grouper)

plot of chunk spag_facet

Spaghetti plot with overall smoother

If you want a smoother for the overall group in addition to the spaghetti plot, you can just add geom_smooth:

sspag = spag + geom_smooth(se=FALSE, colour="black", size=2)
sspag

plot of chunk spag_smooth

sspag + facet_wrap(~ grouper)

plot of chunk spag_smooth

bwspag + facet_wrap(~ grouper)

plot of chunk spag_smooth

Note that the group aesthetic and colour aesthetic do not perform the same way for some operations. For example, let’s try to smooth bwswag:

bwspag + facet_wrap(~ grouper) + geom_smooth(se=FALSE, colour="red")

plot of chunk spag_smooth_bad

We see that it smooths each id, which is not what we want. We can achieve the desired result by setting the group aesthetic:

bwspag + facet_wrap(~ grouper) + 
  geom_smooth(aes(group=1), se=FALSE, colour="red", size =2)

plot of chunk spag_smooth_corr

I hope that this demonstrates some of the simple yet powerful commands ggplot2 allows users to execute. I agree that some behavior may not seem straightforward at first glance, but becomes more understandable as one uses ggplot2 more.

Another Case this is Useful – Save Plot Twice

Another (non-plotting) example I want to show is how saving ggplot2 objects can make saving duplicate plots much easier. In many cases for making plots to show to others, I open up a PDF device using pdf, make a series of plots, and then close the PDF. There may be 1-2 plots that are the real punchline, and I want to make a high-res PNG of them separately. Let me be clear, I want both – the PDF and PNG. For example:

pdf(tempfile())
print({g1 = g + aes(colour = group1)})
print({g1fac = g + aes(colour = factor(group1))})
print({g2 = g + aes(colour = group2)})
dev.off()

plot of chunk pdf_and_pngs

png(tempfile(), res = 300, height =7, width= 7, units = "in")
print(g2)
dev.off()

I am printing the objects, while assigning them. (I have to use the { brackets because I use = for assignment and print would evaluate that as arguments without {}). As g2 is already saved as an object, I can close the PDF, open a png and then print that again.

No copying and pasting was needed for remaking the plot, nor some weird turning off and on devices.

Conclusion

I (and think you should) use ggplot2 for a lot of reasons, especially the grammar and philosophy. The plots I have used here are some powerful representations of data that are simple to execute. I believe using this system reflects and helps the true iterative process of making figures. The syntax may not seem intuitive to a long-time R user, but I believe the startup cost is worth the effort. Let me know if you’d like to see any other plots that you commonly use.

Working with NIfTI images in R

The oro.nifti package is awesome for NeuRoimaging (couldn't help myself). It has functions to read/write images, introduces the S4 nifti class, and has useful plotting functions. There are some limitations and some gotchas that are important to discuss if you are working with these objects in R.

Dataset Creation

We'll read in some data (a template, example taken from readNIfTI in oro.nifti. We have to download it as most packages aren't allowed to include data (Bioconductor has mostly remedied this for bioinformatics).

library(oro.nifti)
URL <- "http://imaging.mrc-cbu.cam.ac.uk/downloads/Colin/colin_1mm.tgz"
urlfile <- "colin_1mm.tgz"
if (!file.exists(urlfile)){
  download.file(URL, dest=urlfile, quiet=TRUE)
  untar(urlfile)
}
img <- readNIfTI("colin_1mm")
img = cal_img(img)

The img object read in is of class nifti, has a series of slots (remember S4 object, not S3), and the print method produces information about the NIfTI header.

class(img)
[1] "nifti"
attr(,"package")
[1] "oro.nifti"
print(img)
NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181 x 1
  Pixel Dimension : 1 x 1 x 1 x 0
  Voxel Units     : mm
  Time Units      : sec
slotNames(img)
 [1] ".Data"          "sizeof_hdr"     "data_type"      "db_name"       
 [5] "extents"        "session_error"  "regular"        "dim_info"      
 [9] "dim_"           "intent_p1"      "intent_p2"      "intent_p3"     
[13] "intent_code"    "datatype"       "bitpix"         "slice_start"   
[17] "pixdim"         "vox_offset"     "scl_slope"      "scl_inter"     
[21] "slice_end"      "slice_code"     "xyzt_units"     "cal_max"       
[25] "cal_min"        "slice_duration" "toffset"        "glmax"         
[29] "glmin"          "descrip"        "aux_file"       "qform_code"    
[33] "sform_code"     "quatern_b"      "quatern_c"      "quatern_d"     
[37] "qoffset_x"      "qoffset_y"      "qoffset_z"      "srow_x"        
[41] "srow_y"         "srow_z"         "intent_name"    "magic"         
[45] "extender"       "reoriented"    
hist(img)

plot of chunk printimg

Array operation

Now, we can do simple operations on this object like any other array. The problem is that certain operations do not (some rightfully so) return a nifti object.

For example, let's get an indicator if the value is greater than 400000. (This is an MRI and has arbitrary units).

mask = img > 400000
mask
NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181 x 1
  Pixel Dimension : 1 x 1 x 1 x 0
  Voxel Units     : mm
  Time Units      : sec
class(mask)
[1] "nifti"
attr(,"package")
[1] "oro.nifti"

We see a nifti object is returned, with all the header information contained. Great! Let's get an indicator of values greater than 400000 but less than 600000.

mask_2 = img > 400000 & img < 600000
class(mask_2)
[1] "array"

We see that this operation returned an array, not a nifti object. That's kind of unexpected, but probably conservative. For example, if this had been 2 different nifti objects and you were performing the subsetting, which object's information should you use?

We see the same behavior when multiplying (or subtracting, etc) two nifti objects.

img_masked = img * mask
class(img_masked)
[1] "array"

We know the object is an array but we may want a nifti object.

fslr Helper functions

The niftiarr function in fslr inputs a nifti object and an array, and returns a nifti object with the array in the @.Data slot, copying over the image header information from the input nifti object.

### need the development version
# library(devtools)
# devtools::install_github("muschellij2/fslr")
library(fslr)
img_masked = niftiarr(img, img * mask)
class(img_masked)
[1] "nifti"
attr(,"package")
[1] "oro.nifti"

Other ways to skin the cat

Another way of masking the image is to subset the values of the image that are not in the mask and setting those values to \(0\) (or some other value).

img_masked_2 = img
img_masked_2[mask == 0] = 0
class(img_masked_2)
[1] "nifti"
attr(,"package")
[1] "oro.nifti"

We see that this correctly returns an object of class nifti. One potential issue with this direction is that cal_min and cal_max slots on the resulting nifti object not the range of the image. This is problematic because orthographic and image.nifti both use this slot to determine grayscale ranges for plotting and writeNIfTI errors if these values don't match. The cal_img function in fslr is a simple helper function that will set these values to the respective values of the range of the data.

range(img_masked_2)
[1]      0 791972
c(cal.min(img_masked_2), cal.max(img_masked_2))
[1]      0 791972
img_masked_2 = cal_img(img_masked_2)
range(img_masked_2)
[1]      0 791972

Note, sometimes readNIfTI sets the cal_min and cal_max slots both to \(0\), which may be a result of how the nifti was created.

Equivalent Operations

We see that after these operations done 2 different ways, the resulting nifti objects are equivalent.

all.equal(img_masked, img_masked_2)
[1] TRUE

Data Typing and Writing nifti objects

I've worked with oro.nifti for some time, and I wanted to also discuss writing nifti objects. Many operations preserve the data type of the array in the nifti object, such as if the data were integers, an array of integers returned. Other times, the data type changes, such as if you smoothed an image of integers, the resulting array would have values that are not integers, but probably continuous values. When plotting the data or operating with the R object, this is not generally an issue because any extraction of the information will take the nifti object's @.Data slot and return that array.

This presents a problem when you write nifti objects using the writeNIfTI function.

Slots datatype and bitpix

In a nifti object, the datatype and bitpix fields specify what type the data is and the level of precision stored in the NIfTI file. For example, if you have binary data, the
datatype would be a value of 2 for UINT8 (unsigned integer) and the bitpix would be 8 for 8 bits per voxel. These must correspond to each other (i.e. no mismatching). Also, storing data in a type that is what I call “over-storing” (e.g storing binary data in a FLOAT32 format) may make files larger than necessary. Under-storing is much more problematic, however.

Why you should care about under-storing

If you read in data as integers, did some operations that returned continuous (decimal) objects, and then write the nifti object using the original header information would truncate the data in ways you wouldn't want. In R, the object would look fine, but when you wrote it to disk, it would be truncated.

img@bitpix
[1] 8
img@datatype
[1] 2

Let's say we scaled the data by its maximum:

pct = img / max(img)
pct = cal_img(pct)
pct@bitpix
[1] 8
pct@datatype
[1] 2
hist(pct)

plot of chunk pct

### we will explain this in the next section
pct = drop_img_dim(pct)
pct = zero_trans(pct)
pct = onefile(pct)

If we write this to disk, and read the result back in, we see:

par(mfrow=c(3,1))
hist(pct, main="Correct Data")

### Writing out pct without any datatype change - still UINT8
tfile = tempfile()
writeNIfTI(pct, filename = tfile, verbose=TRUE)
  writing data at byte = 352
pct2 = readNIfTI(paste0(tfile, ".nii.gz"))
hist(pct2, main="Writing out with no change of datatype/bitpix")

### Writing out pct with datatype change - to FLOAT32
pct = datatype(pct, type_string = "FLOAT32")
writeNIfTI(pct, filename = tfile)
pct3 = readNIfTI(paste0(tfile, ".nii.gz"))
hist(pct3, main="Writing out with changing of datatype/bitpix")

plot of chunk readwrite

As you can see, you need to change the bitpix and datatype when writing out NIfTI files if you are changing the data type in the array of data in R. We had changed integers to percentages/proportions, and we changed the output to FLOAT32 (for floating point data). Take a look at convert.datatype() and convert.bitpix() for examples of types.

Other functions for changinag data

We used 3 functions in the last section which changed around pct: drop_img_dim, zero_trans, and onefile.

I want to explain what each of these steps are doing.

Dropping dimensions

When you read in some data (especially from SPM) into R using readNIfTI, the data can be read in as a 4-dimensional image, with one dimension only being one. For example, if we look at the dimensions of img:

dim(img)
[1] 181 217 181   1
img@dim_
[1]   4 181 217 181   1   0   0   0
pixdim(img)
[1] 1 1 1 1 0 0 0 0

We see that the 4th dimension is 1 and the slot dim_ denotes it's a 4-dimensional object (4 is in the first index). Also, we note that values in the dim_ slot are \(0\), which is not allowed when writing out:

writeNIfTI(img, filename = tempfile())
Error: all dim elements > dim[1] must be 1
dim/pixdim mismatch

In order to force it to be a 3-dimensional object, drop_img_dim changes all these aspects for the user.

img_3d = drop_img_dim(img)
dim(img_3d)
[1] 181 217 181
img_3d@dim_
[1]   3 181 217 181   1   1   1   1
pixdim(img_3d)
[1] 1 1 1 1 0 0 0 0
tfile = tempfile()
writeNIfTI(img_3d, filename = tfile)

Unscaling the data

Many times this is not necessary, but some NIfTI files have the scl_slope and scl_inter slots defined. These denote what to multiply and then add to the data to get it in the correct values. zero_trans simply makes scl_slope = 0 and scl_inter = 1.

Changing NIfTI magic slots

In each NIfTI image, there is a magic slot (see http://nifti.nimh.nih.gov/nifti-1/documentation/nifti1fields/nifti1fields_pages/magic.html):

img_3d@magic
[1] "ni1"
img_3d@vox_offset
[1] 0

The NIfTI documentation states:

“ni1” means that the image data is stored in the “.img” file corresponding
to the header file (starting at file offset 0).

We previously wrote out the 3D image, but if we try to read it back in, readNIfTI fails:

readNIfTI(fname = tfile)
Error: This is not a one-file NIfTI format

We can change the default in readNIfTI to onefile = FALSE, but nothing changes:

writeNIfTI(img_3d, filename = tfile, onefile = FALSE)
readNIfTI(fname = tfile)
Error: This is not a one-file NIfTI format

If we look in oro.nifti:::.read.nifti.content we see

if (!onefile) {
    if (nim@magic != "ni1") {
        stop("This is not a two-file NIfTI format")
    }

We can change this magic to n+1, but we need to change the vox_offset = 352 and that's exactly what onefile() does:

img_3d = onefile(img_3d)
writeNIfTI(img_3d, filename = tfile, onefile = FALSE)
readNIfTI(fname = tfile)
NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181
  Pixel Dimension : 1 x 1 x 1
  Voxel Units     : mm
  Time Units      : sec

Viola! All these fun little quirks that I hope you don't have to encounter. But if you do, I hope this page helps.

Sorted HTML Tables and Javascript Libraries

A few days ago StatsInTheWild asked the following question

So we had a few exchanges where I thought you could use sprintf and be done but it didn't seem to work:

After a bit more discourse, StatsInTheWild shared some data with me:

and I went down the rabbit hole of trying to find out what was going on. Here is the code to make the table:

require(SortableHTMLTables)
myfile = "openWAR.csv"
if (!file.exists(myfile)) {
  download.file("https://dl.dropboxusercontent.com/u/35094868/openWAR.csv", myfile, method="wget")
  }
openWAR<-read.csv(myfile, stringsAsFactors = FALSE);
sortable.html.table(openWAR,"openWAR2014.html")

And as you can see in the output table here the column RAA.pitch does not sort correctly.

Attempts at, and then finding, a Solution

I tried a few things such as changing numeric to string, seeing if missing data was a problem, trying some things where I make the numbers all positive, but the problem persisted.

As StackOverflow usually does, it had insight into an answer. Essentially, prior to version 2.0.5, jquery.tablesorter.js didn't sort numbers exactly correctly. The problem is that SortableHTMLTables ships with version 2.0.3:

head(readLines(system.file("assets/jquery.tablesorter.js", package = "SortableHTMLTables"))) 
[1] "/*"                                                       
[2] " * "                                                      
[3] " * TableSorter 2.0 - Client-side table sorting with ease!"
[4] " * Version 2.0.3"                                         
[5] " * @requires jQuery v1.2.3"                               
[6] " * "                                                      

and uses this version for the table output. Now, if you wanted to fix this, you'd have some css to your file or some other route. Or, you can just update jquery.tablesorter.js. I went to http://tablesorter.com/docs/, downloaded the new js plugin.

But I want this automatic!

If you're using R and don't want to play around with JavaScript, that's the whole point of these functions. Saying you have to edit css or something of the like defeats that purpose. But for this fix to be automatically'' done, you either have to 1) copy the .js file every time you run thesortable.html.table command as it re-copies the files over, 2) wait for the maintainer to update (out of your control), 3) change your css, or 4) copy a new .js file with different name and edit the html file after running to make sure it uses your new js file. I'll implement 4).

require(SortableHTMLTables)
outfile = "openWAR2014_fixed.html"
sortable.html.table(openWAR,outfile)
change_js = function(f, newjs = "jquery.tablesorter_v2.0.5.js"){
  x = readLines(f)
  x = gsub("jquery.tablesorter.js", newjs, x, fixed=TRUE)
  writeLines(x, con = f)
}
change_js(outfile)

(I named my file jquery.tablesorter_v2.0.5.js). Now, you see here, the table works! Hope this helps.

Maintainer

Note, I contacted the maintainer and I'm sure he'll fix this in the next update (he does a LOT of awesome work and development).

Sometimes Table is not the Answer – a Faster 2×2 Table

The table command is great in its simplicity for cross tabulations. I have run into some settings where it is slow and I wanted to demonstrate one simple example here of why you may want to use other functions or write your own tabler. This example is a specific case where, for some examples and functions, you don't need all the good error-checking or flexibility that a function contains, but you want to do something specific and can greatly speed up computation.

Setup of example

I have some brain imaging data. I have a gold standard, where an expert hand-traced (on a computer) a brain scan delineating the brain. I'll refer to this as a brain “mask”. (We use the word mask in imaging to denote a segmented image – either done manually or automatically and I generally reserve the word mask for binary 0/1 values, but others use the term more broadly.)

Using automated methods, I can try to re-create this mask automatically. This image is also binary. I want to simply get a \(2\times2\) contingency table of the automated versus manual masks so I can get sensitivity/specificity/accuracy/etc.

The data

For simplicity and computation, let's consider the images as just a really long vectors. I'll call them manual and auto for the manual and automatic masks, respectively.

These are long logical vectors (9 million elements):

length(manual)
[1] 9175040
length(auto)
[1] 9175040
head(manual)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(auto)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Naturally, you can run table on this data:

stime = system.time({ tab = table(manual, auto) })
print(stime)
   user  system elapsed 
  3.294   0.082   3.386 
print(tab)
       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162

The computation took about 3.4 seconds on my MacBook Pro (2013, 16Gb RAM, 2.8GHz Intel i7), which isn't that bad. Realize, though, that I could have hundreds or thousands of these images. We need to speed this up.

What is the essense of what we're doing?

Taking 3.4 seconds to get 4 numbers seems a bit long. As the data is binary, we can simply compute these with the sum command and logical operators.

Let's make the twoXtwo command:

twoXtwo = function(x, y, dnames=c("x", "y")){
    tt <- sum( x &  y)
        tf <- sum( x & !y)
        ft <- sum(!x &  y)
        ff <- sum(!x & !y)
        tab = matrix(c(ff, tf, ft, tt), ncol=2)
    n = list(c("FALSE", "TRUE"), c("FALSE", "TRUE"))
    names(n) = dnames
    dimnames(tab) = n
    tab = as.table(tab)
    dim
        tab
}

And let's see how fast this is (and confirm the result is the same):

stime2 = system.time({ twotab = twoXtwo(manual, auto, dnames=c("manual", "auto")) })
print(stime2)
   user  system elapsed 
  0.828   0.006   0.835 
print(twotab)
       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162
identical(tab, twotab)
[1] TRUE

Viola, twoXtwo runs about 4.06 times faster than table, largely because we knew we did not have to check certain characteristics of the data and that it's a specific example of a table.

More speed captain!

This isn't something astronomical such as a 100-fold increase, but we can increase the speed by not doing all the logical operations on the vectors, but taking differences from the margin sums.

Let's confirm this is faster and accurate by running it on our data:

stime3 = system.time({ twotab2 = twoXtwo2(manual, auto, dnames=c("manual", "auto")) })
print(stime3)
   user  system elapsed 
  0.198   0.001   0.200 
print(twotab2)
       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162
identical(tab, twotab2)
[1] TRUE

Now, if I were going for speed, this code is good enough for me: it runs about 16.93 times faster than table. The one downside is that it is not as readable as twoXtwo. For even greater speed, I could probably move into C++ using the Rcpp package, but that seems overkill for a two by two table.

Other examples of speeding up the calculation can be found here.

Finishing up

I said I wanted sensitivity/specificity/accuracy/etc. so I will show how to get these. I'm going to use prop.table, which I didn't know about for a while when I first started using R (see margin.table too).

ptab = prop.table(twotab)
rowtab = prop.table(twotab, margin=1)
coltab = prop.table(twotab, margin=2)

As you can see, like the apply command, the prop.table command can either take no margin or take the dimension to divide over (1 for rows, 2 for columns). This means that in ptab, each cell of twotab was divided by the grand total (or sum(tab)). For rowtab, each cell was divided by the rowSums(tab) to get a proportion, and similarly cells in coltab were divided by colSums(tab). After the end of the post, I can show you these are the same.

Getting Performance Measures

Accuracy

Getting the accuracy is very easy:

accur = sum(diag(ptab))
print(accur)
[1] 0.997

Sensitivity/Specificity

For sensitivity/specificity, the “truth” is the rows of the table, so we want the row percentages:

sens = rowtab["TRUE", "TRUE"]
spec = rowtab["FALSE", "FALSE"]
print(sens)
[1] 0.9874
print(spec)
[1] 0.9985

Positive/Negative Predictive Value

We can also get the positive predictive value (PPV) and negative predictive value (NPV) from the column percentages:

ppv = coltab["TRUE", "TRUE"]
npv = coltab["FALSE", "FALSE"]
print(ppv)
[1] 0.9902
print(npv)
[1] 0.9981

Conclusions

After using R for years, I find things to still be very intuitive. Sometimes, though, for large data sets or specific examples, you may want to write your own function for speed, checking against the base functions for a few iterations as a double-check. I have heard this to be a nuisance for those who dislike R, as well as hampering reproducibility in some cases. Overall, I find that someone has made a vetted package that does what you want, but there are still simple cases (such as above) where “rolling your own” is OK and easier than adding a dependency to your code.

Aside: How prop.table works

Over all margins

For just doing prop.table without a margin, you can think of the table being divided by its sum.

print(round(ptab, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.866 0.001
  TRUE  0.002 0.131
print(round(twotab / sum(twotab), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.866 0.001
  TRUE  0.002 0.131

Over row margins

For the margin=1, or row percentages, you can think of dividing the table by the row sums.

print(round(rowtab, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.987
print(round(twotab / rowSums(twotab), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.987

Over column margins

Now for column percentages, you can think of R dividing each cell by its column's sum. This is what prop.table does.

Let's look at the result we should get:

print(round(coltab, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

But in R, when you take a matrix and then add/divide/subtract/multiply it by a vector, R does the operation column-wise. So when you take column sums on the table, you get a vector with the same number of columns as the table:

print(colSums(twotab))
  FALSE    TRUE 
7956925 1218115 

If you try to divide the table by this value, it will not give you the desired result:

print(round( twotab / colSums(twotab), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.990

R operations with matrices and vectors

This is because R thinks of a vector as a column vector (or a matrix of 1 column). R then takes the first column of the table and divides the first element from the first column sum (which is correct), but take the second element of the first column and divides it by the second column sum (which is not correct).
A detailed discussion on SO is located here of how to do row-wise operations on matrices.

Back to column percentages

We can use the t( t() ) operation to get the correct answer:

print(round( t( t(twotab) / colSums(twotab)), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

You can think of R implicitly making the matrix of the correct size with the correct values and then dividing:

cs = colSums(twotab)
cs = matrix(cs, nrow=nrow(tab), ncol=ncol(tab), byrow=TRUE)
print(cs)
        [,1]    [,2]
[1,] 7956925 1218115
[2,] 7956925 1218115
print(round( twotab/cs, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

Happy tabling!