A small neuroimage interactive plotter

Manipulate Package

The manipulate from RStudio allows you to create simple Tcl/Tk operators for interactive visualization. I will use it for a simple slider to view different slices of an image.

library(manipulate)

fslr package

I'm calling the fslr package because I know that if you have it installed, you will likely have FSL and have a 1mm T1 template from MNI in a specific location. fslr also loads the oro.nifti package so that readNIfTI is accessible after loading fslr. You can download a test NIfTI image here if you don't have access to any and don't have FSL downlaoded.

Here I will read in the template image:

library(fslr)
options(fsl.path='/usr/local/fsl')
template = file.path(fsldir(), "data/standard", 
                     "MNI152_T1_1mm_brain.nii.gz")
img = readNIfTI(template)

The iplot function

The iplot function defined below takes in a nifti object, the specific plane to be plotted and additional options to be passed to oro.nifti::image. The function is located on my GitHub here.

iplot = function(img, plane = c("axial", 
                                "coronal", "sagittal"), ...){
  ## pick the plane
  plane = match.arg(plane, c("axial", 
                             "coronal", "sagittal"))
  # Get the max number of slices in that plane for the slider
  ns=  switch(plane,
              "axial"=dim(img)[3],
              "coronal"=dim(img)[2],
              "sagittal"=dim(img)[1])
  ## run the manipulate command
  manipulate({
    image(img, z = z, plot.type= "single", plane = plane, ...)
    # this will return mouse clicks (future experimental work)
    pos <- manipulatorMouseClick()
    if (!is.null(pos)) {
      print(pos)
    }
  },
  ## make the slider
  z = slider(1, ns, step=1, initial = ceiling(ns/2))
  )
}

Example plots

Here are some examples of how this iplot function would be used:

iplot(img)
iplot(img, plane = "coronal")
iplot(img, plane = "sagittal")

The result will be a plotted image of the slice with a slider. This is most useful if you run it within RStudio.

Below are 2 example outputs of what you see in RStudio:

Slice 91:
Slice 1

Slice 145:

Slice 2

Conclusions

The iplot function allows users to interactively explore neuroimages. The plotting is not as fast as I'd like, I may try to speed up the oro.nifti::image command or implement some subsampling. It does however show a proof of concept how interactive neuroimaging visualization can be done in R.

Note

manipulate must be run in RStudio for manipulation. The fslr function fslview will call FSLView from FSL for interactive visualization. This is an option of interactive neuroimaging “in R”, but not a real or satisfactory implementation for me (even though I use it frequently). If anyone has implemented such a solution in R, I'd love to hear about it.

matlabr: a Package to Calling MATLAB from R with system

In my research, I primarily use R, but I try to use existing code if available. In neuroimaging and other areas, that means calling MATLAB code. There are some existing solutions for the problem of R to MATLAB: namely the R.matlab package and the RMatlab package (which can call R from MATLAB as well). I do not use thse solutions usually though.

Previously, Mandy Mejia wrote “THREE WAYS TO USE MATLAB FROM R”. Option 2 is about how to use R.matlab, and Mandy gives and example with some cod. She also describes in Options 1 and 3 how to use the system command to call MATLAB commands.

I like this strategy options because:

  1. I didn’t take the time to learn R.matlab.
  2. It worked for me.
  3. I wrote a package to wrap the options Mandy described: matlabr.

matlabr: Wrapping together system calls to MATLAB

The matlabr package is located in GitHub and you can install it with the following command:

devtools::install_github("muschellij2/matlabr")

It has a very small set of functions and I will go through each function and describe what they do:

  1. get_matlab: Mostly internal command that will return a character string that will be passed to system. If matlab is in your PATH (bash variable), and you are using R based on the terminal, the command would return "matlab". If MATLAB is not in your PATH or using a GUI-based system like RStudio, you must set options(matlab.path='/your/path/to/matlab').
  2. have_matlab: Wrapper for get_matlab to return a logical if matlab is found.
  3. run_matlab_script: This will pass a .m file to MATLAB. It also wraps the command in a try-catch statement in MATLAB so that if it fails, it will print the error message. Without this try-catch, if MATLAB errors, then running the command will remain in MATLAB and not return to R.
  4. run_matlab_code: This takes a character vector of MATLAB code, ends lines with ;, writes it to a temporary .m file, and then runs run_matlab_script on the temporary .m file.
  5. rvec_to_matlab: Takes in a numeric R vector and creates a MATLAB column matrix.
  6. rvec_to_matlabclist: Takes in a vector from R (usually a character vector) and quotes these strings with single quotes and places them in a MATLAB cell using curly braces: { and }. It then stacks these cells into a “matrix” of cells.

Setting up matlabr

Let’s set up the matlab.path as I’m running in RStudio:

library(matlabr)
options(matlab.path = "/Applications/MATLAB_R2014b.app/bin")
have_matlab()

The result from have_matlab() indicates that the matlab command can be called.

Let’s write some code to test it

Here we will create some code to take a value for x, y, z (scalars) and a matrix named a and then save x, a, z to a text file:

code = c("x = 10", 
         "y=20;", 
         "z=x+y", 
         "a = [1 2 3; 4 5 6; 7 8 10]",
         "save('test.txt', 'x', 'a', 'z', '-ascii')")
res = run_matlab_code(code)
/var/folders/1s/wrtqcpxn685_zk570bnx9_rr0000gr/T//RtmpHnOinq/file2f8352c04937.m

Output

First off, we see that test.txt indeed was written to disk.

file.exists("test.txt")
[1] TRUE

We can read in the test.txt from using readLines:

output = readLines(con = "test.txt")
print(output)
[1] "   1.0000000e+01"                                
[2] "   1.0000000e+00   2.0000000e+00   3.0000000e+00"
[3] "   4.0000000e+00   5.0000000e+00   6.0000000e+00"
[4] "   7.0000000e+00   8.0000000e+00   1.0000000e+01"
[5] "   3.0000000e+01"                                

Conclusions

matlabr isn’t fancy and most likely has some drawbacks as using system can have some quirks. However, these functions have been helpful for me to use some SPM routines and other MATLAB commands while remaining “within R“. R.matlab has a better framework, but it may not be as straightforward for batch processing. Also matlabr has some wrappers that will do a try-catch so that you don’t get stuck in MATLAB after calling system.

Let me know if this was helpful or if you have ideas on how to make this better. Or better yet, give a pull request.

White Matter Segmentation in R

Goals and Overall Approach

We will use multiple packages and pieces of software for white matter (and gray matter/cerebro spinal fluid (CSF)) segmentation.

The overall approach will be, with the required packages in parentheses:

  1. N4 Inhomogeneity Bias-Field Correction (extrantsr and ANTsR)
  2. Brain extraction using BET and additional tools (extrantsr and fslr)
  3. FAST for tissue-class segmentation. (fslr)

Installing Packages

Below is a script to install all the current development versions of all packages. The current fslr packages depends on oro.nifti (>= 0.5.0) , which is located at muschellij2/oro.nifti or bjw34032/oro.nifti.

Note, the ITKR and ANTsR packages can take a long time to compile. The extrantsr package builds on ANTsR and makes some convenience wrapper functions.

devtools::install_github("muschellij2/oro.nifti")
devtools::install_github("muschellij2/fslr")
devtools::install_github("stnava/cmaker")
devtools::install_github("stnava/ITKR")
devtools::install_github("stnava/ANTsR")
devtools::install_github("muschellij2/extrantsr")
install.packages("scales")

Load in the packages

Here we will load in the required packages. The scales package is imported just for the alpha function, used below in plotting.

rm(list=ls())
library(fslr)
library(extrantsr)
library(scales)

Specifying FSL path

For fslr to work, FSL must be installed. If run in the Terminal, the FSLDIR environmental variable should be found using R's Sys.getenv("FSLDIR") function.

If run in an IDE (such as RStudio or the R GUI), R must know the path of FSL, as set by the following code:

options(fsl.path="/usr/local/fsl/")

Image Filenames

Here we will set the image name. The nii.stub function will strip off the .nii.gz from img.name.

img.name = "SUBJ0001-01-MPRAGE.nii.gz"
img.stub = nii.stub(img.name)

N4 Bias Field Correction

The first step in most MRI analysis is performing inhomogeneity correction. The extrantsr function bias_correct can perform N3 or N4 bias correction from the ANTsR package.

n4img = bias_correct( img.name, correction = "N4", 
                      outfile = paste0(img.stub, "_N4.nii.gz") )
ortho2(n4img)

plot of chunk biascorrection_plot

Let us note that the image is of the head and a bit of the neck. We wish to perform white matter segmentation only on the brain tissues, so we will do brain extraction.

Brain Extraction

The extrantsr function fslbet_robust performs brain extraction. It relies on the fslr function fslbet which calls bet from FSL. It also performs neck removal (remove.neck = TRUE) and will perform BET once and then estimate a new center of gravity (COG) and then re-run BET. These functions are implemented in fslbet specifically, but these have been re-implemented in fslbet_robust in a slightly different way. fslbet_robust will also perform N4 inhomogeneity correction, but as this has already been performed above, we will set correct = FALSE.

For neck removal, a template brain and mask must be specified. We will use the T1, 1mm resolution, MNI brain included with FSL's installation.

bet = fslbet_robust(img = n4img, 
                    retimg = TRUE,
                    remove.neck = TRUE,
                    robust.mask = FALSE,
                    template.file = file.path( fsldir(), 
                                               "data/standard", 
                                               "MNI152_T1_1mm_brain.nii.gz"),
                    template.mask = file.path( fsldir(), 
                                               "data/standard", 
                                               "MNI152_T1_1mm_brain_mask.nii.gz"), 
                    outfile = "SUBJ0001-01-MPRAGE_N4_BET", 
                    correct = FALSE)

The results look good – the brain tissue is kept (in red) only. Not much brain tissue is discarded nor non-brain-tissue is included.

ortho2(n4img, bet > 0, 
       col.y=alpha("red", 0.5))

plot of chunk bet_plot

FAST Image Segmentation

Now that we have a brain image, we can use FAST for image segmentation. We will use the fslr function fast, which calls fast from FSL. We will pass the -N option so that FAST will not perform inhomogeneity correction (different from N4 and N3), because we had performed this before.

fast = fast(file = bet, 
            outfile = paste0(img.stub, "_BET_FAST"), 
            opts = '-N')

White Matter Results

By default, FAST assumes 3 tissue classes, generally white matter, gray matter, and CSF. These are generally ordered by the mean intensity of the class. For T1-weighted images, white matter is the highest intensity, and assigned class 3. Let's see the results:

ortho2(bet, fast == 3, 
       col.y=alpha("red", 0.5))

plot of chunk fast_plot

Gray Matter / CSF Results

We can also visualize the classes for 1 and 2 for CSF and gray matter, respectively.

ortho2(bet, fast == 1, col.y=alpha("red", 0.5), text="CSF Results")

plot of chunk fast_plot_csf_gm

ortho2(bet, fast == 2, col.y=alpha("red", 0.5), text="Gray Matter\nResults")

plot of chunk fast_plot_csf_gm

The results indicate good segmentation of the T1 image. The fslr function fast result in more than the tissue-class segmentation, see the other files output:

list.files(pattern=paste0(img.stub, "_BET_FAST"))
[1] "SUBJ0001-01-MPRAGE_BET_FAST_mixeltype.nii.gz"
[2] "SUBJ0001-01-MPRAGE_BET_FAST_pve_0.nii.gz"    
[3] "SUBJ0001-01-MPRAGE_BET_FAST_pve_1.nii.gz"    
[4] "SUBJ0001-01-MPRAGE_BET_FAST_pve_2.nii.gz"    
[5] "SUBJ0001-01-MPRAGE_BET_FAST_pveseg.nii.gz"   
[6] "SUBJ0001-01-MPRAGE_BET_FAST_seg.nii.gz"      

Conclusions

It's a exciting time to be working in neuroimaging in R. The fslr and ANTsR packages provide functionality to perform operations for neuroimaging processing. I will be doing a series on some of the options for analysis in the coming weeks. The code for this analysis (and the data) is located at https://github.com/muschellij2/HopStat/blob/gh-pages/White_Matter_Segmentation_in_R/

Notes

The fslr function ortho2 is a rewrite of the oro.nifti::orthographic function, but with different defaults and will set values of 0 in the second image (y argument) to NA.

The Unofficial ENAR 2015 Itinerary Maker

It’s almost ENAR 2015! The final program is out with all the sessions. The last conference I went to, the International Stroke Conference, had a program planner hosted by abstracts online.

Although there are parts of this system I would like to change, I believe it is helpful for looking up sessions, presenters, and especially posters. Therefore, I introduce

Functions and How to Use

Here is an example screen shot of the shiny app:

Each of the functions are as follows:

  • Type of Session – you can choose from different session types, whether you want to limit to posters or short courses
  • Select the day to subset data
  • Select a specific session
  • Search – this text field uses grep (after lower casing the field) to search the title and autor fields for relevant text.
  • Download – this will download a CSV file of the subsetted table. Now note that this will not be exactly the table, but a Google Calendar-friendly format. This will be discussed in the next session.
  • Donate button: I spent a good deal of work on this app and I believe it improves the conference. If you agree and would like to donate some money and/or a beer at ENAR 2015, I’d appreciate it.

Export to Google Calendar

Individual Talks

Each individual session can be added to a Google Calendar using the Add to my Google calendar button next to each session. For standard posters, this will add the poster as the entire poster session. For specific talks, it will not add the complete session, but simply that talk.

Exporting CSV and Uploading to Google Calendar

Create a new google calendar, let’s call it “ENAR 2015″. Downloading a CSV of sessions you would like to attend or download the entire table and filter them in R/Excel. In Google Calendar, go to Other Calendars, click the down arrow and select ‘Import Calendar’, upload the CSV and select your new calendar ENAR 2015, the records should be imported. If this is unclear, I made a 3 minute Youtube video of this step-by-step process.

Feedback

Is any information is incorrect please let me know, either at @StrictlyStat or muschellij2@gmail.com. I spent a good deal of time cleaning the text from the PDF so I believe it should be mostly correct but obviously any last-minute changes I did not capture.

My Sessions

Please stop by the poster session Poster Number 2b. (PDF of poster) and if you’re interested in neuroimaging processing and using R, please sign up for the “T4: A Tutorial for Multisequence Clinical Structural Brain MRI” that we are running.

Code for app

The app is hosted on my GitHub along with the data used to run the app.

If the app crashes

A backup (or mirror) shiny app is located at http://162.129.13.127/ENAR_2015/.

Using Tables for Statistics on Large Vectors

This is the first post I’ve written in a while. I have been somewhat radio silent on social media, but I’m jumping back in.

Now, I work with brain images, which can have millions of elements (referred to as voxels). Many of these elements are zero (for background). We want to calculate basic statistics on the data usually and I wanted to describe how you can speed up operations or reduce memory requirements if you want to calculate many statistics on a large vector with integer values by using summary tables.

Why to use Tables

Tables are relatively computationally expensive to calculate. They must operate over the entire vector, find the unique values, and bin the data into these values. Let n be the length of the vector. For integer vectors (i.e. whole number), the number of unique values is much less than n. Therefore, the table is stored much more efficiently than the entire vector.

Tables are sufficient statistics

You can think of the frequencies and bins as summary statistics for the entire distribution of the data. I will not discuss a formal proof here, but you can easily re-create the entire vector using the table (see epitools::expand.table for a function to do this), and thus the table is a sufficient (but not likely a minimal) statistic.

As a sufficient statistic, we can create any statistic that we’d like relatively easy. Now, R has very efficient functions for many statistics, such as the median and quantiles, so it may not make sense why we’d want to rewrite some of these functions using tables.

I can think of 2 reasons: 1) you want to calculate many statistics on the data and don’t want to pass the vector in multiple times, and 2) you want to preprocess the data to summarize the data into tables to only use these in memory versus the entire vector.

Here are some examples when this question has been asked on stackoverflow: 1, 2 and the R list-serv: 1. What we’re going to do is show some basic operations on tables to get summary statistics and show they agree.

R Implementation

Let’s make a large vector:

set.seed(20150301)
vec = sample(-10:100, size= 1e7, replace = TRUE)

Quantile function for tables

I implemented a quantile function for tables (of only type 1). The code takes in a table, creates the cumulative sum, extracts the unique values of the table, then computes and returns the quantiles.

quantile.table = function(tab, probs = c(0, 0.25, 0.5, 0.75, 1)){
  n = sum(tab)
  #### get CDF
  cs = cumsum(tab)
  ### get values (x)
  uvals = unique(as.numeric(names(tab)))

  #  can add different types of quantile, but using default
  m = 0
  qs = sapply(probs, function(prob){
    np = n * prob
    j = floor(np) + m
    g = np + m - j
    # type == 1
    gamma = as.numeric(g != 0)
    cs &lt;= j
    quant = uvals[min(which(cs &gt;= j))]
    return(quant)
  })
  dig &lt;- max(2L, getOption(&quot;digits&quot;))
  names(qs) &lt;- paste0(if (length(probs) &lt; 100) 
    formatC(100 * probs, format = &quot;fg&quot;, width = 1, digits = dig)
    else format(100 * probs, trim = TRUE, digits = dig), 
    &quot;%&quot;)
  return(qs) 
}

Quantile Benchmarks

Let’s benchmark the quantile functions: 1) creating the table and then getting the quantiles, 2) creating an empircal CDF function then creating the quantiles, 3) creating the quantiles on the original data.

library(microbenchmark)
options(microbenchmark.unit='relative')
qtab = function(vec){
  tab = table(vec)
  quantile.table(tab)
}
qcdf = function(vec){
  cdf = ecdf(vec)
  quantile(cdf, type=1)
}
# quantile(vec, type = 1)
microbenchmark(qtab(vec), qcdf(vec), quantile(vec, type = 1), times = 10L)
Unit: relative
                    expr       min        lq     mean    median       uq
               qtab(vec) 12.495569 12.052644 9.109178 11.589662 7.499691
               qcdf(vec)  5.407606  5.802752 4.375459  5.553492 3.708795
 quantile(vec, type = 1)  1.000000  1.000000 1.000000  1.000000 1.000000
      max neval cld
 5.481202    10   c
 2.653728    10  b 
 1.000000    10 a  

More realistic benchmarks

Not surprisingly, simply running quantile on the vector beats the other 2 methods, by far. So computational speed may not be beneficial for using a table. But if tables or CDFs are already created in a previous processing step, we should compare that procedure:

options(microbenchmark.unit=&quot;relative&quot;)
tab = table(vec)
cdf = ecdf(vec)
all.equal(quantile.table(tab), quantile(cdf, type=1))
[1] TRUE
all.equal(quantile.table(tab), quantile(vec, type=1))
[1] TRUE
microbenchmark(quantile.table(tab), quantile(cdf, type=1), quantile(vec, type = 1), times = 10L)
Unit: relative
                    expr      min       lq     mean   median       uq
     quantile.table(tab)    1.000    1.000   1.0000    1.000   1.0000
 quantile(cdf, type = 1)  774.885 1016.172 596.3217 1144.063 868.8105
 quantile(vec, type = 1) 1029.696 1122.550 653.2146 1199.143 910.3743
      max neval cld
   1.0000    10  a 
 198.1590    10   b
 206.5936    10   b

As we can see, if you had already computed tables, then you get the same quantiles as performing the operation on the vector, and also much faster results. Using quantile on a ecdf object is not much better, which mainly is due to the fact that the quantile function remakes the factor and then calculate quantiles:

stats:::quantile.ecdf
function (x, ...) 
quantile(evalq(rep.int(x, diff(c(0, round(nobs * y)))), environment(x)), 
    ...)
&lt;bytecode: 0x107493e28&gt;
&lt;environment: namespace:stats&gt;

Median for tables

Above we show the quantile.table function, so the median function is trivial where probs = 0.5:

median.table = function(tab){
  quantile.table(tab, probs = 0.5)
}

Mean of a table

Other functions can be used to calculate statstics on the table, such as the mean:

mean.table = function(tab){
  uvals = unique(as.numeric(names(tab)))
  sum(uvals * tab)/sum(tab)
}
mean.table(tab)
[1] 44.98991
mean(tab)
[1] 44.98991
mean(cdf)
Warning in mean.default(cdf): argument is not numeric or logical:
returning NA
[1] NA

As we see, we can simply use mean and do not need to define a new function for tables.

mean(vec)
[1] 44.98991
all.equal(mean(tab), mean(vec))
[1] TRUE

Subsetting tables

One problem with using mean vs. mean.table is when you subset the table or perform an operation that causes it to lose the attribute of the class of table. For example, let’s say I want to estimate the mean of the data for values > 0:

mean(vec[vec &gt; 0])
[1] 50.50371
over0 = tab[as.numeric(names(tab)) &gt; 0]
mean(over0)
[1] 90065.98
mean.table(over0)
[1] 50.50371
class(over0)
[1] &quot;array&quot;

We see that after subsetting, over0 is an array and not a table, so mean computes the mean using the array method, treating the frequences as data and the estimated mean is not correct. mean.table calculates the correct value, as it does not depend on the class of tab. Another way to circumvent this is to reassign a class of table to over0:

class(over0) = &quot;table&quot;
mean(over0)
[1] 50.50371

This process requires the user to know what the class is of the object passed to mean, and may not be correct if the user changes the class of the object.

Aside on NA values

Let’s see what happens when there are NAs in the vector. We’ll put in 20 NA values:

navec = vec
navec[sample(length(navec), 20)] = NA
natab = table(navec, useNA=&quot;ifany&quot;)
nacdf = ecdf(navec)
mean(navec)
[1] NA
mean(natab)
[1] NA
# mean(nacdf)

We see that if we table the data with NA being a category, then any operation that returns NA if NA are present will return NA. For example, if we do a table on the data with the table option useNA="always", then the mean will be NA even though no NA are present in the original vector. Also, ecdf objects do not keep track of NA values after they are computed.

tab2 = table(vec, useNA=&quot;always&quot;)
mean(tab2)
[1] NA
nonatab = table(navec, useNA=&quot;no&quot;)
mean(nonatab)
[1] 44.98993
mean(navec, na.rm=TRUE)
[1] 44.98993

If you are using tables for statistics, the equivalent of na.rm=FALSE is table(..., useNA="ifany") and na.rm=TRUE is table(..., useNA="no"). We also see that an object of ecdf do not ever show NAs. Although we said tables are sufficient statistics, that may not be entirely correct if depending on how you make the table when the data have missing data.

Mean benchmark

Let’s benchmark the mean function, assuming we have pre-computed the table:

options(microbenchmark.unit=&quot;relative&quot;)
microbenchmark(mean(tab), mean(vec), times = 10L)
Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
 mean(tab)   1.0000   1.0000   1.0000   1.0000   1.0000  1.00000    10  a 
 mean(vec) 374.0648 132.3851 111.2533 104.7355 112.7517 75.21185    10   b

Again, if we have the table pre-computed, then estimating means is much faster using the table.

Getting standard deviation

The mean example may be misleading when we try sd on the table:

sd(vec)
[1] 32.04476
sd(tab)
[1] 302.4951

This are not even remotely close. This is because sd is operating on the table as if it were a vector and not a frequency table.

Note, we cannot calculate sd from the ecdf object:

sd(cdf)
Error in as.double(x): cannot coerce type 'closure' to vector of type 'double'

SD and Variance for frequency table

We will create a function to run sd on a table:

var.table = function(tab){
  m = mean(tab)
  uvals = unique(as.numeric(names(tab)))
  n = sum(tab)
  sq = (uvals - m)^2
  ## sum of squared terms
  var = sum(sq * tab) / (n-1)
  return(var)
}
sd.table = function(tab){
  sqrt(var.table(tab))
}
sd.table(tab)
[1] 32.04476

We create the mean, get the squared differences, and sum these up (sum(sq * tab)) , divide by n-1 to get the variance and the sd is the square root of the variance.

Benchmarking SD

Let’s similarly benchmark the data for sd:

options(microbenchmark.unit=&quot;relative&quot;)
microbenchmark(sd.table(tab), sd(vec), times = 10L)
Unit: relative
          expr      min       lq    mean   median       uq      max neval
 sd.table(tab)   1.0000   1.0000   1.000    1.000   1.0000   1.0000    10
       sd(vec) 851.8676 952.7785 847.225 1142.225 732.3427 736.2757    10
 cld
  a 
   b

Mode of distribution

Another statistic we may want for tabular data is the mode. We can simply find the maximum frequency in the table. The multiple option returns multiple values if there is a tie for the maximum frequency.

mode.table = function(tab, multiple = TRUE){
  uvals = unique(as.numeric(names(tab)))
  ind = which.max(tab)
  if (multiple){
    ind = which(tab == max(tab))
  }
  uvals[ind]
}
mode.table(tab)
[1] 36

Memory of each object

We wish to simply show the memory profile for using a table verus the entire vector:

format(object.size(vec), &quot;Kb&quot;)
[1] &quot;39062.5 Kb&quot;
format(object.size(tab), &quot;Kb&quot;)
[1] &quot;7.3 Kb&quot;
round(as.numeric(object.size(vec) / object.size(tab)))
[1] 5348

We see that the table much smaller than the vector. Therefore, computing and storing summary tables for integer data can be much more efficient.

Conclusion

Tables are computationally expensive. If tables are pre-computed for integer data, however, then statistics can be calculated quickly and accurately, even if NAs are present. These tables are also much smaller in memory so that they can be stored with less space. This may be an important thing to think about computing and storage of large vectors in the future.

The average Stripe employee! Congrats to Alyssa!

Recently, my colleague and fellow blogger Alyssa Frazee accepted a job at Stripe. All of us at JHU Biostat are happy for her, yet sad to see her go.

While perusing Stripe’s website, I found the About page, where each employee has a photo of themselves. I’ve been playing around with some PCA and decompositions, so I figured I’d play around with these photos and make some principal components/eigenfaces. (I think it’s funny when people use the SVD/Eigenvalue decomposition in a new field and name the new thing the eigen-whatever.)

Extracting the HTML

In order to get the images, I had to scrape the about page for links.

Let’s note that stripe uses https and not http for their website (not surprisingly as they do secure payment systems).

library(RCurl)
library('httr')
library('XML')

url.stub = 'https://stripe.com/'

As they use https, you cannot simply read the data into R using readLines or other functions. For this, I used curl in the RCurl package. I defined my certification, got the page, extracted the content as a character vector (imaginatively named x), then parsed the HTML using the XML pagckage.

cafile <- system.file('CurlSSL', 'cacert.pem', package = 'RCurl')
# Read page
page <- GET(
  url.stub, 
  path='about', 
  config(cainfo = cafile)
)

x <- content(page, as='text')

#########################
# Parse HTML
#########################
doc <- htmlTreeParse(x, asText=TRUE, useInternal=TRUE)

Extracting Image Links

Now that I have parsed the HTML document, I can use XPath. If you look at the source of the HTML, there is a div with the id of about, which contains all the links. The xpathSApply function takes the document, the XPath query, which says I want to go to that div, grab all img tags and then get the src.

#########################
# Get face URLs
#########################
stub = "//div[@id = 'about']"
urls = xpathSApply(doc, 
  path=paste0(stub, '//img'), 
    xmlGetAttr, 'src')

I then created an output directory imgdir where I’ll store the images (stored as pngs). Below is just some checking to see if I have already downloaded (in case I had to re-run the code) and only downloads images I don’t already have.

img.urls = paste0(url.stub, urls)
out.imgs = file.path(imgdir, basename(img.urls))

stopifnot(!any(duplicated(img.urls)))
have = file.exists(out.imgs)
img.urls = img.urls[!have]
out.imgs = out.imgs[!have]
###########
# Download images
##########
for (iimg in seq_along(img.urls)){
    download.file(url=img.urls[iimg], destfile = out.imgs[iimg], 
        method='curl')
}

Again, since Stripe uses https, we cannot just use download.file with the default method. I again used curl to get the images. I (manually) downloaded and cropped the image from Alyssa’s biostat page to add her to the Stripe set.

Analyze the Images

I now take all the images, read them in using readPNG. readPNG returns an array, and the first 3 dimensions are the RGB if the image is color; they are not 3D arrays if the images are grayscale, but none in this set are. The 4th dimension is the alpha level if there is opacity, but this information is discarded in the readPNG(img.f)[, , 1:3] statement.

library(png)
library(pixmap)
library(matrixStats)
imgs = list.files(imgdir, pattern='.png$', full.names = TRUE)
n_imgs = length(imgs)

img.list = vector(mode= 'list', length = n_imgs)
iimg = 2
for ( iimg in seq(n_imgs)){
  img.f = imgs[iimg]
    img.list[[iimg]] = readPNG(img.f)[, , 1:3]
}

Same Image Size

To make things easier, I only kept images that were 200 pixels by 200 pixels, so each image was the same size. If you had images of different sizes, you may want to do interpolation to get the same size and resolution.

dims = lapply(img.list, dim)

################################
# Don't feel like interpolating - only keeping 200x200x3
################################
dimg = c(200, 200, 3)

keep = sapply(dims, function(x) all(x == dimg))
img.list = img.list[keep]
imgs = imgs[keep]
dims = dims[keep]

We then make a matrix of 12000 by N (N = 167), where the rows are the concatenated values from the red, green, and blue values.

################################
# Making Matrix: P x N
################################
mat = t(sapply(img.list, c))
cmeans = colMeans(mat)
sds = colSds(mat)

Mean Image

A small function makeimg takes in a vector/matrix, creates an array of 200\times200\times3 and plots the image using pixmapRGB from the pixmap package. Here we plot the “Average Striper”.

makeimg = function(x, trunc = FALSE, ...){
  x = array(x, dim = dimg)
  if (trunc){
    x[x < 0] = 0
        x[x &gt; 1] = 1
    }
    plot(pixmapRGB(x), ...)
}
makeimg(cmeans, main = 'Average Striper')

plot of chunk unnamed-chunk-9

PCA

Although this is what’s in common for Stripe pictures, let’s do a quick PCA (or equivalently SVD) to get the principal components after centering and scaling our data to see what’s different:

# #############
# # Centering and scaling matrix
# #############
X = t(t(mat) - cmeans)
X = t(t(X) / sds)
pca = prcomp(X, center=FALSE)

We can get the percent variance explained from standardized eigenvalues (proportional to the squared deviances), or just use screeplot:

devs = pca$sdev^2 / sum(pca$sdev^2)
plot(1-devs, main='Percent Variance Explained', type='l')

plot of chunk unnamed-chunk-11

screeplot(pca)

plot of chunk unnamed-chunk-11

Plot the PCs

Although we would need about 3 components to recover a large percent of the variance of the data. For illustration, we plot the mean image and the first 9 principal components (PCs).

V <- pca$rotation #sample PCs from PCA
################################
# Plotting Mean Image and PCs
################################

par(mfrow=c(2, 5))
par(oma = rep(2, 4), mar=c(0, 0, 3, 0))
makeimg(cmeans, main = 'Average Striper')

for (i in 1:9){
    makeimg(V[,i],main = paste0('PC ', i))  #PCs from sample data
}

plot of chunk unnamed-chunk-12

Conclusion

This post was more about congratulating Alyssa with some analysis, but I still want to discuss the results.

We can see some pattern in the data from the PCs, but you need many PCs to explain a larger percent of the variance in the data. That is not surprising; this data is not standardized in the way people took the pictures, such as front-facing, with different backgrounds, and I’m using the color information rather than black and white.

We would likely also have more interpretable results if we registered images. In neuroimaging, we register brains to each other and average them to make a template image. We could do that in this analysis and should do so if this was a real project and not a post.

Moreover, we are doing a PCA on non-negative values bounded between 0 and 1. I think this is one of the most interesting aspects of the data. In many analyses using PCA we actually always have positive values. For example people’s food choices is one example where non-negative matrix factorization is used; you can’t eat negative calories…if only. I think this is something to look into for people who are doing PCA on strictly positive values. Although you demean and scale the data and make values negative, you can re-construct data from this components and their scores to get non-interpretable values such as those outside [0, 1]. I’m looking into the nsprcomp package for non-negative PCA for future research work.

Adding Cost Functions to ROCR performance objects

In my last post, I gave an introduction of the ROCR package and how to use it for ROC analysis.

In the ROCR reference manual, it states “new performance measures can be added using a standard interface”, but I have not found that to be so. I may have missed some crucial step, but others have tried to adapt new performance measures. One example I came across had “patched” the performance code to use a new performance measure wss (Work Saved over Sampling). I liked some parts of what they did, but wanted to add my own measure and allow for a user to pass a new measure into a function without having to re-copy all the code.

Dice

I wanted to add an overlap measure known as the Dice coefficient, aka Dice Similarity Index (DSI), or Sorensen-Dice Coefficient. Let's define TP to be the number of true positives, TN to be true negatives, FP to be false positives, and FN to be false negatives, and RN/RP to be row negatives/positives and CN/CP be column negatives/positives. Then our 2-by-2 table of predictions vs. true values is:

0 1 Total
0 TN FP RN
1 FN TP RP
Total CN CP N

Let's say rows are predictions and columns are true values, but these are interchangeable.

The Dice coefficient is defined in terms of these values as:
\frac{2 \times TP}{2\times TP + FP + FN} = \frac{2 \times TP}{RP + CP}

In every prediction object, there are slots for tp, tn, fp, and fn (see ?"prediction-class" for more). Therefore, I can simply take these slots to make my Dice coefficient. Here's how I did it:

dice <- function(prediction.obj){
    if (class(prediction.obj) != "prediction") {
        stop(paste("Wrong argument types: First argument must be of type", 
            "'prediction'"))    
    }
    argnames <- c()
    x.values <- list()
    y.values <- list()
    for (i in 1:length(prediction.obj@predictions)) {
      fp = prediction.obj@fp[[i]] 
        tp = prediction.obj@tp[[i]]
        fn = prediction.obj@fn[[i]]
        tn = prediction.obj@tn[[i]]
        cutoffs = prediction.obj@cutoffs[[i]]
        meas_dice = 2 * tp / (2*tp + fp + fn)
        x.values <- c(x.values, list(cutoffs))
        y.values <- c(y.values, list(meas_dice))
    }
    if (!(length(x.values) == 0 || length(x.values) == length(y.values))) {
        stop("Consistency error.")
    }
    return(new("performance", x.name = "cutoff", 
        y.name = "dice", 
        alpha.name = "none", 
        x.values = x.values, y.values = y.values, 
        alpha.values = list())
    )
}

Essentially, I copied the performance function from ROCR, made some adjustments and put in my calculation (into the object meas_dice) in there. That's great! Now I have this handy function to use when I want.

A more general solution

Although this solved my current problem, I thought more about how to add more cost functions in a more general way.

Here is my solution:

# copied from original function
myperformance <- function (prediction.obj, measure, 
    x.measure = "cutoff", ...)
{
    envir.list <- my.define.environments(...)
    long.unit.names <- envir.list$long.unit.names
    function.names <- envir.list$function.names
    obligatory.x.axis <- envir.list$obligatory.x.axis
    optional.arguments <- envir.list$optional.arguments
    default.values <- envir.list$default.values
    if (class(prediction.obj) != "prediction"){
              stop(paste("Wrong argument types: First argument must be of type",
            "'prediction'"))
    }
    if (!exists(measure,  where = long.unit.names, inherits = FALSE)){
      stop(paste("Measure", measure, "not found"))
    }
    if (!exists(x.measure,  where = long.unit.names, inherits = FALSE)){
      stop(paste("Measure", measure, "not found"))
    }    
    if (exists(x.measure, where = obligatory.x.axis, inherits = FALSE)) {
        message <- paste("The performance measure", x.measure,
            "can only be used as 'measure', because it has",
            "the following obligatory 'x.measure':\n", get(x.measure,
                envir = obligatory.x.axis))
        stop(message)
    }
    if (exists(measure, where = obligatory.x.axis, inherits = FALSE)) {
        x.measure <- get(measure, envir = obligatory.x.axis)
    }
    if (x.measure == "cutoff" || exists(measure, where = obligatory.x.axis,
        inherits = FALSE)) {
        optional.args <- list(...)
        argnames <- c()
        if (exists(measure, where = optional.arguments, inherits = FALSE)) {
            argnames <- get(measure, envir = optional.arguments)
            default.arglist <- list()
            for (i in 1:length(argnames)) {
                default.arglist <- c(default.arglist, get(paste(measure,
                  ":", argnames[i], sep = ""), envir = default.values,
                  inherits = FALSE))
            }
            names(default.arglist) <- argnames
            for (i in 1:length(argnames)) {
                templist <- list(optional.args, default.arglist[[i]])
                names(templist) <- c("arglist", argnames[i])
                optional.args <- do.call(".farg", templist)
            }
        }
        optional.args <- .select.args(optional.args, argnames)
        function.name <- get(measure, envir = function.names)
        x.values <- list()
        y.values <- list()
        for (i in 1:length(prediction.obj@predictions)) {
            argumentlist <- .sarg(optional.args, predictions = prediction.obj@predictions[[i]],
                labels = prediction.obj@labels[[i]], cutoffs = prediction.obj@cutoffs[[i]],
                fp = prediction.obj@fp[[i]], tp = prediction.obj@tp[[i]],
                fn = prediction.obj@fn[[i]], tn = prediction.obj@tn[[i]],
                n.pos = prediction.obj@n.pos[[i]], n.neg = prediction.obj@n.neg[[i]],
                n.pos.pred = prediction.obj@n.pos.pred[[i]],
                n.neg.pred = prediction.obj@n.neg.pred[[i]])
            ans <- do.call(function.name, argumentlist)
            if (!is.null(ans[[1]]))
                x.values <- c(x.values, list(ans[[1]]))
            y.values <- c(y.values, list(ans[[2]]))
        }
        if (!(length(x.values) == 0 || length(x.values) == length(y.values))) {
            stop("Consistency error.")
        }
        return(new("performance", x.name = get(x.measure, envir = long.unit.names),
            y.name = get(measure, envir = long.unit.names), alpha.name = "none",
            x.values = x.values, y.values = y.values, alpha.values = list()))
    }
    else {
        perf.obj.1 <- myperformance(prediction.obj, measure = x.measure,
            ...)
        perf.obj.2 <- myperformance(prediction.obj, measure = measure,
            ...)
        return(.combine.performance.objects(perf.obj.1, perf.obj.2))
    }
}

What is all this code?

First off, myperformance is exactly the code from the performance function in ROCR, except the first line is:

envir.list <- my.define.environments(...)

instead of this line from ROCR::performance

envir.list <- .define.environments()

Note that my.define.environments takes arguments, whereas .define.environments does not. This is a crucial difference; this is where you put your measure's code.

New Environments

If you look at the code for .define.environments:

library(ROCR)
head(.define.environments)
1 function ()                                            
2 {                                                      
3     long.unit.names <- new.env()                       
4     assign("none", "None", envir = long.unit.names)    
5     assign("cutoff", "Cutoff", envir = long.unit.names)
6     assign("acc", "Accuracy", envir = long.unit.names) 

we see the code new.env() being executed. In the beginning of the function, it defines the long.unit.names environment as well as other environments. So every time ROCR::performance is called, it creates a new environment with the names of the measures and functions ROCR uses. This is important since we cannot assign new names and objects to an existing, fixed environment or namespace like we could in other scenarios. Hence why I created my.define.environments:

my.define.environments <- function(funnames = NULL, # name of measure
    longnames = NULL, # name of actual thing
    exprs = NULL, # list of 2 character vectors to be expressed
    optargs, # list
    default.vals,
    xaxis
    )
{
  # get original environments
  envir.list <- ROCR::.define.environments()
  long.unit.names = envir.list$long.unit.names
  function.names = envir.list$function.names   
  obligatory.x.axis = envir.list$obligatory.x.axis
  optional.arguments = envir.list$optional.arguments         
  default.values = envir.list$default.values

    .performance.dice <- function (predictions, labels, cutoffs, fp, 
        tp, fn, tn, n.pos, 
        n.neg, n.pos.pred, n.neg.pred) {
        list(cutoffs, 2 * tp / (2*tp + fp + fn))

    }  

    assign("dice", .performance.dice, 
        envir=function.names)

    assign("dice", "Sorensen-Dice coefficient", 
        envir=long.unit.names)    

    #######################################
    # Allow for general adding
    #######################################
    if (!is.null(funnames)){
        stopifnot(
            length(funnames) == length(longnames) &&
            length(funnames) == length(exprs)
            )
        if (!missing(optargs)){
          stopifnot(length(optargs) == length(funnames))
        }
        if (!missing(optargs)){
          stopifnot(length(default.vals) == length(funnames))
        } 
        if (!missing(xaxis)){
          stopifnot(length(xaxis) == length(funnames))
        }       

        for (iname in seq_along(funnames)){  
          ie1 = exprs[[iname]][[1]]
          ie2 = exprs[[iname]][[2]]
          funcname = paste0("func <- function (predictions, labels, 
                            cutoffs, fp, 
                            tp, fn, tn, n.pos, 
                            n.neg, n.pos.pred, n.neg.pred) {
            list(", ie1, ", ", ie2, ") }")
          eval(parse(text=funcname))

            assign(funnames[iname], func, 
                envir=function.names)
            assign(funnames[iname], longnames[iname],
                envir=long.unit.names)

            #############
            # optional arguments
            #############
            if (!missing(optargs)){
              oargs = optargs[[iname]]
              for (ioarg in seq_along(oargs)){
                assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
                  envir=optional.arguments)
              }
            }

            #############
            # Default values
            #############
            if (!missing(default.vals)){
              oargs = default.vals[[iname]]
              for (ioarg in seq_along(oargs)){
                assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
                  envir=default.values)
              }
            }

            if (!missing(default.vals)){
              oargs = default.vals[[iname]]
              for (ioarg in seq_along(oargs)){
                assign(oargs[[ioarg]][[1]], oargs[[ioarg]][[2]],
                  envir=obligatory.x.axis)
              }
            }            

        }
    } # is is.null

  list(
    long.unit.names = long.unit.names,
    function.names = function.names,
    obligatory.x.axis = obligatory.x.axis,
    optional.arguments = optional.arguments,
    default.values = default.values
  )
}

We see that my.define.environments creates new environments too though! Yes, my.define.environments essentially does the same thing, but I can add my dice functiont inside my.define.environments and this measure can then be used in future work in other projects by using the same code. Moreover, the fact that arguments can be passed into my.define.environments allows you to create a measure on-the-fly.

Below is an example of how you can use custom measures based on the code above.

Example

Here I will plot the Jaccard Index, which is not implemented in the performance function. The Jaccard index formula is similar to dice and is represented as:

\frac{TP}{TP + FP + FN}

We can implement this cost function by supplying our measure, which must match our function name in funnames, the human-readable name in longnames and a list of 2-element character vectors in exprs. For scalar measures, the first element is "cutoffs" and the second element is the expression (to be evaluated by R) of the measure to be used.

data(ROCR.simple)
pred <- prediction(ROCR.simple$predictions,ROCR.simple$labels)
perf.jaccard = myperformance(pred, 
                             measure = "jaccard", 
              funnames = "jaccard", 
              longnames="Jaccard Index", 
              exprs = list(c("cutoffs", "tp / (tp + fp + fn)")))
plot(perf.jaccard)

plot of chunk unnamed-chunk-6

Viola! We now have a way to create any cost function (that can be formulated in the terms of the objects of a prediction object).

Here is the example with using Dice:

perf.dice = myperformance(pred, measure = "dice")
plot(perf.dice)

plot of chunk unnamed-chunk-7

As we already added .performance.dice to my.define.environments, we can simply call it as a measure.

Passing in 2 new measures:

The length of funnames must be the same as that of longnames and exprs (exprs must be a list). You can pass in vectors of funnames and longnames and a list of exprs so that you define multiple measures.
And we can pass in 2 new measures and get a performance object of them. In these cases, you will likely only want to pass in a maximum of 2 measures as a performance object will only compute 2 outputs for the x.values and y.values slots.

perf.both = myperformance(pred, x.measure = "dice", 
                             measure = "jaccard", 
                             funnames = c("dice", "jaccard"), 
                            longnames=c("Dice Index", "Jaccard Index"), 
              exprs = list(c("cutoffs", "2 * tp / (2*tp + fp + fn)"),
                           c("cutoffs", "tp / (tp + fp + fn)")))

plot(perf.both)

plot of chunk unnamed-chunk-8

If you look closely, you'll see there is some odd plotting in the upper right tail of the function. The functions may not be monotonic when you get them out of the performance object, so you may want to sort by the x measure first for plotting purposes:

both = data.frame(cbind(x= perf.both@x.values[[1]], y = perf.both@y.values[[1]]))
both = both[ order(both$x), ]
colnames(both) = c(perf.both@x.name, perf.both@y.name)
plot(both, type="l")

plot of chunk unnamed-chunk-9

Conclusion

Overall, you can add new measures to the performance object in R using the code above. It's a shame that the package is orphaned; I like using it for many ROC functions and measure computations. Then again, I'm not volunteering to maintain it. Although the package says new performance measures can be added using a standard interface”, I could not find a way to do so. Hopefully the code above allows you to implement a new measure if you ever choose to do so. Have fun ROC’ing around the Christmas tree! Boom! – You just got punned.