SMART Hackathon 2014

Today is our SMART group's 2014 Hackathon! We will be at an undisclosed location, just plugging away at code/packages/apps. The mission statement:

There's no structure except that you have to work on a software
project and you have to present what you accomplished or didn't at a
subsequent SMART meeting.

We will be tweeting at #JHUSMARThack for the next two days.

I personally will be working on a suite of imaging tools for R for MS lesion imaging, white matter segmentation, CT image analysis, and a package integrating FSL and R: fslr.

I hope to get a lot of commits and some feedback on the package from the group. Let me know if you want to check any of them out and if you want any functionality with them!

brainR: Put your brain on the Cloud!

In my work, we have come across problems where we wanted to visualize data in 4 dimensions (4D). The data came as brains, which are represented as 3D volumes and we wanted to visualize them at different time points (hence 4D). We were using images that are acquired one image per visit (CT structural scans, particularly), where each patient had a maximum of 10 time points. (The methods below have not been tested or adapted for fMRI scans that have hundreds of time points).

As a result, we created the package brainR (CRAN package) to fit our needs with this. We strive to make figures such as this 4D example.

This figure can be rotated and zoomed. The slider provides a changing opacity for the overall brain (in order to see the structures inside). The buttons for each “visit” represent enhanced lesions in patients with MS detected by SuBLIME (Sweeney et al. 2013). I feel that these figures can convey results that an orthographic figure such as

ortho

cannot, nor can a single slice over time as these enhancing lesions are all over the brain.

The steps to make images such asin brainR is this:

  1. Read (neuro)imaging data (usually using oro.dicom or oro.nifti packages).
  2. Have 1) a brain to plot and 2) a region of interest (ROI) in the same space as the brain
  3. Create a scene using contour3d from misc3d package (Feng & Tierney, 2008).
  4. Write the contents of this scene into 3D objects of formats STL) or OBJ using wrappers for writeSTL or writeOBJ from rgl package (Adler et al. 2014).
  5. Use the X ToolKit (XTK) and simple JavaScript controls to visualize these structures and switch between them.

The last step is what I call 4D visualization, being able to move around space with rotation, zooming, and translation while moving through another dimension, usually different time points, ROIs, structures, etc. The reason we went this route is a few reasons, but particularly:

  1. I know R
  2. rgl and misc3d already exist
  3. JavaScript is “accessible” and has a lot of docs/support
  4. Going further into 4D visualization in software is essentially video game programming or using a language I don’t know

Number 4. was interesting as I looked into tools like Blender, Paraview, or CINEMA 4D, which are are highly powerful. Many of them have a GUI-only interface and some used for much visualizing more extensive than (and not adapted so well for) biomedical data, such as full scenes and landscapes. Most solutions, such as those for MATLAB, require the user either have MATLAB programming experience or download software, which we found to be a large and unneccesary burden – so we use JavaScript in a browser. Granted, to run brainR you need to know R, but the end user just needs to know how to open a browser. If the end user cannot do that, I don’t think 4D visualization is for them anyway.

Now, let’s rundown some of the functions:

  • write4D takes in a “scene”, which is a collection of surfaces, and writes them to an html, along with the STL or OBJ files needed to render it.
  • makeScene is an adaptation of contour3d from misc3d that does mutually exclusive leveling. For example, if you have an image that has values {1, 2, 3}, then you can create an individual surface for 1, one for 2, one for 3. With contour3d, they take the values \(\geq\) to the level for the surface, which means that the surface for 2 would contain 2 and 3.
  • writeWebGL_split is an adaptation of rgl‘s writeWebGL but cuts the objects into individual objects of only 65535 vertices (see here for discussion of WebGL number of vertices, see here for discussion with Duncan Murdoch). This is a limitation of WebGL rendering and we simply split the data using the code given by Duncan Murdoch in the second link. (writeWebGL has a note in the details, but I missed it). Why is this important? If you don’t do the splitting, your surface will not render properly, if at all.

Overall, this is my first package and the paper is submitted to R Journal. The admins at CRAN were very helpful in getting the package ready for production. It wasn’t easy, but it’s done.

The development version of brainR is located at this gitHub package.

Note, there is some interesting work done with 3D objects (and some 4D functionality) into PDF recently and there is interesting overview of methods tried located here if you’re interested. We chose the web because it rendered faster and had easier integration with JavaScript.

Citations were done using knitcitations (Adler et al. 2014).

Creating Smaller PDFs in R

Making plots with many points

Whenever I make a lot of plots in R, I tend to make a multi-page PDF document. PDF is usually great: it's vectorized, which means it will scale no matter how much I zoom in or out. The problem with it is that for “large'' plots, which have a lot of points or lines, or just generally have a lot going on, the size of the PDF can become very big. I recently got a new laptop this year and Preview still pinwheels (for Windows people, hour-glasses) and freezes for some PDFs that are 16Mb in size.

Alternatively, for one-off plots, I use PNGs. What if I want multiple PNGs but in one file? I have found that for most purposes, creating a bunch of PNG files, then concatenating them into a PDF gives smaller-size PDFs that do not cause Preview or Adobe Reader to choke, while still giving good-enough quality plots.

Quick scatterplot example: using pdf

For example, let's say you wanted to make 5 different scatterplots of 1000000 bi-variate normals. You can do something like:

print(nplots)
[1] 5
x = matrix(rnorm(1e+06 * nplots), ncol = nplots)
y = matrix(rnorm(1e+06 * nplots), ncol = nplots)

tm1 <- benchmark({
    pdfname = "mypdf.pdf"

    pdf(file = pdfname)
    for (idev in seq(nplots)) {
        plot(x[, idev], y[, idev])
    }
    dev.off()
}, replications = 1)

The syntax is easy: open a PDF device with pdf(), run your plotting commands, and then close your pdf device using dev.off(). If you've ever forgotten to close your device, you know that you cannot open the PDF and you will be told it is corrupted.

The plot is relatively simple, and there are other packages that can do better visualization, based on the number of pixels plotted and overplotting, such as bigvis, but let's say this is my plot to make.

Let's look at the size of the PDF.

fsize = file.info(pdfname)$size/(1024 * 1024)
cat(fsize, "Mb\n")
32.49 Mb

As, we see the file is about 32 Mb. I probably don't need a file that large and that level of granularity for zooming and vectorization.

Quick scatterplot example: using multiple png devices

I could also make a series of 5 PNG files and put them into a folder. I could then open Preview, drag and drop them into a PDF and then save them or scroll through them using a quick preview. This is not a terrible solution, but it's not too reproducible, especially in a larger framework of creating multi-page PDFs.

One alternative is to give R a temporary filename to give to the set of PNGs, create them in a temporary directory, and then concatenate the PNGs using ImageMagick.

tm2 <- benchmark({
    pdfname2 = "mypdf2.pdf"
    tdir = tempdir()
    mypattern = "MYTEMPPNG"
    fname = paste0(mypattern, "%05d.png")
    gpat = paste0(mypattern, ".*\\.png")
    takeout = list.files(path = tdir, pattern = gpat, full.names = TRUE)
    if (length(takeout) > 0) 
        file.remove(takeout)
    pngname = file.path(tdir, fname)
    # png(pngname)
    png(pngname, res = 600, height = 7, width = 7, units = "in")
    for (idev in seq(nplots)) {
        plot(x[, idev], y[, idev])
    }
    dev.off()

    pngs = list.files(path = tdir, pattern = gpat, full.names = TRUE)
    mystr = paste(pngs, collapse = " ", sep = "")
    system(sprintf("convert %s -quality 100 %s", mystr, pdfname2))
}, replications = 1)

One thing of note is that I visited the png help page many times, but never stopped to see:

The page number is substituted if a C integer format is included in the character string, as in the default.

which tells me that I don't need to change around the filename for each plot – R will do that automatically.

Let's look at how big this file is:

fsize = file.info(pdfname2)$size/(1024 * 1024)
cat(fsize, "Mb\n")
4.94 Mb

We see that there is a significant reduction in size. The quality of the png is 600ppi, which has sufficient (actually good) resolution for most applications and a lot of journal requirements.

So what are the downsides?

  1. You have to run more code.
  2. You can't simply replace the pdf and dev.off syntax

To combat these two downsides, I wrapped these into functions mypdf and mydev.off.

mypdf = function(pdfname, mypattern = "MYTEMPPNG", ...) {
    fname = paste0(mypattern, "%05d.png")
    gpat = paste0(mypattern, ".*\\.png")
    takeout = list.files(path = tempdir(), pattern = gpat, full.names = TRUE)
    if (length(takeout) > 0) 
        file.remove(takeout)
    pngname = file.path(tempdir(), fname)
    png(pngname, ...)
    return(list(pdfname = pdfname, mypattern = mypattern))
}
# copts are options to sent to convert
mydev.off = function(pdfname, mypattern, copts = "") {
    dev.off()
    gpat = paste0(mypattern, ".*\\.png")
    pngs = list.files(path = tempdir(), pattern = gpat, full.names = TRUE)
    mystr = paste(pngs, collapse = " ", sep = "")
    system(sprintf("convert %s -quality 100 %s %s", mystr, pdfname, copts))
}

mypdf opens the device and sets up the format for the PNGs (allowing options to be passed to png). It returns the pdfname and the regular expression pattern for the PNG files. The mydev.off function takes in these two arguments, and any options to the convert function from ImageMagick, closes the device and concatenates the PNGs into a multi-page PDF.

Let's see how we could implement this.

tm3 <- benchmark({
    res = mypdf("mypdf3.pdf", res = 600, height = 7, width = 7, units = "in")
    for (idev in seq(nplots)) {
        plot(x[, idev], y[, idev])
    }
    mydev.off(pdfname = res$pdfname, mypattern = res$mypattern)
}, replications = 1)

And just for good measure, show that this PDF is the same size as before:

fsize = file.info("mypdf3.pdf")$size/(1024 * 1024)
cat(fsize, "Mb\n")
4.94 Mb

Of note, the main difference between using this and pdf with respect to syntax is that dev.off() usually doesn't take an argument (it defaults to the current device).

This process is slow

Let's look at how long it takes to create the PDF for each scenario, using benchmark from the rbenchmark package.

print(tm1$elapsed)
[1] 16.74
print(tm2$elapsed)
[1] 294.7
print(tm3$elapsed)
[1] 276.3

We see that it takes longer (by a factor of around 17) to make the PDF with PNGs and then concatenate them. This is likely because 1) there may be some overhead with creating multiple PNGs versus one device and 2) there is the added PNG concatenation into a PDF step.

But the files are smaller and quicker to render

######### Ratio of file sizes
ratio = file.info("mypdf.pdf")$size/file.info("mypdf2.pdf")$size
print(ratio)
[1] 6.577

Here we see the gain in file size (and quickness of rendering) is about 7, but again that gain is traded off by speed of code. You can see the result of using pdf() and using mypdf.

Post-hoc compression

Obviously I'm not the only one who has had this problem; others have created some things to make smaller PDFs. For example, tools::compactPDF, which uses qpdf or GhostScript, compresses already-made PDFs. Also, there are other reasons to use other formats, such as TIFF (which many journals prefer), but I'm just using PNG as my preference. JPEG, BMP, TIFF, etc should work equally as well as above.

BONUS!

Here are some helper functions that I made to make things easier for viewing PDFs directly from R (calling bash). Other functions exist in packages such as openPDF from BioBase, but these are simple to implement. (Note, I use xpdf for my pdfviewer, but getOption("pdfviewer") is a different viewer that failed on our cluster). The first 3 are viewers for PDFs, PNGs, and the third tries to guess given the filename. The 4th: open.dev uses the fname given to open the device. This allows you to switch the filename to .png from .pdf and run the same code.

view.pdf = function(fname, viewer = getOption("pdfviewer")) {
    stopifnot(length(fname) == 1)
    if (is.null(viewer)) {
        viewer = getOption("pdfviewer")
    }
    system(sprintf("%s %s&", viewer, fname))
}

view.png = function(fname, viewer = "display") {
    stopifnot(length(fname) == 1)
    system(sprintf("%s %s&", viewer, fname))
}

view = function(fname, viewer = NULL) {
    stopifnot(length(fname) == 1)
    get.ext = gsub("(.*)\\.(.*)$", "\\2", file)
    stopifnot(get.ext %in% c("pdf", "bmp", "svg", "png", "jpg", "jpeg", "tiff"))
    if (get.ext == "pdf") {
        if (is.null(viewer)) {
            viewer = getOption("pdfviewer")
        }
    }
    if (is.null(viewer)) {
        warning("No viewer given, trying open")
        viewer = "open"
    }
    system(sprintf("%s %s&", viewer, fname))
}

#### open a device from the filename extension
open.dev = function(file, type = "cairo", ...) {
    get.ext = gsub("(.*)\\.(.*)$", "\\2", file)
    stopifnot(get.ext %in% c("pdf", "bmp", "svg", "png", "jpg", "jpeg", "tiff"))

    ## device is jpeg
    if (get.ext == "jpg") 
        get.ext = "jpeg"
    ### difff arguments for diff devices
    if (get.ext %in% c("pdf")) {
        do.call(get.ext, list(file = file, ...))
    } else if (get.ext %in% c("bmp", "jpeg", "png", "tiff", "svg")) {
        do.call(get.ext, list(filename = file, type = type, ...))
    }
}