ggplot2 is not ALWAYS the answer: it’s not supposed to be

Recently, the topic was propose at tea time:

Why should I switch over to ggplot2? I can do everything in base graphics.

I have heard this argument before and I understand it for the most part. Many people have learned R before ggplot2 came on the scene (as well as many other packages) and learned to do all the things they needed to. Some don’t understand the new syntax for plotting and argue the learning curve is not worth the effort. Also, many say it’s not straightforward to customize.

As the discussion progressed, we discussed 10 commonly made base plots, and I would recreate them in ggplot2. The goal was to help base users break into ggplot2 if they wanted, and also see if they were “as easy” as doing them in base. I want to discuss my results and the fact that ggplot2 is not ALWAYS the answer, nor was it supposed to be.

First Plot – Heatmap

The first example discussed was a heatmap that had 100,000 rows and 100 columns with a good default color scheme. As a comparison, I used heatmap in base R. Let’s do only 10,000 to start:

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0("Col", seq(ncol(mat)))
rownames(mat) = paste0("Row", seq(nrow(mat)))

   user  system elapsed 
 29.996   1.032  31.212 

For a heatmap in ggplot2, I used geom_tile and wanted to look at the results. Note, ggplot2 requires the data to be in “long” format, so I had to reshape the data.

df = melt(mat, varnames = c("row", "col"))
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 

   user  system elapsed 
  9.519   0.675  10.394 

One of the problems is that heatmap does clustering by default, so the comparison is not really fair (but still using “defaults” – as was specified in the discussion). Let’s do the heatmap again without the clustering.

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  0.563   0.035   0.605 

Which one looks better? I’m not really sure, but I do like the red/orange coloring scheme – I can differences a bit better. Granted, there shouldn’t be many differences as the data is random. The ggplot graph does have a built-in legend of the values, which is many times necessary. Note, the rows of the heatmap are shown as columns and the columns shown as rows, a usually well-known fact for users of the image function. The ggplot graph plots as you would a scatterplot – up on x-axis and right on the y-axis are increasing. If we want to represent rows as rows and columns as columns, we can switch the x and y aesthetics, but I wanted as close to heatmap as possible.

Don’t factor the rows

Note, I named the matrix rows and columns, if I don’t do this, the plotting will be faster, albeit slightly.

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
df = melt(mat, varnames = c("row", "col"))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  0.642   0.017   0.675 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 

   user  system elapsed 
  9.814   1.260  11.141 

20,000 Observations

I’m going to double it and do 20,000 observations and again not do any clustering:

N = 2e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0("Col", seq(ncol(mat)))
rownames(mat) = paste0("Row", seq(nrow(mat)))
df = melt(mat, varnames = c("row", "col"))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  1.076   0.063   1.144 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 

   user  system elapsed 
 17.799   1.336  19.204 

100,000 Observations

Let’s scale up to the 100,000 observations requested.

N = 1e5
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0("Col", seq(ncol(mat)))
rownames(mat) = paste0("Row", seq(nrow(mat)))
df = melt(mat, varnames = c("row", "col"))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  5.999   0.348   6.413 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 

   user  system elapsed 
104.659   6.977 111.796 

We see that heatmap and geom_tile() scale with the number observations on how long it takes to plot, but heatmap being much quicker. There may be better ways to do this in ggplot2, but after looking around, it looks like geom_tile() is the main recommendation. Overall, for doing heatmaps of this size, I would use the heatmap, after transposing the matrix and use something like seq_gradient_pal from the scales package or RColorBrewer for color mapping.

For smaller dimensions, I’d definitely use geom_tile(), especially if I wanted to do something like map text to the blocks as well. The other benefit is that no transposition needs to be done, but the data does need to be reshaped explicitly.

Like I said, ggplot2 is not ALWAYS the answer; nor was it supposed to be.

ggplot2 does not make publication-ready figures

Another reason for the lengthy discussion is that many argued ggplot2 does not make publication-ready figures. I agree. Many new users need a large

ggplot2 does not make publication-ready figures

message with their first install. But neither does base graphics. Maybe they need that message for their first R install. R would boot up, and you can’t start until you answer “Does R (or any statistical software) give publication-ready figures by default”. Maybe, if you answer yes, R self-destructs.

Overall, a publication-ready figure takes time, customization, consideration about point size, color, line types, and other aspects of the plot, and usually stitching together multiple plots. ggplot2 has a lot of default features that have given considerate thought to color and these other factors, but one size does not fit all. The default color scheme may not be consistent with your other plots or the data. If it doesn’t work, change it.

And in this I agree with this statement: many people make a plot in ggplot2 and it looks good and they do not put the work in to make it publication-ready. How many times have you seen the muted green and red/pink color that is the default first 2 colors in a ggplot2 plot?

Why all the ggplot2 hate?

I’m not hating on ggplot2; I use it often and value it. My next post will be about plots I commonly use in ggplot2 and why I chose to use it. I think it makes graphs that by default look better than many base alternatives. It does things many things well and easier than base. It has good defaults for some things. It’s a different grammar that, when learned, is easier for plotting. However, I do not think that ggplot2 figures are ready-set-go for publication.

Although I think ggplot2 is great, I most-definitely think base R graphics are useful.

ggplot2 is not ALWAYS the answer. Again, it was never supposed to be.

BibLaTeX with elsarticle


I like BibLaTeX and StackOverflow presents some ways reasons to switch to BibLaTeX. I like it, if nothing else, I can have multiple bibliographies easily. I can also use natbib citation style and also limit which fields are displayed in my bibliography. One problem with BibLaTeX is that it does not work well with Elsevier articles (class elsarticle), since natbib is loaded by default. There are some other options to induce compatibility are presented here and here.


So I edited the elsarticle.cls file to make it work. See the diff between the 2 files here:

diff elsarticle.cls elsarticle_nonatbib.cls
<  \def\RCSfile{elsarticle}%
>  \def\RCSfile{elsarticle_nonatbib}%
<  \def\@shortjid{elsarticle}
>  \def\@shortjid{elsarticle_nonatbib}
< \newcounter{author}
< \def\author{\@ifnextchar[{\@@author}{\@author}}
> \newcounter{auth}
> \def\auth{\@ifnextchar[{\@@auth}{\@auth}}
< \def\@@author[#1]#2{\g@addto@macro\elsauthors{%
> \def\@@auth[#1]#2{\g@addto@macro\elsauthors{%
< \def\@author#1{\g@addto@macro\elsauthors{\normalsize%
> \def\@auth#1{\g@addto@macro\elsauthors{\normalsize%
< \RequirePackage[\@biboptions]{natbib}
> %\RequirePackage[\@biboptions]{natbib}

If you've never read a diff output, the < means what's written in the first file (elsarticle.cls) and > means the second file (elsarticle_nonatbib.cls). The numbers correspond to the line in each file, e.g 192c193 means line 192 in file 1 and 193 in file 2.
Overall, I changed the type of article, commented out the natbib requirement, and changed the author field to auth (see below for why). The edited cls is located here.

Things you need to change

The \author field conflicts with biblatex and elsarticle, so you must chagne the \author definitions to \auth instead. This is a minor change, but important one. You can change that field to anything you want in the elsarticle_nonatbib.cls file (such as elsauthor).

Minimal Working Example (MWE)

I tried it with Elsevier's sample manuscript, changing the author fields to auth, and adding a biblatex-type heading:

  natbib = true,
    sortcites = true



at the end, and the manuscript came out as if using natbib. The MWE is located here and output PDF is located here.


You can use elsarticle with biblatex, with some minor changes. You may have to include this cls with the rest of your LaTeX files for them to compile on the editors machine. Maybe Elsevier will change the LaTeX for more flexibility, but the StackOverflow question was asked 3 years ago and not much seems to have changed, so I like my solution.

Working with NIfTI images in R

The oro.nifti package is awesome for NeuRoimaging (couldn't help myself). It has functions to read/write images, introduces the S4 nifti class, and has useful plotting functions. There are some limitations and some gotchas that are important to discuss if you are working with these objects in R.

Dataset Creation

We'll read in some data (a template, example taken from readNIfTI in oro.nifti. We have to download it as most packages aren't allowed to include data (Bioconductor has mostly remedied this for bioinformatics).

URL <- ""
urlfile <- "colin_1mm.tgz"
if (!file.exists(urlfile)){
  download.file(URL, dest=urlfile, quiet=TRUE)
img <- readNIfTI("colin_1mm")
img = cal_img(img)

The img object read in is of class nifti, has a series of slots (remember S4 object, not S3), and the print method produces information about the NIfTI header.

[1] "nifti"
[1] "oro.nifti"
NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181 x 1
  Pixel Dimension : 1 x 1 x 1 x 0
  Voxel Units     : mm
  Time Units      : sec
 [1] ".Data"          "sizeof_hdr"     "data_type"      "db_name"       
 [5] "extents"        "session_error"  "regular"        "dim_info"      
 [9] "dim_"           "intent_p1"      "intent_p2"      "intent_p3"     
[13] "intent_code"    "datatype"       "bitpix"         "slice_start"   
[17] "pixdim"         "vox_offset"     "scl_slope"      "scl_inter"     
[21] "slice_end"      "slice_code"     "xyzt_units"     "cal_max"       
[25] "cal_min"        "slice_duration" "toffset"        "glmax"         
[29] "glmin"          "descrip"        "aux_file"       "qform_code"    
[33] "sform_code"     "quatern_b"      "quatern_c"      "quatern_d"     
[37] "qoffset_x"      "qoffset_y"      "qoffset_z"      "srow_x"        
[41] "srow_y"         "srow_z"         "intent_name"    "magic"         
[45] "extender"       "reoriented"    

plot of chunk printimg

Array operation

Now, we can do simple operations on this object like any other array. The problem is that certain operations do not (some rightfully so) return a nifti object.

For example, let's get an indicator if the value is greater than 400000. (This is an MRI and has arbitrary units).

mask = img > 400000
NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181 x 1
  Pixel Dimension : 1 x 1 x 1 x 0
  Voxel Units     : mm
  Time Units      : sec
[1] "nifti"
[1] "oro.nifti"

We see a nifti object is returned, with all the header information contained. Great! Let's get an indicator of values greater than 400000 but less than 600000.

mask_2 = img > 400000 & img < 600000
[1] "array"

We see that this operation returned an array, not a nifti object. That's kind of unexpected, but probably conservative. For example, if this had been 2 different nifti objects and you were performing the subsetting, which object's information should you use?

We see the same behavior when multiplying (or subtracting, etc) two nifti objects.

img_masked = img * mask
[1] "array"

We know the object is an array but we may want a nifti object.

fslr Helper functions

The niftiarr function in fslr inputs a nifti object and an array, and returns a nifti object with the array in the @.Data slot, copying over the image header information from the input nifti object.

### need the development version
# library(devtools)
# devtools::install_github("muschellij2/fslr")
img_masked = niftiarr(img, img * mask)
[1] "nifti"
[1] "oro.nifti"

Other ways to skin the cat

Another way of masking the image is to subset the values of the image that are not in the mask and setting those values to \(0\) (or some other value).

img_masked_2 = img
img_masked_2[mask == 0] = 0
[1] "nifti"
[1] "oro.nifti"

We see that this correctly returns an object of class nifti. One potential issue with this direction is that cal_min and cal_max slots on the resulting nifti object not the range of the image. This is problematic because orthographic and image.nifti both use this slot to determine grayscale ranges for plotting and writeNIfTI errors if these values don't match. The cal_img function in fslr is a simple helper function that will set these values to the respective values of the range of the data.

[1]      0 791972
c(cal.min(img_masked_2), cal.max(img_masked_2))
[1]      0 791972
img_masked_2 = cal_img(img_masked_2)
[1]      0 791972

Note, sometimes readNIfTI sets the cal_min and cal_max slots both to \(0\), which may be a result of how the nifti was created.

Equivalent Operations

We see that after these operations done 2 different ways, the resulting nifti objects are equivalent.

all.equal(img_masked, img_masked_2)
[1] TRUE

Data Typing and Writing nifti objects

I've worked with oro.nifti for some time, and I wanted to also discuss writing nifti objects. Many operations preserve the data type of the array in the nifti object, such as if the data were integers, an array of integers returned. Other times, the data type changes, such as if you smoothed an image of integers, the resulting array would have values that are not integers, but probably continuous values. When plotting the data or operating with the R object, this is not generally an issue because any extraction of the information will take the nifti object's @.Data slot and return that array.

This presents a problem when you write nifti objects using the writeNIfTI function.

Slots datatype and bitpix

In a nifti object, the datatype and bitpix fields specify what type the data is and the level of precision stored in the NIfTI file. For example, if you have binary data, the
datatype would be a value of 2 for UINT8 (unsigned integer) and the bitpix would be 8 for 8 bits per voxel. These must correspond to each other (i.e. no mismatching). Also, storing data in a type that is what I call “over-storing” (e.g storing binary data in a FLOAT32 format) may make files larger than necessary. Under-storing is much more problematic, however.

Why you should care about under-storing

If you read in data as integers, did some operations that returned continuous (decimal) objects, and then write the nifti object using the original header information would truncate the data in ways you wouldn't want. In R, the object would look fine, but when you wrote it to disk, it would be truncated.

[1] 8
[1] 2

Let's say we scaled the data by its maximum:

pct = img / max(img)
pct = cal_img(pct)
[1] 8
[1] 2

plot of chunk pct

### we will explain this in the next section
pct = drop_img_dim(pct)
pct = zero_trans(pct)
pct = onefile(pct)

If we write this to disk, and read the result back in, we see:

hist(pct, main="Correct Data")

### Writing out pct without any datatype change - still UINT8
tfile = tempfile()
writeNIfTI(pct, filename = tfile, verbose=TRUE)
  writing data at byte = 352
pct2 = readNIfTI(paste0(tfile, ".nii.gz"))
hist(pct2, main="Writing out with no change of datatype/bitpix")

### Writing out pct with datatype change - to FLOAT32
pct = datatype(pct, type_string = "FLOAT32")
writeNIfTI(pct, filename = tfile)
pct3 = readNIfTI(paste0(tfile, ".nii.gz"))
hist(pct3, main="Writing out with changing of datatype/bitpix")

plot of chunk readwrite

As you can see, you need to change the bitpix and datatype when writing out NIfTI files if you are changing the data type in the array of data in R. We had changed integers to percentages/proportions, and we changed the output to FLOAT32 (for floating point data). Take a look at convert.datatype() and convert.bitpix() for examples of types.

Other functions for changinag data

We used 3 functions in the last section which changed around pct: drop_img_dim, zero_trans, and onefile.

I want to explain what each of these steps are doing.

Dropping dimensions

When you read in some data (especially from SPM) into R using readNIfTI, the data can be read in as a 4-dimensional image, with one dimension only being one. For example, if we look at the dimensions of img:

[1] 181 217 181   1
[1]   4 181 217 181   1   0   0   0
[1] 1 1 1 1 0 0 0 0

We see that the 4th dimension is 1 and the slot dim_ denotes it's a 4-dimensional object (4 is in the first index). Also, we note that values in the dim_ slot are \(0\), which is not allowed when writing out:

writeNIfTI(img, filename = tempfile())
Error: all dim elements > dim[1] must be 1
dim/pixdim mismatch

In order to force it to be a 3-dimensional object, drop_img_dim changes all these aspects for the user.

img_3d = drop_img_dim(img)
[1] 181 217 181
[1]   3 181 217 181   1   1   1   1
[1] 1 1 1 1 0 0 0 0
tfile = tempfile()
writeNIfTI(img_3d, filename = tfile)

Unscaling the data

Many times this is not necessary, but some NIfTI files have the scl_slope and scl_inter slots defined. These denote what to multiply and then add to the data to get it in the correct values. zero_trans simply makes scl_slope = 0 and scl_inter = 1.

Changing NIfTI magic slots

In each NIfTI image, there is a magic slot (see

[1] "ni1"
[1] 0

The NIfTI documentation states:

“ni1” means that the image data is stored in the “.img” file corresponding
to the header file (starting at file offset 0).

We previously wrote out the 3D image, but if we try to read it back in, readNIfTI fails:

readNIfTI(fname = tfile)
Error: This is not a one-file NIfTI format

We can change the default in readNIfTI to onefile = FALSE, but nothing changes:

writeNIfTI(img_3d, filename = tfile, onefile = FALSE)
readNIfTI(fname = tfile)
Error: This is not a one-file NIfTI format

If we look in we see

if (!onefile) {
    if (nim@magic != "ni1") {
        stop("This is not a two-file NIfTI format")

We can change this magic to n+1, but we need to change the vox_offset = 352 and that's exactly what onefile() does:

img_3d = onefile(img_3d)
writeNIfTI(img_3d, filename = tfile, onefile = FALSE)
readNIfTI(fname = tfile)
NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181
  Pixel Dimension : 1 x 1 x 1
  Voxel Units     : mm
  Time Units      : sec

Viola! All these fun little quirks that I hope you don't have to encounter. But if you do, I hope this page helps.

Converting JPEG2000 DICOMs to Uncompressed DICOM

TL;DR – Neuroimaging-specific post. Some files can't be converted by programs if they are encoded one way – present Matlab program to fix that.

JPEG2000 DICOM data

Recently, I had some DICOM data that is stored in the JPEG2000 format from OsiriX. I wanted to convert these to a NIfTI format using dcm2nii but it was not a fan of the compression, specifically the transfer syntax (which tells you how the data is encoded).

It spit out the error:

Unsupported Transfer Syntax 1.2.840.10008. Solution: use MRIcro

Using Matlab's Image Processing Toolbox

For most of my endeavors, I try to use R for everything, but that's not always possible. For example the oro.dicom package is great for working with DICOM data, but the JPEG2000 compression format is not supported. Therefore, I went into Matlab, which has the image processing toolbox. You will need an updated version, as described here, Matlab 2009b version will not work.

The Code for Conversion

Overall, the code takes a directory, lists all the files, removing directories. . The for loop goes through each DICOM file, reads in the header information with dicominfo, reads in the data matrix with dicomread. The transfer syntax is changed using newsyntax, creates a new filename (within a specified output directory outdir) and writes the DICOM file using dicomwrite. The createmode can either be Create or Copy. Using Create is the default option and has error and missing checking. The Copy mode bypasses many of these checks, but this may be required for certain DICOM files to be written.

rundir = pwd;
updir = fileparts(rundir);
outdir = fullfile(updir, 'Converted');
addon = '';
newsyntax = '1.2.840.10008.1.2';
createmode = 'Copy'; % could be 'Create';
x = dir(rundir);
nodir = ~[x.isdir]';
x = x(nodir);
x = {}';
ifile = 1;
for ifile = 1:length(x)
    name = x{ifile};
    stub = regexprep(name, '(.*)[.].*', '$1');
    stub = [stub addon '.dcm']; %#ok<AGROW>
    name = fullfile(rundir, name);
    stub = fullfile(outdir, stub);
    dinfo = dicominfo(name);
    dinfo.TransferSyntaxUID = newsyntax;
    X = dicomread(name);
    dicomwrite(X, stub, dinfo, 'CreateMode', createmode);

After this conversion, dcm2nii worked as usual for converting.

Other Options

Other options exist such as the DICOM JPEG 2000 Module, but this must be licensed and Matlab has implemented this in imread. Just figured if you ran into this problem, this may help.

Writing Accountability Groups (WAGs)

Recently, 7 other students and myself joined the first writing accountability group (WAG) for students in our biostatistics department. Johns Hopkins has a set of WAGs all around campus, but we are the first student-only group.

What's a WAG?

I posted before about strategies of How to Write a Lot based on Paul Silvia's Book of the same name. One of his recommendations (especially for junior faculty) was to start what he calls an “agraphia group”, where you discuss goals, go on your way, and then come back and discuss if you met those goals or not. He (and I believe in this) seemed rigorous in this regard: if you came 2 times saying “I didn't do anything”- you were out.

The WAG is the implementation of such a group. The group consists of 3 parts:

  1. 15 minutes of discussing previous goals and goals for the 30 minute writing session.
  2. 30 minutes of “writing” – where writing means accomplishing the goals set in 1.
  3. 15 minutes of discussing if the goals were met from the 30 minute session and setting goals for the next week.

The WAG Makeup

The group consists of 4-8 participants. Don't have more than that – you won't have enough time in an hour and it will become unruly. Also, it's good when you can spot in a second who is missing. It runs for 10 weeks (it can run more), but gives a good concrete timeframe. This inhibits participants from saying “I'll come next time”. There are only 10 sessions, show up – a maximum of 2 misses are allowed.

Why did we start the group?

Now, why did we start this group – why not “just write”?

  1. Peer pressure can be used for good. No one wants to say “I've done nothing”. Even if you get 1 paragraph done as a last-ditch effort before the meetings, it's > 0 paragraphs.
  2. Goals are hard to set; it's easier with others flushing out your ideas.
  3. You are writing for 30 minutes even if you get nothing else done that week.

Overall Goals

Types of overall goals are group are trying to accomplish:
1. Write a paper.
2. Edit a paper for review/submission.
3. Write background/significance section on a grant.
4. Write more in a blog

30-minute Session Goals

These are large and somewhat abstract. In a goal-setting session for the 30 minute writing, we have more concrete goals such as:
1. Write 2 paragraphs of the results section for a paper.
2. Make patient demographics table
3. Incorporate all comments from collaborator 1.
4. Get 1 polished paragraph for background section
5. Write 3 sections of blog post, post by end of day

15-minute End Goal Creation

Examples of these goals are:
1. Write 2 pages of paper 1.
2. Make patient demographics table and 3 figures.
3. Incorporate all comments from all collaborators and format for journal.
4. “Lock down” background section as done.
5. Post blog post, write second for editing next Monday.

How do I start one?

If you are at Hopkins, please visit the WAG facebook group above. If not, just get a few people together and follow the above setup. Make people commit to a timeline (maybe even sign a pseudo contract so they know what they're getting into) that is short, concrete, and has a seeable end. I recommend buying a few of the Write a Lot books (as well as others) to share and discuss.

Additional WAG-related Time

The students in our group have also discussed that the 30 minutes in our WAG is helpful, and an additional 30 minutes would be helpful. We are testing an additional writing “meeting” and we will try to determine its effect. We will adapt based on our WAG (like 10 min goals, 45 mins writing, 5 end goals).

Statisticians in Neuroimaging Need to Learn Preprocessing

I have worked in neuroimaging for the past 4-5 years. I work on CT scans for stroke patients, but have also worked on fMRI and some structural image analyses. One important thing I have learned: (pre?)processing means a lot.

Take a note from Bioinformatics

In some respects, Bioinformaticians had the best opportunity when sequencing became more affordable and the data exploded. I say they had it good because they were the ones who got the raw (mostly) data and had to figure out how to analyze it. That's not to say, in any way, it was easy to figure out correct analysis methods, develop an entire industry from the ground up, or jump into big datasets that required memory far beyond the range laptops could handle. The reason I say they have it good is because the expectation for those working on the data (e.g. (bio)statisticians) to know (and usually agree) with how the data was (pre)processed.

You trust that data?

I remember a distinct conversation at a statistics conference when I spoke to a post-doc, trained in statistics, who worked in imaging. I sat next to him/her and asked “how do you preprocess your data”? The response: “Oh, I don't know, my collaborator does it and I work on the processed data.” I was confused. “You trust that data?” I thought. I have heard that more times since then, but increasingly hear more people getting involved in the whole pipeline: from data collection to analysis.

An analogy to standard datasets

To those who don't work in bioinformatics or imaging, I'll make this analogy: someone gives you dataset and one column is transformed in a non-linear way, but those giving you the data can't really tell you how it was transformed. I think for many it'd be hard to trust and accept that data. My biostat training has forever given me data trust issues. It's hard for me to trust people who give me data.

Questions I usually rattle off rapid-fire:
1. How was it collected?
2. Why is this missing?
3. Why is this point really weird?
4. What does -9/999/. mean?
5. Where is your codebook?
6. Is that patient information!? Ugh. I'm deleting this email and you can remove that and resend. Better yet, use DropBox. NO – don't keep the original with the patient info in there!

It sounds more like an investigation rather than a collaboration – I'm working on changing that. But I was trained to do that because those things are important.

Back to imaging

Many times though, this is exactly how a dataset is given to a statistician. The images were processed in a way, and sometimes registered to a “template” image in a non-linear way.

Why do I think that this happens more often in neuroimaging? (It probably happens in bioinformatics too but I won't speak to that). I think it's because

Preprocessing is uninteresting/hard/non-rewarding/time-consuming

Moreover, I believe there is a larger

MISCONCEPTION that Preprocessing is not important


When I started my lab in fMRI, they had me preprocess data BY HAND (well by click, but you get the idea). They had me go through each step so that I understood what that step did to the data. It made me realize why and where things would go wrong and also taught me an important lesson: decisions upstream in the processing can have tremendous effects downstream. I am forever grateful for that.

It also taught me that preprocessing can be a boring and pain-staking business. Even after I got to scripting the preprocessing, there were still manual steps to check that are inherent. For example, if you co-register (think matching my brain and your brain together) images, you want to make sure it works right. Did this brain really match up way to that brain? There are some methods to try to estimate quality, but almost everyone has to look at the images.

Statisticians are trained to look at data, so we should be USED TO THIS PRACTICE. The problem is 1) if it works, the response is “OK next” and feel like time was wasted (it wasn't) or 2) if it doesn't you have to fix it or throw away the data, which can be painful and long.

How long do you think looking at one scan would take? OK, looking at 1-2 scans is quick, but what about 100? What if I said 1000?

Before I discuss trusting collaborators let me make my message clear:

Even if you don't do the preprocessing yourself, you should know what preprocessing was done on your data. Period.

In my eyes, speakers lose a lot of credibility if they can't answer a few simple questions about how their data got to the analysis stage. Now I haven't remembered every flip angle we've used, but I for sure knew if the data was band-pass filtered.

Trusting Collaborators

Here comes the dilemma: get in the trenches and do hours of work preprocessing the data or get the data preprocessed from collaborators. I say a little of both. Sit down with one of the people that do the preprocessing and watch them/go over their scripts with them. Ask many questions – people may ask you these questions later.

The third option, and one I believe we strive for in our group, is to develop methods that require “light preprocessing”. That is, do things on a per-image or per-person level, derive measures, and then analyze these (usually low-dimensional) measures for groups/populations.

There are some steps that are unavoidable. If you want population information on a spatial brain level, you'll likely have to register/warp images to a template. But if this is the case, do some “procedure sensitivity analysis” – try a couple different registration/warping procedures and see how sensitive your results are. If they are highly dependent on registration, you should be sure the one you chose is “correct”. Dr. Ani Eloyan just had a paper accepted on this very topic titled “Health effects of lesion localization in multiple sclerosis: Spatial registration and confounding adjustment” to come out in PLoS ONE in the next month. If others are doing the processing, and you don't know how to, this can be hard to figure out the right questions to ask. So learn.

Moreover, ask collaborators about the data they threw out along the way. Was it all the females? All under 5 years old? All the people who move too much? Don't stop asking the questions about missing data and potential biases lurking in those discarded (costly) images.

Large Benefits of Learning Preprocessing

Each pre-processing is used for some sort of goal: to correct this or to normalize that, etc. Thus, there is a industry of developing and checking preprocessing steps. So not only can you help to develop statistical models for the data, you can also develop methods that may improve processing, check whether preprocessing steps are good, or test whether one preprocessing method is better than the other (this would be huge). If you don't know about the processing you're missing out on a large piece of the methodological work that can be done.


Learn about preprocessing. It's part of the game with imaging. This may scare some people – good. Let them leave; there are plenty of questions and problems for the rest of us. Looking at brain images (and showing your friends) is still pretty cool to me, that's why I'm still in the imaging game. But I needed to learn the basics.

One warning: if you do know how to process data, people will want you to do it for them. Try as best you can to fight this and instead train others how to do their own processing and convince them why it is useful.

Warning: Shameless Plug

At ENAR 2015, Ciprian Crainiceanu, Ani Eloyan, Elizabeth Sweeney, Taki Shinohara and myself will be presenting a 1 hour 45-minute tutorial on converting raw images, reading data into R, and some basic preprocessing methods.

Converting LaTeX to MS Word

Last year, Elizabeth Sweeney wrote about how she converts LaTeX to Word. If you're trying all open-source solutions to this problem, visit there.

In my experience, I was writing in LaTeX as well. I had a journal that only accepted Word documents. I had to convert from LaTeX to Word.

Same story, different day

I tried a lot of the solutions from StackExchange: latex2rtf, pandoc, TeX2Word, etc.

I think the best quote there is

There is no pain-free way to do this. Really.

And no, nothing really worked VERY well straight out of the box. My solution was hackey as well, but it worked the best for me with a reasonable amount of formatting for me. The biggest problems were, not surprisingly, equations. Some garbled the text everywhere, others created image files that were included in the file.

What I used

My pipeline converted PDF to Word (.docx) using Acrobat Adobe Pro. This is relatively cheap for our students and is a solid program, though somewhat pricey. The conversion was similar for headings/sections to those above, but the equations were “converted”. The equations were converted to some pseudo-equations, but when highlighted in the Word doc, and then clicking Insert > Equation, and viola! The equations looked pretty good (aka usable).

I would say, though, the conversion was not perfect. I had some odd problems with superscripts and such, and ended up uploading the .docx to Google Drive as we were going to try editing together, but it never worked. I did noticed the Google Drive document fixed many formatting issues the OCR had caused from Acrobat Pro. I downloaded it from Drive and the formatting issues were fixed…but equations were a mess again! I ended up just copying and pasting the equations from the pre-uploaded .docx into the Google-Drive-converted .docx. That's where I had my best results.

Someone Please Stop the Madness!

Is this a good pipeline? No. Did it technically “work”? Yes. Why did I go through LaTeX in the first place. Well, 1) I didn't know they only took Word. This is my fault, but could have easily been my 2nd submission to a journal and the other journal accept PDF. 2) I had equations. MS equations, even though they “accept” LaTeX – no. 3) I know how to get LaTeX to format things good enough.

I will go on 1 rant and then discuss some light in this darkness.


Seriously journals, you only accept Word documents? What is this bullshit. The journal even accepted PDFs in Supplements. Everyone can read and annotate PDFs nowadays; get rid of Word requirements.
I imagine this perpetuates because 1) it's easy to use Word's word count and say “that's how many words you sent”, even though that's ridiculous (you included references in your word count!?), and 2) the editors/typesetters have used it for print journals, 3) some reviewers only use Word and cannot annotate PDF.

You know what – get rid of the reviewers from 3). You're reviewing cutting edge research and didn't keep up with a technology that pretty much every journal uses for papers. Maybe you aren't the best for that job.

Light up the Darkness

RStudio released the Knit to Word button in their new versions. Now, many people who use pandoc as discussed before, knew how to do this on some level. The big difference for me is that 1) I never thought to say only in R Markdown and skip LaTeX altogether, 2) It's click-button in RStudio which means more will use it, and 3) I can switch between PDF and Word with one click. With citation style files and knitcitations, I think I can get close to LaTeX references and automated reporting.

Next post to follow up on this.