Statisticians in Neuroimaging Need to Learn Preprocessing

I have worked in neuroimaging for the past 4-5 years. I work on CT scans for stroke patients, but have also worked on fMRI and some structural image analyses. One important thing I have learned: (pre?)processing means a lot.

Take a note from Bioinformatics

In some respects, Bioinformaticians had the best opportunity when sequencing became more affordable and the data exploded. I say they had it good because they were the ones who got the raw (mostly) data and had to figure out how to analyze it. That's not to say, in any way, it was easy to figure out correct analysis methods, develop an entire industry from the ground up, or jump into big datasets that required memory far beyond the range laptops could handle. The reason I say they have it good is because the expectation for those working on the data (e.g. (bio)statisticians) to know (and usually agree) with how the data was (pre)processed.

You trust that data?

I remember a distinct conversation at a statistics conference when I spoke to a post-doc, trained in statistics, who worked in imaging. I sat next to him/her and asked “how do you preprocess your data”? The response: “Oh, I don't know, my collaborator does it and I work on the processed data.” I was confused. “You trust that data?” I thought. I have heard that more times since then, but increasingly hear more people getting involved in the whole pipeline: from data collection to analysis.

An analogy to standard datasets

To those who don't work in bioinformatics or imaging, I'll make this analogy: someone gives you dataset and one column is transformed in a non-linear way, but those giving you the data can't really tell you how it was transformed. I think for many it'd be hard to trust and accept that data. My biostat training has forever given me data trust issues. It's hard for me to trust people who give me data.

Questions I usually rattle off rapid-fire:
1. How was it collected?
2. Why is this missing?
3. Why is this point really weird?
4. What does -9/999/. mean?
5. Where is your codebook?
6. Is that patient information!? Ugh. I'm deleting this email and you can remove that and resend. Better yet, use DropBox. NO – don't keep the original with the patient info in there!

It sounds more like an investigation rather than a collaboration – I'm working on changing that. But I was trained to do that because those things are important.

Back to imaging

Many times though, this is exactly how a dataset is given to a statistician. The images were processed in a way, and sometimes registered to a “template” image in a non-linear way.

Why do I think that this happens more often in neuroimaging? (It probably happens in bioinformatics too but I won't speak to that). I think it's because

Preprocessing is uninteresting/hard/non-rewarding/time-consuming

Moreover, I believe there is a larger

MISCONCEPTION that Preprocessing is not important


When I started my lab in fMRI, they had me preprocess data BY HAND (well by click, but you get the idea). They had me go through each step so that I understood what that step did to the data. It made me realize why and where things would go wrong and also taught me an important lesson: decisions upstream in the processing can have tremendous effects downstream. I am forever grateful for that.

It also taught me that preprocessing can be a boring and pain-staking business. Even after I got to scripting the preprocessing, there were still manual steps to check that are inherent. For example, if you co-register (think matching my brain and your brain together) images, you want to make sure it works right. Did this brain really match up way to that brain? There are some methods to try to estimate quality, but almost everyone has to look at the images.

Statisticians are trained to look at data, so we should be USED TO THIS PRACTICE. The problem is 1) if it works, the response is “OK next” and feel like time was wasted (it wasn't) or 2) if it doesn't you have to fix it or throw away the data, which can be painful and long.

How long do you think looking at one scan would take? OK, looking at 1-2 scans is quick, but what about 100? What if I said 1000?

Before I discuss trusting collaborators let me make my message clear:

Even if you don't do the preprocessing yourself, you should know what preprocessing was done on your data. Period.

In my eyes, speakers lose a lot of credibility if they can't answer a few simple questions about how their data got to the analysis stage. Now I haven't remembered every flip angle we've used, but I for sure knew if the data was band-pass filtered.

Trusting Collaborators

Here comes the dilemma: get in the trenches and do hours of work preprocessing the data or get the data preprocessed from collaborators. I say a little of both. Sit down with one of the people that do the preprocessing and watch them/go over their scripts with them. Ask many questions – people may ask you these questions later.

The third option, and one I believe we strive for in our group, is to develop methods that require “light preprocessing”. That is, do things on a per-image or per-person level, derive measures, and then analyze these (usually low-dimensional) measures for groups/populations.

There are some steps that are unavoidable. If you want population information on a spatial brain level, you'll likely have to register/warp images to a template. But if this is the case, do some “procedure sensitivity analysis” – try a couple different registration/warping procedures and see how sensitive your results are. If they are highly dependent on registration, you should be sure the one you chose is “correct”. Dr. Ani Eloyan just had a paper accepted on this very topic titled “Health effects of lesion localization in multiple sclerosis: Spatial registration and confounding adjustment” to come out in PLoS ONE in the next month. If others are doing the processing, and you don't know how to, this can be hard to figure out the right questions to ask. So learn.

Moreover, ask collaborators about the data they threw out along the way. Was it all the females? All under 5 years old? All the people who move too much? Don't stop asking the questions about missing data and potential biases lurking in those discarded (costly) images.

Large Benefits of Learning Preprocessing

Each pre-processing is used for some sort of goal: to correct this or to normalize that, etc. Thus, there is a industry of developing and checking preprocessing steps. So not only can you help to develop statistical models for the data, you can also develop methods that may improve processing, check whether preprocessing steps are good, or test whether one preprocessing method is better than the other (this would be huge). If you don't know about the processing you're missing out on a large piece of the methodological work that can be done.


Learn about preprocessing. It's part of the game with imaging. This may scare some people – good. Let them leave; there are plenty of questions and problems for the rest of us. Looking at brain images (and showing your friends) is still pretty cool to me, that's why I'm still in the imaging game. But I needed to learn the basics.

One warning: if you do know how to process data, people will want you to do it for them. Try as best you can to fight this and instead train others how to do their own processing and convince them why it is useful.

Warning: Shameless Plug

At ENAR 2015, Ciprian Crainiceanu, Ani Eloyan, Elizabeth Sweeney, Taki Shinohara and myself will be presenting a 1 hour 45-minute tutorial on converting raw images, reading data into R, and some basic preprocessing methods.

Converting LaTeX to MS Word

Last year, Elizabeth Sweeney wrote about how she converts LaTeX to Word. If you're trying all open-source solutions to this problem, visit there.

In my experience, I was writing in LaTeX as well. I had a journal that only accepted Word documents. I had to convert from LaTeX to Word.

Same story, different day

I tried a lot of the solutions from StackExchange: latex2rtf, pandoc, TeX2Word, etc.

I think the best quote there is

There is no pain-free way to do this. Really.

And no, nothing really worked VERY well straight out of the box. My solution was hackey as well, but it worked the best for me with a reasonable amount of formatting for me. The biggest problems were, not surprisingly, equations. Some garbled the text everywhere, others created image files that were included in the file.

What I used

My pipeline converted PDF to Word (.docx) using Acrobat Adobe Pro. This is relatively cheap for our students and is a solid program, though somewhat pricey. The conversion was similar for headings/sections to those above, but the equations were “converted”. The equations were converted to some pseudo-equations, but when highlighted in the Word doc, and then clicking Insert > Equation, and viola! The equations looked pretty good (aka usable).

I would say, though, the conversion was not perfect. I had some odd problems with superscripts and such, and ended up uploading the .docx to Google Drive as we were going to try editing together, but it never worked. I did noticed the Google Drive document fixed many formatting issues the OCR had caused from Acrobat Pro. I downloaded it from Drive and the formatting issues were fixed…but equations were a mess again! I ended up just copying and pasting the equations from the pre-uploaded .docx into the Google-Drive-converted .docx. That's where I had my best results.

Someone Please Stop the Madness!

Is this a good pipeline? No. Did it technically “work”? Yes. Why did I go through LaTeX in the first place. Well, 1) I didn't know they only took Word. This is my fault, but could have easily been my 2nd submission to a journal and the other journal accept PDF. 2) I had equations. MS equations, even though they “accept” LaTeX – no. 3) I know how to get LaTeX to format things good enough.

I will go on 1 rant and then discuss some light in this darkness.


Seriously journals, you only accept Word documents? What is this bullshit. The journal even accepted PDFs in Supplements. Everyone can read and annotate PDFs nowadays; get rid of Word requirements.
I imagine this perpetuates because 1) it's easy to use Word's word count and say “that's how many words you sent”, even though that's ridiculous (you included references in your word count!?), and 2) the editors/typesetters have used it for print journals, 3) some reviewers only use Word and cannot annotate PDF.

You know what – get rid of the reviewers from 3). You're reviewing cutting edge research and didn't keep up with a technology that pretty much every journal uses for papers. Maybe you aren't the best for that job.

Light up the Darkness

RStudio released the Knit to Word button in their new versions. Now, many people who use pandoc as discussed before, knew how to do this on some level. The big difference for me is that 1) I never thought to say only in R Markdown and skip LaTeX altogether, 2) It's click-button in RStudio which means more will use it, and 3) I can switch between PDF and Word with one click. With citation style files and knitcitations, I think I can get close to LaTeX references and automated reporting.

Next post to follow up on this.

Your Research is a Pain-in-the-ass Unicorn

I'm submitting another paper and I've come to a similar spot: I've edited and read the paper 15 times and don't even want to look at it. Things aren't exactly the same as before: I did a lot more writing, had a better writing schedule, which involved more positive editing schedule, and I'm still relatively excited about the results.


The key word in that paragraph is relatively. I have discussed with many students and faculty about how after you do all the analysis, the last push to write seems 10 times harder than all the analysis. I'm sure there are names and psychological theories about why this happens and all the stuff that goes into your head. The How to Write A Lot book discusses many reason why we don't like writing. But these feelings don't happen only when you're writing: it also happens when you have worked on one project for a long time and want to bash it with a hammer sometimes.

Pain-in-the-ass Unicorn

This is natural. I would like to make the following analogy: your research/thesis/dissertation is a pain-in-the-ass unicorn. Think about what would happen if you got a unicorn.

You'd be like OMFG a unicorn!! Everyone look – it's a unicorn. It's MY unicorn! Woo! Anything the unicorn did, you'd be excited. Even if the unicorn took shit all over your bathroom. It's like oh man, I've never seen unicorn poop! You would tell everyone about it and be super excited. After a few months, you'd still be excited to see the unicorn when you came home and take it for unicorn rides and such. You'd mention it to new friends and they'd react like: OH MY GOD NO WAY, that's SOOOOOOO COOL! To which, you'd just shrug and say “Yeah, it's pretty cool. I like it.” and “Oh, you didn't know unicorns could do that?”.

And when they came over they'd play with your unicorn and you'd sit and watch. They would then say something “I wish I had a unicorn.” and it'd hit you: NO YOU DON'T. The unicorn shit on the floor a year after you got it and you can't hide your frustration and anger. You complain to your friends about your pain-in-the-ass unicorn. Why do I even have this thing?

I'm so glad for this post I found after writing the post. Especially this figure (no explanation needed):

Dealing with my unicorn

This is what I see with a lot of my work (and observing others). Initial excitement, lulled into complacency with the problem, thinking it's common and not novel/exciting, and frustration/complaining. I'm not sure if this is “natural”, but I would say it's common. What I try to remind other students and myself time and time again:


Most of the projects you work on in grad school/academia are the cutting edge of research. Even if it's common to you, it sure isn't common to many other people, most importantly to you ½/4 years ago. If you existed then, didn't know the solution, and were excited about the solution, then so will others. Granted, maybe not droves of people, but some. It's your unicorn and realize it's still special and awesome. Yeah it may shit on the couch, but it's a FREAKING UNICORN.

Combatting this I-hate-my-unicorn Feeling

I try strategies to avoid this pattern to keep me excited.

  1. This answer depends. I like to have more than 1 project. Too many projects can hinder progress, but a few can make you realize good aspects of each one. Remember, your unicorns may play well with each other. If they don't, you have 2 unicorns shitting on the floor.
  2. Talk to new people/collaborators about your work. Excitement in others revives excitement in yourself. Talk to your advisor/mentor/collaborator. Many times they reinforce why you are doing the work.
  3. Listen to others about their unicorns. It can help you put your research in perspective because you'll think about your unicorn that way. Or their unicorn/data/collaborator is even worse and you will feel grateful for your unicorn.
  4. Talk to people outside of academia about your unicorn every now and then. This importantly includes, not downplaying (but not boasting) your research when people note that it's cool/exciting/hard/important. They gave you a compliment, don't say “no it's not” – that's rude.
  5. Read reddit – especially r/futurology's summary science of the week.
  6. Make the coolest/craziest figure you can of your research for a talk. Something people will remember and talk about, for better or worse. Interactive – yes; 3D boxplot – NO. Maybe put in unicorns.

We've all been there. Ask around. Some people have even toilet trained their unicorns. Talk to them.

Sorted HTML Tables and Javascript Libraries

A few days ago StatsInTheWild asked the following question

So we had a few exchanges where I thought you could use sprintf and be done but it didn't seem to work:

After a bit more discourse, StatsInTheWild shared some data with me:

and I went down the rabbit hole of trying to find out what was going on. Here is the code to make the table:

myfile = "openWAR.csv"
if (!file.exists(myfile)) {
  download.file("", myfile, method="wget")
openWAR<-read.csv(myfile, stringsAsFactors = FALSE);

And as you can see in the output table here the column RAA.pitch does not sort correctly.

Attempts at, and then finding, a Solution

I tried a few things such as changing numeric to string, seeing if missing data was a problem, trying some things where I make the numbers all positive, but the problem persisted.

As StackOverflow usually does, it had insight into an answer. Essentially, prior to version 2.0.5, jquery.tablesorter.js didn't sort numbers exactly correctly. The problem is that SortableHTMLTables ships with version 2.0.3:

head(readLines(system.file("assets/jquery.tablesorter.js", package = "SortableHTMLTables"))) 
[1] "/*"                                                       
[2] " * "                                                      
[3] " * TableSorter 2.0 - Client-side table sorting with ease!"
[4] " * Version 2.0.3"                                         
[5] " * @requires jQuery v1.2.3"                               
[6] " * "                                                      

and uses this version for the table output. Now, if you wanted to fix this, you'd have some css to your file or some other route. Or, you can just update jquery.tablesorter.js. I went to, downloaded the new js plugin.

But I want this automatic!

If you're using R and don't want to play around with JavaScript, that's the whole point of these functions. Saying you have to edit css or something of the like defeats that purpose. But for this fix to be automatically'' done, you either have to 1) copy the .js file every time you run thesortable.html.table command as it re-copies the files over, 2) wait for the maintainer to update (out of your control), 3) change your css, or 4) copy a new .js file with different name and edit the html file after running to make sure it uses your new js file. I'll implement 4).

outfile = "openWAR2014_fixed.html"
change_js = function(f, newjs = "jquery.tablesorter_v2.0.5.js"){
  x = readLines(f)
  x = gsub("jquery.tablesorter.js", newjs, x, fixed=TRUE)
  writeLines(x, con = f)

(I named my file jquery.tablesorter_v2.0.5.js). Now, you see here, the table works! Hope this helps.


Note, I contacted the maintainer and I'm sure he'll fix this in the next update (he does a LOT of awesome work and development).