Converting LaTeX to MS Word

Last year, Elizabeth Sweeney wrote about how she converts LaTeX to Word. If you're trying all open-source solutions to this problem, visit there.

In my experience, I was writing in LaTeX as well. I had a journal that only accepted Word documents. I had to convert from LaTeX to Word.

Same story, different day

I tried a lot of the solutions from StackExchange: latex2rtf, pandoc, TeX2Word, etc.

I think the best quote there is

There is no pain-free way to do this. Really.

And no, nothing really worked VERY well straight out of the box. My solution was hackey as well, but it worked the best for me with a reasonable amount of formatting for me. The biggest problems were, not surprisingly, equations. Some garbled the text everywhere, others created image files that were included in the file.

What I used

My pipeline converted PDF to Word (.docx) using Acrobat Adobe Pro. This is relatively cheap for our students and is a solid program, though somewhat pricey. The conversion was similar for headings/sections to those above, but the equations were “converted”. The equations were converted to some pseudo-equations, but when highlighted in the Word doc, and then clicking Insert > Equation, and viola! The equations looked pretty good (aka usable).

I would say, though, the conversion was not perfect. I had some odd problems with superscripts and such, and ended up uploading the .docx to Google Drive as we were going to try editing together, but it never worked. I did noticed the Google Drive document fixed many formatting issues the OCR had caused from Acrobat Pro. I downloaded it from Drive and the formatting issues were fixed…but equations were a mess again! I ended up just copying and pasting the equations from the pre-uploaded .docx into the Google-Drive-converted .docx. That's where I had my best results.

Someone Please Stop the Madness!

Is this a good pipeline? No. Did it technically “work”? Yes. Why did I go through LaTeX in the first place. Well, 1) I didn't know they only took Word. This is my fault, but could have easily been my 2nd submission to a journal and the other journal accept PDF. 2) I had equations. MS equations, even though they “accept” LaTeX – no. 3) I know how to get LaTeX to format things good enough.

I will go on 1 rant and then discuss some light in this darkness.

Rant

Seriously journals, you only accept Word documents? What is this bullshit. The journal even accepted PDFs in Supplements. Everyone can read and annotate PDFs nowadays; get rid of Word requirements.
I imagine this perpetuates because 1) it's easy to use Word's word count and say “that's how many words you sent”, even though that's ridiculous (you included references in your word count!?), and 2) the editors/typesetters have used it for print journals, 3) some reviewers only use Word and cannot annotate PDF.

You know what – get rid of the reviewers from 3). You're reviewing cutting edge research and didn't keep up with a technology that pretty much every journal uses for papers. Maybe you aren't the best for that job.

Light up the Darkness

RStudio released the Knit to Word button in their new versions. Now, many people who use pandoc as discussed before, knew how to do this on some level. The big difference for me is that 1) I never thought to say only in R Markdown and skip LaTeX altogether, 2) It's click-button in RStudio which means more will use it, and 3) I can switch between PDF and Word with one click. With citation style files and knitcitations, I think I can get close to LaTeX references and automated reporting.

Next post to follow up on this.

Your Research is a Pain-in-the-ass Unicorn

I'm submitting another paper and I've come to a similar spot: I've edited and read the paper 15 times and don't even want to look at it. Things aren't exactly the same as before: I did a lot more writing, had a better writing schedule, which involved more positive editing schedule, and I'm still relatively excited about the results.

Relatively

The key word in that paragraph is relatively. I have discussed with many students and faculty about how after you do all the analysis, the last push to write seems 10 times harder than all the analysis. I'm sure there are names and psychological theories about why this happens and all the stuff that goes into your head. The How to Write A Lot book discusses many reason why we don't like writing. But these feelings don't happen only when you're writing: it also happens when you have worked on one project for a long time and want to bash it with a hammer sometimes.

Pain-in-the-ass Unicorn

This is natural. I would like to make the following analogy: your research/thesis/dissertation is a pain-in-the-ass unicorn. Think about what would happen if you got a unicorn.

You'd be like OMFG a unicorn!! Everyone look – it's a unicorn. It's MY unicorn! Woo! Anything the unicorn did, you'd be excited. Even if the unicorn took shit all over your bathroom. It's like oh man, I've never seen unicorn poop! You would tell everyone about it and be super excited. After a few months, you'd still be excited to see the unicorn when you came home and take it for unicorn rides and such. You'd mention it to new friends and they'd react like: OH MY GOD NO WAY, that's SOOOOOOO COOL! To which, you'd just shrug and say “Yeah, it's pretty cool. I like it.” and “Oh, you didn't know unicorns could do that?”.

And when they came over they'd play with your unicorn and you'd sit and watch. They would then say something “I wish I had a unicorn.” and it'd hit you: NO YOU DON'T. The unicorn shit on the floor a year after you got it and you can't hide your frustration and anger. You complain to your friends about your pain-in-the-ass unicorn. Why do I even have this thing?

I'm so glad for this post I found after writing the post. Especially this figure (no explanation needed):

Dealing with my unicorn

This is what I see with a lot of my work (and observing others). Initial excitement, lulled into complacency with the problem, thinking it's common and not novel/exciting, and frustration/complaining. I'm not sure if this is “natural”, but I would say it's common. What I try to remind other students and myself time and time again:

YOU HAVE A FREAKING UNICORN

Most of the projects you work on in grad school/academia are the cutting edge of research. Even if it's common to you, it sure isn't common to many other people, most importantly to you ½/4 years ago. If you existed then, didn't know the solution, and were excited about the solution, then so will others. Granted, maybe not droves of people, but some. It's your unicorn and realize it's still special and awesome. Yeah it may shit on the couch, but it's a FREAKING UNICORN.

Combatting this I-hate-my-unicorn Feeling

I try strategies to avoid this pattern to keep me excited.

  1. This answer depends. I like to have more than 1 project. Too many projects can hinder progress, but a few can make you realize good aspects of each one. Remember, your unicorns may play well with each other. If they don't, you have 2 unicorns shitting on the floor.
  2. Talk to new people/collaborators about your work. Excitement in others revives excitement in yourself. Talk to your advisor/mentor/collaborator. Many times they reinforce why you are doing the work.
  3. Listen to others about their unicorns. It can help you put your research in perspective because you'll think about your unicorn that way. Or their unicorn/data/collaborator is even worse and you will feel grateful for your unicorn.
  4. Talk to people outside of academia about your unicorn every now and then. This importantly includes, not downplaying (but not boasting) your research when people note that it's cool/exciting/hard/important. They gave you a compliment, don't say “no it's not” – that's rude.
  5. Read reddit – especially r/futurology's summary science of the week.
  6. Make the coolest/craziest figure you can of your research for a talk. Something people will remember and talk about, for better or worse. Interactive – yes; 3D boxplot – NO. Maybe put in unicorns.

We've all been there. Ask around. Some people have even toilet trained their unicorns. Talk to them.

Sorted HTML Tables and Javascript Libraries

A few days ago StatsInTheWild asked the following question

So we had a few exchanges where I thought you could use sprintf and be done but it didn't seem to work:

After a bit more discourse, StatsInTheWild shared some data with me:

and I went down the rabbit hole of trying to find out what was going on. Here is the code to make the table:

require(SortableHTMLTables)
myfile = "openWAR.csv"
if (!file.exists(myfile)) {
  download.file("https://dl.dropboxusercontent.com/u/35094868/openWAR.csv", myfile, method="wget")
  }
openWAR<-read.csv(myfile, stringsAsFactors = FALSE);
sortable.html.table(openWAR,"openWAR2014.html")

And as you can see in the output table here the column RAA.pitch does not sort correctly.

Attempts at, and then finding, a Solution

I tried a few things such as changing numeric to string, seeing if missing data was a problem, trying some things where I make the numbers all positive, but the problem persisted.

As StackOverflow usually does, it had insight into an answer. Essentially, prior to version 2.0.5, jquery.tablesorter.js didn't sort numbers exactly correctly. The problem is that SortableHTMLTables ships with version 2.0.3:

head(readLines(system.file("assets/jquery.tablesorter.js", package = "SortableHTMLTables"))) 
[1] "/*"                                                       
[2] " * "                                                      
[3] " * TableSorter 2.0 - Client-side table sorting with ease!"
[4] " * Version 2.0.3"                                         
[5] " * @requires jQuery v1.2.3"                               
[6] " * "                                                      

and uses this version for the table output. Now, if you wanted to fix this, you'd have some css to your file or some other route. Or, you can just update jquery.tablesorter.js. I went to http://tablesorter.com/docs/, downloaded the new js plugin.

But I want this automatic!

If you're using R and don't want to play around with JavaScript, that's the whole point of these functions. Saying you have to edit css or something of the like defeats that purpose. But for this fix to be automatically'' done, you either have to 1) copy the .js file every time you run thesortable.html.table command as it re-copies the files over, 2) wait for the maintainer to update (out of your control), 3) change your css, or 4) copy a new .js file with different name and edit the html file after running to make sure it uses your new js file. I'll implement 4).

require(SortableHTMLTables)
outfile = "openWAR2014_fixed.html"
sortable.html.table(openWAR,outfile)
change_js = function(f, newjs = "jquery.tablesorter_v2.0.5.js"){
  x = readLines(f)
  x = gsub("jquery.tablesorter.js", newjs, x, fixed=TRUE)
  writeLines(x, con = f)
}
change_js(outfile)

(I named my file jquery.tablesorter_v2.0.5.js). Now, you see here, the table works! Hope this helps.

Maintainer

Note, I contacted the maintainer and I'm sure he'll fix this in the next update (he does a LOT of awesome work and development).

Sometimes Table is not the Answer – a Faster 2×2 Table

The table command is great in its simplicity for cross tabulations. I have run into some settings where it is slow and I wanted to demonstrate one simple example here of why you may want to use other functions or write your own tabler. This example is a specific case where, for some examples and functions, you don't need all the good error-checking or flexibility that a function contains, but you want to do something specific and can greatly speed up computation.

Setup of example

I have some brain imaging data. I have a gold standard, where an expert hand-traced (on a computer) a brain scan delineating the brain. I'll refer to this as a brain “mask”. (We use the word mask in imaging to denote a segmented image – either done manually or automatically and I generally reserve the word mask for binary 0/1 values, but others use the term more broadly.)

Using automated methods, I can try to re-create this mask automatically. This image is also binary. I want to simply get a \(2\times2\) contingency table of the automated versus manual masks so I can get sensitivity/specificity/accuracy/etc.

The data

For simplicity and computation, let's consider the images as just a really long vectors. I'll call them manual and auto for the manual and automatic masks, respectively.

These are long logical vectors (9 million elements):

length(manual)
[1] 9175040
length(auto)
[1] 9175040
head(manual)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
head(auto)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

Naturally, you can run table on this data:

stime = system.time({ tab = table(manual, auto) })
print(stime)
   user  system elapsed 
  3.294   0.082   3.386 
print(tab)
       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162

The computation took about 3.4 seconds on my MacBook Pro (2013, 16Gb RAM, 2.8GHz Intel i7), which isn't that bad. Realize, though, that I could have hundreds or thousands of these images. We need to speed this up.

What is the essense of what we're doing?

Taking 3.4 seconds to get 4 numbers seems a bit long. As the data is binary, we can simply compute these with the sum command and logical operators.

Let's make the twoXtwo command:

twoXtwo = function(x, y, dnames=c("x", "y")){
    tt <- sum( x &  y)
        tf <- sum( x & !y)
        ft <- sum(!x &  y)
        ff <- sum(!x & !y)
        tab = matrix(c(ff, tf, ft, tt), ncol=2)
    n = list(c("FALSE", "TRUE"), c("FALSE", "TRUE"))
    names(n) = dnames
    dimnames(tab) = n
    tab = as.table(tab)
    dim
        tab
}

And let's see how fast this is (and confirm the result is the same):

stime2 = system.time({ twotab = twoXtwo(manual, auto, dnames=c("manual", "auto")) })
print(stime2)
   user  system elapsed 
  0.828   0.006   0.835 
print(twotab)
       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162
identical(tab, twotab)
[1] TRUE

Viola, twoXtwo runs about 4.06 times faster than table, largely because we knew we did not have to check certain characteristics of the data and that it's a specific example of a table.

More speed captain!

This isn't something astronomical such as a 100-fold increase, but we can increase the speed by not doing all the logical operations on the vectors, but taking differences from the margin sums.

Let's confirm this is faster and accurate by running it on our data:

stime3 = system.time({ twotab2 = twoXtwo2(manual, auto, dnames=c("manual", "auto")) })
print(stime3)
   user  system elapsed 
  0.198   0.001   0.200 
print(twotab2)
       auto
manual    FALSE    TRUE
  FALSE 7941541   11953
  TRUE    15384 1206162
identical(tab, twotab2)
[1] TRUE

Now, if I were going for speed, this code is good enough for me: it runs about 16.93 times faster than table. The one downside is that it is not as readable as twoXtwo. For even greater speed, I could probably move into C++ using the Rcpp package, but that seems overkill for a two by two table.

Other examples of speeding up the calculation can be found here.

Finishing up

I said I wanted sensitivity/specificity/accuracy/etc. so I will show how to get these. I'm going to use prop.table, which I didn't know about for a while when I first started using R (see margin.table too).

ptab = prop.table(twotab)
rowtab = prop.table(twotab, margin=1)
coltab = prop.table(twotab, margin=2)

As you can see, like the apply command, the prop.table command can either take no margin or take the dimension to divide over (1 for rows, 2 for columns). This means that in ptab, each cell of twotab was divided by the grand total (or sum(tab)). For rowtab, each cell was divided by the rowSums(tab) to get a proportion, and similarly cells in coltab were divided by colSums(tab). After the end of the post, I can show you these are the same.

Getting Performance Measures

Accuracy

Getting the accuracy is very easy:

accur = sum(diag(ptab))
print(accur)
[1] 0.997

Sensitivity/Specificity

For sensitivity/specificity, the “truth” is the rows of the table, so we want the row percentages:

sens = rowtab["TRUE", "TRUE"]
spec = rowtab["FALSE", "FALSE"]
print(sens)
[1] 0.9874
print(spec)
[1] 0.9985

Positive/Negative Predictive Value

We can also get the positive predictive value (PPV) and negative predictive value (NPV) from the column percentages:

ppv = coltab["TRUE", "TRUE"]
npv = coltab["FALSE", "FALSE"]
print(ppv)
[1] 0.9902
print(npv)
[1] 0.9981

Conclusions

After using R for years, I find things to still be very intuitive. Sometimes, though, for large data sets or specific examples, you may want to write your own function for speed, checking against the base functions for a few iterations as a double-check. I have heard this to be a nuisance for those who dislike R, as well as hampering reproducibility in some cases. Overall, I find that someone has made a vetted package that does what you want, but there are still simple cases (such as above) where “rolling your own” is OK and easier than adding a dependency to your code.

Aside: How prop.table works

Over all margins

For just doing prop.table without a margin, you can think of the table being divided by its sum.

print(round(ptab, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.866 0.001
  TRUE  0.002 0.131
print(round(twotab / sum(twotab), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.866 0.001
  TRUE  0.002 0.131

Over row margins

For the margin=1, or row percentages, you can think of dividing the table by the row sums.

print(round(rowtab, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.987
print(round(twotab / rowSums(twotab), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.987

Over column margins

Now for column percentages, you can think of R dividing each cell by its column's sum. This is what prop.table does.

Let's look at the result we should get:

print(round(coltab, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

But in R, when you take a matrix and then add/divide/subtract/multiply it by a vector, R does the operation column-wise. So when you take column sums on the table, you get a vector with the same number of columns as the table:

print(colSums(twotab))
  FALSE    TRUE 
7956925 1218115 

If you try to divide the table by this value, it will not give you the desired result:

print(round( twotab / colSums(twotab), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.002
  TRUE  0.013 0.990

R operations with matrices and vectors

This is because R thinks of a vector as a column vector (or a matrix of 1 column). R then takes the first column of the table and divides the first element from the first column sum (which is correct), but take the second element of the first column and divides it by the second column sum (which is not correct).
A detailed discussion on SO is located here of how to do row-wise operations on matrices.

Back to column percentages

We can use the t( t() ) operation to get the correct answer:

print(round( t( t(twotab) / colSums(twotab)), 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

You can think of R implicitly making the matrix of the correct size with the correct values and then dividing:

cs = colSums(twotab)
cs = matrix(cs, nrow=nrow(tab), ncol=ncol(tab), byrow=TRUE)
print(cs)
        [,1]    [,2]
[1,] 7956925 1218115
[2,] 7956925 1218115
print(round( twotab/cs, 3))
       auto
manual  FALSE  TRUE
  FALSE 0.998 0.010
  TRUE  0.002 0.990

Happy tabling!

R is a Language: Treat it Like One

I'm helping out with teaching a class on an introduction to R for students this week. I figured it'd be a good time to discuss my thoughts on programming in R and how a newcomer should feel about learning the language.

Those Who Teach R, Should Use R

Many of the students in the class at the beginning are overwhelmed. They see a bunch of different symbols and syntax; being overwhelmed is understandable. Moreover, those teaching you R can seem lightning fast when running code, moving around screens, or figuring out problems, and they usually are. Do you know why?

Those who are teaching you R, use R. They use R a lot. They use it daily, and for hours a day usually. Don't try to be them immediately, get the basics.

I hear students saying “Oh, you're fast” when power users help them, usually lined with an undertone of low self-esteem. That's like me going to a basketball camp run by Michael Jordan and saying “Oh man Michael, you're really good on the court”. Be like Mike: work hard to learn the basics like the back of your hand. You'll be dunking in no time.

And, of course, those who teach you are fast. Why would you take class from someone slow or unsure while doing what they're teaching you? Yes, your teachers may be fast, but that's the point. Learn from those who do.

I feel as though I'm relatively fast on my machine and I felt this starting out. Even more, I didn't know what tab completion was when starting out my first class. I thought my professor could actually type that fast when writing variables or directory paths. I thought it was magic. If I didn't stick it out, I wouldn't figure out how to make that magic myself.

R is a language

I took Spanish in high school for 4 years. I remember some vocabulary words and some conjugation rules, but am far from conversational. I've never taken French – I don't know French. Now, if I took an introduction to French class for 4 hours, do you think I could speak (or write) French fluently? No, of course not. Yet students think they can with R. My Spanish is like a background in Stata: some words/phrases/commands are similar, other are misleading and can be confusing.

R is a programming language. Just like a foreign language, R has syntax and grammar. You must learn simple punctuation such as placement of commas, assigment using “=” or “<-”, where to close parentheses, and when to notice when a quote is unmatched. My overall message is:

R is a language, treat it like one.

Remember to tell students to hold themselves to the same level of comprehension as they would for a spoken (or signed) or written language. Hopefully, that will put learning R in perspective, even if it does not make it any less overwhelming. I wonder if Rosetta Stone will make an R module one of these days.

Be Careful with Using Model Design in R

In R, useful functions for making design matrices are model.frame and model.matrix. I will to discuss some of the differences of behavior across and within the two functions. I also have an example where I have run into this problme and it caused me to lose time.

Using model.frame for a design matrix

Whenever I use the word “design” I mean the sytematic part of a model; in this case, linear models. For example, if you say

\displaystyle  Y = X\beta + \varepsilon

I'm referring to the X as the design.

model.frame creates a design data.frame of the covariates given, keeping any factor variables as factors with the same levels. Let's create a toy data.frame called df, where Y is a normal random variable linearly related to two variables in the dataset:

n = 100
df = data.frame(X1 = rnorm(n), 
                X2 = rpois(n, lambda = 5), 
                X3= rnorm(100, mean = 4, sd = 2), 
                Sex = factor(rep(c("Male", "Female"), each = 50)))
df$Y = with(df, X1 + 3*X2 + rnorm(100, sd = 10))

Now, if Y is included on the left hand side of the formula, then it is included in the output of model.frame as such:

model.df = model.frame(Y ~ X1 + X2 + X3 + Sex, data=df)
head(model.df, 2)
       Y      X1 X2    X3  Sex
1  9.223  0.3849  2 5.960 Male
2 12.467 -0.5061  5 1.651 Male

This gives you a data.frame with the outcome and the covariates fitting that outcome (not including an intercept).

If Y is not included on the left hand side of the formula:

model.df2 = model.frame(~ X1 + X2 + X3 + Sex, data=df)
head(model.df2, 2)
       X1 X2    X3  Sex
1  0.3849  2 5.960 Male
2 -0.5061  5 1.651 Male

then we see that Y is not included in the output of model.frame. Thus, if you want to create a “design data.frame”, then you likely will want to remove Y from the formula.

Note, in both cases, we see that there is no intercept term added to the data.frame and nothing is done to factor variables.

Using model.matrix

Most cases I'm making model design elements is using model.matrix to then use matrix multiplications to make procedures faster or do “smarter” (i.e. fewer) computations. I will discuss the differences between model.frame and model.matrix using our toy dataset and also dicuss one gotcha) for using model.matrix and lm.

Let's use model.matrix with and without Y on the left hand side of the formula.

model.mat = model.matrix(Y ~ X1 + X2 + X3 + Sex, data=df)
model.mat2 = model.matrix(~ X1 + X2 + X3 + Sex, data=df)
all.equal(model.mat, model.mat2)
[1] TRUE

We see that using any element on the left hand side doesn't affect the output of model.matrix. Difference #1 from model.frame.

Let's look at the output from model.matrix.

head(model.mat, 3)
  (Intercept)      X1 X2    X3 SexMale
1           1  0.3849  2 5.960       1
2           1 -0.5061  5 1.651       1
3           1 -1.3739  3 3.197       1

We see a column was added named (Intercept) with a column of ones for the \beta_0 usually in a model. Difference #2 from model.frame. Also, we see that our factor Sex was converted to an indicator (numeric) variable. Difference #3 from model.frame. We only have 2 levels in Sex in this example. In general, a factor with L levels will generate L - 1 indicator variables using model.matrix.

Review over, how did this affect me?

I wanted to discuss the differences above to note them if you haven't seen them before. Also, I want to show that using model.matrix and a -1 or 0 in a formula can affect how some of your results are calculated using linear models with lm. Running the model with our now-ready model matrix:

mod = lm(df$Y ~ model.mat)
summary(mod)
Call:
lm(formula = df$Y ~ model.mat)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.644  -8.617   0.448   7.648  30.245 

Coefficients: (1 not defined because of singularities)
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -3.1871     3.9533   -0.81     0.42    
model.mat(Intercept)       NA         NA      NA       NA    
model.matX1            1.1894     1.0987    1.08     0.28    
model.matX2            3.6243     0.5790    6.26  1.1e-08 ***
model.matX3            0.0164     0.5422    0.03     0.98    
model.matSexMale       2.6174     2.2726    1.15     0.25    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.2 on 95 degrees of freedom
Multiple R-squared:  0.312, Adjusted R-squared:  0.283 
F-statistic: 10.8 on 4 and 95 DF,  p-value: 3.03e-07

We see that the intercept term created in model.matrix was made NA because it's identical to the intercept term inherently generated by R and is linearly dependent. This is also seen with the warning: “(1 not defined because of singularities)”. This is good to know, but not revelatory or new; just be aware.

When model.matrix goes … differently

Well model.mat already has an intercept, so why not just take out the intercept term with a -1? The model should be the same, right? I would assume this is the case, but let's do it:

mod.noint = lm(df$Y ~ model.mat - 1)
summary(mod.noint)
Call:
lm(formula = df$Y ~ model.mat - 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.644  -8.617   0.448   7.648  30.245 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
model.mat(Intercept)  -3.1871     3.9533   -0.81     0.42    
model.matX1            1.1894     1.0987    1.08     0.28    
model.matX2            3.6243     0.5790    6.26  1.1e-08 ***
model.matX3            0.0164     0.5422    0.03     0.98    
model.matSexMale       2.6174     2.2726    1.15     0.25    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.2 on 95 degrees of freedom
Multiple R-squared:  0.717, Adjusted R-squared:  0.702 
F-statistic: 48.1 on 5 and 95 DF,  p-value: <2e-16

We see the intercepts look exactly the same (except we have removed the NA). But note the r.squared, adjusted.r.squared and F-statistic values!

Let's focus on R^2:

summary(mod)$r.squared
[1] 0.3121
summary(mod.noint)$r.squared
[1] 0.7168

These are different – way different – which seems off. Why? If you look into the summary.lm code, you will notice a some of statements involve the expression:

attr(z$terms, "intercept")

and calculate quantities differently depending on whether it flags that test as TRUE or FALSE.

Let's look our two models again from model.matrix:

attr(mod$terms, "intercept")
[1] 1
attr(mod.noint$terms, "intercept")
[1] 0

We see that when you construct the intercept yourself, this code evaluates to FALSE, even though the model has an “intercept”. The model has an intercept, but R hasn't assigned it that attribute. This effects the calculation of the model sum of squares (from summary.lm):

mss <- if (attr(z$terms, "intercept")) 
sum((f - mean(f))^2)
else sum(f^2)

as well as others. So be aware of this behavior.

Conclusion

I was writing something for a linear model that allowed me to compute a large number of regressions (> 1,000,000) on a matrix of outcomes with a fixed adjustment matrix and changing 1 piece of the design matrix. I was doing a voxel-on-scalar regression with covariate adjustment, but also wanted to incorporate the ability to compute the results on a matrix of permutations of Y.

Either way, I ran into a problem checking my results against the output from lm and it took a while to see why the r.squared values were different but all other elements the same. I realized that this was because I was constructing my own design matrix using model.matrix and was using -1 when running lm and those results were not being calculated correctly. Hope you don't run into this problem ever.

Aside: What I wanted to do

Just to be precise, my model was:

\displaystyle  Y = X\beta + Z\theta + \varepsilon

where X was n (n=100) \times V (V=100,000), Z was n \times p (p=5) and Y was n\times 1, but wanted to be run 1000 times with different permutations. If I simply wanted p-values, I could switch X and Y to get those and run 1000 lm commands versus running 100,000 lm commands. (Doing this is not efficient – solving matrix inversions is time-consuming and should not be redundant). I wanted R^2 values and \beta coefficients as well, so I needed something more powerful. I know packages such as vows or limma can do these regressions – but they are usually when the design is fixed and not changing for every voxel and usually the end result is a p-value.

I have it working and may release it into the wild soon. Let me know if you know of anything that will do this, including covariate adjustment and where you can run for a matrix of permuted Y values.

fslr: An R Package Interfacing with FSL for Neuroimaging Analysis

I use a set of neuroimaging tools, but my language of choice is R. FSL, which is from the University of Oxford's Functional MRI of the Brain (FMRIB), and stands for FMRIB Software Library, is one tool I commonly use. I wrote some wrapper functions into an R package called fslr and wanted to discuss some of the functions.

FSL – what is it?

FSL is has a command line interface with shell-type syntax as well as a GUI (which I generally don't use). It has a lot of functions that are good for general imaging purposes – fslstats and fslmaths have loads of functionality. Using FSL during your pipeline is fine, but I don't like switching between shell and R too much in an analysis and like to have scripts that don't jump around too much, so I created fslr.

One of the main problems I have is that I read an image into R, manipulate it in some way, and then want to use an FSL function on that manipulated. This presents a problem because I have an R object and not a NIfTI file, which FSL expects. I also want the result back in R, not in a file that I have to then find. fslr bridges the gap between R and FSL.

fslr functionality

fslr is implicitly linked to the oro.nifti by Brandon Whitcher. The workflow is this:
1. Read NIfTI data into R using readNIfTI from oro.nifti. This is now an R object of class nifti.
2. Manipulate the data in some way. Maybe you applied a mask or changed some values.
3. Pass this nifti object to a function in fslr, which will write the object to a temporary NIfTI file, run the FSL command, and read the result into an R nifti object again using readNIfTI.

From the user's perspective, it's how it always is in R: pass in R object to R function and get out R object.

Alternatively, if the user specifies a filename of a NIfTI file instead of passing in an R object, fslr commands will simply run the FSL command without the file ever having to be read into R, yet the result can still be returned in R as an R object. This method may be much faster, especially if the image for the function to operate on is on the hard disk (otherwise it would be read in and written out before an FSL command was run).

outfile and retimg options

For any function that has an image output should have two arguments: outfile and retimg. The argument outfile is default set to NULL, which will create a temporary file using tempfile(), which will be deleted when your R session is terminated. Alternatively, you can specify a path to the output file if the user wants it saved to disk.

The retimg argument is a logical indicator for whether you want the output read into R and returned (get it – return image). If retimg=FALSE and outfile=NULL, then the function will error as the outfile will be a temporary file and deleted and no image will be returned. Thus, the user will not ever be able to use the output image from the function which I believe is done in error.

Function call-outs

All functions use the system command in R to execute FSL commands. If you are using an R GUI instead of the command line in a shell, then you will want to specify options(fsl.path=). If you are using the command line and FSL is in your PATH, the path to FSL will be found using Sys.getenv("FSLDIR") as your shell evnironment variables will be available. The function have.fsl() provides a logical check as to whether you can run an fslr command. I have not tested this on a Windows machine. I did not use system2 because I ran into some problems, but may want to change this in a future release for portability. Let's look at an example of using fslr

Example of fslr commands

Let's check to make sure we have FSL in our PATH:

library(fslr)
options(fsl.path="/usr/local/fsl")
have.fsl()
## [1] TRUE

Similarly to fsl.path, you should specify an output type, usually NIFTI_GZ. See here for more information about output type.

options(fsl.outputtype = "NIFTI_GZ")

Reading in Data

Here I'm going to read in a template T1 MRI brain image (no skull), with 1mm resolution and visualize it.

fname = file.path( getOption("fsl.path"), "data", "standard", "MNI152_T1_1mm_brain.nii.gz")
img = readNIfTI(fname)
print(img)
## NIfTI-1 format
##   Type            : nifti
##   Data Type       : 4 (INT16)
##   Bits per Pixel  : 16
##   Slice Code      : 0 (Unknown)
##   Intent Code     : 0 (None)
##   Qform Code      : 4 (MNI_152)
##   Sform Code      : 4 (MNI_152)
##   Dimension       : 182 x 218 x 182
##   Pixel Dimension : 1 x 1 x 1
##   Voxel Units     : mm
##   Time Units      : sec
orthographic(img)

plot of chunk readin

Smoothing

Let's smooth the data using a 5mm Gaussian kernel and view the results:

smooth = fslsmooth(img, sigma = 5, retimg=TRUE)
orthographic(smooth)

plot of chunk smooth

That example is a little boring in that this could be much easier done in FSL than using fslr. If the data is manipulated beforehand, and in an explorative way, it may be easier to see fslr's use. Let's Z-score the image and then keep the z-scores above 2.

Z-score and threshold

thresh.img = img
thresh.img = (thresh.img - mean(thresh.img))/sd(thresh.img)
thresh.img[thresh.img < 2] = NA
orthographic(thresh.img)

plot of chunk theshimg
What? An empty picture? That doesn't make sense. Looking at the histogram of thresh.img we see there is data:

hist(thresh.img)

plot of chunk thresh_hist

So what gives? The orthographic function from oro.nifti uses the slots cal_min and cal_max to determine the grayscale for the picture. fslr also has some helper functions to make it easier to rescale these values so that you can visualize the images again.

Setting cal_min and cal_max

thresh.img = cal_img(thresh.img)
orthographic(thresh.img)

plot of chunk thresh_ortho
There we go. Looks like some white matter regions. Now let's smooth these z-scores using a 4mm Gaussian kernel.

smooth.thresh = fslsmooth(thresh.img, sigma = 4, retimg=TRUE)
orthographic(smooth.thresh)

plot of chunk smooth.thresh

Now you can do all your fun statistics and be able to call FSL and keep everything in R! Also for most of the functions, there is a FUNCTION.help function that will print out FSL's help file:

fslhd.help()
## 
## Usage: fslhd [-x] <input>
##        -x : instead print an XML-style NIFTI header

But other packages exist!!

Some may think “Hmm I had a package with that functionality decades ago”. Other packages in R exist and have functions for images. I know this. But I believe:

  1. R should be able to integrate other established and available neuroimaging software. Python is doing this everywhere and seems to be helpful for the Python part of the community.
  2. Some of the functions the FSL package have are not implemented in R. What package do I call for skull-stripping an image like the BET from FSL? You can call fslbet by the way in fslr.
  3. Some functions are faster in FSL than in some functions of R. Large 3D Gaussian smoothers take a long time in some packages of R.
  4. It allows you to still your pipeline written in R even if it calls FSL or other languages.
  5. If someone wrote it before, I don't want to write it again.

I am writing up a more comprehensive white paper hopefully in the coming weeks. Don't forget to check out other packages like ANTsR which is a wrapper for ANTS in R. I'm excited to start using it, especially for its MRI inhomogeneity correction.

Work to be done

I haven't implemented much of the software for FEAT, which is part of FSL's fMRI analysis pipeline. I hope to do that in the future, but would love feedback if people would want that integration. Always – feedback is welcome on other parts as well.