My Commonly Done ggplot2 graphs

Posted on October 30, 2014 by strictlystat

In my last post, I discussed how ggplot2 is not always the answer to the question “How should I plot this” and that base graphics were still very useful.

Why Do I use ggplot2 then?

The overall question still remains: why (do I) use ggplot2?

ggplot2 vs lattice

For one, ggplot2 replaced the lattice package for many plot types for me. The lattice package is a great system, but if you are plotting multivariate data, I believe you should choose lattice or ggplot2. I chose ggplot2 for the syntax, added capabilities, and the philosophy behind it. The fact Hadley Wickham is the developer never hurts either.

Having multiple versions of the same plot, with slight changes

Many times I want to do the same plot over and over, but vary one aspect of it, such as color of the points by a grouping variable, and then switch the color to another grouping variable. Let me give a toy example, where we have an x and a y with two grouping variables: group1 and group2.

library(ggplot2)
set.seed(20141016)
data = data.frame(x = rnorm(1000, mean=6))
data$group1 = rbinom(n = 1000, size =1 , prob =0.5)
data$y = data$x * 5 + rnorm(1000)
data$group2 = runif(1000) > 0.2

We can construct the ggplot2 object as follows:

g = ggplot(data, aes(x = x, y=y)) + geom_point()

The ggplot command takes the data.frame you want to use and use the aes to specify which aesthetics we want to specify, here we specify the x and y. Some aesthetics are optional depending on the plot, some are not. I think it’s safe to say you always need an x. I then “add” (using +) to this object a “layer”: I want a geometric “thing”, and that thing is a set of points, hence I use geom_points. I’m doing a scatterplot.

If you just call the object g, print is called by default, which plots the object and we see our scatterplot.

plot of chunk print_g

I can color by a grouping variable and we can add that aesthetic:

g + aes(colour = group1)

plot of chunk color_group

g + aes(colour = factor(group1))

plot of chunk color_group

g + aes(colour = group2)

plot of chunk color_group

Note, g is the original plot, and I can add aes to this plot, which is the same as if I did ggplot2(data, aes(...)) in the original call that generated g. NOTE if the aes you are adding was not a column of the data.frame when you created the plot, you will get an error. For example, let’s add a new column to the data and then add it to g:

data$newcol = rbinom(n = nrow(data), size=2, prob = 0.5)
g + aes(colour=factor(newcol))

Error: object 'newcol' not found

This fails because the way the ggplot2 object was created. If we had added this column to the data, created the plot, then added the newcol as an aes, the command would work fine.

g2 = ggplot(data, aes(x = x, y=y)) + geom_point()
g2 + aes(colour=factor(newcol))

plot of chunk g2_create2

We see in the first plot with colour = group1, ggplot2 sees a numeric variable group1, so tries a continuous mapping scheme for the color. The default is to do a range of blue colors denoting intensity. If we want to force it to a discrete mapping, we can turn it into a factor colour = factor(group1). We see the colors are very different and are not a continuum of blue, but colors that separate groups better.

The third plot illustrates that when ggplot2 takes logical vectors for mappings, it factors them, and maps the group to a discrete color.

Slight Changes with additions

In practice, I do this iterative process many times and the addition of elements to a common template plot is very helpful for speed and reproducing the same plot with minor tweaks.

In addition to doing similar plots with slight grouping changes I also add different lines/fits on top of that. In the previous example, we colored by points by different grouping variables. In other examples, I tend to change little things, e.g. a different smoother, a different subset of points, constraining the values to a certain range, etc. I believe in this way, ggplot2 allows us to create plots in a more structured way, without copying and pasting the entire command or creating a user-written wrapper function as you would in base. This should increase reproducibility by decreasing copy-and-paste errors. I tend to have many copy-paste errors, so I want to limit them as much as possible.

ggplot reduces number of commands I have to do

One example plot I make frequently is a scatterplot with a smoother to estimate the shape of bivariate data.

In base, I usually have to run at least 3 commands to do this, e.g. loess, plot, and lines. In ggplot2, geom_smooth() takes care of this for you. Moreover, it does the smoothing by each different aesthetics (aka smoothing per group), which is usually what I want do as well (and takes more than 3 lines in base, usually a for loop or apply statement).

g2 + geom_smooth()

plot of chunk smooth
By default in geom_smooth, it includes the standard error of the estimated relationship, but I usually only look at the estimate for a rough sketch of the relationship. Moreover, if the data are correlated (such as in longitudinal data), the standard errors given by default methods are usually are not accurate anyway.

g2 + geom_smooth(se = FALSE)

plot of chunk smooth_no_se

Therefore, on top of the lack of copying and pasting, you can reduce the number of lines of code. Some say “but I can make a function that does that” – yes you can. Or you can use the commands already there in ggplot2.

Faceting

The other reason I frequently use ggplot2 is for faceting. You can do the same graph, conditioned on levels of a variable, which I frequently used. Many times you want to do a graph, subset by another variable, such as treatment/control, male/female, cancer/control, etc.

g + facet_wrap(~ group1)

plot of chunk facet_group

g + facet_wrap(~ group2)

plot of chunk facet_group

g + facet_wrap(group2 ~ group1)

plot of chunk facet_group

g + facet_wrap( ~ group2 + group1)

plot of chunk facet_group

g + facet_grid(group2 ~ group1)

plot of chunk facet_group

Spaghetti plot with Overall smoother

I also frequently have longitudinal data and make spaghetti plot for a per-person trajectory over time.

For this example I took code from StackOverflow to create some longitudinal data.

library(MASS)
library(nlme)

### set number of individuals
n <- 200

### average intercept and slope
beta0 <- 1.0
beta1 <- 6.0

### true autocorrelation
ar.val <- .4

### true error SD, intercept SD, slope SD, and intercept-slope cor
sigma <- 1.5
tau0  <- 2.5
tau1  <- 2.0
tau01 <- 0.3

### maximum number of possible observations
m <- 10

### simulate number of observations for each individual
p <- round(runif(n,4,m))

### simulate observation moments (assume everybody has 1st obs)
obs <- unlist(sapply(p, function(x) c(1, sort(sample(2:m, x-1, replace=FALSE)))))

### set up data frame
dat <- data.frame(id=rep(1:n, times=p), obs=obs)

### simulate (correlated) random effects for intercepts and slopes
mu  <- c(0,0)
S   <- matrix(c(1, tau01, tau01, 1), nrow=2)
tau <- c(tau0, tau1)
S   <- diag(tau) %*% S %*% diag(tau)
U   <- mvrnorm(n, mu=mu, Sigma=S)

### simulate AR(1) errors and then the actual outcomes
dat$eij <- unlist(sapply(p, function(x) arima.sim(model=list(ar=ar.val), n=x) * sqrt(1-ar.val^2) * sigma))
dat$yij <- (beta0 + rep(U[,1], times=p)) + (beta1 + rep(U[,2], times=p)) * log(dat$obs) + dat$eij

I will first add an alpha level to the plotting lines for the next plot (remember this must be done before the original plot is created). tspag will be the template plot, and I will create a spaghetti plot (spag) where each colour represents an id:

library(plyr)
dat = ddply(dat, .(id), function(x){
  x$alpha = ifelse(runif(n = 1) > 0.9, 1, 0.1)
  x$grouper = factor(rbinom(n=1, size =3 ,prob=0.5), levels=0:3)
  x
})
tspag = ggplot(dat, aes(x=obs, y=yij)) + 
  geom_line() + guides(colour=FALSE) + xlab("Observation Time Point") +
  ylab("Y")
spag = tspag + aes(colour = factor(id))
spag

plot of chunk spag

Many other times I want to group by id but plot just a few lines (let’s say 10% of them) dark and the other light, and not colour them:

bwspag = tspag + aes(alpha=alpha, group=factor(id)) + guides(alpha=FALSE)
bwspag

plot of chunk unnamed-chunk-1

Overall, these 2 plots are useful when you have longitudinal data and don’t want to loop over ids or use lattice. The great addition is that all the faceting and such above can be used in conjunction with these plots to get spaghetti plots by subgroup.

spag + facet_wrap(~ grouper)

plot of chunk spag_facet

Spaghetti plot with overall smoother

If you want a smoother for the overall group in addition to the spaghetti plot, you can just add geom_smooth:

sspag = spag + geom_smooth(se=FALSE, colour="black", size=2)
sspag

plot of chunk spag_smooth

sspag + facet_wrap(~ grouper)

plot of chunk spag_smooth

bwspag + facet_wrap(~ grouper)

plot of chunk spag_smooth

Note that the group aesthetic and colour aesthetic do not perform the same way for some operations. For example, let’s try to smooth bwswag:

bwspag + facet_wrap(~ grouper) + geom_smooth(se=FALSE, colour="red")

plot of chunk spag_smooth_bad

We see that it smooths each id, which is not what we want. We can achieve the desired result by setting the group aesthetic:

bwspag + facet_wrap(~ grouper) + 
  geom_smooth(aes(group=1), se=FALSE, colour="red", size =2)

plot of chunk spag_smooth_corr

I hope that this demonstrates some of the simple yet powerful commands ggplot2 allows users to execute. I agree that some behavior may not seem straightforward at first glance, but becomes more understandable as one uses ggplot2 more.

Another Case this is Useful – Save Plot Twice

Another (non-plotting) example I want to show is how saving ggplot2 objects can make saving duplicate plots much easier. In many cases for making plots to show to others, I open up a PDF device using pdf, make a series of plots, and then close the PDF. There may be 1-2 plots that are the real punchline, and I want to make a high-res PNG of them separately. Let me be clear, I want both – the PDF and PNG. For example:

pdf(tempfile())
print({g1 = g + aes(colour = group1)})
print({g1fac = g + aes(colour = factor(group1))})
print({g2 = g + aes(colour = group2)})
dev.off()

plot of chunk pdf_and_pngs

png(tempfile(), res = 300, height =7, width= 7, units = "in")
print(g2)
dev.off()

I am printing the objects, while assigning them. (I have to use the { brackets because I use = for assignment and print would evaluate that as arguments without {}). As g2 is already saved as an object, I can close the PDF, open a png and then print that again.

No copying and pasting was needed for remaking the plot, nor some weird turning off and on devices.

Conclusion

I (and think you should) use ggplot2 for a lot of reasons, especially the grammar and philosophy. The plots I have used here are some powerful representations of data that are simple to execute. I believe using this system reflects and helps the true iterative process of making figures. The syntax may not seem intuitive to a long-time R user, but I believe the startup cost is worth the effort. Let me know if you’d like to see any other plots that you commonly use.

ggplot2 is not ALWAYS the answer: it’s not supposed to be

Posted on October 16, 2014 by strictlystat

Recently, the topic was propose at tea time:

Why should I switch over to ggplot2? I can do everything in base graphics.

I have heard this argument before and I understand it for the most part. Many people have learned R before ggplot2 came on the scene (as well as many other packages) and learned to do all the things they needed to. Some don’t understand the new syntax for plotting and argue the learning curve is not worth the effort. Also, many say it’s not straightforward to customize.

As the discussion progressed, we discussed 10 commonly made base plots, and I would recreate them in ggplot2. The goal was to help base users break into ggplot2 if they wanted, and also see if they were “as easy” as doing them in base. I want to discuss my results and the fact that ggplot2 is not ALWAYS the answer, nor was it supposed to be.

First Plot – Heatmap

The first example discussed was a heatmap that had 100,000 rows and 100 columns with a good default color scheme. As a comparison, I used heatmap in base R. Let’s do only 10,000 to start:

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0(&amp;amp;quot;Col&amp;amp;quot;, seq(ncol(mat)))
rownames(mat) = paste0(&amp;amp;quot;Row&amp;amp;quot;, seq(nrow(mat)))
system.time({heatmap(mat)})

   user  system elapsed 
 29.996   1.032  31.212

For a heatmap in ggplot2, I used geom_tile and wanted to look at the results. Note, ggplot2 requires the data to be in “long” format, so I had to reshape the data.

library(reshape2)
library(ggplot2)
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))
system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
  9.519   0.675  10.394

One of the problems is that heatmap does clustering by default, so the comparison is not really fair (but still using “defaults” – as was specified in the discussion). Let’s do the heatmap again without the clustering.

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  0.563   0.035   0.605

Which one looks better? I’m not really sure, but I do like the red/orange coloring scheme – I can differences a bit better. Granted, there shouldn’t be many differences as the data is random. The ggplot graph does have a built-in legend of the values, which is many times necessary. Note, the rows of the heatmap are shown as columns and the columns shown as rows, a usually well-known fact for users of the image function. The ggplot graph plots as you would a scatterplot – up on x-axis and right on the y-axis are increasing. If we want to represent rows as rows and columns as columns, we can switch the x and y aesthetics, but I wanted as close to heatmap as possible.

Don’t factor the rows

Note, I named the matrix rows and columns, if I don’t do this, the plotting will be faster, albeit slightly.

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  0.642   0.017   0.675

system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
  9.814   1.260  11.141

20,000 Observations

I’m going to double it and do 20,000 observations and again not do any clustering:

N = 2e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0(&amp;amp;quot;Col&amp;amp;quot;, seq(ncol(mat)))
rownames(mat) = paste0(&amp;amp;quot;Row&amp;amp;quot;, seq(nrow(mat)))
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  1.076   0.063   1.144

system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
 17.799   1.336  19.204

100,000 Observations

Let’s scale up to the 100,000 observations requested.

N = 1e5
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0(&amp;amp;quot;Col&amp;amp;quot;, seq(ncol(mat)))
rownames(mat) = paste0(&amp;amp;quot;Row&amp;amp;quot;, seq(nrow(mat)))
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  5.999   0.348   6.413

system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
104.659   6.977 111.796

We see that heatmap and geom_tile() scale with the number observations on how long it takes to plot, but heatmap being much quicker. There may be better ways to do this in ggplot2, but after looking around, it looks like geom_tile() is the main recommendation. Overall, for doing heatmaps of this size, I would use the heatmap, after transposing the matrix and use something like seq_gradient_pal from the scales package or RColorBrewer for color mapping.

For smaller dimensions, I’d definitely use geom_tile(), especially if I wanted to do something like map text to the blocks as well. The other benefit is that no transposition needs to be done, but the data does need to be reshaped explicitly.

Like I said, ggplot2 is not ALWAYS the answer; nor was it supposed to be.

ggplot2 does not make publication-ready figures

Another reason for the lengthy discussion is that many argued ggplot2 does not make publication-ready figures. I agree. Many new users need a large

ggplot2 does not make publication-ready figures

message with their first install. But neither does base graphics. Maybe they need that message for their first R install. R would boot up, and you can’t start until you answer “Does R (or any statistical software) give publication-ready figures by default”. Maybe, if you answer yes, R self-destructs.

Overall, a publication-ready figure takes time, customization, consideration about point size, color, line types, and other aspects of the plot, and usually stitching together multiple plots. ggplot2 has a lot of default features that have given considerate thought to color and these other factors, but one size does not fit all. The default color scheme may not be consistent with your other plots or the data. If it doesn’t work, change it.

And in this I agree with this statement: many people make a plot in ggplot2 and it looks good and they do not put the work in to make it publication-ready. How many times have you seen the muted green and red/pink color that is the default first 2 colors in a ggplot2 plot?

Why all the ggplot2 hate?

I’m not hating on ggplot2; I use it often and value it. My next post will be about plots I commonly use in ggplot2 and why I chose to use it. I think it makes graphs that by default look better than many base alternatives. It does things many things well and easier than base. It has good defaults for some things. It’s a different grammar that, when learned, is easier for plotting. However, I do not think that ggplot2 figures are ready-set-go for publication.

Although I think ggplot2 is great, I most-definitely think base R graphics are useful.

ggplot2 is not ALWAYS the answer. Again, it was never supposed to be.

BibLaTeX with elsarticle

Posted on October 9, 2014 by strictlystat

Problem

I like BibLaTeX and StackOverflow presents some ways reasons to switch to BibLaTeX. I like it, if nothing else, I can have multiple bibliographies easily. I can also use natbib citation style and also limit which fields are displayed in my bibliography. One problem with BibLaTeX is that it does not work well with Elsevier articles (class elsarticle), since natbib is loaded by default. There are some other options to induce compatibility are presented here and here.

Solution

So I edited the elsarticle.cls file to make it work. See the diff between the 2 files here:

diff elsarticle.cls elsarticle_nonatbib.cls

27c27
<  \def\RCSfile{elsarticle}%
---
>  \def\RCSfile{elsarticle_nonatbib}%
33c33
<  \def\@shortjid{elsarticle}
---
>  \def\@shortjid{elsarticle_nonatbib}
192,193c192,193
< \newcounter{author}
< \def\author{\@ifnextchar[{\@@author}{\@author}}
---
> \newcounter{auth}
> \def\auth{\@ifnextchar[{\@@auth}{\@auth}}
196c196
< \def\@@author[#1]#2{\g@addto@macro\elsauthors{%
---
> \def\@@auth[#1]#2{\g@addto@macro\elsauthors{%
211c211
< \def\@author#1{\g@addto@macro\elsauthors{\normalsize%
---
> \def\@auth#1{\g@addto@macro\elsauthors{\normalsize%
642c642
< \RequirePackage[\@biboptions]{natbib}
---
> %\RequirePackage[\@biboptions]{natbib}

If you've never read a diff output, the < means what's written in the first file (elsarticle.cls) and > means the second file (elsarticle_nonatbib.cls). The numbers correspond to the line in each file, e.g 192c193 means line 192 in file 1 and 193 in file 2.
Overall, I changed the type of article, commented out the natbib requirement, and changed the author field to auth (see below for why). The edited cls is located here.

Things you need to change

The \author field conflicts with biblatex and elsarticle, so you must chagne the \author definitions to \auth instead. This is a minor change, but important one. You can change that field to anything you want in the elsarticle_nonatbib.cls file (such as elsauthor).

Minimal Working Example (MWE)

I tried it with Elsevier's sample manuscript, changing the author fields to auth, and adding a biblatex-type heading:

\usepackage[
  natbib = true,
    backend=bibtex,
    isbn=false,
    url=false,
    doi=false,
    eprint=false,
    style=numeric,
    sorting=nyt,
    sortcites = true
]{biblatex}
\bibliography{mybibfile}

and

\printbibliography

at the end, and the manuscript came out as if using natbib. The MWE is located here and output PDF is located here.

Conclusion

You can use elsarticle with biblatex, with some minor changes. You may have to include this cls with the rest of your LaTeX files for them to compile on the editors machine. Maybe Elsevier will change the LaTeX for more flexibility, but the StackOverflow question was asked 3 years ago and not much seems to have changed, so I like my solution.

Working with NIfTI images in R

Posted on October 1, 2014 by strictlystat

The oro.nifti package is awesome for NeuRoimaging (couldn't help myself). It has functions to read/write images, introduces the S4 nifti class, and has useful plotting functions. There are some limitations and some gotchas that are important to discuss if you are working with these objects in R.

Dataset Creation

We'll read in some data (a template, example taken from readNIfTI in oro.nifti. We have to download it as most packages aren't allowed to include data (Bioconductor has mostly remedied this for bioinformatics).

library(oro.nifti)
URL <- "http://imaging.mrc-cbu.cam.ac.uk/downloads/Colin/colin_1mm.tgz"
urlfile <- "colin_1mm.tgz"
if (!file.exists(urlfile)){
  download.file(URL, dest=urlfile, quiet=TRUE)
  untar(urlfile)
}
img <- readNIfTI("colin_1mm")
img = cal_img(img)

The img object read in is of class nifti, has a series of slots (remember S4 object, not S3), and the print method produces information about the NIfTI header.

class(img)

[1] "nifti"
attr(,"package")
[1] "oro.nifti"

print(img)

NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181 x 1
  Pixel Dimension : 1 x 1 x 1 x 0
  Voxel Units     : mm
  Time Units      : sec

slotNames(img)

 [1] ".Data"          "sizeof_hdr"     "data_type"      "db_name"       
 [5] "extents"        "session_error"  "regular"        "dim_info"      
 [9] "dim_"           "intent_p1"      "intent_p2"      "intent_p3"     
[13] "intent_code"    "datatype"       "bitpix"         "slice_start"   
[17] "pixdim"         "vox_offset"     "scl_slope"      "scl_inter"     
[21] "slice_end"      "slice_code"     "xyzt_units"     "cal_max"       
[25] "cal_min"        "slice_duration" "toffset"        "glmax"         
[29] "glmin"          "descrip"        "aux_file"       "qform_code"    
[33] "sform_code"     "quatern_b"      "quatern_c"      "quatern_d"     
[37] "qoffset_x"      "qoffset_y"      "qoffset_z"      "srow_x"        
[41] "srow_y"         "srow_z"         "intent_name"    "magic"         
[45] "extender"       "reoriented"

hist(img)

plot of chunk printimg

Array operation

Now, we can do simple operations on this object like any other array. The problem is that certain operations do not (some rightfully so) return a nifti object.

For example, let's get an indicator if the value is greater than 400000. (This is an MRI and has arbitrary units).

mask = img > 400000
mask

NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181 x 1
  Pixel Dimension : 1 x 1 x 1 x 0
  Voxel Units     : mm
  Time Units      : sec

class(mask)

[1] "nifti"
attr(,"package")
[1] "oro.nifti"

We see a nifti object is returned, with all the header information contained. Great! Let's get an indicator of values greater than 400000 but less than 600000.

mask_2 = img > 400000 & img < 600000
class(mask_2)

[1] "array"

We see that this operation returned an array, not a nifti object. That's kind of unexpected, but probably conservative. For example, if this had been 2 different nifti objects and you were performing the subsetting, which object's information should you use?

We see the same behavior when multiplying (or subtracting, etc) two nifti objects.

img_masked = img * mask
class(img_masked)

[1] "array"

We know the object is an array but we may want a nifti object.

`fslr` Helper functions

The niftiarr function in fslr inputs a nifti object and an array, and returns a nifti object with the array in the @.Data slot, copying over the image header information from the input nifti object.

### need the development version
# library(devtools)
# devtools::install_github("muschellij2/fslr")
library(fslr)
img_masked = niftiarr(img, img * mask)
class(img_masked)

[1] "nifti"
attr(,"package")
[1] "oro.nifti"

Other ways to skin the cat

Another way of masking the image is to subset the values of the image that are not in the mask and setting those values to \(0\) (or some other value).

img_masked_2 = img
img_masked_2[mask == 0] = 0
class(img_masked_2)

[1] "nifti"
attr(,"package")
[1] "oro.nifti"

We see that this correctly returns an object of class nifti. One potential issue with this direction is that cal_min and cal_max slots on the resulting nifti object not the range of the image. This is problematic because orthographic and image.nifti both use this slot to determine grayscale ranges for plotting and writeNIfTI errors if these values don't match. The cal_img function in fslr is a simple helper function that will set these values to the respective values of the range of the data.

range(img_masked_2)

[1]      0 791972

c(cal.min(img_masked_2), cal.max(img_masked_2))

[1]      0 791972

img_masked_2 = cal_img(img_masked_2)
range(img_masked_2)

[1]      0 791972

Note, sometimes readNIfTI sets the cal_min and cal_max slots both to \(0\), which may be a result of how the nifti was created.

Equivalent Operations

We see that after these operations done 2 different ways, the resulting nifti objects are equivalent.

all.equal(img_masked, img_masked_2)

[1] TRUE

Data Typing and Writing `nifti` objects

I've worked with oro.nifti for some time, and I wanted to also discuss writing nifti objects. Many operations preserve the data type of the array in the nifti object, such as if the data were integers, an array of integers returned. Other times, the data type changes, such as if you smoothed an image of integers, the resulting array would have values that are not integers, but probably continuous values. When plotting the data or operating with the R object, this is not generally an issue because any extraction of the information will take the nifti object's @.Data slot and return that array.

This presents a problem when you write nifti objects using the writeNIfTI function.

Slots `datatype` and `bitpix`

In a nifti object, the datatype and bitpix fields specify what type the data is and the level of precision stored in the NIfTI file. For example, if you have binary data, the
datatype would be a value of 2 for UINT8 (unsigned integer) and the bitpix would be 8 for 8 bits per voxel. These must correspond to each other (i.e. no mismatching). Also, storing data in a type that is what I call “over-storing” (e.g storing binary data in a FLOAT32 format) may make files larger than necessary. Under-storing is much more problematic, however.

Why you should care about under-storing

If you read in data as integers, did some operations that returned continuous (decimal) objects, and then write the nifti object using the original header information would truncate the data in ways you wouldn't want. In R, the object would look fine, but when you wrote it to disk, it would be truncated.

img@bitpix

[1] 8

img@datatype

[1] 2

Let's say we scaled the data by its maximum:

pct = img / max(img)
pct = cal_img(pct)
pct@bitpix

[1] 8

pct@datatype

[1] 2

hist(pct)

plot of chunk pct

### we will explain this in the next section
pct = drop_img_dim(pct)
pct = zero_trans(pct)
pct = onefile(pct)

If we write this to disk, and read the result back in, we see:

par(mfrow=c(3,1))
hist(pct, main="Correct Data")

### Writing out pct without any datatype change - still UINT8
tfile = tempfile()
writeNIfTI(pct, filename = tfile, verbose=TRUE)

  writing data at byte = 352

pct2 = readNIfTI(paste0(tfile, ".nii.gz"))
hist(pct2, main="Writing out with no change of datatype/bitpix")

### Writing out pct with datatype change - to FLOAT32
pct = datatype(pct, type_string = "FLOAT32")
writeNIfTI(pct, filename = tfile)
pct3 = readNIfTI(paste0(tfile, ".nii.gz"))
hist(pct3, main="Writing out with changing of datatype/bitpix")

plot of chunk readwrite

As you can see, you need to change the bitpix and datatype when writing out NIfTI files if you are changing the data type in the array of data in R. We had changed integers to percentages/proportions, and we changed the output to FLOAT32 (for floating point data). Take a look at convert.datatype() and convert.bitpix() for examples of types.

Other functions for changinag data

We used 3 functions in the last section which changed around pct: drop_img_dim, zero_trans, and onefile.

I want to explain what each of these steps are doing.

Dropping dimensions

When you read in some data (especially from SPM) into R using readNIfTI, the data can be read in as a 4-dimensional image, with one dimension only being one. For example, if we look at the dimensions of img:

dim(img)

[1] 181 217 181   1

img@dim_

[1]   4 181 217 181   1   0   0   0

pixdim(img)

[1] 1 1 1 1 0 0 0 0

We see that the 4th dimension is 1 and the slot dim_ denotes it's a 4-dimensional object (4 is in the first index). Also, we note that values in the dim_ slot are \(0\), which is not allowed when writing out:

writeNIfTI(img, filename = tempfile())

Error: all dim elements > dim[1] must be 1
dim/pixdim mismatch

In order to force it to be a 3-dimensional object, drop_img_dim changes all these aspects for the user.

img_3d = drop_img_dim(img)
dim(img_3d)

[1] 181 217 181

img_3d@dim_

[1]   3 181 217 181   1   1   1   1

pixdim(img_3d)

[1] 1 1 1 1 0 0 0 0

tfile = tempfile()
writeNIfTI(img_3d, filename = tfile)

Unscaling the data

Many times this is not necessary, but some NIfTI files have the scl_slope and scl_inter slots defined. These denote what to multiply and then add to the data to get it in the correct values. zero_trans simply makes scl_slope = 0 and scl_inter = 1.

Changing NIfTI magic slots

In each NIfTI image, there is a magic slot (see http://nifti.nimh.nih.gov/nifti-1/documentation/nifti1fields/nifti1fields_pages/magic.html):

img_3d@magic

[1] "ni1"

img_3d@vox_offset

[1] 0

The NIfTI documentation states:

“ni1” means that the image data is stored in the “.img” file corresponding
to the header file (starting at file offset 0).

We previously wrote out the 3D image, but if we try to read it back in, readNIfTI fails:

readNIfTI(fname = tfile)

Error: This is not a one-file NIfTI format

We can change the default in readNIfTI to onefile = FALSE, but nothing changes:

writeNIfTI(img_3d, filename = tfile, onefile = FALSE)
readNIfTI(fname = tfile)

Error: This is not a one-file NIfTI format

If we look in oro.nifti:::.read.nifti.content we see

if (!onefile) {
    if (nim@magic != "ni1") {
        stop("This is not a two-file NIfTI format")
    }

We can change this magic to n+1, but we need to change the vox_offset = 352 and that's exactly what onefile() does:

img_3d = onefile(img_3d)
writeNIfTI(img_3d, filename = tfile, onefile = FALSE)
readNIfTI(fname = tfile)

NIfTI-1 format
  Type            : nifti
  Data Type       : 2 (UINT8)
  Bits per Pixel  : 8
  Slice Code      : 0 (Unknown)
  Intent Code     : 0 (None)
  Qform Code      : 2 (Aligned_Anat)
  Sform Code      : 2 (Aligned_Anat)
  Dimension       : 181 x 217 x 181
  Pixel Dimension : 1 x 1 x 1
  Voxel Units     : mm
  Time Units      : sec

Viola! All these fun little quirks that I hope you don't have to encounter. But if you do, I hope this page helps.

A HopStat and Jump Away

Trying to at least Doggie Paddle through the Sea of Data, Contributor to http://bmorebiostat.com

Monthly Archives: October 2014

My Commonly Done ggplot2 graphs

Why Do I use ggplot2 then?

ggplot2 vs lattice

Having multiple versions of the same plot, with slight changes

Slight Changes with additions

ggplot reduces number of commands I have to do

Faceting

Spaghetti plot with Overall smoother

Spaghetti plot with overall smoother

Another Case this is Useful – Save Plot Twice

Conclusion

ggplot2 is not ALWAYS the answer: it’s not supposed to be

First Plot – Heatmap

Don’t factor the rows

20,000 Observations

100,000 Observations

ggplot2 does not make publication-ready figures

Why all the ggplot2 hate?

BibLaTeX with elsarticle

Problem

Solution

Things you need to change

Minimal Working Example (MWE)

Conclusion

Working with NIfTI images in R

Dataset Creation

Array operation

`fslr` Helper functions

Other ways to skin the cat

Equivalent Operations

Data Typing and Writing `nifti` objects

Slots `datatype` and `bitpix`

Why you should care about under-storing

Other functions for changinag data

Dropping dimensions

Unscaling the data

Changing NIfTI magic slots

A HopStat and Jump Away

Trying to at least Doggie Paddle through the Sea of Data, Contributor to http://bmorebiostat.com

Why Do I use ggplot2 then?

ggplot2 vs lattice

Having multiple versions of the same plot, with slight changes

Slight Changes with additions

ggplot reduces number of commands I have to do

Faceting

Spaghetti plot with Overall smoother

Spaghetti plot with overall smoother

Another Case this is Useful – Save Plot Twice

Conclusion

First Plot – Heatmap

Don’t factor the rows

20,000 Observations

100,000 Observations

ggplot2 does not make publication-ready figures

Why all the ggplot2 hate?

Problem

Solution

Things you need to change

Minimal Working Example (MWE)

Conclusion

Dataset Creation

Array operation

fslr Helper functions

Other ways to skin the cat

Equivalent Operations

Data Typing and Writing nifti objects

Slots datatype and bitpix

Why you should care about under-storing

Other functions for changinag data

Dropping dimensions

Unscaling the data

Changing NIfTI magic slots

`fslr` Helper functions

Data Typing and Writing `nifti` objects

Slots `datatype` and `bitpix`