Recommendations for First Year Graduate Students

This blog post is a little late; I wanted to get it out sooner.

As new students have flooded the halls for the new terms at JHU Biostat, I figured I would give some recommendations to our new students, and biostatistics students in general. Some of these things may be specific to our department, but others are general, so the title should be fitting. Let's dive in!

First Term Things

Don't buy books

Some books are good for a reference, many are not. I say this because much of the information is available on Google or the internet and you will check that 98% of the time compared to going to a book. That being said, many students have these good reference books and will be willing to let you borrow them. Also, the library in your department or school will likely have them.


The full recommendation for books is this:

  1. Borrow books you need for class, especially from current (not new) students. Sharing books with current students is good except if you both need it during crucial times (like exams/comprehensive exams). Everyone has Chung or Billingsley.
  2. Of those you can't borrow, go to class for a week or two and see if you actually need it. Some professors go straight off their lecture notes. Your school bookstore doesn't just go and send back all their copies when school starts, so you can still get it. Also, I heard this new website Amazon has books.
  3. If you think a book is a really good reference, buy a copy. Better yet, buy a digital copy so you can digitally search and annotate it.

Get new gear/electronics

You will be spending the majority of your time on your laptop, so it better work and be fast. Most new programs will have some money for books and a laptop. If you read above, you saved some money on books, so use it to buy a new laptop. If your laptop is less than 2 years old, you can save that money (if PhD) or buy other electronics such as an iPad for notetaking (if Master's).

Have the tools to make your work easy because nothing is worse than you not getting work done due to other factors than yourself.


Get a Unix-like machine (aka Mac). Others say you can do stuff in Windows, but it's easier for some software in Unix. Cluster computing (see below) will be easier as well.

Side note: if you buy a new computer, do not open it until Friday afternoon/Saturday as you will likely spend a whole day playing with your new gear.

Learn who the staff is

I find many students know who the faculty are and what research they do, but have no idea about who the staff is. These people know almost everything you need to know for non-research help. They schedule meetings with the chair, organize events, schedule rooms, and, very importantly, know how to get you paid/your stipend. These people are the glue that makes everything run and are a great resource.


Go into the office and introduce yourself and ask what you should go to person X for. They will know you then when you email.

Research and Organizational Tips

Start doing research NOW

If you want to learn what research is all about, get involved early. Even if you don't feel like you know anything, waiting to get involved on a research project will not help. It can hinder you. I'm not saying work 10 hours per week on a project; you have classes.

Recommendation: Visit all the working group meetings before choosing

Attending research meetings of a few working groups can help you 1) get information on the group and how it's run, 2) meet the group members, 3) choose what you may want to focus on, and 4) get you a small-scale project to start on.

This small project is not set in stone. It is not your thesis. The project contact doesn't have to be your thesis advisor. You will likely be working on this “for free” (unless it's under a training grant mentor, technically). Therefore, you don't “owe” anyone anything if you decide in a month you hate that field or project. Don't take it lightly to abandon a project, but do use it as a feeler in that area.

Let me reiterate (at least in our department): Your academic advisor doesn't need to be your research advisor.

Recommendation: Learn how to code

Learn how to program as soon as possible. Some good resources are codeschool or code academy. If using R, I recommend first Try R from Codeschool. I would then move on to Swirl. It will never be a waste of time getting up to speed or learning how to do something new with programming. If you already feel great with R, you can try Python or move deeper into R.

Learn how to use a computing cluster

This may be necessary later in your program, but try to do it before it's “necessary”.

You are going to work on some project invariably that 1) will use simulations or 2) requires intense computation. As such, a computing cluster is made specifically for these scenarios. Learn how to use it. If you're not going to use it now for research, at least get your login and try it briefly for a class project.

Learn Modern Note-Taking Utilities and Back up your work

Condense your note-taking into one app. I like using Evernote as it syncs with my phone and Mac.

Use Dropbox or Google Drive to have a “hands-free” syncing service. Also, think about investing in an External Hard Drive, maybe as your new gear, to doubly back-up your system/data. Laptops can (and have been) stolen. Although Google Drive/DropBox are likely to be around for some time, you always want something in your control (external HDD) in case something goes wrong on a server. GitHub is great for version control, and some people use it as a back up of sorts, but it's not really for that and not a “hands free” rsync-based solution.

Learn a Markdown/Markup Language

Learn a Markdown language. Yihui has a good description of Markdown vs. LaTeX. You will need to know both. Think about learning some basics of HTML as well.

Make a Webpage

With your newfound HTML skills from above, build a webpage for yourself. Some use GitHub or WordPress. Many options exist, depending on your level of expertise, blogging capability, level of control.

Why do I need a webpage? You work on a computer (after classes) like 98% of the day. You should have a web presence.

What about my LinkedIn profile? That's good for a resume-like page, but not great for your opinions, picture uploads, general ideas. Also, your webpage allows you to control completely what you put out there. Remember, if you don't make the content, Google will pick what people see.

Check out student websites and ask the student whose you like best how they did it.

Student Life

Ask other students questions

One of my rules is to never be scared to ask a stupid question. I ask questions all the time. Some of them are stupid. I know that I won't get an answer if I don't ask though.

We have offices. Students are in those offices. Ask them questions. It's that simple.

Many students say “well I don't want to bother them”. I learned how to code by bothering people. I bothered them very much so. I thought I was annoying, but I didn't care because I didn't know what the hell I was doing.

Does that mean I want questions all day by new students? No. Read that again. No. But I do try to pay forward information to new students just like others paid towards me. If a student is curt or makes you feel stupid about asking a question, stop talking to them. They forgot what it was like when they were lost and confused and are likely now severely delusional.

Your questions are usually not new. We've asked them likely ourselves. We either have the answer or know who does. Ask.

Go to student-lead meetings

No one in my office knows anything!!? Who do I ask now? Well there are student-lead meetings. These have a lot of information and … other students! Go there, ask questions. If the topic is not what you need to know, wait until the end of the meeting when the structure breaks down and ask someone then.

Student-lead meetings have a lot less pressure to ask the “stupid questions” in a safer environment and will likely lead to answers that you understand. Because they are from other students.

Work with your cohort

Get chummy with your cohorts. You don't have to be best friends forever, but you will talk with them, have class with them, and likely work with them. Stop doing things on your own, that's not leveraging other people's brain for you as well.

These are other smart people (they were smarter than me). Why not work with them and grab some of that brainyness floating around. You will feel dumb for a while, but you'll figure it out. If you don't work with a group in the beginning, it may be too late later when people have grouped up.

They are not your competition, though many departments make it seem like that. The next stage of your career will be mixed with projects on team and the rare projects where you are alone (aka thesis). Learn how to play with, and more imporantly listen to, others.

Grades don't matter that much, learning the material does

“I came to grad school to get a 4.0” said no one ever. Grades are important for somewhat narrow things such as if the comprehensive exams go badly, are “an assessment” of your learning, or if you apply to a job with a Master's and they ask for your transcript (and for some reason care).

But good grades are not the goal of grad school. It's learning. Learn and understand the material. Learn how to learn new material. That's the goals. Grades matter in the sense they will let you know quite glaringly when you really don't know something. Remember learning is improving yourself and that should make it easier to do a project than just doing it “because someone told you to”.

Update: As pointed in the comments below, grades can matter greatly if you plan to apply to another program after your degree (e.g. PhD after doing your Master’s). If this may be in your future, make sure to keep an eye on your grades as well.

Life Recommendations

Take at least one day off per 7-day week

You need rest. Take it. A day off can clarify things later. Sometimes it's only when you stop hitting your head against the wall when you realize that what you're doing doesn't work. That's not to say you still won't work like 60 hours a week for a while, but make sure you have some protected time for your banging head.

Explore restaurants/food/night life

One of the best pieces of advice I've ever gotten for grad school was: “find a place you want to spend the next 5 years of your life” in reference to your department AND city. Whatever city your in has fun things to do. Find them. Explore your city and area. People tend to hate places they live in grad school if they don't associate anything with it other than working in a hole. Which leads me to…

Don't work in a hole; Find a happy place to work

Find a place where you are productive and like to go. I like the office; others don't. Find a coffee shop near you for days without class or when you are done classes. Use the reading room or other areas as your go to. Again, working somewhere you don't like is one more hurdle to getting things done. Get rid of such hurdles, you will have enough of them to make your own.

A better interactive neuroimage plotter in R

In a previous post, I described how you can interactively explore a 3D nifti object in R. I used the manipulate package, but the overall results were sluggish and not really usable.

I was introduced to a a good neuroimaging viewer called Mango, by a friend or two and use it somewhat inconsistently. One major advantage of this software is that it has been converted to a pure JavaScript library called Papaya. As such, you can create simple HTML files that have an embedded interactive image viewer.


That's all fine and good, but I like my things in R. I understand many can easily write bash scripts that perform the same operations and use the HTML builder provided by papaya.

I want the operation in R for the same reasons I make many things for R:

  1. I use R
  2. Many statisticians like imaging but need tools they understand and they understand R
  3. I like writing pipelines and scripts in one language

My answer: the papayar package.

Install papayar

To install papayar, you can simply install from GitHub using the devtools package.


Papayar functions

The main function is papaya. Let's look at the arguments:

[1] "images" "outdir" "..."   

The images argument can be a list of nifti objects or a vector of character filenames. The outdir is where the HTML file is written. The default is to a temporary directory that will be trashed after the session is over. The additional arguments are passed to the lower-level function pass_papaya, which in turn are passed to functions httd and daemon_stop in the servr package. The pass_papaya function is useful, however, to open a blank papaya session by just invoking pass_papaya()

Papayar Example

As the httd function starts a server, the images can be rendered (and will be by default) in the RStudio viewer! In the terminal, it opens your default web browser. Here's a basic example:

x = nifti(img = array(rnorm(100^3), dim= rep(100, 3)), dim=rep(100, 3), datatype=16)
y = nifti(img = array(rbinom(100^3, prob = 0.5, size = 10), dim= rep(100, 3)), dim=rep(100, 3), datatype=16)
index.file = papaya(list(x, y))

The first 3 lines make some random arrays, from a normal and binomial distribution and puts them into a nifti object. The list of these nifti objects is passed in. The first image is displayed in grayscale and the second image is overlaid using red-hot colors and the opacity of this image can be changed. The object index.file will be a character filename where the HTML file is stored. The data and this HTML file is written to outdir (which, again, is a temporary directory by default).


Below is a series of screen shots I took from the code above. You should be able to see this in RStudio or your browser:


The main reason to use this is that you can click different areas for the crosshairs and move to a different point in axial, coronal, and sagittal space. Thus, this is truly interactive.

Here we can show there are limited (but useful) controls for the overlay. You can change the mapping of the values in the image and the overlay and the opacity of the overlay.

Brain Example

The above data has been used since everyone could test it, but it's just random numbers and doesn't look very compelling. Here I will show you the hyperintense voxels overlaid on the MNI 152 1mm T1 brain image click here for description, which correspond mainly to the white matter:


Hopefully you can see how this can be useful for exploring data and results.

ITK-SNAP and itksnapr

Some of my colleagues are more partial to using ITK-SNAP for viewing images interactively. I have bundled the executables for ITK-SNAP into the R package itksnapr. The main function is itksnap, which you can specify images to different options to ITK-SNAP.

Install itksnapr

To install itksnapr, you can simply install from GitHub using the devtools package.

itksnap(grayscale = x, overlay = y)

I haven't used ITK-SNAP much, but hear many good things about it. There are many better resources than this blog on how to use it and how to use it well. If interested in a good image viewer, I implore you to google around for some of these resources. If anyone has intense interest of image viewers and wants added functionality, don't hesitate to file an inssue on GitHub.


Although it was included in my fslr package by default and I never discussed it in detail, FSLView is included with the distribution of FSL and is a viewer I use heavily. The fslr function is fslview. One specific advantage of using FSLView is that it can pass through X11 forwarding, so you can remotely view image from a cluster, though it may be slow.


Although I use the orthographic,image.nifti and overlay functions from oro.nifti for many of my figures, for interactive exploring of results, these can be somewhat slow for getting a large-scale view, and not a specific slice view. Therefore, a fully interactive neuroimaging plotter is necessary. Here are 3 options that can be accessed “within” R.

Rendering LaTeX Math Equations in GitHub Markdown

The Problem: GitHub won't render LaTeX

I have many times wondered about getting LaTeX math to render in a README file on GitHub. Apparently, many others ( 1, 2, 3 ), have asked the same question.

The common answers are:

  1. It cannot (and in some cases, shouldn't) be done. GitHub parsing is done by SunDown and is secure, therefore won't do LaTeX.
  2. Use or iTex2Img. These are good options, but 1) they may go away at any time, and 2) require you to rewrite your md file.
  3. Use unicode if possible.
  4. Use LaTeXIt (for Mac OS) or other converter to make your equations and embed them.

A hackey, but working solution

I opted to try a more generic solution for (4.) using some very hackey text parsing. I have done a bit of parsing in the past, but I was either too lazy to think about the right regex to do, couldn't think of it easily, or thought my solution was sufficient even if not elegant.


Two main caveats abound:

  1. This only works for inline equations marked with dollar signs ($) or equations marked by double dollar signs ($$). I could encorporate other delimiters such as \[, but I did not. I only had a bit of time on Wednesday.
  2. I assume any code that involves dollar signs be demarcated by chunks starting with three backticks (“). I wrote this for R code, which can use dollar signs for referencing and never has double dollar signs. If your code does, no guarantees.
  3. This generally assumes you have a GitHub repository (have no idea what others use), and that you're OK with the figures being located in that GitHub repository. I didn't allow options for putting them in a sub-folder, but may incorporate that.
  4. Some text won't be sized correctly.

How do I do it already

I wrote an R package that would parse a (or README.rmd if it's RMarkdown). The package is located at

You can install the package using:


You would then load the package:


The main function is parse_latex. It's not the best function name for what it does, but I don't really care. Let's see it's arguments:


You must put in a README file as the rmd argument. If the README has an rmd or Rmd extension, the README is first knitted using knit(rmd) and then the resultant md file is used. This md is located in a temporary directory and won't write to the directory of the README. The new_md is the filename for the output md file that you wish to create. One example would be rmd = "" and md = "". The git_username and git_reponame must be specified with your username and repository name, respectively. The git_branch allows you to specify which branch you are on, if necessary. If you don't know what that means, just leave as master.

The rest of the arguments are for inserting the LaTeX into the document. The text_height is how large the LaTeX should be (this may be bad for your document), the insert_string is the HTML the LaTeX is subbed for, the raw_git_site uses to reference the figures directly with proper content-type headers (so that they show up). The bad_string is something I'm using in the code. You only need to change bad_string if you happen to have text in your README that matches this (should be rare as they are a bunch of Z's, unless you write like someone sleeping). I'll get to the ... in a minute.

I still don't get it – show me an example

I thought you'd never ask. The parse_latex command has an example from one of my other repos and you can run it as follows (need curl):

rmd = file.path(tempdir(), "README_unparse.rmd")
destfile = rmd, method = "curl")
new_md = file.path(tempdir(), "")
            git_username = "muschellij2",
            git_reponame = "Github_Markdown_LaTeX")
new_html = pandoc(new_md, format = "html")

And you can view the html using browseURL:


You can see the output of the example (only a little bit of LaTeX) at this repo: or at Kristin Linn's README, which was used as an example here:

What is the function actually doing

So what is the function actually doing? Something convoluted I can assure you. The process is as follows:

  1. Find the equations using ($$ and $) parse them out, throwing out any code demarcated with backticks (”).
  2. Put this LaTeX into a simple LaTeX document with \begin{document}. Note, the ... argument can be a character vector of other packages to load in that document. See png_latex documentation.
  3. Run pdflatex on the document. Note, this must be in your path. This creates a PDF.
  4. Run knitr::plot_crop on this document. This will crop out anything that's not the LaTeX equation you wanted.
  5. Convert the PDF to a PNG using animation::im.convert. This is so that they will render in the README. The file will be something like eq_no_01.png in the same folder as the rmd argument.
  6. Replace all the LaTeX with the insert_string, which is raw HTML now.
  7. Write out the parsed md file, which was named using new_md.

Wow – that IS convoluted

My best shot in one day. If you have better solutions, please post in the comments.

Nothing shows up! Read this

NB: The replacement looks for equations (noted by eq_noSOMETHING.png) in your online GitHub repository. If you run this command and don't push these png files, then nothing will show up.


You can have LaTeX “rendered” in a GitHub README file! The sizes of the text may be weird. This is due to the cropping. I could probably use some bounding box or better way to get only the equations, but I didn't. If you want to help, please sumbit a Pull Request to my repository and I'd gladly merge it if it works.

NB: GitHub may override a if a README.rmd (or README.Rmd) exists. I'm not 100% sure on that, but if that's the case, rename the Rmd and just have

Happy parsing!

#rstats Make arrays into vectors before running table

Setup of Problem

While working with nifti objects from the oro.nifti, I tried to table the values of the image. The table took a long time to compute. I thought this was due to the added information about a medical image, but I found that the same sluggishness happened when coercing the nifti object to an array as well.

Quick, illustrative simulation

But, if I coerced the data to a vector using the c function, things were much faster. Here's a simple example of the problem.

dim1 = 30
n = dim1 ^ 3
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = c(dim1, dim1, dim1))
microbenchmark(table(vec), table(arr), table(c(arr)), times = 100)
Unit: milliseconds
          expr       min        lq      mean    median        uq      max
    table(vec)  5.767608  5.977569  8.052919  6.404160  7.574409 51.13589
    table(arr) 21.780273 23.515651 25.050044 24.367534 25.753732 68.91016
 table(c(arr))  5.803281  6.070403  6.829207  6.786833  7.374568  9.69886
 neval cld
   100  a 
   100   b
   100  a 

As you can see, it's much faster to run table on the vector than the array, and the coercion of an array to a vector doesn't take much time compared to the tabling and is comparable in speed.

Explanation of simulation

If the code above is clear, you can skip this section. I created an array that was 30 × 30 × 30 from random binomial variables with half probabily and 15 Bernoulli trials. To keep things on the same playing field, the array (arr) and the vector (vec) have the same values in them. The microbenchmark function (and package of the same name) will run the command 100 times and displays the statistics of the time component.

Why, oh why?

I've looked into the table function, but cannot seem to find where the bottleneck occurs. Now, for and array of 30 × 30 × 30, it takes less than a tenth of a second to compute. The problem is when the data is 512 × 512 × 30 (such as CT data), the tabulation using the array form can be very time consuming.

I reduced the replicates, but let's show see this in a reasonable image dimension example:

dims = c(512, 512, 30)
n = prod(dims)
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = dims)
microbenchmark(table(vec), table(arr), table(c(arr)), times = 10)
Unit: seconds
          expr      min       lq     mean    median        uq       max
    table(vec) 1.871762 1.898383 1.990402  1.950302  1.990898  2.299721
    table(arr) 8.935822 9.355209 9.990732 10.078947 10.449311 11.594772
 table(c(arr)) 1.925444 1.981403 2.127866  2.018741  2.222639  2.612065
 neval cld
    10  a 
    10   b
    10  a 


I can't figure out why right now, but it seems that coercing an array (or nifti image) to a vector before running table can significantly speed up the procedure. If anyone has any intuition why this is, I'd love to hear it. Hope that helps your array tabulations!

Line plots of longitudinal summary data in R using ggplot2

I recently had an email for a colleague asking me to make a figure like this in ggplot2 or trellis in R:

plot of chunk final_plot

As I know more about how to do things in ggplot2, I chose to use that package (if it wasn't obvious from the plot or other posts).

Starting Point

Cookbook R/) has a great starting point for making this graph. The solution there is not sufficient for the desired graph, but that may not be clear why that is. I will go through most of the steps of customization on how to get the desired plot.

Creating Data

To illustrate this, I will create some sample dataset:

N <- 30
id <- as.character(1:N) # create ids
sexes = c("male", "female")
sex <- sample(sexes, size = N/2, replace = TRUE) # create a sample of sex
diseases = c("low", "med", "high")
disease <- rep(diseases, each = N/3) # disease severity 
times = c("Pre", "0", "30", "60")
time <- rep(times, times = N) # times measured 
t <- 0:3
ntimes = length(t)
y1 <- c(replicate(N/2, rnorm(ntimes, mean = 10+2*t)), 
        replicate(N/2, rnorm(ntimes, mean = 10+4*t)))
y2 <- c(replicate(N/2, rnorm(ntimes, mean = 10-2*t)), 
        replicate(N/2, rnorm(ntimes, mean = 10-4*t)))
y3 <- c(replicate(N/2, rnorm(ntimes, mean = 10+t^2)), 
        replicate(N/2, rnorm(ntimes, mean = 10-t^2)))

data <- data.frame(id=rep(id, each=ntimes), sex=rep(sex, each=ntimes), 
                   severity=rep(disease, each=ntimes), time=time, 
                   Y1=c(y1), Y2=c(y2), Y3=c(y3)) # create data.frame
#### factor the variables so in correct order
data$sex = factor(data$sex, levels = sexes)
data$time = factor(data$time, levels = times)
data$severity = factor(data$severity, levels = diseases)
  id    sex severity time        Y1        Y2        Y3
1  1 female      low  Pre  9.262417 11.510636  9.047127
2  1 female      low    0 10.223988  8.592833 11.570381
3  1 female      low   30 13.650680  5.696405 13.954316
4  1 female      low   60 15.528288  5.313968 18.631744
5  2 female      low  Pre  9.734716 11.190081 10.086104
6  2 female      low    0 12.892207  7.897296  9.794494

We have a longitudinal dataset with 30 different people/units with different ID. Each ID has a single sex and disease severity. Each ID has 4 replicates, measuring 3 separate variables (Y1, Y2, and Y3) at each time point. The 4 time points are previous (Pre)/baseline, time 0, 30, and 60, which represent follow-up.

Reformatting Data

In ggplot2, if you want to plot all 3 Y variables, you must have them in the same column, with another column indicating which variable you want plot. Essentially, I need to make the data “longer”. For this, I will reshape the data using the reshape2 package and the function melt.

long = melt(data, measure.vars = c("Y1", "Y2", "Y3") )
  id    sex severity time variable     value
1  1 female      low  Pre       Y1  9.262417
2  1 female      low    0       Y1 10.223988
3  1 female      low   30       Y1 13.650680
4  1 female      low   60       Y1 15.528288
5  2 female      low  Pre       Y1  9.734716
6  2 female      low    0       Y1 12.892207

It may not be clear what has been reshaped, but reordering the data.frame can illustrate that each Y variable is now a separate row:

head(long[ order(long$id, long$time, long$variable),], 10)
    id    sex severity time variable     value
1    1 female      low  Pre       Y1  9.262417
121  1 female      low  Pre       Y2 11.510636
241  1 female      low  Pre       Y3  9.047127
2    1 female      low    0       Y1 10.223988
122  1 female      low    0       Y2  8.592833
242  1 female      low    0       Y3 11.570381
3    1 female      low   30       Y1 13.650680
123  1 female      low   30       Y2  5.696405
243  1 female      low   30       Y3 13.954316
4    1 female      low   60       Y1 15.528288

Creating Summarized data frame

We will make a data.frame with the means and standard deviations for each group, for each sex, for each Y variable, for separate time points. I will use plyr to create this data.frame, using ddply (first d representing I'm putting in a data.frame, and the second d representing I want data.frame out):

agg = ddply(long, .(severity, sex, variable, time), function(x){
  c(mean=mean(x$value), sd = sd(x$value))
  severity  sex variable time      mean        sd
1      low male       Y1  Pre  9.691420 1.1268324
2      low male       Y1    0 12.145178 1.1218897
3      low male       Y1   30 14.304611 0.3342055
4      low male       Y1   60 15.885740 1.7616423
5      low male       Y2  Pre  9.653853 0.7404102
6      low male       Y2    0  7.652401 0.7751223

There is nothing special about means/standard deviations. It could be any summary measures you are interested in visualizing.

We will also create the Mean + 1 standard deviation. We could have done standard error or a confidence interval, etc.

agg$lower = agg$mean + agg$sd
agg$upper = agg$mean - agg$sd

Now, agg contains the data we wish to plot.

Time is not on your side

Time as a factor

If you look at the plot we wish to make, we want the lines to be connected for times 0, 30, 60, but not for the previous data. Let's try using the time variable, which is a factor. We create pd, which will be a ggplot2 object, which tells that I wish to plot the means + error bars slightly next to each other.

[1] "factor"
pd <- position_dodge(width = 0.2) # move them .2 to the left and right

gbase  = ggplot(agg, aes(y=mean, colour=severity)) + 
  geom_errorbar(aes(ymin=lower, ymax=upper), width=.3, position=pd) +
  geom_point(position=pd) + facet_grid(variable ~ sex)
gline = gbase + geom_line(position=pd) 
print(gline + aes(x=time))

plot of chunk gbase

None of the lines are connected! This is because time is a factor. We will use gbase and gline with different times to show how the end result can be achieved.

Time as a numeric

We can make time a numeric variable, and simply replace Pre with -1 so that it can be plotted as well.

agg$num_time = as.numeric(as.character(agg$time))
agg$num_time[$num_time) ] = -1
[1] -1  0 30 60

In a previous post, I have discussed as an aside of creating a plot in ggplot2 and then creating adding data to the data.frame. You must use the %+% to update the data in the object.

gline = gline %+% agg
print(gline + aes(x=num_time))

plot of chunk plus

If you look closely, you can see that Pre and time 0 are very close and not labeled, but also connected. As the scale on the x-axis has changed, the width of the error bar (set to 0.3), now is too small and should be changed if using this solution.

Although there can be a discussion if the Pre data should be even on the same plot or the same timeframe, I will leave that for you to dispute. I don't think it's a terrible idea, and I think the plot works because the Pre and 0 time point data are not connected. There was nothign special about -1, and here we use -30 to make it evenly spaced:

agg$num_time[ agg$num_time == -1 ] = -30
gline = gline %+% agg
print(gline + aes(x=num_time))

plot of chunk create_time_neg

That looks similar to what we want. Again, Pre is connected to the data, but we also now have a labeling problem with the x-axis somewhat. We still must change the width of the error bar in this scenario as well.

Time as a numeric, but not the actual time point

In the next case, we simply use as.numeric to the factor to create a variable new_time that will be 1 for the first level of time (in this case Pre) to the number of time points, in this case 4.

agg$new_time = as.numeric(agg$time)
[1] 1 2 3 4
gline = gline %+% agg
print(gline + aes(x = new_time))

plot of chunk new_time

Here we have something similar with the spacing, but now the labels are not what we want. Also, Pre is still connected. The width of the error bars is now on a scale from 1-4, so they look appropriate.

Creating a Separate data.frame

Here, we will create a separate data.frame for the data that we want to connect the points. We want the times 0-60 to be connected and the Pre time point to be separate.

sub_no_pre = agg[ agg$time != "Pre",]

Mulitple data sets in plot function

Note, previously we did:

gline = gbase + geom_line(position=pd) 

This assumes that geom_line uses the same data.frame as the rest of the plot (agg). We can fully specify the arguments in geom_line so that the line is only for the non-Pre data:

gbase = gbase %+% agg
gline = gbase + geom_line(data = sub_no_pre, position=pd, 
                          aes(x = new_time, y = mean, colour=severity)) 
print(gline + aes(x = new_time))

plot of chunk non_conn
Note, the arguments in aes should match the rest of the plot for this to work smoothly and correctly.

Relabeling the axes

Now, we simply need to re-label the x-axis so that it corresponds to the correct times:

g_final = gline + aes(x=new_time) +
  scale_x_continuous(breaks=c(1:4), labels=c("Pre", "0", "30", "60"))

We could be more robust in this code, using the levels of the factor:

time_levs = levels(agg$time)
g_final = gline + aes(x=new_time) +
    breaks= 1:length(time_levs), 
    labels = time_levs)

plot of chunk relabel2

Give me a break

My colleague also wanted to separate the panels a bit. We will use the panel.margin arguments and use the unit function from the grid package to define how far apart we want the axes.

g_final = g_final + theme(panel.margin.x = unit(1, "lines"), 
                          panel.margin.y = unit(0.5, "lines"))

plot of chunk final

Additional options and conclusoin

I believe legends should be inside a plot for many reasons (I may write about that). Colors can be changed (see scale_colour_manual). Axis labels should be changed, and the Y should be labeled to what they are (this is a toy example).

Overall, this plot seems to be what they wanted and the default options work okay. I hope this illustrates how to customize a ggplot to your needs and how you may need to use multiple data.frames to achieve your desired result.

A small neuroimage interactive plotter

Manipulate Package

The manipulate from RStudio allows you to create simple Tcl/Tk operators for interactive visualization. I will use it for a simple slider to view different slices of an image.


fslr package

I'm calling the fslr package because I know that if you have it installed, you will likely have FSL and have a 1mm T1 template from MNI in a specific location. fslr also loads the oro.nifti package so that readNIfTI is accessible after loading fslr. You can download a test NIfTI image here if you don't have access to any and don't have FSL downlaoded.

Here I will read in the template image:

template = file.path(fsldir(), "data/standard", 
img = readNIfTI(template)

The iplot function

The iplot function defined below takes in a nifti object, the specific plane to be plotted and additional options to be passed to oro.nifti::image. The function is located on my GitHub here.

iplot = function(img, plane = c("axial", 
                                "coronal", "sagittal"), ...){
  ## pick the plane
  plane = match.arg(plane, c("axial", 
                             "coronal", "sagittal"))
  # Get the max number of slices in that plane for the slider
  ns=  switch(plane,
  ## run the manipulate command
    image(img, z = z, plot.type= "single", plane = plane, ...)
    # this will return mouse clicks (future experimental work)
    pos <- manipulatorMouseClick()
    if (!is.null(pos)) {
  ## make the slider
  z = slider(1, ns, step=1, initial = ceiling(ns/2))

Example plots

Here are some examples of how this iplot function would be used:

iplot(img, plane = "coronal")
iplot(img, plane = "sagittal")

The result will be a plotted image of the slice with a slider. This is most useful if you run it within RStudio.

Below are 2 example outputs of what you see in RStudio:

Slice 91:
Slice 1

Slice 145:

Slice 2


The iplot function allows users to interactively explore neuroimages. The plotting is not as fast as I'd like, I may try to speed up the oro.nifti::image command or implement some subsampling. It does however show a proof of concept how interactive neuroimaging visualization can be done in R.


manipulate must be run in RStudio for manipulation. The fslr function fslview will call FSLView from FSL for interactive visualization. This is an option of interactive neuroimaging “in R”, but not a real or satisfactory implementation for me (even though I use it frequently). If anyone has implemented such a solution in R, I'd love to hear about it.

matlabr: a Package to Calling MATLAB from R with system

In my research, I primarily use R, but I try to use existing code if available. In neuroimaging and other areas, that means calling MATLAB code. There are some existing solutions for the problem of R to MATLAB: namely the R.matlab package and the RMatlab package (which can call R from MATLAB as well). I do not use thse solutions usually though.

Previously, Mandy Mejia wrote “THREE WAYS TO USE MATLAB FROM R”. Option 2 is about how to use R.matlab, and Mandy gives and example with some cod. She also describes in Options 1 and 3 how to use the system command to call MATLAB commands.

I like this strategy options because:

  1. I didn’t take the time to learn R.matlab.
  2. It worked for me.
  3. I wrote a package to wrap the options Mandy described: matlabr.

matlabr: Wrapping together system calls to MATLAB

The matlabr package is located in GitHub and you can install it with the following command:


It has a very small set of functions and I will go through each function and describe what they do:

  1. get_matlab: Mostly internal command that will return a character string that will be passed to system. If matlab is in your PATH (bash variable), and you are using R based on the terminal, the command would return "matlab". If MATLAB is not in your PATH or using a GUI-based system like RStudio, you must set options(matlab.path='/your/path/to/matlab').
  2. have_matlab: Wrapper for get_matlab to return a logical if matlab is found.
  3. run_matlab_script: This will pass a .m file to MATLAB. It also wraps the command in a try-catch statement in MATLAB so that if it fails, it will print the error message. Without this try-catch, if MATLAB errors, then running the command will remain in MATLAB and not return to R.
  4. run_matlab_code: This takes a character vector of MATLAB code, ends lines with ;, writes it to a temporary .m file, and then runs run_matlab_script on the temporary .m file.
  5. rvec_to_matlab: Takes in a numeric R vector and creates a MATLAB column matrix.
  6. rvec_to_matlabclist: Takes in a vector from R (usually a character vector) and quotes these strings with single quotes and places them in a MATLAB cell using curly braces: { and }. It then stacks these cells into a “matrix” of cells.

Setting up matlabr

Let’s set up the matlab.path as I’m running in RStudio:

options(matlab.path = "/Applications/")

The result from have_matlab() indicates that the matlab command can be called.

Let’s write some code to test it

Here we will create some code to take a value for x, y, z (scalars) and a matrix named a and then save x, a, z to a text file:

code = c("x = 10", 
         "a = [1 2 3; 4 5 6; 7 8 10]",
         "save('test.txt', 'x', 'a', 'z', '-ascii')")
res = run_matlab_code(code)


First off, we see that test.txt indeed was written to disk.

[1] TRUE

We can read in the test.txt from using readLines:

output = readLines(con = "test.txt")
[1] "   1.0000000e+01"                                
[2] "   1.0000000e+00   2.0000000e+00   3.0000000e+00"
[3] "   4.0000000e+00   5.0000000e+00   6.0000000e+00"
[4] "   7.0000000e+00   8.0000000e+00   1.0000000e+01"
[5] "   3.0000000e+01"                                


matlabr isn’t fancy and most likely has some drawbacks as using system can have some quirks. However, these functions have been helpful for me to use some SPM routines and other MATLAB commands while remaining “within R“. R.matlab has a better framework, but it may not be as straightforward for batch processing. Also matlabr has some wrappers that will do a try-catch so that you don’t get stuck in MATLAB after calling system.

Let me know if this was helpful or if you have ideas on how to make this better. Or better yet, give a pull request.

White Matter Segmentation in R

Goals and Overall Approach

We will use multiple packages and pieces of software for white matter (and gray matter/cerebro spinal fluid (CSF)) segmentation.

The overall approach will be, with the required packages in parentheses:

  1. N4 Inhomogeneity Bias-Field Correction (extrantsr and ANTsR)
  2. Brain extraction using BET and additional tools (extrantsr and fslr)
  3. FAST for tissue-class segmentation. (fslr)

Installing Packages

Below is a script to install all the current development versions of all packages. The current fslr packages depends on oro.nifti (>= 0.5.0) , which is located at muschellij2/oro.nifti or bjw34032/oro.nifti.

Note, the ITKR and ANTsR packages can take a long time to compile. The extrantsr package builds on ANTsR and makes some convenience wrapper functions.


Load in the packages

Here we will load in the required packages. The scales package is imported just for the alpha function, used below in plotting.


Specifying FSL path

For fslr to work, FSL must be installed. If run in the Terminal, the FSLDIR environmental variable should be found using R's Sys.getenv("FSLDIR") function.

If run in an IDE (such as RStudio or the R GUI), R must know the path of FSL, as set by the following code:


Image Filenames

Here we will set the image name. The nii.stub function will strip off the .nii.gz from = "SUBJ0001-01-MPRAGE.nii.gz"
img.stub = nii.stub(

N4 Bias Field Correction

The first step in most MRI analysis is performing inhomogeneity correction. The extrantsr function bias_correct can perform N3 or N4 bias correction from the ANTsR package.

n4img = bias_correct(, correction = "N4", 
                      outfile = paste0(img.stub, "_N4.nii.gz") )

plot of chunk biascorrection_plot

Let us note that the image is of the head and a bit of the neck. We wish to perform white matter segmentation only on the brain tissues, so we will do brain extraction.

Brain Extraction

The extrantsr function fslbet_robust performs brain extraction. It relies on the fslr function fslbet which calls bet from FSL. It also performs neck removal (remove.neck = TRUE) and will perform BET once and then estimate a new center of gravity (COG) and then re-run BET. These functions are implemented in fslbet specifically, but these have been re-implemented in fslbet_robust in a slightly different way. fslbet_robust will also perform N4 inhomogeneity correction, but as this has already been performed above, we will set correct = FALSE.

For neck removal, a template brain and mask must be specified. We will use the T1, 1mm resolution, MNI brain included with FSL's installation.

bet = fslbet_robust(img = n4img, 
                    retimg = TRUE,
                    remove.neck = TRUE,
                    robust.mask = FALSE,
                    template.file = file.path( fsldir(), 
                    template.mask = file.path( fsldir(), 
                    outfile = "SUBJ0001-01-MPRAGE_N4_BET", 
                    correct = FALSE)

The results look good – the brain tissue is kept (in red) only. Not much brain tissue is discarded nor non-brain-tissue is included.

ortho2(n4img, bet > 0, 
       col.y=alpha("red", 0.5))

plot of chunk bet_plot

FAST Image Segmentation

Now that we have a brain image, we can use FAST for image segmentation. We will use the fslr function fast, which calls fast from FSL. We will pass the -N option so that FAST will not perform inhomogeneity correction (different from N4 and N3), because we had performed this before.

fast = fast(file = bet, 
            outfile = paste0(img.stub, "_BET_FAST"), 
            opts = '-N')

White Matter Results

By default, FAST assumes 3 tissue classes, generally white matter, gray matter, and CSF. These are generally ordered by the mean intensity of the class. For T1-weighted images, white matter is the highest intensity, and assigned class 3. Let's see the results:

ortho2(bet, fast == 3, 
       col.y=alpha("red", 0.5))

plot of chunk fast_plot

Gray Matter / CSF Results

We can also visualize the classes for 1 and 2 for CSF and gray matter, respectively.

ortho2(bet, fast == 1, col.y=alpha("red", 0.5), text="CSF Results")

plot of chunk fast_plot_csf_gm

ortho2(bet, fast == 2, col.y=alpha("red", 0.5), text="Gray Matter\nResults")

plot of chunk fast_plot_csf_gm

The results indicate good segmentation of the T1 image. The fslr function fast result in more than the tissue-class segmentation, see the other files output:

list.files(pattern=paste0(img.stub, "_BET_FAST"))
[1] "SUBJ0001-01-MPRAGE_BET_FAST_mixeltype.nii.gz"
[2] "SUBJ0001-01-MPRAGE_BET_FAST_pve_0.nii.gz"    
[3] "SUBJ0001-01-MPRAGE_BET_FAST_pve_1.nii.gz"    
[4] "SUBJ0001-01-MPRAGE_BET_FAST_pve_2.nii.gz"    
[5] "SUBJ0001-01-MPRAGE_BET_FAST_pveseg.nii.gz"   
[6] "SUBJ0001-01-MPRAGE_BET_FAST_seg.nii.gz"      


It's a exciting time to be working in neuroimaging in R. The fslr and ANTsR packages provide functionality to perform operations for neuroimaging processing. I will be doing a series on some of the options for analysis in the coming weeks. The code for this analysis (and the data) is located at


The fslr function ortho2 is a rewrite of the oro.nifti::orthographic function, but with different defaults and will set values of 0 in the second image (y argument) to NA.

The Unofficial ENAR 2015 Itinerary Maker

It’s almost ENAR 2015! The final program is out with all the sessions. The last conference I went to, the International Stroke Conference, had a program planner hosted by abstracts online.

Although there are parts of this system I would like to change, I believe it is helpful for looking up sessions, presenters, and especially posters. Therefore, I introduce

Functions and How to Use

Here is an example screen shot of the shiny app:

Each of the functions are as follows:

  • Type of Session – you can choose from different session types, whether you want to limit to posters or short courses
  • Select the day to subset data
  • Select a specific session
  • Search – this text field uses grep (after lower casing the field) to search the title and autor fields for relevant text.
  • Download – this will download a CSV file of the subsetted table. Now note that this will not be exactly the table, but a Google Calendar-friendly format. This will be discussed in the next session.
  • Donate button: I spent a good deal of work on this app and I believe it improves the conference. If you agree and would like to donate some money and/or a beer at ENAR 2015, I’d appreciate it.

Export to Google Calendar

Individual Talks

Each individual session can be added to a Google Calendar using the Add to my Google calendar button next to each session. For standard posters, this will add the poster as the entire poster session. For specific talks, it will not add the complete session, but simply that talk.

Exporting CSV and Uploading to Google Calendar

Create a new google calendar, let’s call it “ENAR 2015”. Downloading a CSV of sessions you would like to attend or download the entire table and filter them in R/Excel. In Google Calendar, go to Other Calendars, click the down arrow and select ‘Import Calendar’, upload the CSV and select your new calendar ENAR 2015, the records should be imported. If this is unclear, I made a 3 minute Youtube video of this step-by-step process.


Is any information is incorrect please let me know, either at @StrictlyStat or I spent a good deal of time cleaning the text from the PDF so I believe it should be mostly correct but obviously any last-minute changes I did not capture.

My Sessions

Please stop by the poster session Poster Number 2b. (PDF of poster) and if you’re interested in neuroimaging processing and using R, please sign up for the “T4: A Tutorial for Multisequence Clinical Structural Brain MRI” that we are running.

Code for app

The app is hosted on my GitHub along with the data used to run the app.

If the app crashes

A backup (or mirror) shiny app is located at

Using Tables for Statistics on Large Vectors

This is the first post I’ve written in a while. I have been somewhat radio silent on social media, but I’m jumping back in.

Now, I work with brain images, which can have millions of elements (referred to as voxels). Many of these elements are zero (for background). We want to calculate basic statistics on the data usually and I wanted to describe how you can speed up operations or reduce memory requirements if you want to calculate many statistics on a large vector with integer values by using summary tables.

Why to use Tables

Tables are relatively computationally expensive to calculate. They must operate over the entire vector, find the unique values, and bin the data into these values. Let n be the length of the vector. For integer vectors (i.e. whole number), the number of unique values is much less than n. Therefore, the table is stored much more efficiently than the entire vector.

Tables are sufficient statistics

You can think of the frequencies and bins as summary statistics for the entire distribution of the data. I will not discuss a formal proof here, but you can easily re-create the entire vector using the table (see epitools::expand.table for a function to do this), and thus the table is a sufficient (but not likely a minimal) statistic.

As a sufficient statistic, we can create any statistic that we’d like relatively easy. Now, R has very efficient functions for many statistics, such as the median and quantiles, so it may not make sense why we’d want to rewrite some of these functions using tables.

I can think of 2 reasons: 1) you want to calculate many statistics on the data and don’t want to pass the vector in multiple times, and 2) you want to preprocess the data to summarize the data into tables to only use these in memory versus the entire vector.

Here are some examples when this question has been asked on stackoverflow: 1, 2 and the R list-serv: 1. What we’re going to do is show some basic operations on tables to get summary statistics and show they agree.

R Implementation

Let’s make a large vector:

vec = sample(-10:100, size= 1e7, replace = TRUE)

Quantile function for tables

I implemented a quantile function for tables (of only type 1). The code takes in a table, creates the cumulative sum, extracts the unique values of the table, then computes and returns the quantiles.

quantile.table = function(tab, probs = c(0, 0.25, 0.5, 0.75, 1)){
  n = sum(tab)
  #### get CDF
  cs = cumsum(tab)
  ### get values (x)
  uvals = unique(as.numeric(names(tab)))

  #  can add different types of quantile, but using default
  m = 0
  qs = sapply(probs, function(prob){
    np = n * prob
    j = floor(np) + m
    g = np + m - j
    # type == 1
    gamma = as.numeric(g != 0)
    cs &lt;= j
    quant = uvals[min(which(cs &gt;= j))]
  dig &lt;- max(2L, getOption(&quot;digits&quot;))
  names(qs) &lt;- paste0(if (length(probs) &lt; 100) 
    formatC(100 * probs, format = &quot;fg&quot;, width = 1, digits = dig)
    else format(100 * probs, trim = TRUE, digits = dig), 

Quantile Benchmarks

Let’s benchmark the quantile functions: 1) creating the table and then getting the quantiles, 2) creating an empircal CDF function then creating the quantiles, 3) creating the quantiles on the original data.

qtab = function(vec){
  tab = table(vec)
qcdf = function(vec){
  cdf = ecdf(vec)
  quantile(cdf, type=1)
# quantile(vec, type = 1)
microbenchmark(qtab(vec), qcdf(vec), quantile(vec, type = 1), times = 10L)
Unit: relative
                    expr       min        lq     mean    median       uq
               qtab(vec) 12.495569 12.052644 9.109178 11.589662 7.499691
               qcdf(vec)  5.407606  5.802752 4.375459  5.553492 3.708795
 quantile(vec, type = 1)  1.000000  1.000000 1.000000  1.000000 1.000000
      max neval cld
 5.481202    10   c
 2.653728    10  b 
 1.000000    10 a  

More realistic benchmarks

Not surprisingly, simply running quantile on the vector beats the other 2 methods, by far. So computational speed may not be beneficial for using a table. But if tables or CDFs are already created in a previous processing step, we should compare that procedure:

tab = table(vec)
cdf = ecdf(vec)
all.equal(quantile.table(tab), quantile(cdf, type=1))
[1] TRUE
all.equal(quantile.table(tab), quantile(vec, type=1))
[1] TRUE
microbenchmark(quantile.table(tab), quantile(cdf, type=1), quantile(vec, type = 1), times = 10L)
Unit: relative
                    expr      min       lq     mean   median       uq
     quantile.table(tab)    1.000    1.000   1.0000    1.000   1.0000
 quantile(cdf, type = 1)  774.885 1016.172 596.3217 1144.063 868.8105
 quantile(vec, type = 1) 1029.696 1122.550 653.2146 1199.143 910.3743
      max neval cld
   1.0000    10  a 
 198.1590    10   b
 206.5936    10   b

As we can see, if you had already computed tables, then you get the same quantiles as performing the operation on the vector, and also much faster results. Using quantile on a ecdf object is not much better, which mainly is due to the fact that the quantile function remakes the factor and then calculate quantiles:

function (x, ...) 
quantile(evalq(, diff(c(0, round(nobs * y)))), environment(x)), 
&lt;bytecode: 0x107493e28&gt;
&lt;environment: namespace:stats&gt;

Median for tables

Above we show the quantile.table function, so the median function is trivial where probs = 0.5:

median.table = function(tab){
  quantile.table(tab, probs = 0.5)

Mean of a table

Other functions can be used to calculate statstics on the table, such as the mean:

mean.table = function(tab){
  uvals = unique(as.numeric(names(tab)))
  sum(uvals * tab)/sum(tab)
[1] 44.98991
[1] 44.98991
Warning in mean.default(cdf): argument is not numeric or logical:
returning NA
[1] NA

As we see, we can simply use mean and do not need to define a new function for tables.

[1] 44.98991
all.equal(mean(tab), mean(vec))
[1] TRUE

Subsetting tables

One problem with using mean vs. mean.table is when you subset the table or perform an operation that causes it to lose the attribute of the class of table. For example, let’s say I want to estimate the mean of the data for values > 0:

mean(vec[vec &gt; 0])
[1] 50.50371
over0 = tab[as.numeric(names(tab)) &gt; 0]
[1] 90065.98
[1] 50.50371
[1] &quot;array&quot;

We see that after subsetting, over0 is an array and not a table, so mean computes the mean using the array method, treating the frequences as data and the estimated mean is not correct. mean.table calculates the correct value, as it does not depend on the class of tab. Another way to circumvent this is to reassign a class of table to over0:

class(over0) = &quot;table&quot;
[1] 50.50371

This process requires the user to know what the class is of the object passed to mean, and may not be correct if the user changes the class of the object.

Aside on NA values

Let’s see what happens when there are NAs in the vector. We’ll put in 20 NA values:

navec = vec
navec[sample(length(navec), 20)] = NA
natab = table(navec, useNA=&quot;ifany&quot;)
nacdf = ecdf(navec)
[1] NA
[1] NA
# mean(nacdf)

We see that if we table the data with NA being a category, then any operation that returns NA if NA are present will return NA. For example, if we do a table on the data with the table option useNA="always", then the mean will be NA even though no NA are present in the original vector. Also, ecdf objects do not keep track of NA values after they are computed.

tab2 = table(vec, useNA=&quot;always&quot;)
[1] NA
nonatab = table(navec, useNA=&quot;no&quot;)
[1] 44.98993
mean(navec, na.rm=TRUE)
[1] 44.98993

If you are using tables for statistics, the equivalent of na.rm=FALSE is table(..., useNA="ifany") and na.rm=TRUE is table(..., useNA="no"). We also see that an object of ecdf do not ever show NAs. Although we said tables are sufficient statistics, that may not be entirely correct if depending on how you make the table when the data have missing data.

Mean benchmark

Let’s benchmark the mean function, assuming we have pre-computed the table:

microbenchmark(mean(tab), mean(vec), times = 10L)
Unit: relative
      expr      min       lq     mean   median       uq      max neval cld
 mean(tab)   1.0000   1.0000   1.0000   1.0000   1.0000  1.00000    10  a 
 mean(vec) 374.0648 132.3851 111.2533 104.7355 112.7517 75.21185    10   b

Again, if we have the table pre-computed, then estimating means is much faster using the table.

Getting standard deviation

The mean example may be misleading when we try sd on the table:

[1] 32.04476
[1] 302.4951

This are not even remotely close. This is because sd is operating on the table as if it were a vector and not a frequency table.

Note, we cannot calculate sd from the ecdf object:

Error in as.double(x): cannot coerce type 'closure' to vector of type 'double'

SD and Variance for frequency table

We will create a function to run sd on a table:

var.table = function(tab){
  m = mean(tab)
  uvals = unique(as.numeric(names(tab)))
  n = sum(tab)
  sq = (uvals - m)^2
  ## sum of squared terms
  var = sum(sq * tab) / (n-1)
sd.table = function(tab){
[1] 32.04476

We create the mean, get the squared differences, and sum these up (sum(sq * tab)) , divide by n-1 to get the variance and the sd is the square root of the variance.

Benchmarking SD

Let’s similarly benchmark the data for sd:

microbenchmark(sd.table(tab), sd(vec), times = 10L)
Unit: relative
          expr      min       lq    mean   median       uq      max neval
 sd.table(tab)   1.0000   1.0000   1.000    1.000   1.0000   1.0000    10
       sd(vec) 851.8676 952.7785 847.225 1142.225 732.3427 736.2757    10

Mode of distribution

Another statistic we may want for tabular data is the mode. We can simply find the maximum frequency in the table. The multiple option returns multiple values if there is a tie for the maximum frequency.

mode.table = function(tab, multiple = TRUE){
  uvals = unique(as.numeric(names(tab)))
  ind = which.max(tab)
  if (multiple){
    ind = which(tab == max(tab))
[1] 36

Memory of each object

We wish to simply show the memory profile for using a table verus the entire vector:

format(object.size(vec), &quot;Kb&quot;)
[1] &quot;39062.5 Kb&quot;
format(object.size(tab), &quot;Kb&quot;)
[1] &quot;7.3 Kb&quot;
round(as.numeric(object.size(vec) / object.size(tab)))
[1] 5348

We see that the table much smaller than the vector. Therefore, computing and storing summary tables for integer data can be much more efficient.


Tables are computationally expensive. If tables are pre-computed for integer data, however, then statistics can be calculated quickly and accurately, even if NAs are present. These tables are also much smaller in memory so that they can be stored with less space. This may be an important thing to think about computing and storage of large vectors in the future.