A Faster Scale Function

Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the scale command. The scale function is great. It is simple and has 2 options other than passing in the matrix:

  1. Center – should the columns have their mean subracted off?
  2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The matrixStats package has some very quick operations for row/column operations and computing statistics along these margins.

Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

set.seed(1)
mat = matrix(rnorm(1e7, mean = 14, sd = 5),
    nrow = 1e5)
dim(mat)

How fast is scale?

Let’s see how fast the scale function is:

system.time({
    sx = scale(mat)
    })
   user  system elapsed
  0.971   0.285   1.262

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in mat.

Defining a new function

First, we must load in the matrixStats package and the only function we really are using is the colSds.

library(matrixStats)

colScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]
    }

  ################
  # Get the column means
  ################
    cm = colMeans(x, na.rm = TRUE)
  ################
  # Get the column sd
  ################
    if (scale) {
        csd = colSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    }
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    }
    x = t( (t(x) - cm) / csd )
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        }
        if (scale) {
            attr(x, "scaled:scale") <- csd
        }
    }
    return(x)
}

Let’s break down what the function is doing:

  1. The function takes in a matrix x with options:
    1. subsetting rows or cols
    2. center each column (by the mean) or not
    3. scale each column (by the standard deviation) or not
    4. Add the attributes of center/scale, so they match the scale output.
  2. The functions subsets the matrix if options are passed.
  3. Column means are calculated
  4. Column standard deviations are calculated (using the colmeans) if scale = TRUE or simply set to 1 if scale = FALSE.
  5. If the data is not to be centered, the centers are set to 0.
  6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
    • The reason this is done is because R operates column-wise. Let p be the number of columns. The column means/sds are of length p. If one simply subtracted the column means, R would try to do this to each individual column. For instance, it would recycle the p numbers to get to length n (number of rows), and do that subtraction, which is not what we want.
  7. The attributes are added to the matrix to match scale output.

colScale timing

Now we can see how fast the colScale command would take:

system.time({
    csx = colScale(mat)
})
   user  system elapsed
  0.231   0.039   0.271

This is a lot faster than the scale function. First and foremost, let us make sure that these give the same results:

all.equal(sx, csx)
[1] TRUE

Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the microbenchmark package to

library(microbenchmark)
mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
print(mb)
Unit: seconds
          expr       min        lq      mean    median        uq      max
 colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
    scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
 neval cld
    20  a
    20   b

We can visualize the results using ggplot2 and some violin plots.

library(ggplot2)
g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")
print(g)

plot of chunk gg

What about scaling rows!

If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

system.time({
  scaled_row = t( scale(t(mat)) )
})
   user  system elapsed
  2.165   0.624   3.398
all(abs(rowMeans(scaled_row)) < 1e-15)
[1] TRUE

Again, we can do the same thing with colScale:

system.time({
  colscaled_row = t( colScale(t(mat)) )
})
   user  system elapsed
  0.426   0.097   0.542
all(abs(rowMeans(colscaled_row)) < 1e-15)
[1] TRUE
all.equal(colscaled_row, scaled_row)
[1] TRUE

And we see the results are identical

Creating rowScale

The above results are good for what we would like to do. We may want to define the rowScale function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

rowScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]
    }

  ################
  # Get the column means
  ################
    cm = rowMeans(x, na.rm = TRUE)
  ################
  # Get the column sd
  ################
    if (scale) {
        csd = rowSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    }
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    }
    x = (x - cm) / csd
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        }
        if (scale) {
            attr(x, "scaled:scale") <- csd
        }
    }
    return(x)
}

Now let’s see how we do with rowScale:

system.time({
  rowscaled_row = rowScale(mat)
})
   user  system elapsed
  0.174   0.016   0.206
all(abs(rowMeans(rowscaled_row)) < 1e-15)
[1] TRUE
all.equal(rowscaled_row, scaled_row)
[1] TRUE

Let’s look at the times for this breakdown using microbenchmark:

mb_row = microbenchmark(t( colScale(t(mat)) ),
                        t( scale(t(mat)) ),
                        rowScale(mat),
                        times = 20, unit = "s")
print(mb_row)
Unit: seconds
                expr       min        lq      mean    median        uq
 t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
    t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
       rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
       max neval cld
 0.9008130    20  a
 5.0518348    20   b
 0.5138103    20  a

and visualize the results:

g %+% mb_row

plot of chunk gg_row

Conclusions

Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The scale function is well suited for this purpose, but the matrixStats package allows for faster computation done in C. The scale function will have different behavior as the code below from base::scale.default:

f <- function(v) {
  v <- v[!is.na(v)]
  sqrt(sum(v^2)/max(1, length(v) - 1L))
}
scale <- apply(x, 2L, f)

If the data is not centered and center = FALSE, the data will be divided by the squared sum of each column (divided by n-1). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and colScale/rowScale can do that if necessary. I will talk to Henrik Bengtsson (matrixStats author/maintainer) about incorporating these into matrixStats for general use. But for now, you can use the above code.

How I build up a ggplot2 figure

Recently, Jeff Leek at Simply Statistics discussed why he does not use ggplot2. He notes “The bottom line is for production graphics, any system requires work.” and describes a default plot that needs some work:

library(ggplot2)
ggplot(data = quakes, aes(x = lat,y = long,colour = stations)) + geom_point()

plot of chunk plot

To break down what is going on, here is what R interprets (more or less):

  1. Make a container for data ggplot.
  2. Use the quakes data.frame: data = quakes.
  3. Map certain “aesthetics” with the aes to three different aesthetics (x, y, z) to certain variables from the dataset lat, long, stations, respectively.
  4. Add a layer of geometric things, in this case points (geom_point).

Implicitly, ggplot2 notes that all 3 aesthetics are continuous, so maps them onto the plot using a “continuous” scale (color bar). If stations were a factor or character column, the plot would not have a color bar but a “discrete” scale.

Now, Jeff goes on to describe elements he believes required to make this plot “production ready”:

  1. make the axes bigger
  2. make the labels bigger
  3. make the labels be full names (latitude and longitude, ideally with units when variables need them
  4. make the legend title be number of stations reporting

As such, I wanted to go through each step and show how you can do each of these operations

Make the Axes/Labels Bigger

First off, let’s assign this plot to an object, called g:

g = ggplot(data = quakes,
           aes(x = lat,y = long,colour = stations)) +
  geom_point()

Now, you can simply call print(g) to show the plot, but the assignment will not do that by default. If you simply call g, it will print/show the object (as other R objects do), and plot the graph.

Theme – get to know it

One of the most useful ggplot2 functions is theme. Read the documentation (?theme). There is a slew of options, but we will use a few of them for this and expand on them in the next sections.

Setting a global text size

We can use the text argument to change ALL the text sizes to a value. Now this is where users who have never used ggplot2 may be a bit confused. The text argument (input) in the theme command requires that text be an object of class element_text. If you look at the theme help it says “all text elements (element_text)”. This means you can’t just say text = 5, you must specify text = element_text().

As text can have multiple properties (size, color, etc.), element_text can take multiple arguments for these properties. One of these arguments is size:

g + theme(text = element_text(size = 20))

plot of chunk bigger_axis

Again, note that the text argument/property of theme changes all the text sizes. Let’s say we want to change the axis tick text (axis.text), legend header/title (legend.title), legend key text (legend.text), and axis label text (axis.title) to each a different size:

gbig = g + theme(axis.text = element_text(size = 18),
                 axis.title = element_text(size = 20),
                 legend.text = element_text(size = 15),
                 legend.title = element_text(size = 15))
gbig

plot of chunk bigger_axis2

Now, we still have the plot g stored, but we make a new version of the graph, called gbig.

Make the Labels to be full names

To change the x or y labels, you can just use the xlab/ylab functions:

gbig = gbig + xlab(&quot;Latitude&quot;) + ylab(&quot;Longitude&quot;)
gbig

plot of chunk lab_full

We want to keep these labels, so we overwrote gbig.

Maybe add a title

Now, one may assume there is a main() function from ggplot2 to give the title of the graph, but that function is ggtitle(). Note, there is a title command in base R, so this was not overwritten. It can be used by just adding this layer:

gbig + ggtitle(&quot;Spatial Distribution of Stations&quot;)

plot of chunk title

Note, the title is smaller than the specified axes label sizes by default. Again if we wanted to make that title bigger, we can change that using theme:

gbig +
  ggtitle(&quot;Spatial Distribution of Stations&quot;) +
  theme(title = element_text(size = 30))

plot of chunk big_title

I will not reassign this to a new graph as in some figures for publications, you make the title in the figure legend and not the graph itself.

Making a better legend

Now let’s change the header/title of the legend to be number of stations. We can do this using the guides function:

gbigleg_orig = gbig + guides(colour = guide_colorbar(title = &quot;Number of Stations Reporting&quot;))
gbigleg_orig

plot of chunk leg

Here, guides takes arguments that are the same as the aesthetics from before in aes. Also note, that color and colour are aliased so that you can spell it either way you want.

I like the size of the title, but I don’t like how wide it is. We can put line breaks in there as well:

gbigleg = gbig + guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;))
gbigleg

plot of chunk leg2

Ugh, let’s also adjust the horizontal justification, so the title is centered:

gbigleg = gbigleg +
  guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
                                 title.hjust = 0.5))
gbigleg

plot of chunk leg_adjust

That looks better for the legend, but we still have a lot of wasted space.

Legend IN the plot

One of the things I believe is that the legend should be inside the plot. In order to do this, we can use the legend.position from the themes:

gbigleg +
  theme(legend.position = c(0.3, 0.35))

plot of chunk leg_inside

Now, there seems can be a few problems here:

  1. There may not be enough place to put the legend
  2. The legend may mask out points/data

For problem 1., we can either 1) make the y-axis bigger or the legend smaller or a combination of both. In this case, we do not have to change the axes, but you can use ylim to change the y-axis limits:

gbigleg +
  theme(legend.position = c(0.3, 0.35)) +
  ylim(c(160, max(quakes$long)))

plot of chunk change_ylim

I try to not do this as area has been added with no data information. We have enough space, but let’s make the legend “transparent” so we can at least see if any points are masked out and to make the legend look a more inclusive part of the plot.

Making a transparent legend

I have a helper “function” transparent_legend that will make the box around the legend (legend.background) transparent and the boxes around the keys (legend.key) transparent as well. Like text before, we have to specify boxes/backgrounds as an element type, but these are rectangles (element_rect) compared to text (element_text).

transparent_legend =  theme(
  legend.background = element_rect(fill = &quot;transparent&quot;),
  legend.key = element_rect(fill = &quot;transparent&quot;,
                            color = &quot;transparent&quot;)
)

One nice thing is that we can save this as an object and simply “add” it to any plot we want a transparent legend. Let’s add this to what we had and see the result:

gtrans_leg = gbigleg +
  theme(legend.position = c(0.3, 0.35)) +
  transparent_legend
gtrans_leg

plot of chunk leg_inside2

Moving the title of the legend

Now, everything in gtrans_leg looks acceptable (to me) except for the legend title. We can move the title of the legend to the left hand side:

gtrans_leg + guides(colour = guide_colorbar(title.position = &quot;left&quot;))

plot of chunk leg_left

Damnit! Note, that if you respecify the guides, you must make sure you do it all in one shot (easiest way):

gtrans_leg + guides(
  colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
                          title.hjust = 0.5,
                          title.position = &quot;left&quot;))

plot of chunk leg_left_correct

A little more advanced

The last statement is not entirely true, as we could dig into the ggplot2 object and assign a different title.position property to the object after the fact.

gtrans_leg$guides$colour$title.position = &quot;left&quot;
gtrans_leg

plot of chunk respec

“I don’t like that theme”

Many times, I have heard people who like the grammar of ggplot2 but not the specified theme that is default. The ggthemes package has some good extensions of theme from ggplot2, but there are also a bunch of themes included in ggplot2, which should be specified before changing specific elements of theme as done above:

g + theme_bw()

plot of chunk themes

g + theme_dark()

plot of chunk themes

g + theme_minimal()

plot of chunk themes

g + theme_classic()

plot of chunk themes

Conclusions

I agree that ggplot2 can deceive new users by making graphs that look “good”-ish. This may be a detriment as they may believe they are good enough, when they truly need to be changed. The changes are available in base or ggplot2 and the overall goal was to show how the recommendations can be achieved using ggplot2 commands.

Below, I discuss some other aspects of the post, where you can use ggplot2 to make quick-ish exploratory plots. I believe, however, that ggplot2 is not the fastest for quick basic exploratory plots. What is is better than base graphics is for making slightly more complex exploratory plots that are necessary for analysis, where base can take more code to do.

How to make quick exploratory plots

I agree with Jeff that the purpose of exploratory plots should be done quickly and a broad range of plots done with minimal code.

Now, I agree that plot is a great function. I do believe that you can create many quick plots using ggplot2 and can be faster than base in some instances. A specific case would be that you have a binary y variable and multiple continous x variables. Let’s say I want to plot jittered points, a fit from a binomial glm (logistic regression), and one from a loess.

Here we will use mtcars and say if the car is automatic or manual (am variable) is our outcome.

g = ggplot(aes(y = am), data = mtcars) +
  geom_point(position = position_jitter(height = 0.2)) +
  geom_smooth(method = &quot;glm&quot;,
              method.args = list(family = &quot;binomial&quot;), se = FALSE) +
  geom_smooth(method = &quot;loess&quot;, se = FALSE, col = &quot;red&quot;)

Then we can simply add the x variables as aesthetics to look at each of these:

g + aes(x = mpg)

plot of chunk unnamed-chunk-2

g + aes(x = drat)

plot of chunk unnamed-chunk-2

g + aes(x = qsec)

plot of chunk unnamed-chunk-2

Yes, you can create a function to do the operations above in base, but that’s 2 sides of the same coin: function versus ggplot2 object.

 

All code is located on my GitHub for my blog

Dealing with Imposter Syndrome in Graduate School

In my post of recommendations for first-year students, I discussed some tips and viewpoints to help the practical, pragmatic aspects about being a first year student. In this post, I'd like to discuss the common misconceptions/viewpoints that are destructive to new students.

The Dunning-Kruger effect

I know something, so everyone else is dumb

You just learn about p-values and their problems. OMG someone over there uses them? They are so dumb and don't understand anything. Why can't everyone be as smart as you? Whey can't people just “get it”? Have you ever felt this or known someone who sounds like this?

Let me introduce the Dunning-Kruger effect. In short, it describes that the unskilled are unable to:

recognize their own ineptitude and evaluate their own ability accurately.

Therefore, you learn something new about a field (e.g. statistics), and you feel pretty confident when talking to others. I'm not saying you're unskilled, but you may not know what's common in practice or the merits/pitfalls of a method or other methods. Many times new students will learn one thing, usually not that in-depth, and incorrectly think they've mastered that area. Moreover, they usually cite the same piece of information over and over, as they have few pieces of information to draw upon. This sometimes happens with newer students, but fades relatively quickly.

Everyone else is a genius

The equally important converse to the Dunning-Kruger effect is that:

highly skilled individuals may underestimate their relative competence, erroneously assuming that tasks that are easy for them also are easy for others.

After the first feeling fades, a student realizes how out of depth they were. This is the more damaging effect because then you start to feel like…

You are an Imposter; So am I

You are not good enough for your program. Everyone is better prepared and smarter than you. You don't deserve to be here. You're stupid. You are never going to get it. You should quit. Everyone is going to find you out. Then they'll make you leave your program.

If you read all of these statements and some rang true, let me introduce you to Imposter Syndrome. Most students feel this way their first year. I felt this way. Many of my classmates felt this way. Some of us may still feel this way.

Why? Many new students have done previous programs with relative ease. They have been, or at least felt, like the smartest person in the room before. Now, in this new and highly-selective program, you are not the smartest. You may not even be close to the smartest.

Aside: when I say “smart”, I mean whatever criteria you're using for self-worth with respect to intelligence. I think work gets done by banging your head against the wall until something comes out. Being talented is helpful, but hard work gets results. But talent before may have been enough.

Also, if you have tied your superior intelligence to your identity, you've now lost it. If people are smarter than you, then who are you compared to them? How can you combat this feeling? Spoiler: stop comparing to the wrong people.

Comparing Yourself to the Correct Distribution

Many times, people forget what got them there. They worked hard (even relatively) to do the prerequisites, fill out the forms, do the undergraduate research, and make the move to the new department. Much of that hard work is overlooked by new (and even some more senior) students. They are much like Ricky Bobby: “If you're not first your last”. But that's ridiculous “you can be "second, third, fourth, hell, even fifth”.

The wrong distribution

I'm not saying to not strive for being the best. Strive for that, but compare yourselves to the right distribution. Many students compare themselves like this:

plot of chunk unnamed-chunk-1

As seen from this, you are at the low end of the distribution. You are likely not the best first year, feel miles behind a 5th-year student, and a lifetime behind a faculty member. How can you ever become like these people?

There's this saying “if you see it you can be it”. Conversely, if you can't see it, then you don't think you will be it. This is important because if you want to be a faculty member, you must realize every year you are getting one step closer to that role. You have to see yourself in that role.

But right now, you feel like you don't know anything about your field. But this is the WRONG DISTRIBUTION for comparison.

The correct distribution

As said above in the plot title, this is a conditional distribution of knowledge in your field. This distribution compares you as a brand-new first-year student to those who have worked in the field for an entire undergrad degree (the top first-year students), at least 5 years of work and research (5th-year student), or for > 10 years/a lifetime of work (most faculty). Of course you're going to feel inadequate. By construction, you are near the lowest part of the distribution. You're setting yourself up to be the worst.

You should compare yourself to full distribution:

plot of chunk correct_distribution

That's more like it! This is more representative comparison of your skills. The average graduate student likely knows a bit more about your field than everyone else, but notice where YOU are in this distribution. Now yes, the faculty is still out there, but it's relatively closer. With each year, you get closer to that upper tail of the distribution. Also, most people keep concentrating on the right-hand tail whereas they forget about the majority of the distribution is to your left. You know things about your field, more than the majority of people.

At the end of the day, you need to compare yourself to the full distribution, not the conditional one. Just because you are not the smartest/best when you start, that's to be expected. You can't know everything over night. The most important message is to not get discouraged when you first start. Things are confusing and hard, but they get better. Just keep going.

And remember one thing about this whole mental exercise of comparison:

It's not to make you feel better than others. It's to make you feel adequate about yourself and your skills.

Making others feel less worthy or make yourself feel as though you're “better” than another will inevitably cause this same crisis of self when that identity is challenged. Stop doing that. Just do you.

Adapt a different mindset

Many of the issues above are discussed in the book Mindset: The New Psychology of Success. I have some copies of the book if you'd like to read it. I highly recommend it.

The long and short of the book's message is that overcoming ideas such as imposter syndrome comes out of adapting a new mindset. Here are some examples of how the mindsets differ: link 1, link 2 from here, and link 3

In a fixed mindset, someone believes that talents are fixed and unchangeable. Either you are smart or your not. If you aren't smart, then sucks for you because you can't change it. The other, and recommended viewpoint of the author, is the growth mindset, where one believes:

their qualities as things that can be developed through their dedication and effort.

You're not good at one area? Well try hard and change it. Stop worrying you're not good enough and get to work.

Andrew Gelman had an interesting post recently about the book's replicability, so I suggest reading over there for some details. If the books helps you cognitively break down some walls I think that's great, but I always would like to see the evidence. If you want to track your progress post reading, I'd love to hear it!

The battle is not only with new students

Let me be clear that this is not an issue for only new students. Although older students get better at determining their position in the distribution, they then fall to the same issue comparing themselves to faculty in the tails.

I hear students claim many times that they're not ready to become professors. I've spoken to many tenured professors and a lot of them say that a post-doc position is a great gig (at least in biostats). The arguments are that you get the freedom that a assistant faculty has, but not as much of the responsibilites. I agree with this sentiment, and think those are great reasons to do a post-doc.

But I feel as though some students do not think they can be a faculty member because of their CV and number of publications. Not wanting to advise students, not ready to build a class, not ready to write grants, not sure about a new city for long-term – these are great reasons to not want to become a faculty member. Not having “enough” publications is not necessarily a good one. Yes, some places may not invite you to speak based on your CV. Some will invite you but not give you an offer. If there is a penalty for trying to get an interview, not getting it, doing a post-doc, and then re-applying in 2 years, then that system is broken. The only penalty should be that there are no positions and the timing was no longer right. (Assuming you didn't do something off the rails the first interview).

Compare yourself to the correct distribution

One of the reasons I feel that people say they do not have “enough” publications is either because they are 1) comparing themselves to people other than assistant professors, or 2) they are comparing themselves to assistant professors NOW compared to when they started.

Here's an exercise I like to go through. Go to a website of a place you want to apply to, find some assistant professors. Go to their CVs and look the year they graduated with their PhD. Go to their publications and look at those that came out in that year + 1. That's them when they graduated with their PhD, including published, in-press, and current work. If you are greatly behind them, then yes, you may not have compared well against them. But remember, that's them compared to you on the same playing field. And remember, they were like you and they got the job. So give it a shot if that's what you want, but don't let your incorrect comparison cripple you with feeling inadequate. That's not what you do, that's some imposter.

Recommendations for First Year Graduate Students

This blog post is a little late; I wanted to get it out sooner.

As new students have flooded the halls for the new terms at JHU Biostat, I figured I would give some recommendations to our new students, and biostatistics students in general. Some of these things may be specific to our department, but others are general, so the title should be fitting. Let's dive in!

First Term Things

Don't buy books

Some books are good for a reference, many are not. I say this because much of the information is available on Google or the internet and you will check that 98% of the time compared to going to a book. That being said, many students have these good reference books and will be willing to let you borrow them. Also, the library in your department or school will likely have them.

Recommendation

The full recommendation for books is this:

  1. Borrow books you need for class, especially from current (not new) students. Sharing books with current students is good except if you both need it during crucial times (like exams/comprehensive exams). Everyone has Chung or Billingsley.
  2. Of those you can't borrow, go to class for a week or two and see if you actually need it. Some professors go straight off their lecture notes. Your school bookstore doesn't just go and send back all their copies when school starts, so you can still get it. Also, I heard this new website Amazon has books.
  3. If you think a book is a really good reference, buy a copy. Better yet, buy a digital copy so you can digitally search and annotate it.

Get new gear/electronics

You will be spending the majority of your time on your laptop, so it better work and be fast. Most new programs will have some money for books and a laptop. If you read above, you saved some money on books, so use it to buy a new laptop. If your laptop is less than 2 years old, you can save that money (if PhD) or buy other electronics such as an iPad for notetaking (if Master's).

Have the tools to make your work easy because nothing is worse than you not getting work done due to other factors than yourself.

Recommendation

Get a Unix-like machine (aka Mac). Others say you can do stuff in Windows, but it's easier for some software in Unix. Cluster computing (see below) will be easier as well.

Side note: if you buy a new computer, do not open it until Friday afternoon/Saturday as you will likely spend a whole day playing with your new gear.

Learn who the staff is

I find many students know who the faculty are and what research they do, but have no idea about who the staff is. These people know almost everything you need to know for non-research help. They schedule meetings with the chair, organize events, schedule rooms, and, very importantly, know how to get you paid/your stipend. These people are the glue that makes everything run and are a great resource.

Recommendation

Go into the office and introduce yourself and ask what you should go to person X for. They will know you then when you email.

Research and Organizational Tips

Start doing research NOW

If you want to learn what research is all about, get involved early. Even if you don't feel like you know anything, waiting to get involved on a research project will not help. It can hinder you. I'm not saying work 10 hours per week on a project; you have classes.

Recommendation: Visit all the working group meetings before choosing

Attending research meetings of a few working groups can help you 1) get information on the group and how it's run, 2) meet the group members, 3) choose what you may want to focus on, and 4) get you a small-scale project to start on.

This small project is not set in stone. It is not your thesis. The project contact doesn't have to be your thesis advisor. You will likely be working on this “for free” (unless it's under a training grant mentor, technically). Therefore, you don't “owe” anyone anything if you decide in a month you hate that field or project. Don't take it lightly to abandon a project, but do use it as a feeler in that area.

Let me reiterate (at least in our department): Your academic advisor doesn't need to be your research advisor.

Recommendation: Learn how to code

Learn how to program as soon as possible. Some good resources are codeschool or code academy. If using R, I recommend first Try R from Codeschool. I would then move on to Swirl. It will never be a waste of time getting up to speed or learning how to do something new with programming. If you already feel great with R, you can try Python or move deeper into R.

Learn how to use a computing cluster

This may be necessary later in your program, but try to do it before it's “necessary”.

You are going to work on some project invariably that 1) will use simulations or 2) requires intense computation. As such, a computing cluster is made specifically for these scenarios. Learn how to use it. If you're not going to use it now for research, at least get your login and try it briefly for a class project.

Learn Modern Note-Taking Utilities and Back up your work

Condense your note-taking into one app. I like using Evernote as it syncs with my phone and Mac.

Use Dropbox or Google Drive to have a “hands-free” syncing service. Also, think about investing in an External Hard Drive, maybe as your new gear, to doubly back-up your system/data. Laptops can (and have been) stolen. Although Google Drive/DropBox are likely to be around for some time, you always want something in your control (external HDD) in case something goes wrong on a server. GitHub is great for version control, and some people use it as a back up of sorts, but it's not really for that and not a “hands free” rsync-based solution.

Learn a Markdown/Markup Language

Learn a Markdown language. Yihui has a good description of Markdown vs. LaTeX. You will need to know both. Think about learning some basics of HTML as well.

Make a Webpage

With your newfound HTML skills from above, build a webpage for yourself. Some use GitHub or WordPress. Many options exist, depending on your level of expertise, blogging capability, level of control.

Why do I need a webpage? You work on a computer (after classes) like 98% of the day. You should have a web presence.

What about my LinkedIn profile? That's good for a resume-like page, but not great for your opinions, picture uploads, general ideas. Also, your webpage allows you to control completely what you put out there. Remember, if you don't make the content, Google will pick what people see.

Check out student websites and ask the student whose you like best how they did it.

Student Life

Ask other students questions

One of my rules is to never be scared to ask a stupid question. I ask questions all the time. Some of them are stupid. I know that I won't get an answer if I don't ask though.

We have offices. Students are in those offices. Ask them questions. It's that simple.

Many students say “well I don't want to bother them”. I learned how to code by bothering people. I bothered them very much so. I thought I was annoying, but I didn't care because I didn't know what the hell I was doing.

Does that mean I want questions all day by new students? No. Read that again. No. But I do try to pay forward information to new students just like others paid towards me. If a student is curt or makes you feel stupid about asking a question, stop talking to them. They forgot what it was like when they were lost and confused and are likely now severely delusional.

Your questions are usually not new. We've asked them likely ourselves. We either have the answer or know who does. Ask.

Go to student-lead meetings

No one in my office knows anything!!? Who do I ask now? Well there are student-lead meetings. These have a lot of information and … other students! Go there, ask questions. If the topic is not what you need to know, wait until the end of the meeting when the structure breaks down and ask someone then.

Student-lead meetings have a lot less pressure to ask the “stupid questions” in a safer environment and will likely lead to answers that you understand. Because they are from other students.

Work with your cohort

Get chummy with your cohorts. You don't have to be best friends forever, but you will talk with them, have class with them, and likely work with them. Stop doing things on your own, that's not leveraging other people's brain for you as well.

These are other smart people (they were smarter than me). Why not work with them and grab some of that brainyness floating around. You will feel dumb for a while, but you'll figure it out. If you don't work with a group in the beginning, it may be too late later when people have grouped up.

They are not your competition, though many departments make it seem like that. The next stage of your career will be mixed with projects on team and the rare projects where you are alone (aka thesis). Learn how to play with, and more imporantly listen to, others.

Grades don't matter that much, learning the material does

“I came to grad school to get a 4.0” said no one ever. Grades are important for somewhat narrow things such as if the comprehensive exams go badly, are “an assessment” of your learning, or if you apply to a job with a Master's and they ask for your transcript (and for some reason care).

But good grades are not the goal of grad school. It's learning. Learn and understand the material. Learn how to learn new material. That's the goals. Grades matter in the sense they will let you know quite glaringly when you really don't know something. Remember learning is improving yourself and that should make it easier to do a project than just doing it “because someone told you to”.

Update: As pointed in the comments below, grades can matter greatly if you plan to apply to another program after your degree (e.g. PhD after doing your Master’s). If this may be in your future, make sure to keep an eye on your grades as well.

Life Recommendations

Take at least one day off per 7-day week

You need rest. Take it. A day off can clarify things later. Sometimes it's only when you stop hitting your head against the wall when you realize that what you're doing doesn't work. That's not to say you still won't work like 60 hours a week for a while, but make sure you have some protected time for your banging head.

Explore restaurants/food/night life

One of the best pieces of advice I've ever gotten for grad school was: “find a place you want to spend the next 5 years of your life” in reference to your department AND city. Whatever city your in has fun things to do. Find them. Explore your city and area. People tend to hate places they live in grad school if they don't associate anything with it other than working in a hole. Which leads me to…

Don't work in a hole; Find a happy place to work

Find a place where you are productive and like to go. I like the office; others don't. Find a coffee shop near you for days without class or when you are done classes. Use the reading room or other areas as your go to. Again, working somewhere you don't like is one more hurdle to getting things done. Get rid of such hurdles, you will have enough of them to make your own.

A better interactive neuroimage plotter in R

In a previous post, I described how you can interactively explore a 3D nifti object in R. I used the manipulate package, but the overall results were sluggish and not really usable.

I was introduced to a a good neuroimaging viewer called Mango, by a friend or two and use it somewhat inconsistently. One major advantage of this software is that it has been converted to a pure JavaScript library called Papaya. As such, you can create simple HTML files that have an embedded interactive image viewer.

Papayar

That's all fine and good, but I like my things in R. I understand many can easily write bash scripts that perform the same operations and use the HTML builder provided by papaya.

I want the operation in R for the same reasons I make many things for R:

  1. I use R
  2. Many statisticians like imaging but need tools they understand and they understand R
  3. I like writing pipelines and scripts in one language

My answer: the papayar package.

Install papayar

To install papayar, you can simply install from GitHub using the devtools package.

require(devtools)
devtools::install_github("muschellij2/papayar")

Papayar functions

The main function is papaya. Let's look at the arguments:

library(papayar)
formalArgs(papaya)
[1] "images" "outdir" "..."   

The images argument can be a list of nifti objects or a vector of character filenames. The outdir is where the HTML file is written. The default is to a temporary directory that will be trashed after the session is over. The additional arguments are passed to the lower-level function pass_papaya, which in turn are passed to functions httd and daemon_stop in the servr package. The pass_papaya function is useful, however, to open a blank papaya session by just invoking pass_papaya()

Papayar Example

As the httd function starts a server, the images can be rendered (and will be by default) in the RStudio viewer! In the terminal, it opens your default web browser. Here's a basic example:

library(oro.nifti)
x = nifti(img = array(rnorm(100^3), dim= rep(100, 3)), dim=rep(100, 3), datatype=16)
y = nifti(img = array(rbinom(100^3, prob = 0.5, size = 10), dim= rep(100, 3)), dim=rep(100, 3), datatype=16)
index.file = papaya(list(x, y))

The first 3 lines make some random arrays, from a normal and binomial distribution and puts them into a nifti object. The list of these nifti objects is passed in. The first image is displayed in grayscale and the second image is overlaid using red-hot colors and the opacity of this image can be changed. The object index.file will be a character filename where the HTML file is stored. The data and this HTML file is written to outdir (which, again, is a temporary directory by default).

Output

Below is a series of screen shots I took from the code above. You should be able to see this in RStudio or your browser:

Overlay

The main reason to use this is that you can click different areas for the crosshairs and move to a different point in axial, coronal, and sagittal space. Thus, this is truly interactive.

Here we can show there are limited (but useful) controls for the overlay. You can change the mapping of the values in the image and the overlay and the opacity of the overlay.
Overlay_controls

Brain Example

The above data has been used since everyone could test it, but it's just random numbers and doesn't look very compelling. Here I will show you the hyperintense voxels overlaid on the MNI 152 1mm T1 brain image click here for description, which correspond mainly to the white matter:

Brain_Overlay

Hopefully you can see how this can be useful for exploring data and results.

ITK-SNAP and itksnapr

Some of my colleagues are more partial to using ITK-SNAP for viewing images interactively. I have bundled the executables for ITK-SNAP into the R package itksnapr. The main function is itksnap, which you can specify images to different options to ITK-SNAP.

Install itksnapr

To install itksnapr, you can simply install from GitHub using the devtools package.

require(devtools)
devtools::install_github("muschellij2/itsnapr")
library(itksnapr)
itksnap(grayscale = x, overlay = y)

I haven't used ITK-SNAP much, but hear many good things about it. There are many better resources than this blog on how to use it and how to use it well. If interested in a good image viewer, I implore you to google around for some of these resources. If anyone has intense interest of image viewers and wants added functionality, don't hesitate to file an inssue on GitHub.

FSLView

Although it was included in my fslr package by default and I never discussed it in detail, FSLView is included with the distribution of FSL and is a viewer I use heavily. The fslr function is fslview. One specific advantage of using FSLView is that it can pass through X11 forwarding, so you can remotely view image from a cluster, though it may be slow.

Conclusion

Although I use the orthographic,image.nifti and overlay functions from oro.nifti for many of my figures, for interactive exploring of results, these can be somewhat slow for getting a large-scale view, and not a specific slice view. Therefore, a fully interactive neuroimaging plotter is necessary. Here are 3 options that can be accessed “within” R.

Rendering LaTeX Math Equations in GitHub Markdown

The Problem: GitHub README.md won't render LaTeX

I have many times wondered about getting LaTeX math to render in a README file on GitHub. Apparently, many others ( 1, 2, 3 ), have asked the same question.

The common answers are:

  1. It cannot (and in some cases, shouldn't) be done. GitHub parsing is done by SunDown and is secure, therefore won't do LaTeX.
  2. Use http://latex.codecogs.com/ or iTex2Img. These are good options, but 1) they may go away at any time, and 2) require you to rewrite your md file.
  3. Use unicode if possible.
  4. Use LaTeXIt (for Mac OS) or other converter to make your equations and embed them.

A hackey, but working solution

I opted to try a more generic solution for (4.) using some very hackey text parsing. I have done a bit of parsing in the past, but I was either too lazy to think about the right regex to do, couldn't think of it easily, or thought my solution was sufficient even if not elegant.

Caveat

Two main caveats abound:

  1. This only works for inline equations marked with dollar signs ($) or equations marked by double dollar signs ($$). I could encorporate other delimiters such as \[, but I did not. I only had a bit of time on Wednesday.
  2. I assume any code that involves dollar signs be demarcated by chunks starting with three backticks (“). I wrote this for R code, which can use dollar signs for referencing and never has double dollar signs. If your code does, no guarantees.
  3. This generally assumes you have a GitHub repository (have no idea what others use), and that you're OK with the figures being located in that GitHub repository. I didn't allow options for putting them in a sub-folder, but may incorporate that.
  4. Some text won't be sized correctly.

How do I do it already

I wrote an R package that would parse a README.md (or README.rmd if it's RMarkdown). The package is located at https://github.com/muschellij2/latexreadme.

You can install the package using:

library(devtools)
install_github("muschellij2/latexreadme")

You would then load the package:

library(latexreadme)

The main function is parse_latex. It's not the best function name for what it does, but I don't really care. Let's see it's arguments:

args(parse_latex)

You must put in a README file as the rmd argument. If the README has an rmd or Rmd extension, the README is first knitted using knit(rmd) and then the resultant md file is used. This md is located in a temporary directory and won't write to the directory of the README. The new_md is the filename for the output md file that you wish to create. One example would be rmd = "README_with_latex.md" and md = "README.md". The git_username and git_reponame must be specified with your username and repository name, respectively. The git_branch allows you to specify which branch you are on, if necessary. If you don't know what that means, just leave as master.

The rest of the arguments are for inserting the LaTeX into the document. The text_height is how large the LaTeX should be (this may be bad for your document), the insert_string is the HTML the LaTeX is subbed for, the raw_git_site uses https://rawgit.com to reference the figures directly with proper content-type headers (so that they show up). The bad_string is something I'm using in the code. You only need to change bad_string if you happen to have text in your README that matches this (should be rare as they are a bunch of Z's, unless you write like someone sleeping). I'll get to the ... in a minute.

I still don't get it – show me an example

I thought you'd never ask. The parse_latex command has an example from one of my other repos and you can run it as follows (need curl):

rmd = file.path(tempdir(), "README_unparse.rmd")
download.file(
"https://raw.githubusercontent.com/muschellij2/Github_Markdown_LaTeX/master/README_unparse.rmd",
destfile = rmd, method = "curl")
new_md = file.path(tempdir(), "README.md")
parse_latex(rmd,
            new_md,
            git_username = "muschellij2",
            git_reponame = "Github_Markdown_LaTeX")
library(knitr)
new_html = pandoc(new_md, format = "html")

And you can view the html using browseURL:

browseURL(new_html)

You can see the output of the example (only a little bit of LaTeX) at this repo: https://github.com/muschellij2/Github_Markdown_LaTeX or at Kristin Linn's README, which was used as an example here: https://github.com/kalinn/IPW-SVM/blob/master/README_2.md

What is the function actually doing

So what is the function actually doing? Something convoluted I can assure you. The process is as follows:

  1. Find the equations using ($$ and $) parse them out, throwing out any code demarcated with backticks (”).
  2. Put this LaTeX into a simple LaTeX document with \begin{document}. Note, the ... argument can be a character vector of other packages to load in that document. See png_latex documentation.
  3. Run pdflatex on the document. Note, this must be in your path. This creates a PDF.
  4. Run knitr::plot_crop on this document. This will crop out anything that's not the LaTeX equation you wanted.
  5. Convert the PDF to a PNG using animation::im.convert. This is so that they will render in the README. The file will be something like eq_no_01.png in the same folder as the rmd argument.
  6. Replace all the LaTeX with the insert_string, which is raw HTML now.
  7. Write out the parsed md file, which was named using new_md.

Wow – that IS convoluted

My best shot in one day. If you have better solutions, please post in the comments.

Nothing shows up! Read this

NB: The replacement looks for equations (noted by eq_noSOMETHING.png) in your online GitHub repository. If you run this command and don't push these png files, then nothing will show up.

Conclusions

You can have LaTeX “rendered” in a GitHub README file! The sizes of the text may be weird. This is due to the cropping. I could probably use some bounding box or better way to get only the equations, but I didn't. If you want to help, please sumbit a Pull Request to my repository and I'd gladly merge it if it works.

NB: GitHub may override a README.md if a README.rmd (or README.Rmd) exists. I'm not 100% sure on that, but if that's the case, rename the Rmd and just have README.md.

Happy parsing!

#rstats Make arrays into vectors before running table

Setup of Problem

While working with nifti objects from the oro.nifti, I tried to table the values of the image. The table took a long time to compute. I thought this was due to the added information about a medical image, but I found that the same sluggishness happened when coercing the nifti object to an array as well.

Quick, illustrative simulation

But, if I coerced the data to a vector using the c function, things were much faster. Here's a simple example of the problem.

library(microbenchmark)
dim1 = 30
n = dim1 ^ 3
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = c(dim1, dim1, dim1))
microbenchmark(table(vec), table(arr), table(c(arr)), times = 100)
Unit: milliseconds
          expr       min        lq      mean    median        uq      max
    table(vec)  5.767608  5.977569  8.052919  6.404160  7.574409 51.13589
    table(arr) 21.780273 23.515651 25.050044 24.367534 25.753732 68.91016
 table(c(arr))  5.803281  6.070403  6.829207  6.786833  7.374568  9.69886
 neval cld
   100  a 
   100   b
   100  a 

As you can see, it's much faster to run table on the vector than the array, and the coercion of an array to a vector doesn't take much time compared to the tabling and is comparable in speed.

Explanation of simulation

If the code above is clear, you can skip this section. I created an array that was 30 × 30 × 30 from random binomial variables with half probabily and 15 Bernoulli trials. To keep things on the same playing field, the array (arr) and the vector (vec) have the same values in them. The microbenchmark function (and package of the same name) will run the command 100 times and displays the statistics of the time component.

Why, oh why?

I've looked into the table function, but cannot seem to find where the bottleneck occurs. Now, for and array of 30 × 30 × 30, it takes less than a tenth of a second to compute. The problem is when the data is 512 × 512 × 30 (such as CT data), the tabulation using the array form can be very time consuming.

I reduced the replicates, but let's show see this in a reasonable image dimension example:

library(microbenchmark)
dims = c(512, 512, 30)
n = prod(dims)
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = dims)
microbenchmark(table(vec), table(arr), table(c(arr)), times = 10)
Unit: seconds
          expr      min       lq     mean    median        uq       max
    table(vec) 1.871762 1.898383 1.990402  1.950302  1.990898  2.299721
    table(arr) 8.935822 9.355209 9.990732 10.078947 10.449311 11.594772
 table(c(arr)) 1.925444 1.981403 2.127866  2.018741  2.222639  2.612065
 neval cld
    10  a 
    10   b
    10  a 

Conclusion

I can't figure out why right now, but it seems that coercing an array (or nifti image) to a vector before running table can significantly speed up the procedure. If anyone has any intuition why this is, I'd love to hear it. Hope that helps your array tabulations!