ggplot2 is not ALWAYS the answer: it’s not supposed to be

Recently, the topic was propose at tea time:

Why should I switch over to ggplot2? I can do everything in base graphics.

I have heard this argument before and I understand it for the most part. Many people have learned R before ggplot2 came on the scene (as well as many other packages) and learned to do all the things they needed to. Some don’t understand the new syntax for plotting and argue the learning curve is not worth the effort. Also, many say it’s not straightforward to customize.

As the discussion progressed, we discussed 10 commonly made base plots, and I would recreate them in ggplot2. The goal was to help base users break into ggplot2 if they wanted, and also see if they were “as easy” as doing them in base. I want to discuss my results and the fact that ggplot2 is not ALWAYS the answer, nor was it supposed to be.

First Plot – Heatmap

The first example discussed was a heatmap that had 100,000 rows and 100 columns with a good default color scheme. As a comparison, I used heatmap in base R. Let’s do only 10,000 to start:

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0("Col", seq(ncol(mat)))
rownames(mat) = paste0("Row", seq(nrow(mat)))
system.time({heatmap(mat)})

   user  system elapsed 
 29.996   1.032  31.212 

For a heatmap in ggplot2, I used geom_tile and wanted to look at the results. Note, ggplot2 requires the data to be in “long” format, so I had to reshape the data.

library(reshape2)
library(ggplot2)
df = melt(mat, varnames = c("row", "col"))
system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
  9.519   0.675  10.394 

One of the problems is that heatmap does clustering by default, so the comparison is not really fair (but still using “defaults” – as was specified in the discussion). Let’s do the heatmap again without the clustering.

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  0.563   0.035   0.605 

Which one looks better? I’m not really sure, but I do like the red/orange coloring scheme – I can differences a bit better. Granted, there shouldn’t be many differences as the data is random. The ggplot graph does have a built-in legend of the values, which is many times necessary. Note, the rows of the heatmap are shown as columns and the columns shown as rows, a usually well-known fact for users of the image function. The ggplot graph plots as you would a scatterplot – up on x-axis and right on the y-axis are increasing. If we want to represent rows as rows and columns as columns, we can switch the x and y aesthetics, but I wanted as close to heatmap as possible.

Don’t factor the rows

Note, I named the matrix rows and columns, if I don’t do this, the plotting will be faster, albeit slightly.

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
df = melt(mat, varnames = c("row", "col"))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  0.642   0.017   0.675 
system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
  9.814   1.260  11.141 

20,000 Observations

I’m going to double it and do 20,000 observations and again not do any clustering:

N = 2e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0("Col", seq(ncol(mat)))
rownames(mat) = paste0("Row", seq(nrow(mat)))
df = melt(mat, varnames = c("row", "col"))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  1.076   0.063   1.144 
system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
 17.799   1.336  19.204 

100,000 Observations

Let’s scale up to the 100,000 observations requested.

N = 1e5
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0("Col", seq(ncol(mat)))
rownames(mat) = paste0("Row", seq(nrow(mat)))
df = melt(mat, varnames = c("row", "col"))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})

   user  system elapsed 
  5.999   0.348   6.413 
system.time({
  print({ 
    g= ggplot(df, aes(x = col, y = row, fill = value)) + 
      geom_tile()}) 
  })

   user  system elapsed 
104.659   6.977 111.796 

We see that heatmap and geom_tile() scale with the number observations on how long it takes to plot, but heatmap being much quicker. There may be better ways to do this in ggplot2, but after looking around, it looks like geom_tile() is the main recommendation. Overall, for doing heatmaps of this size, I would use the heatmap, after transposing the matrix and use something like seq_gradient_pal from the scales package or RColorBrewer for color mapping.

For smaller dimensions, I’d definitely use geom_tile(), especially if I wanted to do something like map text to the blocks as well. The other benefit is that no transposition needs to be done, but the data does need to be reshaped explicitly.

Like I said, ggplot2 is not ALWAYS the answer; nor was it supposed to be.

ggplot2 does not make publication-ready figures

Another reason for the lengthy discussion is that many argued ggplot2 does not make publication-ready figures. I agree. Many new users need a large

ggplot2 does not make publication-ready figures

message with their first install. But neither does base graphics. Maybe they need that message for their first R install. R would boot up, and you can’t start until you answer “Does R (or any statistical software) give publication-ready figures by default”. Maybe, if you answer yes, R self-destructs.

Overall, a publication-ready figure takes time, customization, consideration about point size, color, line types, and other aspects of the plot, and usually stitching together multiple plots. ggplot2 has a lot of default features that have given considerate thought to color and these other factors, but one size does not fit all. The default color scheme may not be consistent with your other plots or the data. If it doesn’t work, change it.

And in this I agree with this statement: many people make a plot in ggplot2 and it looks good and they do not put the work in to make it publication-ready. How many times have you seen the muted green and red/pink color that is the default first 2 colors in a ggplot2 plot?

Why all the ggplot2 hate?

I’m not hating on ggplot2; I use it often and value it. My next post will be about plots I commonly use in ggplot2 and why I chose to use it. I think it makes graphs that by default look better than many base alternatives. It does things many things well and easier than base. It has good defaults for some things. It’s a different grammar that, when learned, is easier for plotting. However, I do not think that ggplot2 figures are ready-set-go for publication.

Although I think ggplot2 is great, I most-definitely think base R graphics are useful.

ggplot2 is not ALWAYS the answer. Again, it was never supposed to be.

Advertisements

One thought on “ggplot2 is not ALWAYS the answer: it’s not supposed to be

  1. Pingback: My Commonly Done ggplot2 graphs | A HopStat and Jump Away

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s