Recently, the topic was propose at tea time:
Why should I switch over to ggplot2? I can do everything in base graphics.
I have heard this argument before and I understand it for the most part. Many people have learned R
before ggplot2
came on the scene (as well as many other packages) and learned to do all the things they needed to. Some don’t understand the new syntax for plotting and argue the learning curve is not worth the effort. Also, many say it’s not straightforward to customize.
As the discussion progressed, we discussed 10 commonly made base plots, and I would recreate them in ggplot2
. The goal was to help base users break into ggplot2
if they wanted, and also see if they were “as easy” as doing them in base. I want to discuss my results and the fact that ggplot2
is not ALWAYS the answer, nor was it supposed to be.
First Plot – Heatmap
The first example discussed was a heatmap that had 100,000 rows and 100 columns with a good default color scheme. As a comparison, I used heatmap
in base R
. Let’s do only 10,000 to start:
N = 1e4 mat = matrix(rnorm(100*N), nrow=N, ncol=100) colnames(mat) = paste0("Col", seq(ncol(mat))) rownames(mat) = paste0("Row", seq(nrow(mat))) system.time({heatmap(mat)})
user system elapsed 29.996 1.032 31.212
For a heatmap in ggplot2
, I used geom_tile
and wanted to look at the results. Note, ggplot2
requires the data to be in “long” format, so I had to reshape the data.
library(reshape2) library(ggplot2) df = melt(mat, varnames = c("row", "col")) system.time({ print({ g= ggplot(df, aes(x = col, y = row, fill = value)) + geom_tile()}) })
user system elapsed 9.519 0.675 10.394
One of the problems is that heatmap
does clustering by default, so the comparison is not really fair (but still using “defaults” – as was specified in the discussion). Let’s do the heatmap
again without the clustering.
system.time({heatmap(mat, Rowv = NA, Colv = NA)})
user system elapsed 0.563 0.035 0.605
Which one looks better? I’m not really sure, but I do like the red/orange coloring scheme – I can differences a bit better. Granted, there shouldn’t be many differences as the data is random. The ggplot
graph does have a built-in legend of the values, which is many times necessary. Note, the rows of the heatmap
are shown as columns and the columns shown as rows, a usually well-known fact for users of the image
function. The ggplot
graph plots as you would a scatterplot – up on x-axis and right on the y-axis are increasing. If we want to represent rows as rows and columns as columns, we can switch the x
and y
aesthetics, but I wanted as close to heatmap
as possible.
Don’t factor the rows
Note, I named the matrix rows and columns, if I don’t do this, the plotting will be faster, albeit slightly.
N = 1e4 mat = matrix(rnorm(100*N), nrow=N, ncol=100) df = melt(mat, varnames = c("row", "col")) system.time({heatmap(mat, Rowv = NA, Colv = NA)})
user system elapsed 0.642 0.017 0.675
system.time({ print({ g= ggplot(df, aes(x = col, y = row, fill = value)) + geom_tile()}) })
user system elapsed 9.814 1.260 11.141
20,000 Observations
I’m going to double it and do 20,000 observations and again not do any clustering:
N = 2e4 mat = matrix(rnorm(100*N), nrow=N, ncol=100) colnames(mat) = paste0("Col", seq(ncol(mat))) rownames(mat) = paste0("Row", seq(nrow(mat))) df = melt(mat, varnames = c("row", "col")) system.time({heatmap(mat, Rowv = NA, Colv = NA)})
user system elapsed 1.076 0.063 1.144
system.time({ print({ g= ggplot(df, aes(x = col, y = row, fill = value)) + geom_tile()}) })
user system elapsed 17.799 1.336 19.204
100,000 Observations
Let’s scale up to the 100,000 observations requested.
N = 1e5 mat = matrix(rnorm(100*N), nrow=N, ncol=100) colnames(mat) = paste0("Col", seq(ncol(mat))) rownames(mat) = paste0("Row", seq(nrow(mat))) df = melt(mat, varnames = c("row", "col")) system.time({heatmap(mat, Rowv = NA, Colv = NA)})
user system elapsed 5.999 0.348 6.413
system.time({ print({ g= ggplot(df, aes(x = col, y = row, fill = value)) + geom_tile()}) })
user system elapsed 104.659 6.977 111.796
We see that heatmap
and geom_tile()
scale with the number observations on how long it takes to plot, but heatmap
being much quicker. There may be better ways to do this in ggplot2
, but after looking around, it looks like geom_tile()
is the main recommendation. Overall, for doing heatmaps of this size, I would use the heatmap
, after transposing the matrix and use something like seq_gradient_pal
from the scales
package or RColorBrewer
for color mapping.
For smaller dimensions, I’d definitely use geom_tile()
, especially if I wanted to do something like map text to the blocks as well. The other benefit is that no transposition needs to be done, but the data does need to be reshaped explicitly.
Like I said, ggplot2
is not ALWAYS the answer; nor was it supposed to be.
ggplot2 does not make publication-ready figures
Another reason for the lengthy discussion is that many argued ggplot2 does not make publication-ready figures. I agree. Many new users need a large
ggplot2 does not make publication-ready figures
message with their first install. But neither does base graphics. Maybe they need that message for their first R
install. R would boot up, and you can’t start until you answer “Does R (or any statistical software) give publication-ready figures by default”. Maybe, if you answer yes, R self-destructs.
Overall, a publication-ready figure takes time, customization, consideration about point size, color, line types, and other aspects of the plot, and usually stitching together multiple plots. ggplot2
has a lot of default features that have given considerate thought to color and these other factors, but one size does not fit all. The default color scheme may not be consistent with your other plots or the data. If it doesn’t work, change it.
And in this I agree with this statement: many people make a plot in ggplot2
and it looks good and they do not put the work in to make it publication-ready. How many times have you seen the muted green and red/pink color that is the default first 2 colors in a ggplot2
plot?
Why all the ggplot2 hate?
I’m not hating on ggplot2
; I use it often and value it. My next post will be about plots I commonly use in ggplot2
and why I chose to use it. I think it makes graphs that by default look better than many base alternatives. It does things many things well and easier than base. It has good defaults for some things. It’s a different grammar that, when learned, is easier for plotting. However, I do not think that ggplot2
figures are ready-set-go for publication.
Although I think ggplot2
is great, I most-definitely think base R
graphics are useful.
ggplot2
is not ALWAYS the answer. Again, it was never supposed to be.
Pingback: My Commonly Done ggplot2 graphs | A HopStat and Jump Away