Rendering LaTeX Math Equations in GitHub Markdown

Posted on July 31, 2015 by strictlystat

The Problem: GitHub README.md won't render LaTeX

I have many times wondered about getting LaTeX math to render in a README file on GitHub. Apparently, many others ( 1, 2, 3 ), have asked the same question.

The common answers are:

It cannot (and in some cases, shouldn't) be done. GitHub parsing is done by SunDown and is secure, therefore won't do LaTeX.
Use http://latex.codecogs.com/ or iTex2Img. These are good options, but 1) they may go away at any time, and 2) require you to rewrite your md file.
Use unicode if possible.
Use LaTeXIt (for Mac OS) or other converter to make your equations and embed them.

A hackey, but working solution

I opted to try a more generic solution for (4.) using some very hackey text parsing. I have done a bit of parsing in the past, but I was either too lazy to think about the right regex to do, couldn't think of it easily, or thought my solution was sufficient even if not elegant.

Caveat

Two main caveats abound:

This only works for inline equations marked with dollar signs ($) or equations marked by double dollar signs ($$). I could encorporate other delimiters such as \[, but I did not. I only had a bit of time on Wednesday.
I assume any code that involves dollar signs be demarcated by chunks starting with three backticks (“). I wrote this for R code, which can use dollar signs for referencing and never has double dollar signs. If your code does, no guarantees.
This generally assumes you have a GitHub repository (have no idea what others use), and that you're OK with the figures being located in that GitHub repository. I didn't allow options for putting them in a sub-folder, but may incorporate that.
Some text won't be sized correctly.

How do I do it already

I wrote an R package that would parse a README.md (or README.rmd if it's RMarkdown). The package is located at https://github.com/muschellij2/latexreadme.

You can install the package using:

library(devtools)
install_github("muschellij2/latexreadme")

You would then load the package:

library(latexreadme)

The main function is parse_latex. It's not the best function name for what it does, but I don't really care. Let's see it's arguments:

args(parse_latex)

You must put in a README file as the rmd argument. If the README has an rmd or Rmd extension, the README is first knitted using knit(rmd) and then the resultant md file is used. This md is located in a temporary directory and won't write to the directory of the README. The new_md is the filename for the output md file that you wish to create. One example would be rmd = "README_with_latex.md" and md = "README.md". The git_username and git_reponame must be specified with your username and repository name, respectively. The git_branch allows you to specify which branch you are on, if necessary. If you don't know what that means, just leave as master.

The rest of the arguments are for inserting the LaTeX into the document. The text_height is how large the LaTeX should be (this may be bad for your document), the insert_string is the HTML the LaTeX is subbed for, the raw_git_site uses https://rawgit.com to reference the figures directly with proper content-type headers (so that they show up). The bad_string is something I'm using in the code. You only need to change bad_string if you happen to have text in your README that matches this (should be rare as they are a bunch of Z's, unless you write like someone sleeping). I'll get to the ... in a minute.

I still don't get it – show me an example

I thought you'd never ask. The parse_latex command has an example from one of my other repos and you can run it as follows (need curl):

rmd = file.path(tempdir(), "README_unparse.rmd")
download.file(
"https://raw.githubusercontent.com/muschellij2/Github_Markdown_LaTeX/master/README_unparse.rmd",
destfile = rmd, method = "curl")
new_md = file.path(tempdir(), "README.md")
parse_latex(rmd,
            new_md,
            git_username = "muschellij2",
            git_reponame = "Github_Markdown_LaTeX")
library(knitr)
new_html = pandoc(new_md, format = "html")

And you can view the html using browseURL:

browseURL(new_html)

You can see the output of the example (only a little bit of LaTeX) at this repo: https://github.com/muschellij2/Github_Markdown_LaTeX or at Kristin Linn's README, which was used as an example here: https://github.com/kalinn/IPW-SVM/blob/master/README_2.md

What is the function actually doing

So what is the function actually doing? Something convoluted I can assure you. The process is as follows:

Find the equations using ($$ and $) parse them out, throwing out any code demarcated with backticks (”).
Put this LaTeX into a simple LaTeX document with \begin{document}. Note, the ... argument can be a character vector of other packages to load in that document. See png_latex documentation.
Run pdflatex on the document. Note, this must be in your path. This creates a PDF.
Run knitr::plot_crop on this document. This will crop out anything that's not the LaTeX equation you wanted.
Convert the PDF to a PNG using animation::im.convert. This is so that they will render in the README. The file will be something like eq_no_01.png in the same folder as the rmd argument.
Replace all the LaTeX with the insert_string, which is raw HTML now.
Write out the parsed md file, which was named using new_md.

Wow – that IS convoluted

My best shot in one day. If you have better solutions, please post in the comments.

Nothing shows up! Read this

NB: The replacement looks for equations (noted by eq_noSOMETHING.png) in your online GitHub repository. If you run this command and don't push these png files, then nothing will show up.

Conclusions

You can have LaTeX “rendered” in a GitHub README file! The sizes of the text may be weird. This is due to the cropping. I could probably use some bounding box or better way to get only the equations, but I didn't. If you want to help, please sumbit a Pull Request to my repository and I'd gladly merge it if it works.

NB: GitHub may override a README.md if a README.rmd (or README.Rmd) exists. I'm not 100% sure on that, but if that's the case, rename the Rmd and just have README.md.

Happy parsing!

#rstats Make arrays into vectors before running table

Posted on July 30, 2015 by strictlystat

Setup of Problem

While working with nifti objects from the oro.nifti, I tried to table the values of the image. The table took a long time to compute. I thought this was due to the added information about a medical image, but I found that the same sluggishness happened when coercing the nifti object to an array as well.

Quick, illustrative simulation

But, if I coerced the data to a vector using the c function, things were much faster. Here's a simple example of the problem.

library(microbenchmark)
dim1 = 30
n = dim1 ^ 3
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = c(dim1, dim1, dim1))
microbenchmark(table(vec), table(arr), table(c(arr)), times = 100)

Unit: milliseconds
          expr       min        lq      mean    median        uq      max
    table(vec)  5.767608  5.977569  8.052919  6.404160  7.574409 51.13589
    table(arr) 21.780273 23.515651 25.050044 24.367534 25.753732 68.91016
 table(c(arr))  5.803281  6.070403  6.829207  6.786833  7.374568  9.69886
 neval cld
   100  a 
   100   b
   100  a

As you can see, it's much faster to run table on the vector than the array, and the coercion of an array to a vector doesn't take much time compared to the tabling and is comparable in speed.

Explanation of simulation

If the code above is clear, you can skip this section. I created an array that was 30 × 30 × 30 from random binomial variables with half probabily and 15 Bernoulli trials. To keep things on the same playing field, the array (arr) and the vector (vec) have the same values in them. The microbenchmark function (and package of the same name) will run the command 100 times and displays the statistics of the time component.

Why, oh why?

I've looked into the table function, but cannot seem to find where the bottleneck occurs. Now, for and array of 30 × 30 × 30, it takes less than a tenth of a second to compute. The problem is when the data is 512 × 512 × 30 (such as CT data), the tabulation using the array form can be very time consuming.

I reduced the replicates, but let's show see this in a reasonable image dimension example:

library(microbenchmark)
dims = c(512, 512, 30)
n = prod(dims)
vec = rbinom(n = n, size = 15, prob = 0.5)
arr = array(vec, dim = dims)
microbenchmark(table(vec), table(arr), table(c(arr)), times = 10)

Unit: seconds
          expr      min       lq     mean    median        uq       max
    table(vec) 1.871762 1.898383 1.990402  1.950302  1.990898  2.299721
    table(arr) 8.935822 9.355209 9.990732 10.078947 10.449311 11.594772
 table(c(arr)) 1.925444 1.981403 2.127866  2.018741  2.222639  2.612065
 neval cld
    10  a 
    10   b
    10  a

Conclusion

I can't figure out why right now, but it seems that coercing an array (or nifti image) to a vector before running table can significantly speed up the procedure. If anyone has any intuition why this is, I'd love to hear it. Hope that helps your array tabulations!

Line plots of longitudinal summary data in R using ggplot2

Posted on July 9, 2015 by strictlystat

I recently had an email for a colleague asking me to make a figure like this in ggplot2 or trellis in R:

plot of chunk final_plot

As I know more about how to do things in ggplot2, I chose to use that package (if it wasn't obvious from the plot or other posts).

Starting Point

Cookbook R/) has a great starting point for making this graph. The solution there is not sufficient for the desired graph, but that may not be clear why that is. I will go through most of the steps of customization on how to get the desired plot.

Creating Data

To illustrate this, I will create some sample dataset:

N <- 30
id <- as.character(1:N) # create ids
sexes = c("male", "female")
sex <- sample(sexes, size = N/2, replace = TRUE) # create a sample of sex
diseases = c("low", "med", "high")
disease <- rep(diseases, each = N/3) # disease severity 
times = c("Pre", "0", "30", "60")
time <- rep(times, times = N) # times measured 
t <- 0:3
ntimes = length(t)
y1 <- c(replicate(N/2, rnorm(ntimes, mean = 10+2*t)), 
        replicate(N/2, rnorm(ntimes, mean = 10+4*t)))
y2 <- c(replicate(N/2, rnorm(ntimes, mean = 10-2*t)), 
        replicate(N/2, rnorm(ntimes, mean = 10-4*t)))
y3 <- c(replicate(N/2, rnorm(ntimes, mean = 10+t^2)), 
        replicate(N/2, rnorm(ntimes, mean = 10-t^2)))

data <- data.frame(id=rep(id, each=ntimes), sex=rep(sex, each=ntimes), 
                   severity=rep(disease, each=ntimes), time=time, 
                   Y1=c(y1), Y2=c(y2), Y3=c(y3)) # create data.frame
#### factor the variables so in correct order
data$sex = factor(data$sex, levels = sexes)
data$time = factor(data$time, levels = times)
data$severity = factor(data$severity, levels = diseases)
head(data)

  id    sex severity time        Y1        Y2        Y3
1  1 female      low  Pre  9.262417 11.510636  9.047127
2  1 female      low    0 10.223988  8.592833 11.570381
3  1 female      low   30 13.650680  5.696405 13.954316
4  1 female      low   60 15.528288  5.313968 18.631744
5  2 female      low  Pre  9.734716 11.190081 10.086104
6  2 female      low    0 12.892207  7.897296  9.794494

We have a longitudinal dataset with 30 different people/units with different ID. Each ID has a single sex and disease severity. Each ID has 4 replicates, measuring 3 separate variables (Y1, Y2, and Y3) at each time point. The 4 time points are previous (Pre)/baseline, time 0, 30, and 60, which represent follow-up.

Reformatting Data

In ggplot2, if you want to plot all 3 Y variables, you must have them in the same column, with another column indicating which variable you want plot. Essentially, I need to make the data “longer”. For this, I will reshape the data using the reshape2 package and the function melt.

library(reshape2)
long = melt(data, measure.vars = c("Y1", "Y2", "Y3") )
head(long)

  id    sex severity time variable     value
1  1 female      low  Pre       Y1  9.262417
2  1 female      low    0       Y1 10.223988
3  1 female      low   30       Y1 13.650680
4  1 female      low   60       Y1 15.528288
5  2 female      low  Pre       Y1  9.734716
6  2 female      low    0       Y1 12.892207

It may not be clear what has been reshaped, but reordering the data.frame can illustrate that each Y variable is now a separate row:

head(long[ order(long$id, long$time, long$variable),], 10)

    id    sex severity time variable     value
1    1 female      low  Pre       Y1  9.262417
121  1 female      low  Pre       Y2 11.510636
241  1 female      low  Pre       Y3  9.047127
2    1 female      low    0       Y1 10.223988
122  1 female      low    0       Y2  8.592833
242  1 female      low    0       Y3 11.570381
3    1 female      low   30       Y1 13.650680
123  1 female      low   30       Y2  5.696405
243  1 female      low   30       Y3 13.954316
4    1 female      low   60       Y1 15.528288

Creating Summarized data frame

We will make a data.frame with the means and standard deviations for each group, for each sex, for each Y variable, for separate time points. I will use plyr to create this data.frame, using ddply (first d representing I'm putting in a data.frame, and the second d representing I want data.frame out):

library(plyr)
agg = ddply(long, .(severity, sex, variable, time), function(x){
  c(mean=mean(x$value), sd = sd(x$value))
})
head(agg)

  severity  sex variable time      mean        sd
1      low male       Y1  Pre  9.691420 1.1268324
2      low male       Y1    0 12.145178 1.1218897
3      low male       Y1   30 14.304611 0.3342055
4      low male       Y1   60 15.885740 1.7616423
5      low male       Y2  Pre  9.653853 0.7404102
6      low male       Y2    0  7.652401 0.7751223

There is nothing special about means/standard deviations. It could be any summary measures you are interested in visualizing.

We will also create the Mean + 1 standard deviation. We could have done standard error or a confidence interval, etc.

agg$lower = agg$mean + agg$sd
agg$upper = agg$mean - agg$sd

Now, agg contains the data we wish to plot.

Time is not on your side

Time as a factor

If you look at the plot we wish to make, we want the lines to be connected for times 0, 30, 60, but not for the previous data. Let's try using the time variable, which is a factor. We create pd, which will be a ggplot2 object, which tells that I wish to plot the means + error bars slightly next to each other.

class(agg$time)

[1] "factor"

pd <- position_dodge(width = 0.2) # move them .2 to the left and right

gbase  = ggplot(agg, aes(y=mean, colour=severity)) + 
  geom_errorbar(aes(ymin=lower, ymax=upper), width=.3, position=pd) +
  geom_point(position=pd) + facet_grid(variable ~ sex)
gline = gbase + geom_line(position=pd) 
print(gline + aes(x=time))

plot of chunk gbase

None of the lines are connected! This is because time is a factor. We will use gbase and gline with different times to show how the end result can be achieved.

Time as a numeric

We can make time a numeric variable, and simply replace Pre with -1 so that it can be plotted as well.

agg$num_time = as.numeric(as.character(agg$time))
agg$num_time[ is.na(agg$num_time) ] = -1
unique(agg$num_time)

[1] -1  0 30 60

In a previous post, I have discussed as an aside of creating a plot in ggplot2 and then creating adding data to the data.frame. You must use the %+% to update the data in the object.

gline = gline %+% agg
print(gline + aes(x=num_time))

plot of chunk plus

If you look closely, you can see that Pre and time 0 are very close and not labeled, but also connected. As the scale on the x-axis has changed, the width of the error bar (set to 0.3), now is too small and should be changed if using this solution.

Although there can be a discussion if the Pre data should be even on the same plot or the same timeframe, I will leave that for you to dispute. I don't think it's a terrible idea, and I think the plot works because the Pre and 0 time point data are not connected. There was nothign special about -1, and here we use -30 to make it evenly spaced:

agg$num_time[ agg$num_time == -1 ] = -30
gline = gline %+% agg
print(gline + aes(x=num_time))

plot of chunk create_time_neg

That looks similar to what we want. Again, Pre is connected to the data, but we also now have a labeling problem with the x-axis somewhat. We still must change the width of the error bar in this scenario as well.

Time as a numeric, but not the actual time point

In the next case, we simply use as.numeric to the factor to create a variable new_time that will be 1 for the first level of time (in this case Pre) to the number of time points, in this case 4.

agg$new_time = as.numeric(agg$time)
unique(agg$new_time)

[1] 1 2 3 4

gline = gline %+% agg
print(gline + aes(x = new_time))

plot of chunk new_time

Here we have something similar with the spacing, but now the labels are not what we want. Also, Pre is still connected. The width of the error bars is now on a scale from 1-4, so they look appropriate.

Creating a Separate data.frame

Here, we will create a separate data.frame for the data that we want to connect the points. We want the times 0-60 to be connected and the Pre time point to be separate.

sub_no_pre = agg[ agg$time != "Pre",]

Mulitple data sets in plot function

Note, previously we did:

gline = gbase + geom_line(position=pd)

This assumes that geom_line uses the same data.frame as the rest of the plot (agg). We can fully specify the arguments in geom_line so that the line is only for the non-Pre data:

gbase = gbase %+% agg
gline = gbase + geom_line(data = sub_no_pre, position=pd, 
                          aes(x = new_time, y = mean, colour=severity)) 
print(gline + aes(x = new_time))

plot of chunk non_conn
Note, the arguments in aes should match the rest of the plot for this to work smoothly and correctly.

Relabeling the axes

Now, we simply need to re-label the x-axis so that it corresponds to the correct times:

g_final = gline + aes(x=new_time) +
  scale_x_continuous(breaks=c(1:4), labels=c("Pre", "0", "30", "60"))

We could be more robust in this code, using the levels of the factor:

time_levs = levels(agg$time)
g_final = gline + aes(x=new_time) +
  scale_x_continuous(
    breaks= 1:length(time_levs), 
    labels = time_levs)
print(g_final)

plot of chunk relabel2

Give me a break

My colleague also wanted to separate the panels a bit. We will use the panel.margin arguments and use the unit function from the grid package to define how far apart we want the axes.

library(grid)
g_final = g_final + theme(panel.margin.x = unit(1, "lines"), 
                          panel.margin.y = unit(0.5, "lines"))
print(g_final)

plot of chunk final

Additional options and conclusoin

I believe legends should be inside a plot for many reasons (I may write about that). Colors can be changed (see scale_colour_manual). Axis labels should be changed, and the Y should be labeled to what they are (this is a toy example).

Overall, this plot seems to be what they wanted and the default options work okay. I hope this illustrates how to customize a ggplot to your needs and how you may need to use multiple data.frames to achieve your desired result.

A HopStat and Jump Away

Trying to at least Doggie Paddle through the Sea of Data, Contributor to http://bmorebiostat.com

Monthly Archives: July 2015