# R is a Language: Treat it Like One

I'm helping out with teaching a class on an introduction to R for students this week. I figured it'd be a good time to discuss my thoughts on programming in R and how a newcomer should feel about learning the language.

## Those Who Teach R, Should Use R

Many of the students in the class at the beginning are overwhelmed. They see a bunch of different symbols and syntax; being overwhelmed is understandable. Moreover, those teaching you R can seem lightning fast when running code, moving around screens, or figuring out problems, and they usually are. Do you know why?

Those who are teaching you R, use R. They use R a lot. They use it daily, and for hours a day usually. Don't try to be them immediately, get the basics.

I hear students saying “Oh, you're fast” when power users help them, usually lined with an undertone of low self-esteem. That's like me going to a basketball camp run by Michael Jordan and saying “Oh man Michael, you're really good on the court”. Be like Mike: work hard to learn the basics like the back of your hand. You'll be dunking in no time.

And, of course, those who teach you are fast. Why would you take class from someone slow or unsure while doing what they're teaching you? Yes, your teachers may be fast, but that's the point. Learn from those who do.

I feel as though I'm relatively fast on my machine and I felt this starting out. Even more, I didn't know what tab completion was when starting out my first class. I thought my professor could actually type that fast when writing variables or directory paths. I thought it was magic. If I didn't stick it out, I wouldn't figure out how to make that magic myself.

## R is a language

I took Spanish in high school for 4 years. I remember some vocabulary words and some conjugation rules, but am far from conversational. I've never taken French – I don't know French. Now, if I took an introduction to French class for 4 hours, do you think I could speak (or write) French fluently? No, of course not. Yet students think they can with R. My Spanish is like a background in Stata: some words/phrases/commands are similar, other are misleading and can be confusing.

R is a programming language. Just like a foreign language, R has syntax and grammar. You must learn simple punctuation such as placement of commas, assigment using “=” or “<-”, where to close parentheses, and when to notice when a quote is unmatched. My overall message is:

R is a language, treat it like one.

Remember to tell students to hold themselves to the same level of comprehension as they would for a spoken (or signed) or written language. Hopefully, that will put learning R in perspective, even if it does not make it any less overwhelming. I wonder if Rosetta Stone will make an R module one of these days.

In R, useful functions for making design matrices are model.frame and model.matrix. I will to discuss some of the differences of behavior across and within the two functions. I also have an example where I have run into this problme and it caused me to lose time.

## Using model.frame for a design matrix

Whenever I use the word “design” I mean the sytematic part of a model; in this case, linear models. For example, if you say

$\displaystyle Y = X\beta + \varepsilon$

I'm referring to the $X$ as the design.

model.frame creates a design data.frame of the covariates given, keeping any factor variables as factors with the same levels. Let's create a toy data.frame called df, where Y is a normal random variable linearly related to two variables in the dataset:

n = 100
df = data.frame(X1 = rnorm(n),
X2 = rpois(n, lambda = 5),
X3= rnorm(100, mean = 4, sd = 2),
Sex = factor(rep(c("Male", "Female"), each = 50)))
df$Y = with(df, X1 + 3*X2 + rnorm(100, sd = 10)) Now, if Y is included on the left hand side of the formula, then it is included in the output of model.frame as such: model.df = model.frame(Y ~ X1 + X2 + X3 + Sex, data=df) head(model.df, 2) Y X1 X2 X3 Sex 1 9.223 0.3849 2 5.960 Male 2 12.467 -0.5061 5 1.651 Male This gives you a data.frame with the outcome and the covariates fitting that outcome (not including an intercept). If Y is not included on the left hand side of the formula: model.df2 = model.frame(~ X1 + X2 + X3 + Sex, data=df) head(model.df2, 2) X1 X2 X3 Sex 1 0.3849 2 5.960 Male 2 -0.5061 5 1.651 Male then we see that Y is not included in the output of model.frame. Thus, if you want to create a “design data.frame”, then you likely will want to remove Y from the formula. Note, in both cases, we see that there is no intercept term added to the data.frame and nothing is done to factor variables. ## Using model.matrix Most cases I'm making model design elements is using model.matrix to then use matrix multiplications to make procedures faster or do “smarter” (i.e. fewer) computations. I will discuss the differences between model.frame and model.matrix using our toy dataset and also dicuss one gotcha) for using model.matrix and lm. Let's use model.matrix with and without Y on the left hand side of the formula. model.mat = model.matrix(Y ~ X1 + X2 + X3 + Sex, data=df) model.mat2 = model.matrix(~ X1 + X2 + X3 + Sex, data=df) all.equal(model.mat, model.mat2) [1] TRUE We see that using any element on the left hand side doesn't affect the output of model.matrix. Difference #1 from model.frame. Let's look at the output from model.matrix. head(model.mat, 3) (Intercept) X1 X2 X3 SexMale 1 1 0.3849 2 5.960 1 2 1 -0.5061 5 1.651 1 3 1 -1.3739 3 3.197 1 We see a column was added named (Intercept) with a column of ones for the $\beta_0$ usually in a model. Difference #2 from model.frame. Also, we see that our factor Sex was converted to an indicator (numeric) variable. Difference #3 from model.frame. We only have 2 levels in Sex in this example. In general, a factor with L levels will generate L - 1 indicator variables using model.matrix. ## Review over, how did this affect me? I wanted to discuss the differences above to note them if you haven't seen them before. Also, I want to show that using model.matrix and a -1 or 0 in a formula can affect how some of your results are calculated using linear models with lm. Running the model with our now-ready model matrix: mod = lm(df$Y ~ model.mat)
summary(mod)
Call:
lm(formula = df$Y ~ model.mat) Residuals: Min 1Q Median 3Q Max -25.644 -8.617 0.448 7.648 30.245 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) -3.1871 3.9533 -0.81 0.42 model.mat(Intercept) NA NA NA NA model.matX1 1.1894 1.0987 1.08 0.28 model.matX2 3.6243 0.5790 6.26 1.1e-08 *** model.matX3 0.0164 0.5422 0.03 0.98 model.matSexMale 2.6174 2.2726 1.15 0.25 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.2 on 95 degrees of freedom Multiple R-squared: 0.312, Adjusted R-squared: 0.283 F-statistic: 10.8 on 4 and 95 DF, p-value: 3.03e-07 We see that the intercept term created in model.matrix was made NA because it's identical to the intercept term inherently generated by R and is linearly dependent. This is also seen with the warning: “(1 not defined because of singularities)”. This is good to know, but not revelatory or new; just be aware. ### When model.matrix goes … differently Well model.mat already has an intercept, so why not just take out the intercept term with a -1? The model should be the same, right? I would assume this is the case, but let's do it: mod.noint = lm(df$Y ~ model.mat - 1)
summary(mod.noint)
Call:
lm(formula = df$Y ~ model.mat - 1) Residuals: Min 1Q Median 3Q Max -25.644 -8.617 0.448 7.648 30.245 Coefficients: Estimate Std. Error t value Pr(>|t|) model.mat(Intercept) -3.1871 3.9533 -0.81 0.42 model.matX1 1.1894 1.0987 1.08 0.28 model.matX2 3.6243 0.5790 6.26 1.1e-08 *** model.matX3 0.0164 0.5422 0.03 0.98 model.matSexMale 2.6174 2.2726 1.15 0.25 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 11.2 on 95 degrees of freedom Multiple R-squared: 0.717, Adjusted R-squared: 0.702 F-statistic: 48.1 on 5 and 95 DF, p-value: <2e-16 We see the intercepts look exactly the same (except we have removed the NA). But note the r.squared, adjusted.r.squared and F-statistic values! Let's focus on $R^2$: summary(mod)$r.squared
[1] 0.3121
summary(mod.noint)$r.squared [1] 0.7168 These are different – way different – which seems off. Why? If you look into the summary.lm code, you will notice a some of statements involve the expression: attr(z$terms, "intercept")

and calculate quantities differently depending on whether it flags that test as TRUE or FALSE.

Let's look our two models again from model.matrix:

attr(mod$terms, "intercept") [1] 1 attr(mod.noint$terms, "intercept")
[1] 0

We see that when you construct the intercept yourself, this code evaluates to FALSE, even though the model has an “intercept”. The model has an intercept, but R hasn't assigned it that attribute. This effects the calculation of the model sum of squares (from summary.lm):

geom_histogram( aes(x = x2, y = -..density..), binwidth = diff(range(df$x))/30, fill= "green") print(g) ## Warning: Stacking not well defined when ymin != 0 I simply simulated 2 normal distributions of 100 points and then plotted them. Not the ..density call in the aes for the histograms. This just scales the histogram to a density and not a count. The -..density.. flips the second histogram around zero so that they are back-to-back. We see that ggplot doesn't like stacking when you have negative data, but it's ok for this exmaple and don't overlap. print(g + coord_flip()) ## Warning: Stacking not well defined when ymin != 0 Using coord_flip plots back-to-back histograms horizontally. This code can easily be extended using geom_density and actually a volcano plot version is in the help for stat_density. ### Base implementation Not everyone likes ggplot2 so I figured I would provide in implementation in base graphics. ## using base h1 = hist(df$x, plot=FALSE)
h2 = hist(df$x2, plot=FALSE) h2$counts = - h2$counts hmax = max(h1$counts)
hmin = min(h2$counts) X = c(h1$breaks, h2\$breaks)
xmax = max(X)
xmin = min(X)
plot(h1, ylim=c(hmin, hmax), col="green", xlim=c(xmin, xmax))
lines(h2, col="blue")

The code calculates the histograms for each distribution and stores the information. I simply take the negative number of counts to flip the histogram over the x-axis.

### Go forth and prosper

You can adjust the axes to positive numbers, make more implementations with densities/etc, but this is a simple graphic I've seen people use. Hope this helps someone out.

# SMART Hackathon: Day 2: Writing Packages in RStudio

So day 2 of the #JHUSMARTHack was last week, but I figured this would be a good time to discuss what was accomplished. I created some packages that are somewhat specialized and aren't fully finished yet, so I'll hold off. What I really want to discuss though is why I like using RStudio for making packages.

# What was accomplished?

I had spoken before about the repositories I had worked on, and also about Developing Packages in RStudio. I'll discuss the workflow I settled into for for making a package.

## Workflow for an R package

I'm assuming your R code is already available, presumably functions you had created during a project or analysis. If code is not available, GREAT! You can start your workflow for your new package or product all the same. I'll try to put command-line equivalents in double brackets [[ ]].

Workflow:

1. Start RStudio.
2. Go to File -> New Project. (Save any unsaved work).
3. Select New Directory. Now this may be counterintuitive if you have work saved, but if you're creating a package choose this. This will setup a new folder and copy over any code you have already created.
4. Select R Package. This will allow you to name your package (it will be used in the library statement unless changed later); let's call it mypckg. You can also choose code you have to operationalize, e.g. put into a package. Also – select you want to create a git repository.
5. Voila! The folder is created where you have the components of an R package, such as the documentation (man), extra install dir (inst), the R code (R), etc. [[library(devtools); create("mypckg");]]

Now, here is one of the main reasons I like using RStudio for projects: the .Rproj file. The .Rproj file is an RStudio project file. It allows you to work on stuff in multiple tabs/scripts, then close the project, and pop up the other tabs/scripts you were working on before opening up that project. If you are in RStudio, the top right should show a Project: None if you don't have a project loaded. These project files allows me to segregate my workflows and scripts, and they help me organize a bit more. I highly recommend checking out Hilary Parker's post before continuing, especially if you're not an RStudio fan.

### Using RStudio Build Tools

Now, when I say RStudio Build Tools, I essentially mean wrappers for the devtools package. The package is amazing (hardly shocking since Hadley Wickham is the main author), and along with the roxygen2 package, allow package creation to be as easy as possible.

Now, let's set up our options for Build Tools. In RStudio, go to Build -> Configure Build Tools (again you must be in an RStudio project). For Check Package, I recommend putting the --as-cran option (especially if you plan to submit to CRAN. You should also see a checkbox saying Generate documentation using Roxygen. If this is not available, run install.packages("roxygen2"), close and reopen the project. Check this box, and click the Configure button and I usually click all options.

### Setting up a remote git repository

Before, we checked for a git repository to be created. Now, you can create a new repository in your favorite GitHub remote repository. Mine is GitHub. You can use the GUIs such as the GitHub GUI or SourceTree, but I generally set this up using the Terminal by just adding the remote. (Here is a link to create ssh keys so you don't have to type in passwords for git). Now, if you restarted the RStudio project, go to Build -> Configure Build Tools and you should see the remote repository if you click the Git/SVN tab.

Now that the repository is set up (even if you don't use a remote repository), you can go to Tools --> Commit to commit to the repository. This allows you to add and stage the changed files while adding a commit comment. You can also see a visual history of the differences and changes as well as do much of what you would need to from the command line. Again, I like the Terminal, but I like having this all in one program and not having to switch back and forth.

### DOCUMENTATION! EXAMPLES! VIGNETTES!

Now that you have everything set up, you have to do the big things that differentiate a bunch of functions from a package: documentation and examples (including vignettes). Again, for documentation, we'll be using the roxygen2 package. Roxygen is essentially a format that starts with a line with #' followed by @ followed by a “tag”. The tags can be found at ?rd_roclet. Now, I highly recommend vignettes, but I'm not an expert on these and think we'll just stick to function docs right now.

#### Jumping to Sublime Text

Before we start documentation, let me again tell about MY workflow rather than Roxygen. My workflow now jumps to Sublime Text. I have Line-by-line installed (which you will need), which allows me to run a script to parse an R function and create the necessary Roxygen tags. See Alyssa's post for the description and a more command-line workflow for R packages.

Now that we're in Sublime Text, I open the .R files from my packages R directory. Select the function definition such as x = function(z, y, l=4, ...){ and use CMD+D to create Roxygen tags! This is like meta-programming for documentation: running scripts to make documentation (granted it's in Python). As an aside, one powerful feature of this documentation is that if you have code as:

LFPCAg <- function(
Y,# an n x D matrix
# Y is assumed to be centered by its mean function
gridpoints = 1:ncol(Y),       # a vector of grid points along curves, corresponding to columns of Y
Zlist=NA,
G=NA,
Ivec=NA,
...){

this will parse the Roxygen tags as the comments for each argument/parameter (even if multi-line!):

#' @title <brief desc>
#'
#' @description <full description>
#' @param Y an n x D matrix Y is assumed to be centered by its mean function
#' @param gridpoints a vector of grid points along curves corresponding to columns of Y
#' @param Zlist
#' @param G
#' @param Ivec
#' @param ...
#' @export
#' @keywords
#' @seealso
#' @return
#' @aliases
#' @examples \dontrun{
#'
#'}

This also puts in your mind, even if you're only creating functions and not a package, that you'll almost have documentation ready made when using this function format from day 1.

#### Jumping Back to RStudio

Now, opening these Roxygen-tagged functions, I can fill in the rest in RStudio. One thing to note is that RStudio will assume you're trying to stay in Roxygen notation with a return of line (which is great for multi-line descriptions/titles/etc). Also, if you have #' @ starting a line, then RStudio will do tab completion of Roxygen tags. Not leaps and bounds saved on time, but hey, I like tab completion.

Now you have to write your examples, the description of arguments (denoted as parameters), the overall function description and title, and use the @export to allow this function accessible to the user. One note is that if you depend on another function or package, use the @import pkgname or @importsFrom pkgname::funcName tags. R CMD check will warn you if you don't have anything in @keywords, @aliases, or @examples, so remove these if not necessary.

### Just let me check my functions!

If you're still working on the package and want to play with functions and no so much the documentation, you can use Build -> Load All [[devtools::load_all]] to load the functions (even those not exported) into memory.

Now let's fast-forward to when you have created the the documentation for your functions. While still in your project, go to Build -> Build and Reload to get your package loaded into memory [[devtools::build then devtools::install]]. Roxygen will create the docs. FYI – if you change around function names and recompile, the man folder may have obsolete .Rd files, so you can delete old ones.

You should edit the DESCRIPTION file to change some specifications, such as Depends: fields for package dependencies. That's documented many places on the web to find about what goes in there.

Now edit your functions and docs, push to the remote repository and then allow people install your package by using:

library(devtools)

and there you have it – you've released software. Build -> Check Package is good for testing your package (will tests your examples) and make sure everything looks OK.

# Conclusion

R package creation seems like a daunting task. You can use tools in RStudio such as Code -> Extract Function to take loads of code to try to functional-ize it. When you have a collection of functions, creating an RStudio project allows you to separate your package creation process from regular RStudio analysis and use, let's you have a one-stop shop for git version control, building, and checking of packages. It let's you get over any hurdles of learning new functions in devtools (which may not be a good thing) and get you running in a short amount of time. The Sublime Text plugin is not a crucial step, but can allow you to parse semi-documented functions and create a Roxygen header that's partially filled in. This workflow allowed me to develop multiple projects and get them documented quickly at the hackathon.

Hopefully this helps and good luck packaging!

# SMART Hackathon: Day 1

So day 1 of the #JHUSMARTHack 2014 is over and day 2 is underway. It's been awesome. One student estimated he did “one week's worth of work”. Some interesting (I think) things I thought about:

1. Flooding rain is good weather for coding. (As long as no sinkholes arise)
2. The Admiral Fell Inn is a pretty sweet place to host this event (and supposedly reasonably priced).
3. A lot of other conferences have Hackathons, such as OHBM. We discussed the potential for these at ENAR and JSM.
4. Headphones are good and bad at times. They can block out any talking but they also block out any cool discussions.
5. I'm used to coding by myself a lot, which is hard to break out of and when knowing to ask for help.
6. Shiny apps are easy to implement and cool products. (Not really new for me, but worthwhile to say).
7. It's good to have someone that knows languages you don't, such as Python.
8. You want to have example data before you come for your packages to be able to use, i.e. is IRB approved for distribution.
9. Doing some prep before, like reading through Developing Packages with RStudio, is helpful beforehand. (I think Roxygen2 should be default ON)
10. Using -- in Mac OSX when trying to put in options for R CMD check, such as --as-cran, turns -- into (which fails on the check).

Check out my github repos (posted yesterday) for progress. I've added the most functionality to WhiteStripe (white matter “segmentation”) and fslr (a wrapper for FSL).

Today, I'm looking at the knitr vignettes, amongst others, to make my first vignette for these. I'm very excited I can write in R Markdown with these. Happy hacking!

# SMART Hackathon 2014

Today is our SMART group's 2014 Hackathon! We will be at an undisclosed location, just plugging away at code/packages/apps. The mission statement:

There's no structure except that you have to work on a software
project and you have to present what you accomplished or didn't at a
subsequent SMART meeting.

We will be tweeting at #JHUSMARThack for the next two days.

I personally will be working on a suite of imaging tools for R for MS lesion imaging, white matter segmentation, CT image analysis, and a package integrating FSL and R: fslr.

I hope to get a lot of commits and some feedback on the package from the group. Let me know if you want to check any of them out and if you want any functionality with them!