# R projects may make large files

## Introduction

I have an “old” MacBook, it's a late 2013 MacBook Pro. I haven't upgraded because I wasn't a fan of the butterfly keyboards and the top row bar. I'm glad to hear you can now get new MacBooks with the “old” keyboards. Also I don't see large advances in the specs of the machine, but I'll stay with Mac because I love the OS and integration.

That being said, one of the downsides to having an old MacBook is that I'm struggling with space at times. I offload a lot of things to the cloud and my external drive, but I like having things locally. Also, I am a huge fan of the RStudio Packages framework. I would say the RStudio IDE is a must for using R nowadays; at least if you're a new user. RStudio Projects alleviates a lot of the problems of working outside of an IDE, such as switching directories (opening an .Rproj file opens to the root directory and here::here uses this), multiple unrelated scripts open (each has its own session/window), and has additional build tools for package development.

## How the RStudio IDE integrates with Package Development

Using the RStudio Projects for Package development is great. The tools integrate with devtools, which changed the game with making a package. RStudio additionally wrapped this functionality to keyboard shortcuts and GUI clicks, along with integration to Git. WHen you are compiling and building a package, the RStudio IDE knows that you should restart the R session because all the packages (and options) you previously loaded should to be reset. Now it doesn't want you losing any saved work, so all the objects are cached, the session is restarted, and the cache is restored.

## The issue

One of the downsides with this strategy is that I'm impatient. Sometimes, especially with large packages or objects, the RStudio IDE will freeze. I will wait and get annoyed and kill the process. The overall issue is that the cached data is not cleared away. The data is stored in .Rproj.user folder and can be quite big (100s of Mb) depending on what you had in memory. A lot of other files are located in there that are related to your user state (think the 10 Untitled files you just haven't saved yet, what scripts were open, what was in the Viewer). Most of the time for projects that are packages, I don't need this information so I delete the folder. Don't worry, it'll get regenerated when you open that project again.

## What's the point

If you're doing some house cleaning for hard drive space, take a look at the .Rproj.user hidden folders and see how large they are. They shouldn't be much more than 1Mb, and that's even pretty big depending on how much code you have. Either way, hope it gives you some “free” space. I guess I could buy another MacBook but this one works perfectly well still.

Here's a simple script allowing you to see the overall size of the directory. There are some things I couldn't find using file.size or file.info after recursively listing the files, so I just used du.

x = list.files(
pattern = "[.]Rproj[.]user",
all.files = TRUE,
include.dirs = TRUE,
recursive = TRUE,
no.. = TRUE)
dir.size = function(path) {
res = system(paste0("du ", shQuote(path)), intern = TRUE)
ss = strsplit(res, "\t")
ss = sapply(ss, function(x) as.numeric(x[1]))
sum(ss)
}
sizes = sapply(x, dir.size)


# Tips for a Job Search (Academic Edition)

After going to a few interviews last cycle for assistant professor positions, I figured I should write on some of the points that I found relevant and general. Some of these were tips given to me, some of them are my own. All these represent my opinions and mine alone.

This will be a least a 2-part series, so I will have an update in the coming week or so.

Full disclosure, I did not receive a tenure-track position offer, so take these with a grain of salt. Most of the materials I had sent out are located on my GitHub and website. I found that most people ask previous applicants/students of your advisor/fellow graduate students for copies of their statements, but I feel like these should be more open and editable, so I published them online for our (and other) students.

# My Packets/Materials

My CV is located here and my research/teaching statements and my cover letter for academia is located here.

# Step 0: Academic or Industry

One of the first things you should think about or know is whether you are looking for academic or industry tech. I chose to apply for academic positions as well as biotech/tech jobs. Not all my peers chose this, but I will tell you why I did:

1. Academic and Industry turnaround time is different.
1. Academic applications are due around November (get them done in October), but you should be applying around early or mid October as November is somewhat late for applications.
2. Although you apply in November for these jobs, you have time as institutions will not get back to you until January through March.
2. Both academic and industry offer good jobs. They have pros and cons (which I won’t list here), but they both afford a solid lifestyle and usually some variety at work.
3. There were a lot of positions in academia open (2016).

Let’s say you want to apply for academia. The rest of this post will be discussing an academic job search and future posts will be on industry searches and also aspects of an interview.

Now you’ve kept the door open for academia, you should know if you are looking more for a teaching gig or research gig. They have vastly different responsibilities and soft/hard money ratios. No one has defined FTE (full time effort) explicitly for me, but I have heard it range from 50-60 hours/week for an assistant professor.

From what I have seen, a “soft” money department requires 70-80% of your salary (FTE) to be covered by grants and the rest from the department (20-30%). These tend to be 12-month appointments. Many biostatistics departments, especially those in a school of public health, fall into this category.

A “hard” money department can range from 50-100% FTE covered from the school, which generally comes from teaching more courses. These are generally a 9-month appointment, where the 25% of the remaining salary (in the summer) comes from grants. Statistics departments and biostatistics departments in schools of medicine can fall into these categories.

Each type of department has their own pros and cons and each department is different.

## Timeline

You should probably get applying to many academic institutions around mid-October to early November. Mid-November (although they will not likely be reviewed for quite some time) is a little late in the game. A lot of planning for visits goes on and you want a to be invited and it’s hard for a place to invite you if they’ve filled a lot of “yes” invitations in the future.

Let’s assume you are applying to academia. First thing you need to do is make your packet.

Here’s what you’ll need:

### Curriculum Vitae.

This is the most important part. This is your “abstract” or your first impression on a committee most times. It must be updated, formatted well, and have all the relevant information. I remember a professor noting “Someone is going to take 2-3 minutes on a CV. They will go through 20-30 in a session. You need yours to be top-notch.”

Let’s look at the items.

1. Name, website, email, phone number (optional but recommended). Any blog/social media (Twitter)/GitHub pages. You may want to include preferred communication (phone vs. email).
2. Eduation: What school and department your are attending. Include advisor(s), expected graduation date and if you have defended (people will ask), areas of research, dates attended. I have seen people include GPAs and others have not.
3. Previous education: Master’s/Bachelor’s – same as above but I have seen GPAs more commonly here.
4. Relevant (Research) Experience – not cutting grass in 11th grade. Also, usually limit to the last 10 years unless you have done a lot before that. Put pointed deliverables in the text about what you did/how you added value/why it matters to the reader.
5. Teaching Experience. Classes you’ve taught, TA-ed, helped create. Include the professor you worked with (people love talking about mutual acquaintences) and your role. Include short courses/tutorials/workshops you lead, created, or participated as a teacher.
6. Published Publications. Generally descending/ascending by year. Make sure you highlight your name in the author order – people are looking for it. You may want to number them; mine are grouped by year but not numbered. These include in-press publications without the full citation. Make sure you update the citation when the article does get published.
7. Submitted Publications – these are submitted or under review but not accepted yet. These give people an idea of what you are currently working on and how many projects you work on at a time, though it may be a bad measure of that.
8. Talks/Presentations/Posters. Include all the talks you have given, including those at your own institution. Gave a talk at a conference? Include it. Working group? Include it. Journal club? Computing club? If it’s a presentation in front of others the projector is on, include it. Some separate posters and talks, but I included them together. Include the conference name or event.
9. Working Groups. Maybe you work with a group or are on a training grant, but you don’t have publications from it – still add it. People many times know your group
10. Honors/Awards. Win an award for a comprehensive exam or paper in your department? Put it down! ENAR poster award – umm yeah, that’s awesome to put down. If someone shook your hand and you got anything from a shout out in a room, certificate, to money, put it down.
11. Software. Do you release software – make that clear! Put links, say what it does. This includes web (Shiny) applications or any type of apps. Was it in undergrad? So what – put it down. I have had discussions with faculty the entire time allotted just about a web application I did one random weekend not related to my main research.
12. Skills – everyone knows Microsoft Word. Programming languages or other spoken/written languages. Write 1-2 scripts in Python? You’re a beginner not an intermediate. Can you read someone else’s code and know what it does, you’re an intermediate at least (there are other criteria, it’s not set). Do you feel like this i your language and you can speak in it as well as your native verbal language? You’re an expert.
13. Academic Service – I volunteer and I put that stuff down. Any academic job requires “service” (usually of a different sort), but showing you do service outside of reviewing papers, it fits in line with many university missions. Moreover, if you start a club or run a club in your department that is full blown academic service.
14. Additional Experience – things that don’t fit above. Do a hackathon? Put it down here. Say the cool project you did and link to it.

There may be other sections for a CV, but those are mine, save for one. I have a “Research Interests” section at the top that says what I’m interested in/want to do research in. This may be good or bad depending on your view. It may become an reason to put you in the no pile before reading, but I think it’s useful.

Remember, academic search committees are looking for someone who can 1) do research, 2) teach, and 3) perform academic service, e.g. mentoring students, serving on thesis committees, serving on other committees (seminar committee, student recruitment, job search commmittee).

### Research Statement

Depending on the position you are interviewing for, the teaching statement or research statement is likely to be the first thing read after your CV. That means you should spend a bit of time on it. Like grants, I hear the best way to write one is to get someone else’s who has been successful. Ask previous post docs, your advisor (though it may be dated), and previous students who have graduated for their statements.
I do not think I have been overwhelmingly successful in getting job offers, but I put my research and teaching statement on GitHub.

I have a few guidelines for what I would include in your research statement:

2. What you want to do in the next 5 years.
3. Why institution X and position Y is the place to do it.

“The Professor is in” has some good points in this and this. Check it out.

### Teaching Statement

In an academic tenure-track job (and most research-track), you will teach. Teaching can afford you discretionary money in research track and is expected in tenure-track as a portion of your salary generally comes from you teaching. If you haven’t taught a course, were you a graduate assistant or design something for high school students? Put that down. You should highlight any teaching awards you have had in the past and how they have helped you or what led you to receiving them.

Overall, you should have a philosophy for teaching (at least loosely). As a biostatistician who works with a lot of colleagues who are not statisticians or biostatisticians, I (like to) think that I have the skills to bring the material “down a level” into more understanding terms. I believe there should be transparency in grading and an up-front level of expectation on each side of the classroom. Although I don’t find myself to be the most organized while teaching, I feel that it is an important fact because without it the goals of the class can become out of reach or unclear. Anecdotes may be OK, but used only when directly relevant.

It seemed to me that many research institutions “assume” you are a good teacher and they focus more on discussing your research. There are zero places that will say that consistently good teaching is not essential to their program and your success.

If you don’t have any experience teaching, you should 1) consider getting some and 2) consider again that your job is going to require you to do this. Also, although conferences and presentations are not exactly teaching, you can maybe pepper something in there about you feeling comfortable in front of a room of your (1-year-junior) peers.

### Cover Letter

Not all places require a cover letter, but some do and it’s a nice touchpoint to start your packet. Some professors have told me they don’t read them (if they don’t require them) and others do.

I think it’s good to include:

1. Be clear which position you are applying for (most places have multiple)
2. Where you are graduating from.
3. Why are you applying there?
4. How are you qualified.

### Letters of Reference/Recommendation

For the people you choose to write you letters of recommendation/reference (I’ll refer to as “letters”), there are no hard and fast guidelines. Except that your advisor should be one of them. This person is the person you (presumably) worked the closest with in the past 4-5 years and they know your strengths and weaknesses best. Moreover, they have likely sat on a committee to hire a new person like you and know how to present your strengths (truthfully) in the best light.

Overall, the goal is ask people early. If a number of students in your department are graduating, they may have too many requests than they can handle in a reasonable timeframe. Some may ask you which places you plan on applying to so that they can maybe make some specific remarks (or a call or 2). Others may ask to see your CV to talk a little bit more about specifics (or to remember exactly what you did).

I worked closely with an non-biostatistician collaborator and I applied to many departments where I’d be working with non-biostatistician collaborators, so I thought he was crucial for a letter. I chose a previous advisor and professors whom I did an extensive project with. You should know who you’ve done work with. If not, check your defense committee again.

Most places you will need 3-4 letters, but have about 5 people you have asked as some places will ask for 5 and some will “allow up to 5”. Make sure you have a file of their full name, email, address, phone number, position, and relation to you (aka advisor/collaborator/etc.).

# Step 2: Figure out where the jobs are

For Biostatistics and Statistics, there are some great places to look for jobs online:

If you have a place in mind, check out the website for the department. They will have it advertised. Does your membership organization have a magazine? It sounds dated, but a lot of universities still advertise there.

You can also email any of your previous colleagues to ask if their department is hiring. This person should be a persons you would feel comfortable emailing for other reasons.

Check Twitter and social media. Some departments have these and use them to disseminate information. Check them out.

# Step 3: Where do you want to live for the next 6 years?

### The number one question you should be able to answer is “Why do you want to work here”

You should have a solid answer for that question. Period. Everything else is ancillary to that point.

In many tenure-track positions, it’s 6 years to tenure. If you’re doing well, that is. Leaving a position after 3 years is reasonable, but may not reflect well on you and you will inevitably get asked “Why?”. Moreover, it may seem as though you hadn’t thought thoroughly through on the position. While most of these may be ridiculous because people move jobs for a multitude of reasons (such as partners/family/area/weather/…life), the thoughts will exist.

So ask yourself: “Would I be comfortable/happy living in this town for the next 6 years?” Yes, great. Geographic location and a type of living (city vs. suburb vs. rural) are real things in making your decision. It’s also something that goes into offering someone a job. If the applicant seems great on paper and the interview, but seems to hate the surrounding area or “could never see themself living there”, that may be a thing that puts the decision over to a “no”. You’re not a robot and you have preferences, remember that.

After that question is answered, you more importantly need to answer: “Would I be comfortable/happy working in this place for the next 6 years?” – that’s a bit harder to know, but if there is a “No” creeping around there for some reason, that’s not a great sign. That’s not a dealbreaker for not applying, but remember one thing: interviews are draining. You don’t want to put all your eggs in one basket, but you don’t want a big basket of slightly-cracked eggs. Eggs in this metaphor are your “best self” and cracked eggs are OK, but not so great.

# Step 4: Filling out an Unholy amount of forms/Sending Emails

Applications are about dotting i’s and crossingt t’s. They have some automation, but a lot of it is still very manual in its entry. You will have to write and copy and paste many documents over and over. Some will have optical character recognition (OCR) to determine information from your CV. If you have a “standard” CV, this will work. Otherwise, you’ll likely get a bunch of misformatted text you need to delete.

You will need to have accounts for each different university separately as they do not share across for information. Even though most of them use Taleo as a backend. More are using LinkedIn as a resource, which may be a good reason to update your LinkedIn to look like your CV. Many of these systems have places for you to put information about your references so remember to have that text file open with each reference’s information.

If the university you are applying to doesn’t have an automated system set up, you may have to send your packet to a search committee chair or an administrator who is listed on the posting. So you’ll email them and you’ll likely forget something, format something wrong, or forget to say what position you’re applying for, so you’ll get to answer a lot of emails.

Regardless, after the packet is signed off and in, you should (in like 3 weeks) send an email just confirming that everything is there. This is especially important if you don’t get confirmation when your letters of reference are submitted. Applications do fall through the cracks and emails do get overlooked. Do not trust any system in place and always double check your confirmation.

# Conclusions

This is one post in hopefully a few on some of my (hopefully useful) insights on the process of applying and interviewing for academic and industry positions for a quant/data scientist/data analyst/research professor. Overall, there is a lot of prep you need to do (now it’s October 5). Some of it will be out of your hands (like letters of reference), which is why it’s so important to be ahead of schedule. Much of it is writing and revising, writing and revising, which you should be good at now. The one takehome message is:

Don’t sell yourself short. You just finished a long, grueling process which at times you probably thought you’d fail at. But you didn’t. Maybe not all the things you’ve done is glamorous or earth-shattering, but you did interesting things. You did things that mattered. Remember that and not make others see that and believe it.

# Tips for First Year Comprehensive Exams

During our program, like most others, you have to take written comprehensive exams (“comps”) at the end of your first year of coursework. For many students it's a time of stress, which can be mitigated with some long-term planning. I wanted to make some suggestions on how to go about this for our (and other) PhD students.

## Start the week after spring break

Again, comps are stressful. You can be tested on anything from the material (ideally) from your first year. Professors can throw in problems that seem from left field that you did not study or prep on. How can you learn or study all the material?

The way to make comps more manageable is to have a long-term studying trajectory. We have 2 weeks after the last exam to study and prep, and that is crunch time. In my opinion, that time should be working on the topics you're struggling with, annotating books for crucial theorems (if you're allowed them in the exam), and doing a bunch of problems. Those 2 weeks is not the time to cover everything from day one. That time comes before that 2 weeks.

The week after spring break (the week before this was published) is a good time to start your timeline. That gives you about 10 weeks to study and prep. You can start from the beginning of the year to the current time, or work backward. If nothing else in the first week, make a timeline of what topics or terms you will cover over what time frame. This will reduce stress so that it breaks the test into discrete chunks of time and discrete courses.

## Get Past Exams

What's the best preparation for the comprehensive exam? A comprehensive exam. This may be a bit self-evident, but I know I had the feeling of not knowing where to start. Our department sends us the previous exams from the past 5-7 years. Some are may not be equitable with respect to the difficulty or concepts covered, but I believe more questions are always better.

Vanderbilt has some great exams, as does the University of New Mexico, and Villanova. You can go to the reference textbooks (Billingsley, Chung, Casella & Berger, Probability with Martingales (Williams)) to try some problems from the chapters you covered as well.

### Work from the back

My strategy is to map each exam (or 2) to a specific week. I worked on the older exams first and saved (e.g. did not look at) the ones from the previous 2 years until the 2 weeks before the test. I also would set out blocks of time (2-3 hours) to try to an entire section of an exam, simulating the conditions for that portion of the test. I think these are helpful at gauging how well your studying is going.

## Make a study group

How can you study or summarize all the material? Well, it's much easier if you have a team. You can also bounce ideas off each other. Moreover, the exams you have don't have an answer key, they are just the problems. It helps having others that can 1) check your work (swapping), 2) give you their solutions if you can't work out the problem, and 3) discuss different strategies for solving the problem.

We had a group separately for each section of the exam (probability, theory, methods). This separation helps because some students are retaking only parts of the exam and can help in some areas but don't want to be working on the sections they do not have to take. It also helps segment time studying so you don't focus only on one area while leaving another area (likely the one you don't like and are not the best at) neglected.

### Delegate Study Areas

We separated different topics (letting people choose first) for each of the sections for that week. Of those not chosen, the rest needs to be assigned. The people/small team that was assigned to a topic needed to make concise (2-3 page) documents outlining the most important areas. They would also do a 5 minute presentation to the group about why these are the most important areas. That is the time to ask questions (and be prepared to get some when you present).

At the end of the school year, you have an organized study document If you think your notes from the year are organized, you are likely mistaken. Even if you're highly organized (many people are not), there is usually too much superfluous details relevant to the course/homework/etc and not the material. Split it up and let others weed through different areas while you focus on those you were assigned.

### Drop the weight

If someone does not deliver on their delegated task, drop them. If there was an understanding that they would get double next time, fine. But if no discussion was made, they are out of the group. That person is not holding up his/her end of the bargain, are getting help for free, while contributing nothing back. All students are busy, and incorporating that is fine, but must be done before the session and at the time of delegation. Otherwise, that non-delivery will likely become a pattern and hurt the entire group. These are your friends and classmates, and it must be clear that any non-delivery is a direct negative to the group. No excuses excuse that.

## Do as many problems as possible

Do problems. Do more. And then do some more. The exam is a set of problems. Knowing the material is essential, but the more comfortable you are with doing these difficult problems in a compressed time frame, the better you are. Many tests up until now may have been collaborative, take home, and shorter. Your comprehensive exam will be a bit different, so you have to prepare yourself. We're talking about practice; it's important (sorry AI).

## Conclusions

Overall, the best way to perform well on the comprehensive exams is to learn the material as thoroughly as possible. Ideally, that is done during the course. Topics are forgotten and areas are always not fully understood the first time around. Therefore, a methodical, long-term study plan should be made to tackle the year's worth of material. I think a team is the best format for discussion and delegation, but you MUST do work alone (doing the problems), as the team does not collaboratively take the test. If you follow your plan (while obviously learning the new concepts in class), then you should feel as prepared as you can be. Best of luck. I would like to leave you a quote/clip from the recent Bridge of Spies movie:
“Do you never worry?”
“Would it help?”

# R CMD INSTALL with symlink to R

## Problems with R CMD INSTALL

I was trying to install a package (ANTsR) by running R CMD INSTALL as normal on the cloned github repository on a particular cluster. I kept getting errors and could not understand why for the life of me. Note, I have not used this cluster much and wasn't sure how it was configured.

I was pretty sure that this was a configuration problem on the cluster because I had installed this repo on:

1. My system
2. A shiny server
3. Another cluster

## Finding the Error

The build was using cmake, so I figured it was some flag. Oddly enough, I was getting the error (I put . where thes are hard paths not relevant to you):

-- Check for working C compiler: ./cc -- broken
CMake Error at . (message):
The C compiler "." is not able to compile a simple test program.


Whoa. Now either the C compiler is broken (unlikely) or the configuration has an error/bug (much more likely). I didn't write the config for this package (it's pretty complex), so I felt a bit at a loss.

### Let's look at the error

Well, the error did push me to the log for the error CMakeFiles/CMakeError.log, let's go there. Looking at the result of CMakeFiles/CMakeError.log, I found the following areas that seemed like where things were problems:

Build flags: ;WARNING:;ignoring;environment;value;of;R_HOME


Hmm. I see that these are words, not really building flags. They also seem like R code. I don't know how they got there, so I did some Googling.

I got to a page of someone having a similar issue: '“ignoring environment value of R_HOME” error when installing packages'. That sounds like my problem. OK their admin reset R_HOME and everything is great. Good for him, not much help for me.

I found a bug report for R which discusses this, but there didn't seem to be any resolution.

## Finding a Solution

I went back to the simple R warning “WARNING: ignoring environment value of R_HOME” and found an old StackOverflow Post about it.

Now, one thing they discussed was

unset R_HOME


I tried that, nothing worked.

OK, well what about RHOME, let's unset that too:

unset RHOME


Error, fail. Moreover, these variables were never set anyway. Oh! I realized if we reversed the error, that R_HOME was set incorrectly, then let's just set it before R CMD INSTALL and then it shouldn't error:

R_HOME=R RHOME;
R CMD INSTALL ANTsR


That's not to say the package will install without a hitch, but this part of the build seems to be fixed. (Note: I had to clean out the errors to rerun).

## Why did this happen?

I believe most of this happened by the configuration of R on the cluster and the linking of the R folder (try which R) to the true home for R (try R RHOME). I don't know where (if anywhere) in the setup/.bashrc/.bash_profile scripts R_HOME is set, but it seems that this discrepancy caused a problem.

# ggplot2 is not ALWAYS the answer: it’s not supposed to be

Recently, the topic was propose at tea time:

Why should I switch over to ggplot2? I can do everything in base graphics.

I have heard this argument before and I understand it for the most part. Many people have learned R before ggplot2 came on the scene (as well as many other packages) and learned to do all the things they needed to. Some don’t understand the new syntax for plotting and argue the learning curve is not worth the effort. Also, many say it’s not straightforward to customize.

As the discussion progressed, we discussed 10 commonly made base plots, and I would recreate them in ggplot2. The goal was to help base users break into ggplot2 if they wanted, and also see if they were “as easy” as doing them in base. I want to discuss my results and the fact that ggplot2 is not ALWAYS the answer, nor was it supposed to be.

## First Plot – Heatmap

The first example discussed was a heatmap that had 100,000 rows and 100 columns with a good default color scheme. As a comparison, I used heatmap in base R. Let’s do only 10,000 to start:

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0(&amp;amp;quot;Col&amp;amp;quot;, seq(ncol(mat)))
rownames(mat) = paste0(&amp;amp;quot;Row&amp;amp;quot;, seq(nrow(mat)))
system.time({heatmap(mat)})


   user  system elapsed
29.996   1.032  31.212


For a heatmap in ggplot2, I used geom_tile and wanted to look at the results. Note, ggplot2 requires the data to be in “long” format, so I had to reshape the data.

library(reshape2)
library(ggplot2)
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))
system.time({
print({
g= ggplot(df, aes(x = col, y = row, fill = value)) +
geom_tile()})
})


   user  system elapsed
9.519   0.675  10.394


One of the problems is that heatmap does clustering by default, so the comparison is not really fair (but still using “defaults” – as was specified in the discussion). Let’s do the heatmap again without the clustering.

system.time({heatmap(mat, Rowv = NA, Colv = NA)})


   user  system elapsed
0.563   0.035   0.605


Which one looks better? I’m not really sure, but I do like the red/orange coloring scheme – I can differences a bit better. Granted, there shouldn’t be many differences as the data is random. The ggplot graph does have a built-in legend of the values, which is many times necessary. Note, the rows of the heatmap are shown as columns and the columns shown as rows, a usually well-known fact for users of the image function. The ggplot graph plots as you would a scatterplot – up on x-axis and right on the y-axis are increasing. If we want to represent rows as rows and columns as columns, we can switch the x and y aesthetics, but I wanted as close to heatmap as possible.

### Don’t factor the rows

Note, I named the matrix rows and columns, if I don’t do this, the plotting will be faster, albeit slightly.

N = 1e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})


   user  system elapsed
0.642   0.017   0.675

system.time({
print({
g= ggplot(df, aes(x = col, y = row, fill = value)) +
geom_tile()})
})


   user  system elapsed
9.814   1.260  11.141


### 20,000 Observations

I’m going to double it and do 20,000 observations and again not do any clustering:

N = 2e4
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0(&amp;amp;quot;Col&amp;amp;quot;, seq(ncol(mat)))
rownames(mat) = paste0(&amp;amp;quot;Row&amp;amp;quot;, seq(nrow(mat)))
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})


   user  system elapsed
1.076   0.063   1.144

system.time({
print({
g= ggplot(df, aes(x = col, y = row, fill = value)) +
geom_tile()})
})


   user  system elapsed
17.799   1.336  19.204


### 100,000 Observations

Let’s scale up to the 100,000 observations requested.

N = 1e5
mat = matrix(rnorm(100*N), nrow=N, ncol=100)
colnames(mat) = paste0(&amp;amp;quot;Col&amp;amp;quot;, seq(ncol(mat)))
rownames(mat) = paste0(&amp;amp;quot;Row&amp;amp;quot;, seq(nrow(mat)))
df = melt(mat, varnames = c(&amp;amp;quot;row&amp;amp;quot;, &amp;amp;quot;col&amp;amp;quot;))

system.time({heatmap(mat, Rowv = NA, Colv = NA)})


   user  system elapsed
5.999   0.348   6.413

system.time({
print({
g= ggplot(df, aes(x = col, y = row, fill = value)) +
geom_tile()})
})


   user  system elapsed
104.659   6.977 111.796


We see that heatmap and geom_tile() scale with the number observations on how long it takes to plot, but heatmap being much quicker. There may be better ways to do this in ggplot2, but after looking around, it looks like geom_tile() is the main recommendation. Overall, for doing heatmaps of this size, I would use the heatmap, after transposing the matrix and use something like seq_gradient_pal from the scales package or RColorBrewer for color mapping.

For smaller dimensions, I’d definitely use geom_tile(), especially if I wanted to do something like map text to the blocks as well. The other benefit is that no transposition needs to be done, but the data does need to be reshaped explicitly.

Like I said, ggplot2 is not ALWAYS the answer; nor was it supposed to be.

## ggplot2 does not make publication-ready figures

Another reason for the lengthy discussion is that many argued ggplot2 does not make publication-ready figures. I agree. Many new users need a large

ggplot2 does not make publication-ready figures

message with their first install. But neither does base graphics. Maybe they need that message for their first R install. R would boot up, and you can’t start until you answer “Does R (or any statistical software) give publication-ready figures by default”. Maybe, if you answer yes, R self-destructs.

Overall, a publication-ready figure takes time, customization, consideration about point size, color, line types, and other aspects of the plot, and usually stitching together multiple plots. ggplot2 has a lot of default features that have given considerate thought to color and these other factors, but one size does not fit all. The default color scheme may not be consistent with your other plots or the data. If it doesn’t work, change it.

And in this I agree with this statement: many people make a plot in ggplot2 and it looks good and they do not put the work in to make it publication-ready. How many times have you seen the muted green and red/pink color that is the default first 2 colors in a ggplot2 plot?

## Why all the ggplot2 hate?

I’m not hating on ggplot2; I use it often and value it. My next post will be about plots I commonly use in ggplot2 and why I chose to use it. I think it makes graphs that by default look better than many base alternatives. It does things many things well and easier than base. It has good defaults for some things. It’s a different grammar that, when learned, is easier for plotting. However, I do not think that ggplot2 figures are ready-set-go for publication.

Although I think ggplot2 is great, I most-definitely think base R graphics are useful.

ggplot2 is not ALWAYS the answer. Again, it was never supposed to be.

# BibLaTeX with elsarticle

## Problem

I like BibLaTeX and StackOverflow presents some ways reasons to switch to BibLaTeX. I like it, if nothing else, I can have multiple bibliographies easily. I can also use natbib citation style and also limit which fields are displayed in my bibliography. One problem with BibLaTeX is that it does not work well with Elsevier articles (class elsarticle), since natbib is loaded by default. There are some other options to induce compatibility are presented here and here.

## Solution

So I edited the elsarticle.cls file to make it work. See the diff between the 2 files here:

diff elsarticle.cls elsarticle_nonatbib.cls

27c27
<  \def\RCSfile{elsarticle}%
---
>  \def\RCSfile{elsarticle_nonatbib}%
33c33
<  \def\@shortjid{elsarticle}
---
>  \def\@shortjid{elsarticle_nonatbib}
192,193c192,193
< \newcounter{author}
< \def\author{\@ifnextchar[{\@@author}{\@author}}
---
> \newcounter{auth}
> \def\auth{\@ifnextchar[{\@@auth}{\@auth}}
196c196
---
211c211
---
642c642
< \RequirePackage[\@biboptions]{natbib}
---
> %\RequirePackage[\@biboptions]{natbib}


If you've never read a diff output, the < means what's written in the first file (elsarticle.cls) and > means the second file (elsarticle_nonatbib.cls). The numbers correspond to the line in each file, e.g 192c193 means line 192 in file 1 and 193 in file 2.
Overall, I changed the type of article, commented out the natbib requirement, and changed the author field to auth (see below for why). The edited cls is located here.

## Things you need to change

The \author field conflicts with biblatex and elsarticle, so you must chagne the \author definitions to \auth instead. This is a minor change, but important one. You can change that field to anything you want in the elsarticle_nonatbib.cls file (such as elsauthor).

### Minimal Working Example (MWE)

I tried it with Elsevier's sample manuscript, changing the author fields to auth, and adding a biblatex-type heading:

\usepackage[
natbib = true,
backend=bibtex,
isbn=false,
url=false,
doi=false,
eprint=false,
style=numeric,
sorting=nyt,
sortcites = true
]{biblatex}
\bibliography{mybibfile}


and

\printbibliography


at the end, and the manuscript came out as if using natbib. The MWE is located here and output PDF is located here.

## Conclusion

You can use elsarticle with biblatex, with some minor changes. You may have to include this cls with the rest of your LaTeX files for them to compile on the editors machine. Maybe Elsevier will change the LaTeX for more flexibility, but the StackOverflow question was asked 3 years ago and not much seems to have changed, so I like my solution.

# Converting JPEG2000 DICOMs to Uncompressed DICOM

TL;DR – Neuroimaging-specific post. Some files can't be converted by programs if they are encoded one way – present Matlab program to fix that.

## JPEG2000 DICOM data

Recently, I had some DICOM data that is stored in the JPEG2000 format from OsiriX. I wanted to convert these to a NIfTI format using dcm2nii but it was not a fan of the compression, specifically the transfer syntax (which tells you how the data is encoded).

It spit out the error:

Unsupported Transfer Syntax 1.2.840.10008.1.2.4.91 Solution: use MRIcro


## Using Matlab's Image Processing Toolbox

For most of my endeavors, I try to use R for everything, but that's not always possible. For example the oro.dicom package is great for working with DICOM data, but the JPEG2000 compression format is not supported. Therefore, I went into Matlab, which has the image processing toolbox. You will need an updated version, as described here, Matlab 2009b version will not work.

## The Code for Conversion

Overall, the code takes a directory, lists all the files, removing directories. . The for loop goes through each DICOM file, reads in the header information with dicominfo, reads in the data matrix with dicomread. The transfer syntax is changed using newsyntax, creates a new filename (within a specified output directory outdir) and writes the DICOM file using dicomwrite. The createmode can either be Create or Copy. Using Create is the default option and has error and missing checking. The Copy mode bypasses many of these checks, but this may be required for certain DICOM files to be written.

rundir = pwd;
updir = fileparts(rundir);
outdir = fullfile(updir, 'Converted');
newsyntax = '1.2.840.10008.1.2';
createmode = 'Copy'; % could be 'Create';
x = dir(rundir);
nodir = ~[x.isdir]';
x = x(nodir);
x = {x.name}';
ifile = 1;
for ifile = 1:length(x)
name = x{ifile};
stub = regexprep(name, '(.*)[.].*', '\$1');
stub = [stub addon '.dcm']; %#ok<AGROW>
name = fullfile(rundir, name);
stub = fullfile(outdir, stub);
dinfo = dicominfo(name);
dinfo.TransferSyntaxUID = newsyntax;

After this conversion, dcm2nii worked as usual for converting.
Other options exist such as the DICOM JPEG 2000 Module, but this must be licensed and Matlab has implemented this in imread. Just figured if you ran into this problem, this may help.