The 3 ‘Times’ of a Project

During a conversation with Sean Kross about projects, particularly data science projects, I tried to explain how things can go right and wrong with a project. I was explaining things with respect to being the data scientist on academic projects, but I think these issues are cross-cutting so figured I’d post them here.

I thought back to when projects did not go well or someone was left frustrated or angry during or at the end of the interaction. To me, the issues usually come down to the 3 “time”s of a project: time, timeline, and timeliness.

Before talking about these “time”s, I think it’s important to note that most of the frustration really comes down to miscommunication. The miscommunication or differing expectations, in my opinion usually fits into one of these time buckets.


Time represents how long you estimate to do something. Particularly, this relates to how many hours a week you can work on a project, or percent effort, also called %FTE (percent full time equivalent). “Time” also means there should be a discussion of whether you have the space in your schedule to commit to something. Many instances you may not have space but you’ve been “strongly urged” to do the work.

Helpful things to do:

  1. Do not say how many hours you have available. Tell them 80% of that or tell them how many you want to work on this. Time is a fluid – it fills the space provided.
  2. Sometimes work out 1-2 “hypotheticals”, such as what if the data is in terrible shape. Even better, wait to give a yes or a no for accepting a project until after you get some of the data, but most people assume you are a “yes” once you get the data.
  3. Estimate (or overestimate) how long the first set of tasks will take.
    • this sets the precedent for the project.

It’s fine to deliver this a bit quicker than projected. It excites people (“That was fast!!”), but you can still lag on sending it exactly when it’s done. This time slack allows you to think if the results are right, but more importantly makes it so that when things go wrong (WTH is that data point!?) the expectation of a quick turnaround is mitigated ab it.

One of the main issues is that novelty is a cruel mistress. New and shiny things are exciting. Most projects sound like they can change the world or practice or our understanding of an area. Some can, not all do. Think of a project you’re on right now and try to answer the question that if it dropped right now and someone came back in a week and asked what time you could dedicate to that project again. Would it be the same? How much less? Think of your good and not-so-good projects, and averaging that might give you an idea on how you’ll feel about this new project in 3 months.


I know you’re saying “3 months from now!? I get all my projects done quickly!”. That brings us to timeline. The full timeline of the project is the how long the overall goal or set of goals for a project is going to take. This discussion usually is more overarching than the time discussion for a specific task. Is the project one paper? Developing an entire suite of work? Multiple clinical trials?

But let’s focus on one analysis, that (hopefully) results in a paper.

A few questions that could be helpful are:

  1. When do you plan on submiting the paper?
  2. Are all the patients/subjects enrolled followed up?
  3. Is someone (student/intern/visitor) leaving soon and this needs to be done by then?

Many times you’re not privvy to the internal workings of a group, including the fact that the data they’re about to give you may have be stopped and started 3 different time with different analyses.

Many people think once the paper gets thrown down the ravine to the wolves of review, it’s out of sight and mind and never thought of again. But then, it crawls up, bloodied and beaten, back from the land of reviews into you line of sight: REVISION!

You need to ask: When will reviews likely get back from this journal, what’s the turnaround time on those usually (2 weeks to 1 month), who will take lead?

Other important timeline questions:

  1. What other projects will start once project 1 is done?
  2. What if I need to move my time around after this project is done, who can take over?
  3. What if things significantly change? Examples: you’re grant gets funded! Your main collaborator’s big grant gets funded! You are planning in changing jobs?


Although it may be a bit of an abuse of the term, the last time is timeliness. I consider timeliness similar to responsiveness. Many projects have long or short-term explicit goals, like a paper or book, but many have implicit deliverables along the way, like short presentations. The discussion here is something like “If you send me a question about this project, how fast do you expect me to respond? Same hour? Same day?”. This discussion sets up the ability to use keywords such as URGENT or NON-URGENT. These can be abused, but at least you know what one party believes is important so that they don’t come back later and indicate you shrugged off something for another day that was pertinent.
Also, effective email writing techniques such as putting in an estimate of how long you think a task would take (could be way off – again good to know what people think) or putting a TL;DR (too long; didn’t read) synopsis at the beginning of a long-worded email.

We’re all battling the evil dragon of email back daily, trying to rescue the prize of “free time”. These little things allow people to prioritize tasks for a project and not open a 2-page email, be overwhelmed and close it, putting it off until later. A little TL;DR can make things a tad easier. Remember that people use email in very different ways; that long email may be a stream of consciousness mess or a well-itemized TODO list that people should refer back to. Now many, and I mean many, different project management solutions exist for this type of work, but 1) I can’t find anyone who agrees on which one to use, 2) some are unwilling to pay for these solutions, and 3) if you’re a data scientist you’re usually not able to force the use of these. Even if you can force using this solution, the next project may say no.

Although most don’t use “project management” tools per se, there are services that most are amenable to that can help these issues. For example, shared folders such as DropBox, OneDrive, and Box provide a one stop shop where materials should be created. Writing a paper? Use Google Docs, or for the LaTeX crowd, Overleaf. As an aside, Overleaf is a great product, that you can even use knitr in! Once they make a way to use this with Rmarkdown (I’m looking at you RStudio), I will throw down the gauntlet and try to only use this service, as it incorporates LaTeX/PDF, dynamic documents, can output DOCX, PPTX, slide decks. ANNND Back to other tools like GitHub for a shared space for code. At the end of the day, you’re trying to end the torment of an email with an attachment of Manuscript_FINAL_2020May15_JM3_REALFINAL_willThisEverEnd?.docx. Many of these tools are painless replacements for the email song and dance, have version history and track changes. Push or them.

I have had horror stories of timeliness. I have had emails that said WE NEED THIS RIGHT NOW. Long into the night, breaking my back (but probably neck because ergonomics is hard) for this project, I’d send off my finished product. Then I’d wait. And wait. And forget. Then remember and get mad that I hadn’t heard anything. Then ping the email and get nothing. Then I’d look up, 6 months had gone by, and I had realized my beard looked like Tom Hanks in Castaway, and feel the serene closure of letting a dead project die. Then a week later I’d get an email saying Thanks for that! WE NEED THIS OTHER THING RIGHT NOW. Don’t do this for your mental health, the health of your facial hair (or lack thereof), and for the stress balls that may explode otherwise.


Time is a fickle thing that we think we have none of (today), a world of (I’LL NEVER DIE!), or some (let’s have a quick chat). For projects, time discussions and expectations are vital to a good collaboration. Like an awkward first date, sometimes you need to get some of the cards on the table otherwise you end up down the line as a depressed John Cusack as he has played in so many movies. Talk about your 3 times of a project, be happy, and collaborations will hopefully flourish!

Some Thoughts as a Junior Faculty (at JHSPH)

Being a Junior Faculty member, or considering it, leads to a lot of questions. I hope to answer a few of them here. Some of my statements will be specific to Johns Hopkins Bloomberg School of Public Health (JHSPH) and maybe specific to the Department of Biostatistics. Disclaimer: this is the only department I have been in (for PhD and faculty), so not all of these may generalize or apply. All of these opinions are my own and all of this is knowledge that was not taken in confidence.

Do you want to be a faculty?

First and foremost: do you want to do this? I'm not saying you need to be 100% sure about everything and this has been your lifelong dream and you've never thought about anything else. I'm saying, did you like writing papers and doing research, where much of the work you needed to be independently motivated? I like going down the rabbit hole and finding out where it leads me. Maybe too much at times. That means finding a bug in my code, figuring out if my hypothesis is off the mark, or whether I can tackle this problem in front of me. The independence is a large draw for me.

Overall, I believe the flexibility/independence to work on what you're passionate about is the main draw of academia. That doesn't mean you'll never have to do things you don't like or aren't passionate about. It means that you'll have the opportunity to explore your own ideas if you want, or work on interesting research that just-so-happens someone else wrote the grant for. A lot of the other perks of academia you can find in other industries. Many jobs today are allowing for flexible time schedules, conference travel, up to 20% independent research time, remote work, and other things that were unheard of 25 years ago. That's not a bad thing for academia, but just that those perks are not only for academic faculty.

That independence/flexibility comes at some cost. For one thing, you may be paid below “market rate” in industry or consulting. The main cost I see, though, is that independence can be hard sometimes, at least for me. I don't like being told what to work on all the time (see rabbit hole above), but I do like some structured work that has deliverables. Trying to reorder your priorities fluidly can be a bit draining.

One of the best analogies I've heard about being a junior faculty is that your own startup. You're the CEO of your own career. You're finding funding usually by grants compared to VCs. You're a recruiter, usually of students and other collaborators. You're your own assistant, scheduling meetings, staying on top of your email, booking your own travel (maybe), and running the meetings. And you're the team doing the research, writing the code, and delivering the product (papers/presentations/grants); you're the advertiser of the product (vlogs/blogs/presentations/papers/classes). Over time, these roles change in the percent of time you spend doing each task, but when you start out, you're it. And lastly, you're setting the agenda and vision for your career.

I'm an impostor: I don't have ideas

Many graduating students have the concern that they will not have enough ideas to generate new papers or grants. I'd stay that's generally not something you should worry about. No area of research is completely explored; but it may be an issue if you are too narrow in your scope. Almost every paper I have finished has led to at least 3 more questions. Those questions may be about that data set or method or about new data we need collected. Even if your well of ideas dries up temporarily (highly doubtful), if you have energetic collaborators/mentors, they will have enough ideas to lend you. If you're working on something someone else suggested, I recommend to 1) understand why it's important before starting, 2) making sure you have enough interest/passion in this topic, for those nights where the project has turned to your worst enemy, this passion keeps you from totally throwing it in the garbage, and 3) to have expectations discussed before doing the work with respect to the level of help those suggesting is providing, and 4) make sure authorship is at least discussed a bit before doing a whole bunch of work. If that doesn't work, go to one conference and see if you don't come back with a handful of ideas.

Soft money vs. Hard money

Soft-money generally refers to salary funding coming from grants or other awards rather than tuition or endowments. Hard money is the opposite and many times the majority of your salary will come from teaching. There are numbers such as “2-1”, “2-2”, “1-1” that refer to the number of classes you teach in a semester for hard money positions. JHSPH is generally a soft money environment. Moreover, we are in a quarter (not semester) system, so the numbers do not mean the same thing. Depending on how much you teach, however, you will be required to cover anywhere from 60-85% of your salary as a tenure-track faculty at a given time on grants or awards. If you're research track, make it 75-100%.

Research Track vs. Tenure-Track

First off, I'm an Assistant Scientist at JHSPH. This means I'm a research-track faculty member. Other institutions have different names for this track and also may have different tracks for research or clinical work, etc. In some departments, research-track faculty members are treated starkly different than tenure-track members, not just implicitly: some have different voting rights and restrictions on their work and/or mentorship. In Biostatistics at JHSPH, research/scientist- track members have similar voting rights (not completely the same) and are treated very similar to tenure-track faculty.
For example:
You can teach courses.
You can have discretionary accounts.
You can be the PI on a grant (or co-PI).
You usually get competitive offers and can use the AMStat news to guide your salary.
Skills related to research, teaching, service, and mentorship are extremely useful.

Some differences worth noting are:
You cannot be the primary research advisor to a PhD student. You can be an advisor, not the primary. You can be a primary research advisor for a Master's student.
This has pros and cons. You can't be the primary mentor, but can still work with students, and tend to not have to find funding for them as that is likely the duty of the primary advisor
You don't have a built-in sabbatical whereas it's more assumed for tenure track. You could potentially negotiate this.
You are usually hired under a project or a direct mentor.
This does not imply that you cannot work on your own work, but that initially you don't have to find all of your funding when starting.
The search, hiring process, and requirements from the dean is not exactly the same as tenure-track
Startup packages are not necessarily the same. Again, could potentially be negotiated.
You start working on day 1, compared to some “protected time” with tenure-track faculty.
You don't have a “tenure clock”. This can be a double-edged sword.
On one hand, you don't have the same timeline pressure.
On the other hand, you may need to make a concerted effort to set up meetings with your chair and/or mentor to discuss progress with respect to promotions. Our chair has yearly progress meetings with all faculty, regardless of track.
This can also lead to more variable promotion timelines. This can be mitigated by clear communication from the chair and mentor about expectations and previous precedence.
You have different expectations for promotion. These can vary wildly from institution to institution. We have similar expectations in many respects at JHSPH, but do not have as many external letters required for the promotion committee.


How do you choose a mentor? Well, find someone you can talk with, that knows stuff about stuff you don't know well, and will agree to make time for you. We have one formal mentor. But most likely, you'll have many mentors. One is likely to be in the department, but you'll likely find mentors that are collaborators. There are some informal setups in our department, which work overall because most people are open to having you schedule a meeting or walk in and ask some questions. If you find a department where that's not the case, try to get something more formal. Generally, someone in a working group you are in may be a good place to start. We also have an informal lunch on the calendar each day where faculty/post-docs may join, which allows you to meet other faculty that you may not directly work with. I have found this immensely helpful to get to know my fellow faculty, or get some advice from senior faculty that have dense schedules I would not feel comfortable sequestering an hour from.


One of the most asked questions for new junior faculty is about funding. These questions and discussions can be stressful, especially if you have no experience with grants. I had some experience with grants when being a Master's-level statistician, but never from the viewpoint of a PI.

How do Grants work?

Honestly, I'm still not 100% sure. NIH R Grants have different requirements with respect to page limits (, but they are generally between 6 and 12 pages. That seems like a lot but it isn't. Remember, all the aims of the grant, the introduction, the figures, and novelty of the grant needs to go in there. Don't go over the limit; period.

One thing we do in JHSPH Biostatistics is the faculty share written grant proposals. Some of the grants have been funded, some have not been funded but discussed, and some were not discussed. This allows junior faculty who have never been on an NIH panel see an array of grants. I learned writing papers by reading other papers and applying a similar logic structure. I imagine grants are a similar endeavour. Disclaimer, I've never applied for an NIH grant where I was the main PI and the one who did the lion's share of the writing. But when I do, having examples to draw from can help immensely. I have submitted to internal and other grant mechanisms, but not NIH as a PI.

Study sections and that stuff

I will tackle a few simple questions now. At JHSPH, as it is a school of public health, a lot of grants come from the National Institutes of Health (NIH), at many different institutes or centers in the NIH (called ICs). Many of our faculty (not myself though) have received grants from the National Science Foundation (NSF). These tended to be more theoretical, but not always. There are also a number of internal grants at an institution. For example, I have a DELTA grant (, which is an internal JHU grant.

Grants have letters and numbers, those letters generally refer to the type of grant it is (see Many grants you will apply for will be R grants, which stands for research. Particularly for junior faculty, some target Career development awards (K grants, Many faculty target R01 grants, as they are the most common. Junior faculty may be more likely to target R21 grants as well as they are for research in earlier stages.

If you are a postdoctoral fellow, you can apply for a K99/R00 (sometimes called a “kangaroo” grant,, which is a “Pathway to Independence Award”. These are similar to R01s in the funding amount usually. They are highly competitive, but the number of eligible applicants is smaller than the number of faculty.

For many sections, there are requests for applications (RFAs) that go out. These are proposals that call for grants that do a specific type of work, tackle a specific subject area, or require specific infrastructure resources. Make sure you're on the mission of the RFA before going forward. In order to do that, you'll want to talk to a program officer. In many respects, these people are similar to project managers in other settings. They have a portfolio of different divisions; this proposal is not their only one. Most program officers (POs) have extensive backgrounds in science, but not always specific to your field or the niche of the RFA. That can cause struggles when discussing some of the importance of your work, but that's a good thing. It's a good thing because the panel of the grant isn't going to be niche people. If the program officer doesn't see how your proposal fits with the RFA, it's highly unlikely the study section will see it either. Also, the program officers look at a number RFAs other than this specific one, which allows them to maybe identify other sections or RFAs where your grant may be more appropriate. Don't harass them, but they are your contact to ask questions and you should use them.

Funding: Direct and Indirects

Grants have direct and indirect costs. The direct costs are the monies needed to do the work, such as salary, computing, data collection/analysis, etc. This is generally how you can fund your salary, your work, students, and/or post-docs. The indirects or indirect costs relate to money in the budget that is not directly related to the work (hence indirect), such as money for office space, staff, heating/cooling, electricity, other institutional requirements/support. The “indirect rate” is negotiated by the school and the funding body (see for some rates).

Write a lot

I recommend the book “How to Write a Lot: A Practical Guide to Productive Academic Writing”. It's not expensive and it's a short book. Note, this will not teach you how to write well or publish. It's specifically on how to write a lot. As an academic, that's the majority of the job. Writing papers, writing grants, writing letters of recommendation (eventually), writing letters of support (“I'd work on this grant for sure”), writing presentations, etc. Writing a lot can help, even if the writing isn't that great to start. The book also recommends a writing accountability group (WAG). We have one with junior faculty in our department, and it has led to grants and papers that would not have existed otherwise. If you don't have one, start one. At JHU, our faculty development office helps create them and facilitate them if you don't have the ability or pull to start one on your own (

How do you recruit students?

First, students need to know who you are. That means attending departmental events and meetings where there are students. We have a tea time every week where students and faculty share tea. We discuss a number of things: life, pets, that week's seminars, other non-statistics human things. We also have a chili cookoff at the beginning of every academic year so that new students can meet the department. We also have a holiday and end of the year celebration. We additionally have joint faculty/student meetings to discuss departmental matters. We have had off-site retreats approximately every 1.5 years to discuss long-term matters of the department and adjustment to our vision and our mission. Our offices are all on the same floor, so they see us in the halls and know where we sit. If you are in a department where that's not the case, try to be somewhere visible some days a week (like a coffee shop in the building the students are) if possible.

A large resource for recruiting students is teaching the first or second-year courses. These students get to know you, how you work, and you get to know them. They at least know who you are if they are your teacher (hopefully). Thus, in some hard money environments, you may have discussions at interviews about “buying out” of teaching. This can be beneficial to have a discussion about this option, but not teaching may put off some departments and may limit your ability to recruit students quickly. That being said, I find teaching incredibly rewarding, but also extremely tiring. I have never taught a lot in one day and not felt like it took a lot out of me. But I've never seen those days as I “got nothing accomplished”, which has happened with strictly research days at times.


At JHSPH, I had the tremendous opportunity to attend a lot of conferences. I like to travel and see places, network and meet people, and don't mind public speaking. All of those traits are helpful for going to conferences, but are by no means necessary. Sadly, some programs allow students to go to one conference over the course of their degree. This sometimes conveys the idea that conferences aren't useful or aren't for students. Both are patently false. You don't need to go to 5 conferences a year to be a successful faculty member. Heck, you don't need to go to any. But conferences are great places to meet people in your field, get your name out there (advertiser), and make collaborations and connections for future projects. Oh, and students definitely do go to conferences (maybe a future post-doc?). To fund the travel, hopefully there are funds in the grant for travel and conferences. If not, you may have money in a discretionary account that you had from a startup package or other means. Our department also will pay for one conference a year for all faculty. If none of those options exist, try your hardest to get a travel award from the conference. These are highly competitive, may be only open to students or post-docs, and will likely not cover all the costs incurred at the conference.

Staff and Administration

Lastly, respect the hell out of your good staff and administration members. My mother was a secretary at a university. She had stories about professors who were not the nicest to staff and that stuck with me. If you think it's hard getting a meeting with a collaborator, imagine trying to organize 5 senior faculty from different departments to get on thesis defenses or filling a speaker schedule where no one answers emails. The administrative may be the gatekeepers to senior faculty calendars or room schedules. They are the glue that keeps things together at times and the oil that keeps the machine running at others.

The administrative team also usually knows the ins and outs of grant submissions and may be the ones submitting the grant. Respect their time. Do not expect them to reply on weekends or after hours unless absolutely necessary. Our admin at JHSPH Biostatistics have made policies about requiring notification that we are submitting a grant a period of time before submission and the faculty agreed. Moreover, most staff and admin have been in the department much longer than you; they know who to talk to, the answers to your questions, and they generally will meet with you and do a Q&A. Most importantly, if you have good people that do not feel respected at their job, they will leave.


Try to find someone who's done well in the environment you're in. That is likely a mentor, but maybe not. Try to have people know who you are. You'll have ideas for research; you'll probably will write grants. Ask successful grant writers for copies of their work to use a starting template. You weren't always an expert on how to analyze data or write papers; it takes practice, help, and usually a template. Like most things, a lot of anxiety and frustration can be mitigated or avoided by having open, frank discussions about expectations, requirements, and getting feedback. Remember, you will likely have to ask for help if you need it, but your department wants you to succeed.

The way people use AI is ruining Reproducible Science Again

The basic premise of this article is this: “Would you accept a paper that did a logistic regression, but did not publish the weights due to intellectual property?”. If you answer yes, then I do not think you will agree with some of the following statements. If so, I thank you for your reviewing service and will let the authors for which I review know who you are to send to you.

If you answered no, my question to you is, why do we accept this for artificial intelligence (AI) models? Here I'm using AI in the broad sense, including machine learning, deep learning, and neural networks. In many of these cases, the model itself is only useful as an object. For example, for a random forest, the combination of the individual trees are necessary to do prediction. It is extremely difficult (likely impossible) to reduce this to a reduced representation that would be useful in a paper to do prediction. In a regression framework, even penalized regression, the model can be shown by a series of weights or beta coefficients. For deep learning models, the number of parameters can explode given the complexity, depth, and representation of the network. When using a convolutional neural network (CNN) to segment or classify images, there can be millions of weights for different areas of an image to get a final result. These weights are impractical to print out in a PDF, text file, or supplemental material as it would take a researcher hours to reconstruct this into the network. Thus, the model weights should be released if the results are to be reproducible or useful on an external data set. I will yield that a CNN can be represented in a figure to some degree and be reproduced, but many times other processing, normalization, augmentation, or other non-shown steps are required for reproducibility.

Why is this Happening?

Frameworks such as Tensorflow, Keras, Theano, and PyTorch make deep learning more usable for all researchers. Fitting these models or predicting output (also called inference) can be done on a number of platforms, including mobile, which makes it highly attractive. Moreover, container solutions such as Docker and Singularity allow the entire system to be preserved on which the model was used. So what's the issue? The growing issue is the use of AI, especially in applications of medical data, is that people are not releasing 1) their data, 2) their code, or 3) the model weights.

Release the Data?

Let us tackle the easiest first: the data. Some data was collected without consent to be released, has protected health information (PHI) that cannot be released under protections such as HIPAA (Health Insurance Portability and Accountability Act). It is completely reasonable for researchers to not be able to release the data. Thus, this is totally valid. I will say if they can release the data, many times it is stated it is “available upon request”, but adherence to this policy is not enforced by many journals as the paper is already published (, . If authors simply ignore these requests, there can be little ramifications. This may be understandable, because the downsides to the researcher of releasing data, as 1) users could find issues (may be a benefit), 2) it may require maintaining data usage agreements, or 3) many think of this as “intellectual property”, which I will address now.

Release the Code?

Many people, seeing how well AI is working in their application, think that their method could be turned into a commercial product. This may be valid, but must not be used as a shield against reproducible research. Let's turn to releasing the code. If there is no novelty in the framework they used, such as an off-the-shelf VNET, then the code should be released as nothing is “secret”. Even with slight adaptations, unless large and completely new, the code should be released. Many state that if it is off-the-shelf, why would code need to be released? The reason is that although most off-the-shelf methods are used, getting the data into the correct way before running them, including data processing and checks, need to be available. Thus, these “ancillary” scripts are actually crucial for research and reproduction. Even if the architecture is completely novel, it will likely be described in detail in the publication, and thus potentially could be released. Let's assume though that you cannot release the data or the code.

Release the Model?

Lastly, releasing the model. Again, the “model” in this setting can be a complex set of trees or weights, amongst other things. It's uncertain as to whether PHI can be recovered from these models, which is a valid concern given the data cannot be released. I assert that after many discussions that many don't release the model because it is “proprietary” or has potential “intellectual property” that can be commercializable, which I don't disagree with. What I disagree with is that many applications will not fit the requirements for a patent, as slight changes to an algorithm can classify it as a different algorithm. Using these models in a software-as-a-service (SaaS) framework could potentially be profitable, but it's doubtful this will ever happen. Moreover, there is no time limit on these commercializations. Therefore, you claim this can be commercialized, but after 5 years no progress is made, then is it really going to be commercialized or simply an impediment to reproducible and progressive science. If a model fits in the cloud but never comes down, is it a model really at all?

Any Solution?

So what's the answer? I don't know. But here's some help in reviewing.
Personally, I have been putting in boilerplate concerns with a number of medical imaging AI projects, which hopefully you may be able to use:

  • Overall, the other main concerns are 1) the data is not available to determine quality, 2) no software is available to test or apply this methodology to another data set, and 3) the segmentation/detection results were not directly compared to any of the methodology for segmentation previously published.
  • Releasing the code for processing and modeling, including the final model weights would greatly increase the impact for this paper is highly encouraged.
  • Are the data released anywhere? Will it be made public? Will the segmentations/classifications?

I've had authors and editors give the concerns above, which I have yielded to in some cases. I don't think these are 100% necessary for publication, but I would like to know the reasons that I cannot reproduce this analysis or use it to learn how to do better science. Until journals make clearer guidance about these policies (instead of omitting them in many cases), I guess I'll just be ice-skating uphill.

R projects may make large files


I have an “old” MacBook, it's a late 2013 MacBook Pro. I haven't upgraded because I wasn't a fan of the butterfly keyboards and the top row bar. I'm glad to hear you can now get new MacBooks with the “old” keyboards. Also I don't see large advances in the specs of the machine, but I'll stay with Mac because I love the OS and integration.

That being said, one of the downsides to having an old MacBook is that I'm struggling with space at times. I offload a lot of things to the cloud and my external drive, but I like having things locally. Also, I am a huge fan of the RStudio Packages framework. I would say the RStudio IDE is a must for using R nowadays; at least if you're a new user. RStudio Projects alleviates a lot of the problems of working outside of an IDE, such as switching directories (opening an .Rproj file opens to the root directory and here::here uses this), multiple unrelated scripts open (each has its own session/window), and has additional build tools for package development.

How the RStudio IDE integrates with Package Development

Using the RStudio Projects for Package development is great. The tools integrate with devtools, which changed the game with making a package. RStudio additionally wrapped this functionality to keyboard shortcuts and GUI clicks, along with integration to Git. WHen you are compiling and building a package, the RStudio IDE knows that you should restart the R session because all the packages (and options) you previously loaded should to be reset. Now it doesn't want you losing any saved work, so all the objects are cached, the session is restarted, and the cache is restored.

The issue

One of the downsides with this strategy is that I'm impatient. Sometimes, especially with large packages or objects, the RStudio IDE will freeze. I will wait and get annoyed and kill the process. The overall issue is that the cached data is not cleared away. The data is stored in .Rproj.user folder and can be quite big (100s of Mb) depending on what you had in memory. A lot of other files are located in there that are related to your user state (think the 10 Untitled files you just haven't saved yet, what scripts were open, what was in the Viewer). Most of the time for projects that are packages, I don't need this information so I delete the folder. Don't worry, it'll get regenerated when you open that project again.

What's the point

If you're doing some house cleaning for hard drive space, take a look at the .Rproj.user hidden folders and see how large they are. They shouldn't be much more than 1Mb, and that's even pretty big depending on how much code you have. Either way, hope it gives you some “free” space. I guess I could buy another MacBook but this one works perfectly well still.

Here's a simple script allowing you to see the overall size of the directory. There are some things I couldn't find using file.size or after recursively listing the files, so I just used du.

x = list.files(
  pattern = "[.]Rproj[.]user",
  all.files = TRUE, 
  include.dirs = TRUE, 
  recursive = TRUE,
  no.. = TRUE)
dir.size = function(path) {
  res = system(paste0("du ", shQuote(path)), intern = TRUE)
  ss = strsplit(res, "\t")
  ss = sapply(ss, function(x) as.numeric(x[1]))
sizes = sapply(x, dir.size)

Tips for a Job Search (Academic Edition)

After going to a few interviews last cycle for assistant professor positions, I figured I should write on some of the points that I found relevant and general. Some of these were tips given to me, some of them are my own. All these represent my opinions and mine alone.

This will be a least a 2-part series, so I will have an update in the coming week or so.

Full disclosure, I did not receive a tenure-track position offer, so take these with a grain of salt. Most of the materials I had sent out are located on my GitHub and website. I found that most people ask previous applicants/students of your advisor/fellow graduate students for copies of their statements, but I feel like these should be more open and editable, so I published them online for our (and other) students.

My Packets/Materials

My CV is located here and my research/teaching statements and my cover letter for academia is located here.

Step 0: Academic or Industry

One of the first things you should think about or know is whether you are looking for academic or industry tech. I chose to apply for academic positions as well as biotech/tech jobs. Not all my peers chose this, but I will tell you why I did:

  1. Academic and Industry turnaround time is different.
    1. Academic applications are due around November (get them done in October), but you should be applying around early or mid October as November is somewhat late for applications.
    2. Although you apply in November for these jobs, you have time as institutions will not get back to you until January through March.
  2. Both academic and industry offer good jobs. They have pros and cons (which I won’t list here), but they both afford a solid lifestyle and usually some variety at work.
  3. There were a lot of positions in academia open (2016).

Let’s say you want to apply for academia. The rest of this post will be discussing an academic job search and future posts will be on industry searches and also aspects of an interview.

Now you’ve kept the door open for academia, you should know if you are looking more for a teaching gig or research gig. They have vastly different responsibilities and soft/hard money ratios. No one has defined FTE (full time effort) explicitly for me, but I have heard it range from 50-60 hours/week for an assistant professor.

From what I have seen, a “soft” money department requires 70-80% of your salary (FTE) to be covered by grants and the rest from the department (20-30%). These tend to be 12-month appointments. Many biostatistics departments, especially those in a school of public health, fall into this category.

A “hard” money department can range from 50-100% FTE covered from the school, which generally comes from teaching more courses. These are generally a 9-month appointment, where the 25% of the remaining salary (in the summer) comes from grants. Statistics departments and biostatistics departments in schools of medicine can fall into these categories.

Each type of department has their own pros and cons and each department is different.


You should probably get applying to many academic institutions around mid-October to early November. Mid-November (although they will not likely be reviewed for quite some time) is a little late in the game. A lot of planning for visits goes on and you want a to be invited and it’s hard for a place to invite you if they’ve filled a lot of “yes” invitations in the future.

Step 1: Your Packet: Academia

Let’s assume you are applying to academia. First thing you need to do is make your packet.

Here’s what you’ll need:

Curriculum Vitae.

This is the most important part. This is your “abstract” or your first impression on a committee most times. It must be updated, formatted well, and have all the relevant information. I remember a professor noting “Someone is going to take 2-3 minutes on a CV. They will go through 20-30 in a session. You need yours to be top-notch.”

Let’s look at the items.

  1. Name, website, email, phone number (optional but recommended). Any blog/social media (Twitter)/GitHub pages. You may want to include preferred communication (phone vs. email).
  2. Eduation: What school and department your are attending. Include advisor(s), expected graduation date and if you have defended (people will ask), areas of research, dates attended. I have seen people include GPAs and others have not.
  3. Previous education: Master’s/Bachelor’s – same as above but I have seen GPAs more commonly here.
  4. Relevant (Research) Experience – not cutting grass in 11th grade. Also, usually limit to the last 10 years unless you have done a lot before that. Put pointed deliverables in the text about what you did/how you added value/why it matters to the reader.
  5. Teaching Experience. Classes you’ve taught, TA-ed, helped create. Include the professor you worked with (people love talking about mutual acquaintences) and your role. Include short courses/tutorials/workshops you lead, created, or participated as a teacher.
  6. Published Publications. Generally descending/ascending by year. Make sure you highlight your name in the author order – people are looking for it. You may want to number them; mine are grouped by year but not numbered. These include in-press publications without the full citation. Make sure you update the citation when the article does get published.
  7. Submitted Publications – these are submitted or under review but not accepted yet. These give people an idea of what you are currently working on and how many projects you work on at a time, though it may be a bad measure of that.
  8. Talks/Presentations/Posters. Include all the talks you have given, including those at your own institution. Gave a talk at a conference? Include it. Working group? Include it. Journal club? Computing club? If it’s a presentation in front of others the projector is on, include it. Some separate posters and talks, but I included them together. Include the conference name or event.
  9. Working Groups. Maybe you work with a group or are on a training grant, but you don’t have publications from it – still add it. People many times know your group
  10. Honors/Awards. Win an award for a comprehensive exam or paper in your department? Put it down! ENAR poster award – umm yeah, that’s awesome to put down. If someone shook your hand and you got anything from a shout out in a room, certificate, to money, put it down.
  11. Software. Do you release software – make that clear! Put links, say what it does. This includes web (Shiny) applications or any type of apps. Was it in undergrad? So what – put it down. I have had discussions with faculty the entire time allotted just about a web application I did one random weekend not related to my main research.
  12. Skills – everyone knows Microsoft Word. Programming languages or other spoken/written languages. Write 1-2 scripts in Python? You’re a beginner not an intermediate. Can you read someone else’s code and know what it does, you’re an intermediate at least (there are other criteria, it’s not set). Do you feel like this i your language and you can speak in it as well as your native verbal language? You’re an expert.
  13. Academic Service – I volunteer and I put that stuff down. Any academic job requires “service” (usually of a different sort), but showing you do service outside of reviewing papers, it fits in line with many university missions. Moreover, if you start a club or run a club in your department that is full blown academic service.
  14. Additional Experience – things that don’t fit above. Do a hackathon? Put it down here. Say the cool project you did and link to it.

There may be other sections for a CV, but those are mine, save for one. I have a “Research Interests” section at the top that says what I’m interested in/want to do research in. This may be good or bad depending on your view. It may become an reason to put you in the no pile before reading, but I think it’s useful.

Remember, academic search committees are looking for someone who can 1) do research, 2) teach, and 3) perform academic service, e.g. mentoring students, serving on thesis committees, serving on other committees (seminar committee, student recruitment, job search commmittee).

Research Statement

Depending on the position you are interviewing for, the teaching statement or research statement is likely to be the first thing read after your CV. That means you should spend a bit of time on it. Like grants, I hear the best way to write one is to get someone else’s who has been successful. Ask previous post docs, your advisor (though it may be dated), and previous students who have graduated for their statements.
I do not think I have been overwhelmingly successful in getting job offers, but I put my research and teaching statement on GitHub.

I have a few guidelines for what I would include in your research statement:

  1. Your philosophy on research.
  2. What you want to do in the next 5 years.
  3. Why institution X and position Y is the place to do it.

“The Professor is in” has some good points in this and this. Check it out.

Teaching Statement

In an academic tenure-track job (and most research-track), you will teach. Teaching can afford you discretionary money in research track and is expected in tenure-track as a portion of your salary generally comes from you teaching. If you haven’t taught a course, were you a graduate assistant or design something for high school students? Put that down. You should highlight any teaching awards you have had in the past and how they have helped you or what led you to receiving them.

Overall, you should have a philosophy for teaching (at least loosely). As a biostatistician who works with a lot of colleagues who are not statisticians or biostatisticians, I (like to) think that I have the skills to bring the material “down a level” into more understanding terms. I believe there should be transparency in grading and an up-front level of expectation on each side of the classroom. Although I don’t find myself to be the most organized while teaching, I feel that it is an important fact because without it the goals of the class can become out of reach or unclear. Anecdotes may be OK, but used only when directly relevant.

It seemed to me that many research institutions “assume” you are a good teacher and they focus more on discussing your research. There are zero places that will say that consistently good teaching is not essential to their program and your success.

If you don’t have any experience teaching, you should 1) consider getting some and 2) consider again that your job is going to require you to do this. Also, although conferences and presentations are not exactly teaching, you can maybe pepper something in there about you feeling comfortable in front of a room of your (1-year-junior) peers.

Cover Letter

Not all places require a cover letter, but some do and it’s a nice touchpoint to start your packet. Some professors have told me they don’t read them (if they don’t require them) and others do.

I think it’s good to include:

  1. Be clear which position you are applying for (most places have multiple)
  2. Where you are graduating from.
  3. Why are you applying there?
  4. How are you qualified.

Letters of Reference/Recommendation

For the people you choose to write you letters of recommendation/reference (I’ll refer to as “letters”), there are no hard and fast guidelines. Except that your advisor should be one of them. This person is the person you (presumably) worked the closest with in the past 4-5 years and they know your strengths and weaknesses best. Moreover, they have likely sat on a committee to hire a new person like you and know how to present your strengths (truthfully) in the best light.

Overall, the goal is ask people early. If a number of students in your department are graduating, they may have too many requests than they can handle in a reasonable timeframe. Some may ask you which places you plan on applying to so that they can maybe make some specific remarks (or a call or 2). Others may ask to see your CV to talk a little bit more about specifics (or to remember exactly what you did).

I worked closely with an non-biostatistician collaborator and I applied to many departments where I’d be working with non-biostatistician collaborators, so I thought he was crucial for a letter. I chose a previous advisor and professors whom I did an extensive project with. You should know who you’ve done work with. If not, check your defense committee again.

Most places you will need 3-4 letters, but have about 5 people you have asked as some places will ask for 5 and some will “allow up to 5”. Make sure you have a file of their full name, email, address, phone number, position, and relation to you (aka advisor/collaborator/etc.).

Step 2: Figure out where the jobs are

For Biostatistics and Statistics, there are some great places to look for jobs online:

If you have a place in mind, check out the website for the department. They will have it advertised. Does your membership organization have a magazine? It sounds dated, but a lot of universities still advertise there.

You can also email any of your previous colleagues to ask if their department is hiring. This person should be a persons you would feel comfortable emailing for other reasons.

Check Twitter and social media. Some departments have these and use them to disseminate information. Check them out.

Step 3: Where do you want to live for the next 6 years?

The number one question you should be able to answer is “Why do you want to work here”

You should have a solid answer for that question. Period. Everything else is ancillary to that point.

In many tenure-track positions, it’s 6 years to tenure. If you’re doing well, that is. Leaving a position after 3 years is reasonable, but may not reflect well on you and you will inevitably get asked “Why?”. Moreover, it may seem as though you hadn’t thought thoroughly through on the position. While most of these may be ridiculous because people move jobs for a multitude of reasons (such as partners/family/area/weather/…life), the thoughts will exist.

So ask yourself: “Would I be comfortable/happy living in this town for the next 6 years?” Yes, great. Geographic location and a type of living (city vs. suburb vs. rural) are real things in making your decision. It’s also something that goes into offering someone a job. If the applicant seems great on paper and the interview, but seems to hate the surrounding area or “could never see themself living there”, that may be a thing that puts the decision over to a “no”. You’re not a robot and you have preferences, remember that.

After that question is answered, you more importantly need to answer: “Would I be comfortable/happy working in this place for the next 6 years?” – that’s a bit harder to know, but if there is a “No” creeping around there for some reason, that’s not a great sign. That’s not a dealbreaker for not applying, but remember one thing: interviews are draining. You don’t want to put all your eggs in one basket, but you don’t want a big basket of slightly-cracked eggs. Eggs in this metaphor are your “best self” and cracked eggs are OK, but not so great.

Step 4: Filling out an Unholy amount of forms/Sending Emails

Applications are about dotting i’s and crossingt t’s. They have some automation, but a lot of it is still very manual in its entry. You will have to write and copy and paste many documents over and over. Some will have optical character recognition (OCR) to determine information from your CV. If you have a “standard” CV, this will work. Otherwise, you’ll likely get a bunch of misformatted text you need to delete.

You will need to have accounts for each different university separately as they do not share across for information. Even though most of them use Taleo as a backend. More are using LinkedIn as a resource, which may be a good reason to update your LinkedIn to look like your CV. Many of these systems have places for you to put information about your references so remember to have that text file open with each reference’s information.

If the university you are applying to doesn’t have an automated system set up, you may have to send your packet to a search committee chair or an administrator who is listed on the posting. So you’ll email them and you’ll likely forget something, format something wrong, or forget to say what position you’re applying for, so you’ll get to answer a lot of emails.

Regardless, after the packet is signed off and in, you should (in like 3 weeks) send an email just confirming that everything is there. This is especially important if you don’t get confirmation when your letters of reference are submitted. Applications do fall through the cracks and emails do get overlooked. Do not trust any system in place and always double check your confirmation.


This is one post in hopefully a few on some of my (hopefully useful) insights on the process of applying and interviewing for academic and industry positions for a quant/data scientist/data analyst/research professor. Overall, there is a lot of prep you need to do (now it’s October 5). Some of it will be out of your hands (like letters of reference), which is why it’s so important to be ahead of schedule. Much of it is writing and revising, writing and revising, which you should be good at now. The one takehome message is:

Don’t sell yourself short. You just finished a long, grueling process which at times you probably thought you’d fail at. But you didn’t. Maybe not all the things you’ve done is glamorous or earth-shattering, but you did interesting things. You did things that mattered. Remember that and not make others see that and believe it.

Tips for First Year Comprehensive Exams

During our program, like most others, you have to take written comprehensive exams (“comps”) at the end of your first year of coursework. For many students it's a time of stress, which can be mitigated with some long-term planning. I wanted to make some suggestions on how to go about this for our (and other) PhD students.

Start the week after spring break

Again, comps are stressful. You can be tested on anything from the material (ideally) from your first year. Professors can throw in problems that seem from left field that you did not study or prep on. How can you learn or study all the material?

The way to make comps more manageable is to have a long-term studying trajectory. We have 2 weeks after the last exam to study and prep, and that is crunch time. In my opinion, that time should be working on the topics you're struggling with, annotating books for crucial theorems (if you're allowed them in the exam), and doing a bunch of problems. Those 2 weeks is not the time to cover everything from day one. That time comes before that 2 weeks.

The week after spring break (the week before this was published) is a good time to start your timeline. That gives you about 10 weeks to study and prep. You can start from the beginning of the year to the current time, or work backward. If nothing else in the first week, make a timeline of what topics or terms you will cover over what time frame. This will reduce stress so that it breaks the test into discrete chunks of time and discrete courses.

Get Past Exams

What's the best preparation for the comprehensive exam? A comprehensive exam. This may be a bit self-evident, but I know I had the feeling of not knowing where to start. Our department sends us the previous exams from the past 5-7 years. Some are may not be equitable with respect to the difficulty or concepts covered, but I believe more questions are always better.

Vanderbilt has some great exams, as does the University of New Mexico, and Villanova. You can go to the reference textbooks (Billingsley, Chung, Casella & Berger, Probability with Martingales (Williams)) to try some problems from the chapters you covered as well.

Work from the back

My strategy is to map each exam (or 2) to a specific week. I worked on the older exams first and saved (e.g. did not look at) the ones from the previous 2 years until the 2 weeks before the test. I also would set out blocks of time (2-3 hours) to try to an entire section of an exam, simulating the conditions for that portion of the test. I think these are helpful at gauging how well your studying is going.

Make a study group

How can you study or summarize all the material? Well, it's much easier if you have a team. You can also bounce ideas off each other. Moreover, the exams you have don't have an answer key, they are just the problems. It helps having others that can 1) check your work (swapping), 2) give you their solutions if you can't work out the problem, and 3) discuss different strategies for solving the problem.

We had a group separately for each section of the exam (probability, theory, methods). This separation helps because some students are retaking only parts of the exam and can help in some areas but don't want to be working on the sections they do not have to take. It also helps segment time studying so you don't focus only on one area while leaving another area (likely the one you don't like and are not the best at) neglected.

Delegate Study Areas

We separated different topics (letting people choose first) for each of the sections for that week. Of those not chosen, the rest needs to be assigned. The people/small team that was assigned to a topic needed to make concise (2-3 page) documents outlining the most important areas. They would also do a 5 minute presentation to the group about why these are the most important areas. That is the time to ask questions (and be prepared to get some when you present).

At the end of the school year, you have an organized study document If you think your notes from the year are organized, you are likely mistaken. Even if you're highly organized (many people are not), there is usually too much superfluous details relevant to the course/homework/etc and not the material. Split it up and let others weed through different areas while you focus on those you were assigned.

Drop the weight

If someone does not deliver on their delegated task, drop them. If there was an understanding that they would get double next time, fine. But if no discussion was made, they are out of the group. That person is not holding up his/her end of the bargain, are getting help for free, while contributing nothing back. All students are busy, and incorporating that is fine, but must be done before the session and at the time of delegation. Otherwise, that non-delivery will likely become a pattern and hurt the entire group. These are your friends and classmates, and it must be clear that any non-delivery is a direct negative to the group. No excuses excuse that.

Do as many problems as possible

Do problems. Do more. And then do some more. The exam is a set of problems. Knowing the material is essential, but the more comfortable you are with doing these difficult problems in a compressed time frame, the better you are. Many tests up until now may have been collaborative, take home, and shorter. Your comprehensive exam will be a bit different, so you have to prepare yourself. We're talking about practice; it's important (sorry AI).


Overall, the best way to perform well on the comprehensive exams is to learn the material as thoroughly as possible. Ideally, that is done during the course. Topics are forgotten and areas are always not fully understood the first time around. Therefore, a methodical, long-term study plan should be made to tackle the year's worth of material. I think a team is the best format for discussion and delegation, but you MUST do work alone (doing the problems), as the team does not collaboratively take the test. If you follow your plan (while obviously learning the new concepts in class), then you should feel as prepared as you can be. Best of luck. I would like to leave you a quote/clip from the recent Bridge of Spies movie:
“Do you never worry?”
“Would it help?”

A Faster Scale Function

Problem Setup

In recent question on LinkedIn’s R user group, a user asked “How to normalize by the row sums of the variable?”. Now first, we must define what we mean by “normalize” a matrix/data.frame.

One way to standardize/normalize a row is to subtract by the mean and divide by the max to put the data into the [0, 1] domain. Many times, however, users want to perform z-score normalization by doing:
(x – μ)/σ
where μ/σ are the mean/standard deviation of the column (typically the column).

The answer on that forum eventually came down to using the scale command. The scale function is great. It is simple and has 2 options other than passing in the matrix:

  1. Center – should the columns have their mean subracted off?
  2. Scale – should the columns have their standard deviation divided after/not centering?

Overall, the function is fast, but it can always be faster.

The matrixStats package has some very quick operations for row/column operations and computing statistics along these margins.

Creating Some Data

Here we will create a random normal matrix with each column having a mean of 14 and a standard deviation of 5.

mat = matrix(rnorm(1e7, mean = 14, sd = 5),
    nrow = 1e5)

How fast is scale?

Let’s see how fast the scale function is:

    sx = scale(mat)
   user  system elapsed
  0.971   0.285   1.262

That’s pretty fast! Overall, there is not much room for improvement in this case, but it may be relevant if you have a lot of matrices or ones bigger than the one defined here in mat.

Defining a new function

First, we must load in the matrixStats package and the only function we really are using is the colSds.


colScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]

  # Get the column means
    cm = colMeans(x, na.rm = TRUE)
  # Get the column sd
    if (scale) {
        csd = colSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    x = t( (t(x) - cm) / csd )
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        if (scale) {
            attr(x, "scaled:scale") <- csd

Let’s break down what the function is doing:

  1. The function takes in a matrix x with options:
    1. subsetting rows or cols
    2. center each column (by the mean) or not
    3. scale each column (by the standard deviation) or not
    4. Add the attributes of center/scale, so they match the scale output.
  2. The functions subsets the matrix if options are passed.
  3. Column means are calculated
  4. Column standard deviations are calculated (using the colmeans) if scale = TRUE or simply set to 1 if scale = FALSE.
  5. If the data is not to be centered, the centers are set to 0.
  6. The data is transposed and the mean is subtracted then the result is divded by the standard deviation. The data is transposed back.
    • The reason this is done is because R operates column-wise. Let p be the number of columns. The column means/sds are of length p. If one simply subtracted the column means, R would try to do this to each individual column. For instance, it would recycle the p numbers to get to length n (number of rows), and do that subtraction, which is not what we want.
  7. The attributes are added to the matrix to match scale output.

colScale timing

Now we can see how fast the colScale command would take:

    csx = colScale(mat)
   user  system elapsed
  0.231   0.039   0.271

This is a lot faster than the scale function. First and foremost, let us make sure that these give the same results:

all.equal(sx, csx)
[1] TRUE

Better benchmarking

OK, we found that we can speed up this operation, but maybe this was a one-off event. Let’s use the microbenchmark package to

mb = microbenchmark(colScale(mat), scale(mat), times = 20, unit = "s")
Unit: seconds
          expr       min        lq      mean    median        uq      max
 colScale(mat) 0.2738255 0.3426157 0.4682762 0.3770815 0.4872505 1.844507
    scale(mat) 1.2945400 1.5511671 1.9378106 1.9226087 2.2731682 2.601223
 neval cld
    20  a
    20   b

We can visualize the results using ggplot2 and some violin plots.

g = ggplot(data = mb, aes(y = time / 1e9, x = expr)) + geom_violin() + theme_grey(base_size = 20) + xlab("Method") + ylab("Time (seconds)")

plot of chunk gg

What about scaling rows!

If you note above, we did not standardize the matrix with respect to the rows, but rather the columns. We can perform this simply by transposing the matrix, running scale and then transposing the matrix back:

  scaled_row = t( scale(t(mat)) )
   user  system elapsed
  2.165   0.624   3.398
all(abs(rowMeans(scaled_row)) < 1e-15)
[1] TRUE

Again, we can do the same thing with colScale:

  colscaled_row = t( colScale(t(mat)) )
   user  system elapsed
  0.426   0.097   0.542
all(abs(rowMeans(colscaled_row)) < 1e-15)
[1] TRUE
all.equal(colscaled_row, scaled_row)
[1] TRUE

And we see the results are identical

Creating rowScale

The above results are good for what we would like to do. We may want to define the rowScale function (as below), where we do not have to do the transposing and transposing back, as this takes may take some extra time.

Again, if we’re about improving speed, this may help.

rowScale = function(x,
    center = TRUE,
    scale = TRUE,
    add_attr = TRUE,
    rows = NULL,
    cols = NULL) {

    if (!is.null(rows) && !is.null(cols)) {
        x <- x[rows, cols, drop = FALSE]
    } else if (!is.null(rows)) {
        x <- x[rows, , drop = FALSE]
    } else if (!is.null(cols)) {
        x <- x[, cols, drop = FALSE]

  # Get the column means
    cm = rowMeans(x, na.rm = TRUE)
  # Get the column sd
    if (scale) {
        csd = rowSds(x, center = cm)
    } else {
        # just divide by 1 if not
        csd = rep(1, length = length(cm))
    if (!center) {
        # just subtract 0
        cm = rep(0, length = length(cm))
    x = (x - cm) / csd
    if (add_attr) {
        if (center) {
            attr(x, "scaled:center") <- cm
        if (scale) {
            attr(x, "scaled:scale") <- csd

Now let’s see how we do with rowScale:

  rowscaled_row = rowScale(mat)
   user  system elapsed
  0.174   0.016   0.206
all(abs(rowMeans(rowscaled_row)) < 1e-15)
[1] TRUE
all.equal(rowscaled_row, scaled_row)
[1] TRUE

Let’s look at the times for this breakdown using microbenchmark:

mb_row = microbenchmark(t( colScale(t(mat)) ),
                        t( scale(t(mat)) ),
                        times = 20, unit = "s")
Unit: seconds
                expr       min        lq      mean    median        uq
 t(colScale(t(mat))) 0.4009850 0.4653892 0.6303221 0.6659232 0.7422429
    t(scale(t(mat))) 1.7889625 2.0211590 2.4763732 2.1928348 2.6543272
       rowScale(mat) 0.1665216 0.1789968 0.2688652 0.2228373 0.3413327
       max neval cld
 0.9008130    20  a
 5.0518348    20   b
 0.5138103    20  a

and visualize the results:

g %+% mb_row

plot of chunk gg_row


Overall, normalizing a matrix using a z-score transformation can be very fast and efficient. The scale function is well suited for this purpose, but the matrixStats package allows for faster computation done in C. The scale function will have different behavior as the code below from base::scale.default:

f <- function(v) {
  v <- v[!]
  sqrt(sum(v^2)/max(1, length(v) - 1L))
scale <- apply(x, 2L, f)

If the data is not centered and center = FALSE, the data will be divided by the squared sum of each column (divided by n-1). This may be the desired behavior, but the user may want to divide by the standard deviation and not this squared sum and colScale/rowScale can do that if necessary. I will talk to Henrik Bengtsson (matrixStats author/maintainer) about incorporating these into matrixStats for general use. But for now, you can use the above code.

How I build up a ggplot2 figure

Recently, Jeff Leek at Simply Statistics discussed why he does not use ggplot2. He notes “The bottom line is for production graphics, any system requires work.” and describes a default plot that needs some work:

ggplot(data = quakes, aes(x = lat,y = long,colour = stations)) + geom_point()

plot of chunk plot

To break down what is going on, here is what R interprets (more or less):

  1. Make a container for data ggplot.
  2. Use the quakes data.frame: data = quakes.
  3. Map certain “aesthetics” with the aes to three different aesthetics (x, y, z) to certain variables from the dataset lat, long, stations, respectively.
  4. Add a layer of geometric things, in this case points (geom_point).

Implicitly, ggplot2 notes that all 3 aesthetics are continuous, so maps them onto the plot using a “continuous” scale (color bar). If stations were a factor or character column, the plot would not have a color bar but a “discrete” scale.

Now, Jeff goes on to describe elements he believes required to make this plot “production ready”:

  1. make the axes bigger
  2. make the labels bigger
  3. make the labels be full names (latitude and longitude, ideally with units when variables need them
  4. make the legend title be number of stations reporting

As such, I wanted to go through each step and show how you can do each of these operations

Make the Axes/Labels Bigger

First off, let’s assign this plot to an object, called g:

g = ggplot(data = quakes,
           aes(x = lat,y = long,colour = stations)) +

Now, you can simply call print(g) to show the plot, but the assignment will not do that by default. If you simply call g, it will print/show the object (as other R objects do), and plot the graph.

Theme – get to know it

One of the most useful ggplot2 functions is theme. Read the documentation (?theme). There is a slew of options, but we will use a few of them for this and expand on them in the next sections.

Setting a global text size

We can use the text argument to change ALL the text sizes to a value. Now this is where users who have never used ggplot2 may be a bit confused. The text argument (input) in the theme command requires that text be an object of class element_text. If you look at the theme help it says “all text elements (element_text)”. This means you can’t just say text = 5, you must specify text = element_text().

As text can have multiple properties (size, color, etc.), element_text can take multiple arguments for these properties. One of these arguments is size:

g + theme(text = element_text(size = 20))

plot of chunk bigger_axis

Again, note that the text argument/property of theme changes all the text sizes. Let’s say we want to change the axis tick text (axis.text), legend header/title (legend.title), legend key text (legend.text), and axis label text (axis.title) to each a different size:

gbig = g + theme(axis.text = element_text(size = 18),
                 axis.title = element_text(size = 20),
                 legend.text = element_text(size = 15),
                 legend.title = element_text(size = 15))

plot of chunk bigger_axis2

Now, we still have the plot g stored, but we make a new version of the graph, called gbig.

Make the Labels to be full names

To change the x or y labels, you can just use the xlab/ylab functions:

gbig = gbig + xlab(&quot;Latitude&quot;) + ylab(&quot;Longitude&quot;)

plot of chunk lab_full

We want to keep these labels, so we overwrote gbig.

Maybe add a title

Now, one may assume there is a main() function from ggplot2 to give the title of the graph, but that function is ggtitle(). Note, there is a title command in base R, so this was not overwritten. It can be used by just adding this layer:

gbig + ggtitle(&quot;Spatial Distribution of Stations&quot;)

plot of chunk title

Note, the title is smaller than the specified axes label sizes by default. Again if we wanted to make that title bigger, we can change that using theme:

gbig +
  ggtitle(&quot;Spatial Distribution of Stations&quot;) +
  theme(title = element_text(size = 30))

plot of chunk big_title

I will not reassign this to a new graph as in some figures for publications, you make the title in the figure legend and not the graph itself.

Making a better legend

Now let’s change the header/title of the legend to be number of stations. We can do this using the guides function:

gbigleg_orig = gbig + guides(colour = guide_colorbar(title = &quot;Number of Stations Reporting&quot;))

plot of chunk leg

Here, guides takes arguments that are the same as the aesthetics from before in aes. Also note, that color and colour are aliased so that you can spell it either way you want.

I like the size of the title, but I don’t like how wide it is. We can put line breaks in there as well:

gbigleg = gbig + guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;))

plot of chunk leg2

Ugh, let’s also adjust the horizontal justification, so the title is centered:

gbigleg = gbigleg +
  guides(colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
                                 title.hjust = 0.5))

plot of chunk leg_adjust

That looks better for the legend, but we still have a lot of wasted space.

Legend IN the plot

One of the things I believe is that the legend should be inside the plot. In order to do this, we can use the legend.position from the themes:

gbigleg +
  theme(legend.position = c(0.3, 0.35))

plot of chunk leg_inside

Now, there seems can be a few problems here:

  1. There may not be enough place to put the legend
  2. The legend may mask out points/data

For problem 1., we can either 1) make the y-axis bigger or the legend smaller or a combination of both. In this case, we do not have to change the axes, but you can use ylim to change the y-axis limits:

gbigleg +
  theme(legend.position = c(0.3, 0.35)) +
  ylim(c(160, max(quakes$long)))

plot of chunk change_ylim

I try to not do this as area has been added with no data information. We have enough space, but let’s make the legend “transparent” so we can at least see if any points are masked out and to make the legend look a more inclusive part of the plot.

Making a transparent legend

I have a helper “function” transparent_legend that will make the box around the legend (legend.background) transparent and the boxes around the keys (legend.key) transparent as well. Like text before, we have to specify boxes/backgrounds as an element type, but these are rectangles (element_rect) compared to text (element_text).

transparent_legend =  theme(
  legend.background = element_rect(fill = &quot;transparent&quot;),
  legend.key = element_rect(fill = &quot;transparent&quot;,
                            color = &quot;transparent&quot;)

One nice thing is that we can save this as an object and simply “add” it to any plot we want a transparent legend. Let’s add this to what we had and see the result:

gtrans_leg = gbigleg +
  theme(legend.position = c(0.3, 0.35)) +

plot of chunk leg_inside2

Moving the title of the legend

Now, everything in gtrans_leg looks acceptable (to me) except for the legend title. We can move the title of the legend to the left hand side:

gtrans_leg + guides(colour = guide_colorbar(title.position = &quot;left&quot;))

plot of chunk leg_left

Damnit! Note, that if you respecify the guides, you must make sure you do it all in one shot (easiest way):

gtrans_leg + guides(
  colour = guide_colorbar(title = &quot;Number\nof\nStations\nReporting&quot;,
                          title.hjust = 0.5,
                          title.position = &quot;left&quot;))

plot of chunk leg_left_correct

A little more advanced

The last statement is not entirely true, as we could dig into the ggplot2 object and assign a different title.position property to the object after the fact.

gtrans_leg$guides$colour$title.position = &quot;left&quot;

plot of chunk respec

“I don’t like that theme”

Many times, I have heard people who like the grammar of ggplot2 but not the specified theme that is default. The ggthemes package has some good extensions of theme from ggplot2, but there are also a bunch of themes included in ggplot2, which should be specified before changing specific elements of theme as done above:

g + theme_bw()

plot of chunk themes

g + theme_dark()

plot of chunk themes

g + theme_minimal()

plot of chunk themes

g + theme_classic()

plot of chunk themes


I agree that ggplot2 can deceive new users by making graphs that look “good”-ish. This may be a detriment as they may believe they are good enough, when they truly need to be changed. The changes are available in base or ggplot2 and the overall goal was to show how the recommendations can be achieved using ggplot2 commands.

Below, I discuss some other aspects of the post, where you can use ggplot2 to make quick-ish exploratory plots. I believe, however, that ggplot2 is not the fastest for quick basic exploratory plots. What is is better than base graphics is for making slightly more complex exploratory plots that are necessary for analysis, where base can take more code to do.

How to make quick exploratory plots

I agree with Jeff that the purpose of exploratory plots should be done quickly and a broad range of plots done with minimal code.

Now, I agree that plot is a great function. I do believe that you can create many quick plots using ggplot2 and can be faster than base in some instances. A specific case would be that you have a binary y variable and multiple continous x variables. Let’s say I want to plot jittered points, a fit from a binomial glm (logistic regression), and one from a loess.

Here we will use mtcars and say if the car is automatic or manual (am variable) is our outcome.

g = ggplot(aes(y = am), data = mtcars) +
  geom_point(position = position_jitter(height = 0.2)) +
  geom_smooth(method = &quot;glm&quot;,
              method.args = list(family = &quot;binomial&quot;), se = FALSE) +
  geom_smooth(method = &quot;loess&quot;, se = FALSE, col = &quot;red&quot;)

Then we can simply add the x variables as aesthetics to look at each of these:

g + aes(x = mpg)

plot of chunk unnamed-chunk-2

g + aes(x = drat)

plot of chunk unnamed-chunk-2

g + aes(x = qsec)

plot of chunk unnamed-chunk-2

Yes, you can create a function to do the operations above in base, but that’s 2 sides of the same coin: function versus ggplot2 object.


All code is located on my GitHub for my blog

R CMD INSTALL with symlink to R

Problems with R CMD INSTALL

I was trying to install a package (ANTsR) by running R CMD INSTALL as normal on the cloned github repository on a particular cluster. I kept getting errors and could not understand why for the life of me. Note, I have not used this cluster much and wasn't sure how it was configured.

I was pretty sure that this was a configuration problem on the cluster because I had installed this repo on:

  1. My system
  2. A shiny server
  3. Another cluster

Finding the Error

The build was using cmake, so I figured it was some flag. Oddly enough, I was getting the error (I put . where thes are hard paths not relevant to you):

-- Check for working C compiler: ./cc -- broken
CMake Error at . (message):
  The C compiler "." is not able to compile a simple test program.

Whoa. Now either the C compiler is broken (unlikely) or the configuration has an error/bug (much more likely). I didn't write the config for this package (it's pretty complex), so I felt a bit at a loss.

Let's look at the error

Well, the error did push me to the log for the error CMakeFiles/CMakeError.log, let's go there. Looking at the result of CMakeFiles/CMakeError.log, I found the following areas that seemed like where things were problems:

Build flags: ;WARNING:;ignoring;environment;value;of;R_HOME

Hmm. I see that these are words, not really building flags. They also seem like R code. I don't know how they got there, so I did some Googling.

I got to a page of someone having a similar issue: '“ignoring environment value of R_HOME” error when installing packages'. That sounds like my problem. OK their admin reset R_HOME and everything is great. Good for him, not much help for me.

I found a bug report for R which discusses this, but there didn't seem to be any resolution.

Finding a Solution

I went back to the simple R warning “WARNING: ignoring environment value of R_HOME” and found an old StackOverflow Post about it.

Now, one thing they discussed was

unset R_HOME

I tried that, nothing worked.

OK, well what about RHOME, let's unset that too:

unset RHOME

Error, fail. Moreover, these variables were never set anyway. Oh! I realized if we reversed the error, that R_HOME was set incorrectly, then let's just set it before R CMD INSTALL and then it shouldn't error:


That's not to say the package will install without a hitch, but this part of the build seems to be fixed. (Note: I had to clean out the errors to rerun).

Why did this happen?

I believe most of this happened by the configuration of R on the cluster and the linking of the R folder (try which R) to the true home for R (try R RHOME). I don't know where (if anywhere) in the setup/.bashrc/.bash_profile scripts R_HOME is set, but it seems that this discrepancy caused a problem.

Dealing with Imposter Syndrome in Graduate School

In my post of recommendations for first-year students, I discussed some tips and viewpoints to help the practical, pragmatic aspects about being a first year student. In this post, I'd like to discuss the common misconceptions/viewpoints that are destructive to new students.

The Dunning-Kruger effect

I know something, so everyone else is dumb

You just learn about p-values and their problems. OMG someone over there uses them? They are so dumb and don't understand anything. Why can't everyone be as smart as you? Whey can't people just “get it”? Have you ever felt this or known someone who sounds like this?

Let me introduce the Dunning-Kruger effect. In short, it describes that the unskilled are unable to:

recognize their own ineptitude and evaluate their own ability accurately.

Therefore, you learn something new about a field (e.g. statistics), and you feel pretty confident when talking to others. I'm not saying you're unskilled, but you may not know what's common in practice or the merits/pitfalls of a method or other methods. Many times new students will learn one thing, usually not that in-depth, and incorrectly think they've mastered that area. Moreover, they usually cite the same piece of information over and over, as they have few pieces of information to draw upon. This sometimes happens with newer students, but fades relatively quickly.

Everyone else is a genius

The equally important converse to the Dunning-Kruger effect is that:

highly skilled individuals may underestimate their relative competence, erroneously assuming that tasks that are easy for them also are easy for others.

After the first feeling fades, a student realizes how out of depth they were. This is the more damaging effect because then you start to feel like…

You are an Imposter; So am I

You are not good enough for your program. Everyone is better prepared and smarter than you. You don't deserve to be here. You're stupid. You are never going to get it. You should quit. Everyone is going to find you out. Then they'll make you leave your program.

If you read all of these statements and some rang true, let me introduce you to Imposter Syndrome. Most students feel this way their first year. I felt this way. Many of my classmates felt this way. Some of us may still feel this way.

Why? Many new students have done previous programs with relative ease. They have been, or at least felt, like the smartest person in the room before. Now, in this new and highly-selective program, you are not the smartest. You may not even be close to the smartest.

Aside: when I say “smart”, I mean whatever criteria you're using for self-worth with respect to intelligence. I think work gets done by banging your head against the wall until something comes out. Being talented is helpful, but hard work gets results. But talent before may have been enough.

Also, if you have tied your superior intelligence to your identity, you've now lost it. If people are smarter than you, then who are you compared to them? How can you combat this feeling? Spoiler: stop comparing to the wrong people.

Comparing Yourself to the Correct Distribution

Many times, people forget what got them there. They worked hard (even relatively) to do the prerequisites, fill out the forms, do the undergraduate research, and make the move to the new department. Much of that hard work is overlooked by new (and even some more senior) students. They are much like Ricky Bobby: “If you're not first your last”. But that's ridiculous “you can be "second, third, fourth, hell, even fifth”.

The wrong distribution

I'm not saying to not strive for being the best. Strive for that, but compare yourselves to the right distribution. Many students compare themselves like this:

plot of chunk unnamed-chunk-1

As seen from this, you are at the low end of the distribution. You are likely not the best first year, feel miles behind a 5th-year student, and a lifetime behind a faculty member. How can you ever become like these people?

There's this saying “if you see it you can be it”. Conversely, if you can't see it, then you don't think you will be it. This is important because if you want to be a faculty member, you must realize every year you are getting one step closer to that role. You have to see yourself in that role.

But right now, you feel like you don't know anything about your field. But this is the WRONG DISTRIBUTION for comparison.

The correct distribution

As said above in the plot title, this is a conditional distribution of knowledge in your field. This distribution compares you as a brand-new first-year student to those who have worked in the field for an entire undergrad degree (the top first-year students), at least 5 years of work and research (5th-year student), or for > 10 years/a lifetime of work (most faculty). Of course you're going to feel inadequate. By construction, you are near the lowest part of the distribution. You're setting yourself up to be the worst.

You should compare yourself to full distribution:

plot of chunk correct_distribution

That's more like it! This is more representative comparison of your skills. The average graduate student likely knows a bit more about your field than everyone else, but notice where YOU are in this distribution. Now yes, the faculty is still out there, but it's relatively closer. With each year, you get closer to that upper tail of the distribution. Also, most people keep concentrating on the right-hand tail whereas they forget about the majority of the distribution is to your left. You know things about your field, more than the majority of people.

At the end of the day, you need to compare yourself to the full distribution, not the conditional one. Just because you are not the smartest/best when you start, that's to be expected. You can't know everything over night. The most important message is to not get discouraged when you first start. Things are confusing and hard, but they get better. Just keep going.

And remember one thing about this whole mental exercise of comparison:

It's not to make you feel better than others. It's to make you feel adequate about yourself and your skills.

Making others feel less worthy or make yourself feel as though you're “better” than another will inevitably cause this same crisis of self when that identity is challenged. Stop doing that. Just do you.

Adapt a different mindset

Many of the issues above are discussed in the book Mindset: The New Psychology of Success. I have some copies of the book if you'd like to read it. I highly recommend it.

The long and short of the book's message is that overcoming ideas such as imposter syndrome comes out of adapting a new mindset. Here are some examples of how the mindsets differ: link 1, link 2 from here, and link 3

In a fixed mindset, someone believes that talents are fixed and unchangeable. Either you are smart or your not. If you aren't smart, then sucks for you because you can't change it. The other, and recommended viewpoint of the author, is the growth mindset, where one believes:

their qualities as things that can be developed through their dedication and effort.

You're not good at one area? Well try hard and change it. Stop worrying you're not good enough and get to work.

Andrew Gelman had an interesting post recently about the book's replicability, so I suggest reading over there for some details. If the books helps you cognitively break down some walls I think that's great, but I always would like to see the evidence. If you want to track your progress post reading, I'd love to hear it!

The battle is not only with new students

Let me be clear that this is not an issue for only new students. Although older students get better at determining their position in the distribution, they then fall to the same issue comparing themselves to faculty in the tails.

I hear students claim many times that they're not ready to become professors. I've spoken to many tenured professors and a lot of them say that a post-doc position is a great gig (at least in biostats). The arguments are that you get the freedom that a assistant faculty has, but not as much of the responsibilites. I agree with this sentiment, and think those are great reasons to do a post-doc.

But I feel as though some students do not think they can be a faculty member because of their CV and number of publications. Not wanting to advise students, not ready to build a class, not ready to write grants, not sure about a new city for long-term – these are great reasons to not want to become a faculty member. Not having “enough” publications is not necessarily a good one. Yes, some places may not invite you to speak based on your CV. Some will invite you but not give you an offer. If there is a penalty for trying to get an interview, not getting it, doing a post-doc, and then re-applying in 2 years, then that system is broken. The only penalty should be that there are no positions and the timing was no longer right. (Assuming you didn't do something off the rails the first interview).

Compare yourself to the correct distribution

One of the reasons I feel that people say they do not have “enough” publications is either because they are 1) comparing themselves to people other than assistant professors, or 2) they are comparing themselves to assistant professors NOW compared to when they started.

Here's an exercise I like to go through. Go to a website of a place you want to apply to, find some assistant professors. Go to their CVs and look the year they graduated with their PhD. Go to their publications and look at those that came out in that year + 1. That's them when they graduated with their PhD, including published, in-press, and current work. If you are greatly behind them, then yes, you may not have compared well against them. But remember, that's them compared to you on the same playing field. And remember, they were like you and they got the job. So give it a shot if that's what you want, but don't let your incorrect comparison cripple you with feeling inadequate. That's not what you do, that's some imposter.