Random Forests and Such

I have found that Leo Breiman’s page on Random Forests(tm) is highly useful.


I also noticed they trademarked RF(tm), RandomForests(tm), RandomForest(tm) and Random Forest(tm).  I think it’s pretty impressive you can trademark just RF(tm).  Now if people use these methods, and don’t cite them, can they sue them for trademark infringement, and force a citation?

Machine learners beware?

the pitfalls of electronic voting…



One Bender Bending Rodríguez was elected to the 2010 school board in Washington DC. A team of hackers from the University of Michigan got Bender elected as a write-in candidate who stole every vote from the real candidates. Bender, of course, is a cartoon character from the TV series Futurama. -aj

Table 1 Sample Size

One of my pet peeves when looking at “Table 1” of any paper or analysis is when people use % without N’s, and similarly means and standard deviations without means.  (Table 1 is being referred as the common demographic by treatment table) Although the point is taught at such a basic level as to why use a percent rather than a count, noting that they are not comparable if the size of the groups are very different, using only a percent in data with missing is problematic. 

Patterns of missingness are usually informative at least for the fact that you can know how variable the estimate of the percent is.  If you have a table of just percents/means without the sample size, then you cannot know how many samples these estimates were based on.  Even if these means is the most accurate you can get with your current data, the N will help give an assessment of the missingness, and thus somewhat of the bias in these estimates.

Here are 2 examples of what I’m saying: 

http://goo.gl/Sk56x – does not have N’s so there is no indication of missingness.  (Found in Comorbid panic disorder and major depression: Implications for cognitive–behavioral therapy, McLean et al 1998 just from google images).  Therefore, unless the paper said they excluded those missing variables (honestly I have not read it), we don’t know what sample sizes each variable’s estimate is based on.  My apologies if they all have the same N and there is no missingness, but a reference in the caption may be useful.  This was just used as an example.

So including an N column may be a little more cumbersome, but could be highly useful.  Another method would be to put (N=xxx) next to each variable.  Although it does not break it down by treatment/disorder, it can at least give the reader an idea of how much missingness there is in variable.

I’m almost convinced now that I’ve done this multiple times before. — JM