So I was delighted to see a post from: http://www.propublica.org/nerds/item/doc-dollars-guides-collecting-the-data about how scraping data and they use Google refine: https://code.google.com/p/google-refine/
I have heard about Google Refine a while ago, but as I feel more comfortable scripting and having that as a roadmap to what I’ve done for an analysis/data cleaning, I choose to use that. But I wonder, why do researchers, who are stuck in Excel many times, not use this as their main data cleaning tool?
The features are more powerful than Excel at times and can be re-exported from Google Refine, so that you can make plots/etc. Also – it can be integrated with Google Docs. And the data does not go to the cloud unless you’d want it to – all data is run on your machine. In the heart of reproducibility, when you create an project and export it, it saves your “history”, essentially a long list of things you did (and can undo any of them at any time). I’m not saying that this is “reproducible” in a lot of the sense that people are discussing nowadays, but it seems a much better solution than changing excel sheets with no history, or multiple sheets, or multiple (and usually poorly annotated) versions.
So why don’t they use it? A new tool to learn/fear? They don’t know about it?
Here again is a good description on how it is used: http://www.propublica.org/nerds/item/using-google-refine-for-data-cleaning