When a scientific article is published, ideally the data behind the reported results should be made available. Anyone should be able to scrutinize the basis of scientific claims.
While this has long been the ideal, it has rarely been practiced. But this has been changing, and momentum is building to actually require the posting of data in circumstances where it’s feasible. See the list below for links regarding the movement of various scientific fields, science funders, and repositories towards requiring and enabling data sharing.
The PLoS multidisciplinary journals, including PLoS ONE, are considering nudging authors towards sharing by requiring a “data availability statement”. Thanks to the thoughtful people on FriendFeed who have commented on the idea.
Objections have been raised to this proposal. I don’t think any of the objections are show-stoppers, but some make some valid points. Here’s one- requiring posting of raw data would not prevent fraudulent behaviors as the evil-doers would simply start manipulating their raw data or fabricating it. I have two thoughts about that. First, fabricating raw data is often a much bigger task for an author than fudging a few data points on a plot, or changing a p-value to be better than it was. If one is to fabricate an entire dataset, one has to self-consciously think like a criminal for an extended period, all along realizing that what one is doing is wrong. Out-and-out cheaters will certainly do this, but many will stop short of fabricating the raw data. It is not a behavior that can be self-justified as some kind of shortcut or ‘making up for how the instrument wasn’t working well that day’ or other justifications for fudging. And I think most current problems with reported results fall into these latter categories. A second point is that despite the seeming ease of fabricating raw data, when humans do this they often leave tell-tale signs that indicate the data were tampered with. See Benford’s Law for one example. I know, I know, perhaps only the stupid scientists wouldn’t be able to randomize their fake data properly but nevertheless many frauds are detected based on these kinds of mistakes.
Below are some examples of the growing movement towards more sharing of raw data. This is an exciting moment for science!
- A Trends in Cognitive Sciences article by neuroscience bigwigs explains why we must share more of the data
- Proteomics journals decide to mandate data sharing
- new NSF requirement to explain how grantees will share their data
- Biomed Central moves towards requiring that authors make their data available
- The Panton Principles for open data
- Dryad is a repository of data, initially focusing on evolution, ecology, and related fields
- Climategate scientists were cleared but “had not shown sufficient openness” regarding the data