Data analysis with Python, SciPy and R April 5, 2009
Posted by alexholcombe in probability and statistics, programming, psychology, science.Tags: open source
1 comment so far
I’ve transitioned to all open-source software for my science. The Python language and its libraries VisionEgg and Psychopy are more than sufficient to code my perception experiments. For data analysis, I’ve gotten pretty far with the SciPy library for Python, which has probability distributions, function minimization, Fourier transforms, etc. The Matplotlib library makes it easy to make plots in a way familiar for old MATLAB users like me. Unfortunately however, it appears that nothing’s available for taking a load of data, data that’s formatted with many entries (e.g. rows) each of which has several values associated with it (one for each independent and dependent variable of the experiment), and
- summarizing (calculating mean etc.) of the dependent variable contingent on various independent variables (like an Excel pivot-table)
- performing the all-important (in experimental psychology and neuroscience) multiple linear regression and ANOVAs.
I wrote something for #1, but #2 is too much for me. I have had to start using R.
R appears to be the best open-source data analysis and statistics program, and has an incredible variety of packages for all sorts of analyses, often programmed as soon as a statistics professor dreams it up. For example, there is a package for the directional statistics I need, which I don’t think you can find in SPSS or SAS. The R syntax is really clunky, as opposed to the beauty that is Python, which is irritating but doesn’t actually slow one down much.
Fortunately RPy2 allows one to call R functions from Python. It’s a fairly basic interface and took me awhile to understand how to pass data between Python and R, but it works well. I’m very grateful to the developers, who deserve more help.
The documentation of all these Python libraries leaves a lot to be desired. The example code snippets for SciPy are still too sparse, and more are sorely needed to help users quickly do specific things without having to spend an hour figuring out exactly what some poorly-documented function’s parameters do. The same goes for RPy. I hope to help out when I have time.
summarizing data by combinations of variables with python January 26, 2009
Posted by alexholcombe in science.Tags: Python, SciPy, programming, code
2 comments
For data analysis, I switched from using MATLAB, partially motivated by a desire to support open source, to using R. But my experiments nowadays are written in Python, so I decided to try analyzing the data with Python as well.
SciPy is an open-source library that helps with this, and duplicates a lot of MATLAB functionality to make it easier to switch from MATLAB. IPython provides an interactive command line with tab-completion, history, and some of the other conveniences that come with MATLAB. It’s been working well for my data plotting, except my code was becoming cumbersome when it came to extracting the data I wanted to plot. The loadtxt function easily imports my data files in a structure called a recarray, similar to a data.frame in R, a lot like a flat spreadsheet with a name for each column. Then, I need to plot the dependent variable as a function of a subset of the independent variables in the experiment, like this: 
Here I plotted the mean shift, and std dev of the shift, by observer (columns), eccentricity, and direction of motion (colors). This requires collapsing across the other variables that you can’t see here. I think this involves a “PivotTable” in Excel terminology. For python, I wrote a function where I pass a recarray and the names of the variables (datafile columns) that I want to collapse by, and it passes back multi-dimensional arrays providing the mean, standard deviation, and number of data points for every combination of the variables.
collapseBy(data,DV,*factors)
I hope someone finds this code as useful as I do; it seems something like this should be put into SciPy.
Update: Josef schooled me (in a helpful way!) by writing new code for this functionality in three different ways, with each way much cleaner than mine.
The binding problem: A new encyclopedia entry December 12, 2008
Posted by alexholcombe in neuroscience, open access, psychology.add a comment
The conventional encyclopedia: old and unimproved!
Here is a preprint of my entry for the The Sage Encyclopedia of Perception with headword “The Binding Problem”. The hardcopy version of the encyclopedia will be a massive 1100-page tome with hundreds of contributors. Sadly, this is very much a conventional, 20th-century era encyclopedia—the style guide prohibited me from referencing the original papers I was referring to. The only way I could point the reader towards the original research was to say things like “Smith has shown …” or “In 1980, Treisman proposed …”.
Perhaps this encyclopedia style made sense back before the internet, when limited space might prevent actual referencing, and anyway the average reader had no ability to access original papers. So Britannica had an excuse to adopt their lofty tone which almost gives the impression they created all the knowledge themselves. But writing in 2008, to me it felt unconscionable to describe all these discoveries in an academic publication without giving credit where it’s due. So in the preprint I’ve posted, I’ve added all the references in as if it were a modern academic publication. And I’m posting this now, a good 10 months or more before the encyclopedia is actually published. I expect that ten months from now, the entry may be embarrassingly out of date.
Sage has dozens of encyclopedias like this in the works, all of which presumably have these major shortcomings of long publication lag and impoverished referencing, but apparently they still think they will make money. They are charging $450 for the Perception volume my entry will appear in!
To me, the open-access approach exemplified by Scholarpedia is the only way to go, because:
- It is free. So the readership is tens or hundreds of times larger.
- It is published nearly instantly, so it is not already out of date the day it appears.
- It does not kill trees.
- It is easily updated.
- Its authors have no incentive to undermine it, as I have done to the Sage Encyclopedia so that my work can be seen by those who are unwilling to pay $450 for it! By the way, posting a preprint as I have done is almost always legal.
Finally, online publishing projects like Scholarpedia do not have arbitrary word limits, which would have allowed me to avoid apologizing to those whose work on binding I left out because of the arbitrary limit in this encyclopedia. But there is evidence that overall, I omitted things fairly: I’m upset about me leaving so much of me out (e.g. Holcombe 2008; Holcombe & Cavanagh 2008; Holcombe & Judson 2007; Holcombe & Cavanagh 2001).
Alex O. Holcombe (2009). The binding problem The Sage Encyclopedia of Perception
don’t know much ’bout neural networks? An interactive tutorial December 10, 2008
Posted by alexholcombe in neuroscience, psychology.add a comment
I’m releasing an interactive tutorial suitable for either individual learning or in the context of a class wherein each student, or pair of students, has a computer. I used it for my third-year psychology university students. Before beginning the 100-minute class, most had little idea how connectionist networks could store memories or compute visually guided action. By the end, they were happily rewiring their networks to encode new memories or accomplish new actions. It’s all made possible by the free, beautiful, and easy-to-use Java-based neural network simulator SimBrain that I blogged earlier.
SimBrain comes with a large number of tutorials, but these are designed for an entire course on neural networks. I needed one that could fit a single 90-minute class, so I created my own, which are basically just modifications of the great content they already released. I’ve posted the network files to be used with SimBrain, plus the instructions, here. The instructions are broken up into several separate webpages. Each one ends with an exercise for the students to try. The subsequent webpage discusses the answer to the exercise a bit. During the actual class we teach, we prevent the students from proceeding immediately to the subsequent webpage by password-protecting the pages, and giving them the password after they’ve made an effort. I might be able to provide access to that version upon request. Let me know of any problems, and whether you find the tutorial useful.
Nature targets financial weakness of PLoS journals July 3, 2008
Posted by alexholcombe in open access, science.3 comments
Nature has published a news item by Declan Butler on the finances of one of its competitors: the open-access PLoS journals, using language that puts the organisation and its journals, especially PLoS ONE, in a negative light.
The fact that PLoS does not meet its costs exclusively from the author publication fees, as Nature focuses on, is interesting, especially from the point of view of an organization like Nature Publishing Group, whose purpose is to make a profit. But,
- the purpose of PLoS is not to make a profit. The purpose is to create an outlet for original science that everyone in the world can read. This is worthwhile even if it can only happen at a “loss” subsidized by research foundations, charities, and governments.
- Even if one does insist on looking only at the “bottom line”, the analysis of Nature has a serious flaw. Open-access if it were corporate would be called a “growth industry” and the PLoS journals are young and have a high market share. We can be very confident in the upward growth trajectory as very recently, NIH, Harvard and other leading organisations have begun taking various steps to ensure that research they fund or produce is available open-access. But at this early stage, criticizing PLoS journals for not making ends meet is like asking the CEO of a new solar power company why he’s in business, when in 2008 oil is cheaper. There are reasons besides price to support solar power, and it may well be self-sustaining in future. Indeed, the open-access mandates will inevitably increase revenue, probably rapidly.
- PLoS ONE is characterized by the Nature report as a low-quality bulk publisher. It is true that PLoS ONE will publish practically any science as long as its methodology is proper and its conclusions reasonable based on the evidence. I think this is a good thing, because I believe the world needs outlets like this where science can get out rapidly, with commenting and rating tools that PLoS ONE has, so that post-publication scrutiny will continue, as part of a more transparent way to vet and evaluate science. But that’s a longer discussion.
I’d like to direct readers directly to the article I’m criticizing, but unfortunately many can’t read it since Nature is not open access!
Butler, D. (2008). PLoS stays afloat with bulk publishing. Nature, 454(7200), 11-11. DOI: 10.1038/454011a
Conflict of interest alert: I am a member of both the editorial board and also the advisory board of PLoS ONE. For the views of more disinterested parties, see these other bloggers’ posts:
http://phylogenomics.blogspot.com/2008/07/only-nature-could-turn-success-of-plos.html
http://scienceblogs.com/drugmonkey/2008/07/nature_offers_a_completely_obj.php
http://scienceblogs.com/gregladen/2008/07/is_plos_coming_of_age.php
http://frontalblogotomy.blogspot.com/2008/07/nature-versus-nurturing-open-access.html
http://scienceblogs.com/gnxp/2008/07/nature_vs_everyone_else.php
mashing up Elsevier’s journal article database June 30, 2008
Posted by alexholcombe in open access.Tags: web2.0
add a comment
Thanks to Efoundations, I see Elsevier has announced an Article 2.0 contest, in which programmers can create new value by harvesting data from 7500 XML-encoded scientific articles. This is an exciting opportunity for web 2.0 programmers interested in science. But I hope people keep in mind they’d be giving their software ideas to a publisher that charges exorbitant prices for publicly-funded science. Are there opportunities out there for expert web2.0 programmers to jump into open-access projects?
learning about neural networks: free software June 30, 2008
Posted by alexholcombe in neuroscience, psychology.1 comment so far
Free neural network simulation engines, good for understanding simple cognitive-style networks, abstracting away from the actual reality with all those pesky ion channels and membrane potentials and spikes.
- Emergent is a workhorse, used by serious neural networks researchers but also useful for learning, in conjunction with an associated neural nets textbook, which is probably good for advanced undergraduates.
- Brainwave, an Australian product from the University of Queensland, provides some nice little tutorials and you can get into them immediately thanks to them being embedded in the actual web pages. The disadvantage is that after you build a custom network, you can’t save its state and return to it later.
- Simbrain is a beautiful java app for basic neural net simulation, with an extensive set of lessons to help you, step-by-step, construct and learn about various basic neural network architectures. It has an entirely graphical interface that is well-designed so you’re not overwhelmed by the serious functionality underneath. There’s enough going on that it took a little while to learn how to use, but it was a pleasure to do so. It was perfect for use with my one-hour undergraduate lab/tutorial session for psychology students. I’ve just taken and quickly adapted two of the many lessons, one on autoassociative networks to explain how the brain’s connections can allow it to retrieve an entire memory from a partial cue, and one on Braitenberg vehicles simulating how a mouse might follow an odor trail for cheese. The coolest thing about SimBrain is the virtual world with mice and cheese that lets you simulate actual behavior, and definitely adds to the “playability” fun factor.
Brains, Minds, Media is a new electronic journal with articles about tools like these.
supporting open access science with PLoS June 30, 2008
Posted by alexholcombe in open access, science.1 comment so far
McDawg says he has four PLoS ONE t-shirts! He must be embarrassed to not have a PLoS Pathogens t-shirt, PLoS Messenger Bag, or PLoS Travel Mug. Support open access by becoming a Public Library of Science member and get the goodies. Or do it for free by posting one of these free signs on your door or website.
PLoS ONE has been doing really well, with over 2000 papers published to date. But we do need more people making comments and adding ratings to papers, visit some articles and try the ratings.

Parsimony: A newish principle? June 26, 2008
Posted by alexholcombe in history, science.Tags: philosophy
add a comment
Everything should be made as simple as possible, but no simpler – Einstein, paraphrased
KISS- Keep It Simple, Stupid! - unknown
The principle of parsimony seems obvious, reflexive even. Simpler theories should be favored over more complicated ones. And the idea does seem to have been around for a long time, according to Wikipedia at least since the twelfth century when Maimonides apparently discussed it.
The principle is even embodied in stories of the beginning of experimental science. The heliocentric Copernican theory championed by our hero Galileo could explain the movement of the planets much more simply than the Church’s old geocentric theory, with its complicated structure of eccentrics, epicycles, deferents and equants. Parsimony being an obvious advantage of the heliocentric theory, the Church’s position was doomed, once everyone got over their religious piety. So the story goes, at least as I always understood it.
I was shocked to read in this online paper that the factor of simplicity was not even raised in Galileo’s time! In the same paper, we’re also told that “Copernicus actually introduced epicycles of his own, and even epicycles on top of these”.
If not then, when? When did parsimony become a principle of working science?
if it still hasn’t happened yet, it’s likely to take a long time longer! June 25, 2008
Posted by alexholcombe in probability and statistics, science.Tags: statistics probability
add a comment
The Cauchy distribution is a unimodal distribution with fatter tails than a Gaussian. (Fig 1 at right)
Janssen & Shadlen (2005), Nature Neuroscience found that monkey LIP neuron activity followed the subjective hazard function of an objective bimodal probability density function, which goes up, down, then up again. With a Gaussian distribution (bell-shaped curve), the hazard function increases monotonically with time (Fig 2), in other words it is increasingly likely that the event will occur in the next moment if it has not occurred already, because the hazard function is proportional to the likelihood the event will occur in the next moment if it has not yet occurred.

But would the neurons successfully represent a Cauchy distribution for which the hazard rate actually decreases with time after the mean? (Fig 3) 
This hazard function is surprising to many, because it seems that for a unimodal distribution, as time elapses and the event still has not occurred it should be increasingly likely that it will occur. But this won’t occur if the tails are fat enough, as pointed out by Nassim Taleb in his book The Black Swan. Hence the title of this post. This kind of hazard function applies to various real-world phenomena, like construction contractors! as time passes after when they said it would be done, every day they don’t finish it indicates the time they’ll finish is probably even further into the future. I think Taleb suggests that humans don’t usually represent this hazard function, but he’s probably referring to cognition. I don’t know if the same is true for a go/no-go learned response time task or the like, something more automatic than cognition. Probably noone has done this experiment. Maybe it is indeed very difficult to learn this.
Indeed I think someone has shown (maybe Taleb) that it is hard, takes a lot of data even in principle to learn the fatness of tails. Maybe our default hazard function is increasing. It might be easier to see this effect in a two-button experiment, where the task is to press one or the other button, and one has a Gaussian (increasing hazard) and the other a Cauchy distribution (decreasing hazard).