summarizing data by combinations of variables with python
For data analysis, I switched from using MATLAB, partially motivated by a desire to support open source, to using R. But my experiments nowadays are written in Python, so I decided to try analyzing the data with Python as well.
SciPy is an open-source library that helps with this, and duplicates a lot of MATLAB functionality to make it easier to switch from MATLAB. IPython provides an interactive command line with tab-completion, history, and some of the other conveniences that come with MATLAB. It’s been working well for my data plotting, except my code was becoming cumbersome when it came to extracting the data I wanted to plot. The
loadtxt function easily imports my data files in a structure called a recarray, similar to a data.frame in R, a lot like a flat spreadsheet with a name for each column. Then, I need to plot the dependent variable as a function of a subset of the independent variables in the experiment, like this:
Here I plotted the mean shift, and std dev of the shift, by observer (columns), eccentricity, and direction of motion (colors). This requires collapsing across the other variables that you can’t see here. I think this involves a “PivotTable” in Excel terminology. For python, I wrote a function where I pass a recarray and the names of the variables (datafile columns) that I want to collapse by, and it passes back multi-dimensional arrays providing the mean, standard deviation, and number of data points for every combination of the variables.
I hope someone finds this code as useful as I do; it seems something like this should be put into SciPy.
Update: Josef schooled me (in a helpful way!) by writing new code for this functionality in three different ways, with each way much cleaner than mine.