Wed, Nov 27, 2013 – pandas, SciPy, and matplotlib for data analysis, statistics, and plotting

Installing pandas on EC2

Since pandas does not come installed by default on EC2, we need to install it ourselves. Run the following commands to update your EC2 node:

# Update your apt-get:
apt-get update

# Pre-requisities
apt-get install build-essential gfortran gcc g++ curl wget python-dev

# Make sure you have the latest setup tools
wget -O - | python2.7

# Get pip
curl --show-error --retry 5 | python2.7

The pandas module is built off of the NumPy module. Now we need to update NumPy to the latest version of it. Unfortunately, the automatic upgrading tools on EC2 don’t seem to work right, so we need to update it manually:

# download the latest version of NumPy
wget --no-check-certificate

# unzip the NumPy file and move into the install directory
cd numpy-1.8.0/

# build NumPy
python build

# install NumPy
python install

# move back to the home directory and clean up
cd ~/
rm -r numpy-1.8.0/

Finally, we can install pandas:

pip install pandas

You should be able to use pandas on your EC2 nodes now. Try it out:

# download some test data

# import pandas and read some data
from pandas import *

test_data = read_csv("parasite_data.csv")
print test_data
print test_data["Virulence"]

pandas, scipy, and matplotlib

We will go over some tutorials on pandas, SciPy, and matplotlib.

pandas & SciPy tutorial:

pandas video tutorial:

matplotlib tutorial: