Homework 3 (due Tuesday, Oct 22nd, at 11:59pm PST)


Need help? Go to http://angus.askbot.com/ to ask questions and see other people’s answers!


See Using ‘screen’ for information on using screen.

1. Annotate E. coli; compare annotations

Hint: you definitely want to coordinate with some other people on this one. Each Prokka run will take about 45 minutes.

Start up an m1.xlarge. Install BLAST (just the part under “Next, install BLAST”, in BLASTing your assembled data) and download ngs-scripts:

git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts

You’ll also need screed:

pip install screed

Next: install prokka, as per Installing Prokka.

Go to /mnt and download a bunch of velvet assemblies that I ran for you (what would be produced by running through Basic (single-genome) assembly for many different k):

cd /mnt
curl -O http://public.ged.msu.edu.s3.amazonaws.com/velvet-dn-ecoli.tar.gz
tar xzf velvet-dn-ecoli.tar.gz

Run the command ls; you should see about 11 assemblies, named ecoli-kXX.fa.

Now, download the “official” E. coli MG 1655 file:

curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa

and format it for BLAST:

formatdb -i NC_000913.faa -o T -p T

Then, build and compare annotations by doing the following: for each E. coli file, run Prokka and construct a set of pairwise BLAST files:

/mnt/prokka-1.7/bin/prokka ecoli-k19.fa --outdir k19 --prefix k19

formatdb -i k19/k19.faa -o T -p T
blastall -i k19/k19.faa -d NC_000913.faa -p blastp -e 1e-12 -o k19.x.mg1655
blastall -d k19/k19.faa -i NC_000913.faa -p blastp -e 1e-12 -o mg1655.x.k19

Finally, for each set of annotated E. coli files, calculate the orthologs and then count them:

python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py k19/k19.faa NC_000913.faa k19.x.mg1655 mg1655.x.k19 > k19-ortho.csv
wc -l k19-ortho.csv

You should see output like this: 3804 k19-ortho.csv. Record the number (3804) and the k value (19); do this for each of the E. coli assemblies & annotations, then put the list in a spreadsheet and send to me.


You can automate the above by putting it in a for loop in the shell:

for i in {19..51..2}
   /mnt/prokka-1.7/bin/prokka ecoli-k${i}.fa --outdir k${i} --prefix k${i}
   # other commands involving ${i}

Here, the for loop will run the commands between ‘do’ and ‘done’ once for every value of “i” between 19 and 51, skipping numbers forward by 2 each time (e.g. 19, 21, 23, 25, ...).

A good trick is to use ‘echo’ to see if your commands are correct before actually running them – e.g. run

for i in {19..51..2}
   echo /mnt/prokka-1.7.bin/prokka ... ${i}

and then look at the commands; after it looks like you’re going to run the right commands, then take the ‘echo’ commands out.

2. Programming in Python: lists and dictionaries and functions, oh my! part 2

Log in to your EC2 instance via SSH and install ipythonblocks:

pip install ipythonblocks

and Ruby/Ruby gems:

apt-get install -y ruby1.9.1 rubygems

Then install ‘gist’:

gem install gist

and download the class 3 notebook:

cd /usr/local/notebooks
curl -O https://raw.github.com/beacon-center/2013-intro-computational-science/master/notebooks/class3-lists-dicts-functions.ipynb

Connect to IPython Notebook (‘https://‘ + your EC2 machine name) and solve the problems. Post the notebook as a github ‘gist’ (see last line for instructions) and send me the nbviewer link.