Homework 3 (due Tuesday, Oct 22nd, at 11:59pm PST)¶
Note
Need help? Go to http://angus.askbot.com/ to ask questions and see other people’s answers!
Note
See Using ‘screen’ for information on using screen.
1. Annotate E. coli; compare annotations¶
Hint: you definitely want to coordinate with some other people on this one. Each Prokka run will take about 45 minutes.
Start up an m1.xlarge. Install BLAST (just the part under “Next, install BLAST”, in BLASTing your assembled data) and download ngs-scripts:
git clone https://github.com/ngs-docs/ngs-scripts /usr/local/share/ngs-scripts
You’ll also need screed:
pip install screed
Next: install prokka, as per Installing Prokka.
Go to /mnt and download a bunch of velvet assemblies that I ran for you (what would be produced by running through Basic (single-genome) assembly for many different k):
cd /mnt
curl -O http://public.ged.msu.edu.s3.amazonaws.com/velvet-dn-ecoli.tar.gz
tar xzf velvet-dn-ecoli.tar.gz
Run the command ls
; you should see about 11 assemblies, named
ecoli-kXX.fa
.
Now, download the “official” E. coli MG 1655 file:
curl -O http://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655_uid57779/NC_000913.faa
and format it for BLAST:
formatdb -i NC_000913.faa -o T -p T
Then, build and compare annotations by doing the following: for each E. coli file, run Prokka and construct a set of pairwise BLAST files:
/mnt/prokka-1.7/bin/prokka ecoli-k19.fa --outdir k19 --prefix k19
formatdb -i k19/k19.faa -o T -p T
blastall -i k19/k19.faa -d NC_000913.faa -p blastp -e 1e-12 -o k19.x.mg1655
blastall -d k19/k19.faa -i NC_000913.faa -p blastp -e 1e-12 -o mg1655.x.k19
Finally, for each set of annotated E. coli files, calculate the orthologs and then count them:
python /usr/local/share/ngs-scripts/blast/blast-to-ortho-csv.py k19/k19.faa NC_000913.faa k19.x.mg1655 mg1655.x.k19 > k19-ortho.csv
wc -l k19-ortho.csv
You should see output like this: 3804 k19-ortho.csv
. Record the
number (3804) and the k value (19); do this for each of the E. coli
assemblies & annotations, then put the list in a spreadsheet and send
to me.
Note
You can automate the above by putting it in a for loop in the shell:
for i in {19..51..2}
do
/mnt/prokka-1.7/bin/prokka ecoli-k${i}.fa --outdir k${i} --prefix k${i}
# other commands involving ${i}
done
Here, the for loop will run the commands between ‘do’ and ‘done’ once for every value of “i” between 19 and 51, skipping numbers forward by 2 each time (e.g. 19, 21, 23, 25, ...).
A good trick is to use ‘echo’ to see if your commands are correct before actually running them – e.g. run
for i in {19..51..2}
do
echo /mnt/prokka-1.7.bin/prokka ... ${i}
done
and then look at the commands; after it looks like you’re going to run the right commands, then take the ‘echo’ commands out.
2. Programming in Python: lists and dictionaries and functions, oh my! part 2¶
Log in to your EC2 instance via SSH and install ipythonblocks:
pip install ipythonblocks
and Ruby/Ruby gems:
apt-get install -y ruby1.9.1 rubygems
Then install ‘gist’:
gem install gist
and download the class 3 notebook:
cd /usr/local/notebooks
curl -O https://raw.github.com/beacon-center/2013-intro-computational-science/master/notebooks/class3-lists-dicts-functions.ipynb
Connect to IPython Notebook (‘https://‘ + your EC2 machine name) and solve the problems. Post the notebook as a github ‘gist’ (see last line for instructions) and send me the nbviewer link.