# Homework 2 (due Tuesday, Oct 15th, at 11:59pm PST)¶

Note

Note

See Using ‘screen’ for information on using screen.

## 2. Retrieve CRP_ECOLI from an assembly.¶

Go to http://www.ncbi.nlm.nih.gov/ and search the protein database for CRP_ECOLI. Retrieve the FASTA version of this gene (just the amino acids).

Next, run an assembly with a k other than 31: pick any odd k between 19 and 51, and rerun the velvet etc. commands to produce an ‘ecoli-kXX.fa’ assembly.

Install this assembly in a BLAST server (see BLASTing your assembled data) and search it with the CRP_ECOLI protein sequence.

## 3. Compare the mismatch profiles for 4 different stages of reads.¶

We’re going to take four different collections of reads and map them to an assembled genome, then look at the read mismatch profile (where mismatches between the read and the assembly exist, plotted by position in read). This is a way of looking at the overall error rate.

First, install bowtie, which does read mapping:

cd /root
curl -O -L http://sourceforge.net/projects/bowtie-bio/files/bowtie/0.12.7/bowtie-0.12.7-linux-x86_64.zip
unzip bowtie-0.12.7-linux-x86_64.zip
cd bowtie-0.12.7
cp bowtie bowtie-build bowtie-inspect /usr/local/bin

Now, make the E. coli assembly into something that bowtie can map reads to by indexing it:

cd /mnt/assembly
bowtie-build ecoli-k31.fa ecoli-k31

Next, collect the four read data sets. First, get 200k of the raw reads:

Also collect the first 200k of the QC reads:

Take 200k reads from the diginorm processing step, and the abundfilt processing step, too:

Now, map all of them:

You should see output like this:

# reads with at least one reported alignment: 181936 (90.97%)
# reads that failed to align: 18064 (9.03%)
Reported 181936 alignments to 1 output stream(s)

Take a look at the mapping files – it shows which reads map where in the genome, and where the mismatches are for each read:

Now, let’s convert each of these mapping files into a set of counts:

git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts

do
python /root/ngs-scripts/bowtie/map-profile.py $i >$i.count
done

This will create a bunch of .count files, one for each .map file.

To graph these files, go to your IPython Notebook (the https URL) and create a new notebook called hw2-mismatches-USERID. In this notebook, put two cells. First,

cd /mnt/assembly

and second:

When you execute these, it will load the files into Python; you can plot them in a third cell like so:

plot(qc_counts[:,0], qc_counts[:,1], label='qc')
plot(raw_counts[:, 0], raw_counts[:, 1], label='raw')
axis(ymin=0, ymax=80000, xmin=0, xmax=160)
ylabel('number of reads with mismatches at that position')
legend()

This will plot the first column of the two given counts files (the position in the read) against the second column (the number of mismatches at that position).

Play with the data a bit by adding cells to graph different combinations against each other, changing axes, etc. Use the graphs to determine which data set has the most mismatches, save the notebook, and send it to me.