Module 1, QCB midterm test, November 12th 2021

Download the data

  1. Download sciprog-qcb-2021-11-12-FIRSTNAME-LASTNAME-ID.zip and extract it on your desktop. Folder content should be like this:

|- sciprog-qcb-2021-11-12-FIRSTNAME-LASTNAME-ID
    |- QCB_midTerm_2021_11_12.pdf
    |- exercise1.py
    |- exercise2.py
    |- shortSample.fasta
    |- test_reads_10k.fasta
  1. Rename sciprog-qcb-2021-11-12-FIRSTNAME-LASTNAME-ID folder: put your name, lastname an id number, like sciprog-qcb-2021-11-12-john-doe-432432

From now on, you will be editing the files in that folder. At the end of the exam, that is what will be evaluated.

  1. Edit the files following the instructions.

Problem 1

Given a text string S (possibly on several lines), write a python function that takes the string S and a second string S1 (assumed to be shorter than S) in input and returns a list with all positions at which S1 starts in S. If no such positions exist a warning message should be printed and an empty list returned.

Example:

S = """TGAATACATATTATTCCGC
GTCCTACCAAAAAATATACATAGATAT
GATACTAGGGGACCGGTTCCGGGATGA
TGATGATACTAGGGGACCGGTTGATAC"""

print(findPositions(S, "TGA"))
print(findPositions(S, "GAT"))
print(findPositions(S, "none"))

should return:

String 'TGA' found at the following positions:
[0, 45, 70, 73, 76, 94]

String 'GAT' found at the following positions:
[41, 46, 68, 71, 74, 77, 95]

Warning: string 'none' not found. Returning an empty list.
[]

Problem 2

Fasta is a format for storing sequence information and the corresponding sequencing quality. A sample entry is the following:

>SRR190875.100000 HWI-ST180_0167:2:1:6207:136494
AGGCCTAGGTCTTCCAGAGTCGCTTTTTCCAGCTCCAGACCGATCTCTTCAG

where the first line is the identifier of the read and starts with a >. The sequence follows the line with the identifier (in general it can be on several lines, but you can assume it is only on one line).

HINT: Biopython’s ``SeqIO.parse`` can read fasta formatted files.

Kmers are substrings of length K of a given sequence. For example, 2mers are all substrings having length 2 of a given sequence. Assuming a sequence seq = "AATAACTAGC", all possible 2mers of seq are: AA,AT,TA,AC,CT,AG,GC. Note that the 2mers "AA" and "TA" are present twice.

Start editing exercise2.py, implementing the required functions as in the next points.

2.1) computeKmers(seq, kmerLen, kmerDict)

Implement computeKmers(seq, kmerLen, kmerDict) : gets a sequence seq (stored as a string), an integer kmerLen specifying the size of the kmer and fills the dictionary kmerDict with all the possible kmers of length kmerLen present in the sequence as keys and their multiplicity (i.e., the number of times they appear in the sequence) as values.

Example: given the sequence seq="AATAACTAGC" the dictionary of 2mers is the following:

{'AA': 2, 'AG': 1, 'TA': 2, 'AT': 1, 'GC': 1, 'CT': 1, 'AC': 1}

Note: if kmerLen is higher than len(seq) the dictionary will be not updated. Calling

seq = "AATTAATTAACTAGCCTTAA"
twoMers = dict()
print("sequence:" , seq)
computeKmers(seq, 2, twoMers)
print("twoMers")
print(twoMers)
fourMers = dict()
computeKmers(seq, 4, fourMers)
print("4mers")
print(fourMers)
t2Mers = dict()
computeKmers(seq, 32, t2Mers)
print("32mers")
print(t2Mers)

should give:

sequence: AATTAATTAACTAGCCTTAA
twoMers
{'AA': 4, 'AT': 2, 'TT': 3, 'TA': 4, 'AC': 1, 'CT': 2, 'AG': 1, 'GC': 1, 'CC': 1}
4mers
{'AATT': 2, 'ATTA': 2, 'TTAA': 3, 'TAAT': 1, 'TAAC': 1, 'AACT': 1, 'ACTA': 1, 'CTAG': 1, 'TAGC': 1, 'AGCC': 1, 'GCCT': 1, 'CCTT': 1, 'CTTA': 1}
32mers
{}

2.2) parseFasta(fileName,kmerLen)

Implement parseFasta(fileName,kmerLen): reads the fasta formatted file fileName and returns a dictionary with all the kmers having length kmerLen and their multiplicities present in the file. The function should print the number of sequences read from the file and the number of distinct kmers.

Important: it could be very useful to take advantage of the function computeKmers developed at point 1 to implement this function.

Example: if the file fileName contains the two sequences "ATATCACATCTG" and "CTGACATATAT" (with the respective sequence headers), the output dictionary of 4mers would be:

{'ATAT': 3, 'TCTG': 1, 'TCAC': 1, 'TGAC': 1, 'CATA': 1, 'GACA': 1, 'TATA': 1, 'TATC': 1, 'CATC': 1, 'ATCT': 1, 'CACA': 1, 'ATCA': 1, 'CTGA': 1, 'ACAT': 2}

Given the input files shortSample.fasta and test_reads_10k.fasta (that are included in your working folder), calling:

myFile = "shortSample.fasta"
kmers = parseFasta(myFile,4)
print(kmers)

should give:

Read 2 sequences.
Total number of 4mers: 14
{'ATAT': 3, 'TATC': 1, 'ATCA': 1, 'TCAC': 1, 'CACA': 1, 'ACAT': 2, 'CATC': 1, 'ATCT': 1, 'TCTG': 1, 'CTGA': 1, 'TGAC': 1, 'GACA': 1, 'CATA': 1, 'TATA': 1}

while

myFile = "test_reads_10k.fasta"
kmers = parseFasta(myFile,17)

should print:

Read 10000 sequences.
Total number of 17mers: 765263

IMPORTANT: given the length of test_reads_10k.fasta, try to process this file only when the function parseFasta() seems to work with the file shortSample.fasta.

2.3) plotKmerSpectrum(kmerDict)

Implement plotKmerSpectrum(kmerDict): given a dictionary kmerDict created as above (i.e., keys are kmer sequences and values are the number of occurrences of the kmer sequence) plots the kmer spectrum of all the kmers present in the dictionary. The kmer spectrum is a histogram of the multiplicities of all kmers. For example, given the dictionary

{'AC': 1, 'AG': 1, 'AA': 4, 'AT': 2, 'CT': 2, 'CC': 1, 'TA': 4, 'TT': 3, 'GC': 1}

a dictionary representing the kmer spectrum of these kmers is:

{1: 4, 2: 2, 3: 1, 4: 2}

which basically means that there are 4 2mers with multiplicity 1, 2 with multiplicity 2, 1 with multiplicity 3 and 2 with multiplicity 4. The plot for such a kmer spectrum should be:

img1

Calling

myFile = "test_reads_10k.fasta"
kmers = parseFasta(myFile,17)
plotKmerSpectrum(kmers)

should produce:

Read 10000 sequences.
Total number of 17mers: 765263
img2

Show/Hide a possible solution