BioPython module and algorithms

Asst. Prof. Teerasak E-kobon

25 Jan 2021

The previous chapter taught you how to install and use certain Python modules. This chapter will focus on the BioPython Module and its applications in bioinformatics and algorithmic design. The BioPython module is imported by import Bio statement.  Several useful functions to work with the sequence object. Let's practice by the following example. This example is designed to find the enzyme cleavage position in the protein sequence. First, a single protein sequence is an input and the position of 'R' amino acid is counted before the loop will search all 'R' positions in the sequence. The peptides are cut at the 'R' position and the peptide sequences are stored in the list named peptideSeq.

Example 1 Finding protein cleavage site and returning the peptide sequences

from Bio.Seq import Seq
my_seq = Seq("MRKKTAIAIAVALAGFATVAQAAPKDNTWYTGAKLGWSQYHDTGFINNNGPTHENQLGAGAFGGYQVNPYVGFEMGYDWLGRMPYKGSVENGAYKAQGVQLTAKLGYPITDDLDIYTRLGGMVWRADTKSNVYGKNHDTGVSPVFAGGVEYAITPEIATRLEYQWTNNIGDAHTIGTRPDNGMLSLGVSYRFGQGEAAPVVAPAPAPAPEVQTKHFTLKSDVLFNFNKATLKPEGQAALDQLYSQLSNLDPKDGSVVVLGYTDRIGSDAYNQGLSERRAQSVVDYLISKGIPADKISARGMGESNPVTGNTCDNVKQRAALIDCLAPDRRVEIEVKGIKDVVTQPQA")

print("Number of R is", my_seq.count("R"))
print(my_seq[0:3])
print(my_seq[0])
print("The length of this sequence is ", len(my_seq), "amino acids.")

peptideSeq = []
peptide = ""
for i in range(0, len(my_seq)):
    if my_seq[i] != 'R':
       peptide = peptide + my_seq[i]
    else:
       peptide = peptide + 'R'
       peptideSeq.append(peptide)
       peptide = ""

print("The cleavage of this protein returns ", my_seq.count("R"), "positions.",
      "\n", peptideSeq)

 

Question 1 Please modify the codes to check at least five different cleavage sites and return user-friendly outputs.

 

The SeqIO module provides functions to parse information and contents from several file formats (.fasta and .gbk). The parsed information is then analyzed by other following codes. The second example shows how to read a multi-fasta file and parse the protein sequences for further analysis. The testSeqChapter3.fasta is available here.

Example 2 Read and parse the fasta sequence file

from Bio import SeqIO

all_species = []
for seq_record in SeqIO.parse("C:\\Users\\TeerasakArt\\Downloads\\testSeqChapter.fasta", "fasta"):
    all_species.append(seq_record.description)
    print(seq_record.format("fasta"))

print(all_species)

Question 2 Please modify codes in example 1 to continuously read each protein sequence from the testSeqChapter.fasta file and output the cleaved peptides from each protein.

Question 3 Please store these fasta sequences and the definition as a dictionary.

The modules for multiple sequence alignments will not be covered in this chapter as it will be trained by the next lecturer. If you want to compare your sequence with other sequences in the databases, there is the NCBIWWW module responsible for the BLAST functions ( qblast() ). The third example shows how to send a search query to blast against the nucleotide database and record the search output into a file.

 

Example 3 Use BioPython to operate the BLAST search

from Bio.Blast import NCBIWWW

result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")

with open("C:\\Users\\TeerasakArt\\Downloads\\my_blast.xml", "w") as out_handle:
    out_handle.write(result_handle.read())
    result_handle.close()

Question 4 Could you extend the codes in example 3 to display well-formatted output on the screen?

The BioPython also provides modules to communicate with other bioinformatics resources including SwissProt for searching the Swiss-Prot protein database, Bio.ExPASy for accessing the enzyme database, ScanProsite for accessing the Prosites protein domain and motif database, PDB protein structure database, Bio,PopGen for population genetics analysis, and Bio.Phylo for phylogenetics analysis. The fourth example demonstrates basic usage of some modules which can be useful for later program development. For the use of the ScanProsite module, this module does not seem to work well so we trace to the file origin of the module here and add this module manually to the current program environment which means that the new ScanProsite module is no longer under the Bio.ExPASy and can be called directly from the current location. From the error message obtained from running the first version of the codes, this error message could be searched for some solutions by Google and there is a solution about changing the server mirror of this ExPASy server found. Therefore, inside the ScanProsite.py file, the line "def scan(seq="", mirror="https://www.expasy.org", output="xml", **keywords):" is changed to "def scan(seq="", mirror="https://prosite.expasy.org/", output="xml", **keywords):". Here is an example to edit the existing codes when they are not running properly. The new version of the ScanProsite.py file is available here.

Example 4 Trial of SwissProt, ExPASy and ScanProsite modules

from Bio import ExPASy
from Bio import SwissProt
handle = ExPASy.get_sprot_raw("O23729")
record = SwissProt.read(handle)
handle.close()
print(record.entry_name)

handle2 = ExPASy.get_prosite_raw('PS00001')
text = handle2.read()
print(text)
handle2.close()

sequence = "MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFTCRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN"

# First version
from Bio.ExPASy import ScanProsite
handle = ScanProsite.scan(seq=sequence)

# Second version
import ScanProsite
handle3 = ScanProsite.scan(seq=sequence)
result = ScanProsite.read(handle3)
print(result)

Question 5 From the fourth example, can you manage to run the ScanProsite module? Please modify the given codes to create a program that repeatedly scans domains from the previously given protein sequences in the trySeqChapter3.fasta file.

 

At this stage, you are expected to be familiar with several BioPython modules and different ways of data parsing. Please read more information and try additional examples given in the BioPython documentation because you might need to use these modules later.

 

The above examples demonstrate an important concept of programming which is the designed steps to solve the biological problems or a step-by-step procedure to be executed to get the desired output, called the algorithm. Good algorithms should be unambiguous (clear), explicitly defined inputs and outputs, finiteness (termination after the finite steps of the codes), feasibility (feasible with available resources), and independent (free from any other codes). Simple examples of the computational algorithms are search, sort, insert, update and delete algorithms. The example below demonstrates how to measure the program running time using the timeit() function of the timeit module. 

Question 6 How will you compare the algorithm efficiency? Which parameters can be used?

Example 5 Using timeit module to measure the program running time

# testing timeit()
import timeit
import_module = "import random"
testcode = ''' 
def test(): 
    return random.randint(10, 100)
'''
print(timeit.timeit(stmt=testcode, setup=import_module))

 

********************************