Regular expression in Python

Asst. Prof. Teerasak E-kobon

1 Feb 2021

The regular expression is an essential basic algorithm for finding a pattern in sequences or strings. Python has a built-in module for the RegEx analysis named re. The re module contains useful functions i.e. findall() which gives a list of all matches, search() which a Match object, split() which returns a list of split strings, and sub() which replaces a match(es) with a string. There are metacharacters which are used together in designing the RegEx pattern.

[ ] for a set of characters ex. “[a-m]”
to escape special characters ex. “d”
. for any character except the newline character ex. “c..t”
^ for starts with ex. “^hello”
$ for ends with ex. “lipase$”
* for zero or more occurrences ex. “transport*”
+ for one or more occurrences ex. “transport+”
{ } for exact specification of the occurrence number ex. “AT{2}”
| for either or ex. “TG|TA”
( ) for capture and group

Also, the special sequence begins with the followed by a character. Examples are listed below.

A for matching the specified characters are at the beginning of the string ex. “Agaca”
b for matching the specified characters are at the beginning or at the end of a word
(the “r” in the beginning is to make sure that the string is a “raw string”) ex. r”base”
B for matching whether the specified characters are present, but NOT at the beginning (or at the end) of a word
d for matching where the string contains digits (numbers from 0-9)
D for matching where the string DOES NOT contain digits
s for matching where the string contains a white space character
S for matching where the string DOES NOT contain a white space character
w for matching where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
W for matching where the string DOES NOT contain any word characters
Z for matching if the specified characters are at the end of the string

Example 1 Using regular expression in Python

import re txt = “ATTAGGCTGCATGAGGTAATTATGGA” x = re.search(“TAATT”, txt) print(x) print(“The first matched character is located in position:”, x.start()) y = re.findall(“TG”, txt) print(y) z = re.split(“TG”, txt) print(z) q = re.split(“TG”, txt, 1) print(q) p = re.sub(“TGA”, “AAA”, txt) print(p) s = re.search(“TGAG”, txt) print(s) print(s.span())

Question 1 Please give five examples of known patterns in molecular biology or bioinformatics such as the START and STOP codons, and TATA box of the transcription factors.

Question 2 Give the text file named testtext.txt which contains an abstract from a research article. Please read the string/text data from this file and write the codes to check (1) how many sentences are in this abstract, (2) how many times the term “COVID” appear, and (3) convert from present tense verbs (is, are, …) to past tense verbs (was, were, …)?

Example 2 The use of bioPython module to identify the required patterns

# Biopython’s SeqIO module handles sequence input/output from Bio import SeqIO def get_cds_feature_with_qualifier_value(seq_record, name, value): “””Function to look for CDS feature by annotation value in sequence record. e.g. You can use this for finding features by locus tag, gene ID, or protein ID. “”” # Loop over the features for feature in genome_record.features: if feature.type == “CDS” and value in feature.qualifiers.get(name, []): return feature # Could not find it return None genome_record = SeqIO.read(“C:\Users\TeerasakArt\Downloads\exampleGenbank.gb”, “genbank”) cds_feature = get_cds_feature_with_qualifier_value(genome_record, “gene”, “IL6”) print(cds_feature)

Question 3 From the given the genbak file in the second example, please write the codes to read this file and create a summary table that shows the number of the interleukin 6 (IL-6) isoforms and differences between them such as exons, amino acid sequences, and length of the translated proteins. The codes in the second example can be modified or extended.

Question 4 From the genbank file in the second example, there are multiple protein sequences of different isoforms of the IL6. Please write the codes to identify the shared sequence regions (common amino acid sequences) between these isoforms.

There are also other useful built-in functions for string handle in Python that could be helpful for creating the codes in this chapter. The string operators + and * for joining and multiplying strings. The in and not in operators for checking membership. The functions chr(), ord(), len(), and str() to convert an integer to a character, convert a character to an integer, check the length of the string, and convert to the string. The use of [ ] for indexing (ex. [2]) and slicing (ex. [2:7], [ :5], [6: ], [1:2:2]) of the string. Other useful functions are suggested below.

capitalize() for case conversion
lower() for lowercase conversion
swapcase() for the case swapping
upper() for uppercase conversion
count() for counting the specified string pattern
endwith() for determination if the target string ends with the specified pattern
find() to find the specified string pattern
rfind() to find the specified string pattern and return the highest index
startwith() to for determination if the target string starts with the specified pattern
isalnum() to check if the string has alphanumeric characters
isalpha() to check if the string has alphabetic characters
isdigit() to check if the string has digit characters
islower() to check if the string has lowercase characters
isspace() to check if the string has whitespace characters
isupper() to check if the string has uppercase characters
center() for centric adjustment
ljust() for the left adjustment
lstrip() for trimming the leading characters
replace() for substring replacement
rjust() for the right adjustment
rstrip() for trimming the trailing characters
splitlines() for string breaking at the line boundaries (n for newline, r for carriage return, rn for carriage return and line feed, v for line tabulation, and f for form feed)

จ.	อ.	พ.	พฤ.	ศ.	ส.	อา.
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31