Regular expression in Python

Asst. Prof. Teerasak E-kobon

1 Feb 2021

The regular expression is an essential basic algorithm for finding a pattern in sequences or strings. Python has a built-in module for the RegEx analysis named re. The re module contains useful functions i.e. findall() which gives a list of all matches, search() which a Match object, split() which returns a list of split strings, and sub() which replaces a match(es) with a string. There are metacharacters which are used together in designing the RegEx pattern.

  • [ ] for a set of characters ex. "[a-m]" 
  • \ to escape special characters ex. "\d"
  • . for any character except the newline character ex. "c..t"
  • ^ for starts with ex. "^hello"
  • $ for ends with ex. "lipase$"
  • * for zero or more occurrences ex. "transport*"
  • + for one or more occurrences ex. "transport+"
  • { }  for exact specification of the occurrence number ex. "AT{2}"
  • | for either or ex. "TG|TA"
  • ( ) for capture and group

Also, the special sequence begins with the \ followed by a character. Examples are listed below.

  • \A for matching the specified characters are at the beginning of the string ex. "\Agaca"
  • \b for matching the specified characters are at the beginning or at the end of a word
    (the "r" in the beginning is to make sure that the string is  a "raw string") ex. r"\base"
  • \B for matching whether the specified characters are present, but NOT at the beginning (or at the end) of a word
  • \d for matching where the string contains digits (numbers from 0-9)
  • \D for matching where the string DOES NOT contain digits
  • \s for matching where the string contains a white space character
  • \S for matching where the string DOES NOT contain a white space character
  • \w for matching where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
  • \W for matching where the string DOES NOT contain any word characters
  • \Z for matching if the specified characters are at the end of the string

 

Example 1 Using regular expression in Python

import re

txt = "ATTAGGCTGCATGAGGTAATTATGGA"
x = re.search("TAATT", txt)
print(x)
print("The first matched character is located in position:", x.start())

y = re.findall("TG", txt)
print(y)

z = re.split("TG", txt)
print(z)

q = re.split("TG", txt, 1)
print(q)

p = re.sub("TGA", "AAA", txt)
print(p)

s = re.search("TGAG", txt)
print(s)
print(s.span())

Question 1 Please give five examples of known patterns in molecular biology or bioinformatics such as the START and STOP codons, and TATA box of the transcription factors.

Question 2 Give the text file named testtext.txt which contains an abstract from a research article. Please read the string/text data from this file and write the codes to check (1) how many sentences are in this abstract, (2) how many times the term "COVID" appear, and (3) convert from present tense verbs (is, are, ...) to past tense verbs (was, were, ...)?

 

Example 2 The use of bioPython module to identify the required patterns

# Biopython's SeqIO module handles sequence input/output
from Bio import SeqIO
def get_cds_feature_with_qualifier_value(seq_record, name, value):
    """Function to look for CDS feature by annotation value in sequence record.
    e.g. You can use this for finding features by locus tag, gene ID, or protein ID.
    """
    # Loop over the features
    for feature in genome_record.features:
        if feature.type == "CDS" and value in feature.qualifiers.get(name, []):
            return feature
    # Could not find it
    return None

genome_record = SeqIO.read("C:\\Users\\TeerasakArt\\Downloads\\exampleGenbank.gb", "genbank")
cds_feature = get_cds_feature_with_qualifier_value(genome_record, "gene", "IL6")
print(cds_feature)

 

Question 3 From the given the genbak file in the second example, please write the codes to read this file and create a summary table that shows the number of the interleukin 6 (IL-6) isoforms and differences between them such as exons, amino acid sequences, and length of the translated proteins. The codes in the second example can be modified or extended.

 

Question 4 From the genbank file in the second example, there are multiple protein sequences of different isoforms of the IL6. Please write the codes to identify the shared sequence regions (common amino acid sequences) between these isoforms.

There are also other useful built-in functions for string handle in Python that could be helpful for creating the codes in this chapter. The string operators + and * for joining and multiplying strings. The in and not in operators for checking membership.  The functions chr(), ord(), len(), and str() to convert an integer to a character, convert a character to an integer, check the length of the string, and convert to the string. The use of [ ] for indexing (ex. [2]) and slicing (ex. [2:7], [ :5], [6: ], [1:2:2]) of the string. Other useful functions are suggested below.

  • capitalize() for case conversion
  • lower() for lowercase conversion
  • swapcase() for the case swapping
  • upper() for uppercase conversion
  • count() for counting the specified string pattern
  • endwith() for determination if the target string ends with the specified pattern
  • find() to find the specified string pattern
  • rfind() to find the specified string pattern and return the highest index
  • startwith() to for determination if the target string starts with the specified pattern
  • isalnum() to check if the string has alphanumeric characters
  • isalpha() to check if the string has alphabetic characters
  • isdigit() to check if the string has digit characters
  • islower() to check if the string has lowercase characters
  • isspace() to check if the string has whitespace characters
  • isupper() to check if the string has uppercase characters
  • center() for centric adjustment
  • ljust() for the left adjustment
  • lstrip() for trimming the leading characters
  • replace() for substring replacement
  • rjust() for the right adjustment
  • rstrip() for trimming the trailing characters
  • splitlines() for string breaking at the line boundaries (\n for newline, \r for carriage return, \r\n for carriage return and line feed, \v for line tabulation, and \f for form feed)