Module 2, Practical 4

In this practical we will learn how to deal with regular expressions.

Regular expressions

Regular expressions, or regex are a powerful language for text pattern matching. They are extremely useful in searching for complex patterns of characters to filter, replace and validate user inputs. Regex are employed in search engines, word processors find and replace functions, and text processing utilities such as awk or grep.

e.g. How do I check that a strings contains only alphabetic characters? How do I check that an email is properly formatted (i.e. user@domain.com)? How do I find all strings containing years from 2010 to 2017?

Pyhton provides a module, re, to write and match regular expressions. But how do I write a regex?

Basic Patterns

(fromhttps://developers.google.com/edu/python/regular-expressions)

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

a, X, 9, < – ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ + ? { [ ]  | ( )*

. (a period) – matches any single character except newline ‘\n

+ means at least one instance of the preceding character

* means zero or more instances of the preceding character

\w – (lowercase w) matches a “word” character: a letter or digit or underbar [[a-zA-Z0-9_]]. Note that although “word” is the mnemonic for this, it only matches a single word char, not a whole word.

\W (upper case W) matches any non-word character.

\b – boundary between word and non-word

\s – (lowercase s) matches a single whitespace character – space, newline, return, tab, form [[\n\r\t\f]].

\S (upper case S) matches any non-whitespace character.

\t, \n, \r – tab, newline, return

\d – decimal digit [[0-9]] (some older regex utilities do not support but \d, but they all support \w `and :nbsphinx-math:s`)

^ = start, $ = end – match the start or end of the string

\ – inhibit the “specialness” of a character. So, for example, use . to match a period or \ to match a slash. If you are unsure if a character has special meaning, such as ‘@’, you can put a slash in front of it, @, to make sure it is treated just as a character.

{m,n} – causes the resulting regex to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.

[] – used to indicate a set of characters. e.g. integers from 0 to 9 = [[0-9]]

Let’s see an example. Write a regex that matches all a’s followed by zero or more b’s :

[6]:
import re

pattern = "ab*"

print(re.search(pattern, "ac"))
print(re.search(pattern, "abc"))
print(re.search(pattern, "abbc"))

print(re.search(pattern, "cdfegh")) # this will answer None...

# another way of searching...
m = re.search(pattern, "abbc")
print(m.start())
print(m.end())
<re.Match object; span=(0, 1), match='a'>
<re.Match object; span=(0, 2), match='ab'>
<re.Match object; span=(0, 3), match='abb'>
None
0
3

match objects always have a boolean value of True. If no match is found, search returns None, allowing you to test for the presence of matches like this:

[ ]:
import re

pattern = "ab*"

def doMatch(pattern, searchString):
    if re.search(pattern, searchString):
        print("Match found!")
    else:
        print("No match of pattern", pattern, "in", searchString)

doMatch(pattern, "defabbbc") # this will print "Match found!"
doMatch(pattern, "defkklic") # this will print "No match of pattern ab* in defkklic"

Can we then extract information for all matches ?

[8]:
# what happens here ?

import re

pattern = "ab*"

myMatches = re.search(pattern, "defabbckipabpoaoccdabbbb")
print(myMatches)
for match in myMatches.group():
    print(match)
<re.Match object; span=(3, 6), match='abb'>
a
b
b
[9]:
# use findall instead of search

import re

pattern = "ab*"

myMatches = re.findall(pattern, "defabbckipabpoaoccdabbbb")
print(myMatches)
for match in myMatches:
    print(match)

['abb', 'ab', 'a', 'abbbb']
abb
ab
a
abbbb
[10]:
# or even better, use finditer

import re

pattern = "ab*"

myMatches = re.finditer(pattern, "defabbckipabpoaoccdabbbb")
for match in myMatches:
    print("Match <{}> at positions: {}-{}".format(match.group(0), match.start(), match.end()))

Match <abb> at positions: 3-6
Match <ab> at positions: 10-12
Match <a> at positions: 14-15
Match <abbbb> at positions: 19-24

Group extraction

The group feature allows you to extract the different parts of a matched substring.
Suppose we want to decompose an email address into username and domain. We can create a group by adding ( and ) parenthesis around the parts of the regex matching the username and domain, and then extract groups from the matches.
[11]:
import re
pattern = "([\w.-]+)@([\w.-]+)" # patter marching (everything in lowercase)@(everything in lowercase)

m = re.match(pattern, "john.doe@nih.gov")
print(m.group(0))
print(m.group(1))
print(m.group(2))

john.doe@nih.gov
john.doe
nih.gov
[12]:
import re
# groups can also be named for ease of extraction
pattern = "(?P<username>[\w.-]+)@(?P<domain>[\w.-]+)"

m = re.match(pattern, "john.doe@nih.gov")
print(m.group(0))
print(m.group("username"))
print(m.group("domain"))
john.doe@nih.gov
john.doe
nih.gov

Exercises

  1. Write a regex to check that a string is alphanumeric, i.e. it contains only a the a-z, A-Z and 0-9 set of characters.

Show/Hide Solution

  1. Write a regex that matches a word containing the letter z.

Show/Hide Solution

re.compile

Searching can also be performed by compiling the regex with re.compile(pattern), and using the returned object of the re.Pattern type to call search without having to specify the pattern every time search is called

[29]:
import re

myRegex = re.compile(r'do.+')
myRegex.search("this animal is a donkey")

[29]:
<re.Match object; span=(17, 23), match='donkey'>

Substitution

The re.sub function allows you to search for matches to a given regex and replace them by something else in your input string. The replacement string can include ‘\1’, ‘\2’, …, ‘\X’ which refer to the text from group(1), group(2), …, group(X) from the original matching text.

[34]:
import re

pattern = "([\w\.-]+)@([\w\.-]+)"
print(re.sub(pattern, r"\1@unitn.it", "john.doe@nih.gov"))
print(re.sub(pattern, r"luke.skywalker@\2", "john.doe@nih.gov"))
john.doe@unitn.it
luke.skywalker@nih.gov

Exercise

Write a regex to remove all whitespaces from the input string.

Show/Hide Solution

Exercises

  1. Write a regex to convert a date of yyyy-mm-dd format to dd-mm-yyyy format.

Show/Hide Solution

  1. Use a regex to find all words starting with ‘a’ or ‘e’ in a given string.

Show/Hide Solution

  1. Write a regex to insert spaces between words starting with capital letters (e.g. “CamelCase” should become “Camel Case”)

Show/Hide Solution