The Hogwarts Regex challenge + explanation
A friendly coding challenge + explanation to get you started with Regular Expressions
Welcome challengee - it is time to perform magic with Python. In this challenge you will be working with Regex (short for Regular Expressions), a language with which you can match and retrieve strings of text or numbers. This allows you to turn messy real world data (e.g. from websites, emails, or other forms of communication) into a neat data structure (e.g. a Pandas Dataframe).
The solutions to the challenges below can be found here
The challange explained
To accomplish this challenge, you can use the instructions on this page as well as the following websites:
It is advised that you start with the instructions provide here, and then you can do some extra reading on these websites. Please do not use any other sources, unless it is specifically stated that you can.
7 Regex examples
Example 1. Matching digits and using quantifiers
So what is Regex used for? Imaging that you like to extract all ‘years’ from the string s
below. This can be achieved with the Regex pattern \d{4}
which will match on all four digit numbers. Whilst \d
matches on all digits, {4}
quantifies ‘4 instances of the foregoing character.’ In this particular example, you could also have used \d+
, meaning ‘one or more digits.’
Run the code cell below to see how it works. The Regex library comes pre-installed in Anaconda, so you just need to import it.
import re
s = 'The Philosopher\'s Stone is published in 1997, but the writing started in 1990.'
m = re.findall(r'\d{4}',s)
m
Some background: use re.findall()
when you want to retrieve multiple strings that match a pattern. This basic Regex function is used in most of the challenges below. It returns matches as a list, or a list of tuples, as we will see in example 3 in this Notebook, where the use of groups will be discussed. The Python Documentation on re.findall()
:
Return all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result.
The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.
Example 2. Matching word characters and whitespace
Word characters are matched with \w
. Sometimes you also need to match whitespace, which can be done with the Regex character \s
. In the example below, you only match two word characters that are followed by a whitespace, so only ‘aa ‘ rather than ‘bb’.
s = 'aa bb'
m = re.search(r'\w{2}\s', s)
m
Some background: use re.search()
when you want to retrieve one string that matches the pattern. It returns a match object, which in the example above consists of only 1 group (i.e. group(0)). In the next example, you will see how you can work with matches consisting of multiple groups. The Python Documentation on re.search()
:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
Example 3. Match anything in between two strings and divide matches into groups
As a magician practicioner, you may be interested in any text or numbers in between two specific strings. In the example below, we like to retrieve anything that comes in between the letters ‘b’ and ‘d.’ We also specify that these characters should be surrounded by whitespace (so \sb\s
and \sd\s
), otherwise the pattern could also match on any ‘b’ or ‘d,’ including those that are part of larger words.
The pattern .*?
matches on anything. It is placed between parentheses ‘()’ to form a delimited group, i.e. a subset of your full match from which you can seperately retrieve information, as will be explained in a moment.
First run the code.
s = 'a b anything, including a c d e f'
m = re.search(r'\sb\s(.*?)\sd\s', s).group(1)
m
It is important to note that the pattern \sb\s(.*?)\sd\s
will match anything that comes in between ‘b’ and ‘d’ (as long a the ‘b’ and ‘d’ are surrounded by whitespace).
At this point it is also useful to understand how to use groups.
.group(0)
always gives the full match associated with a Regex pattern. In the example above .group(0)
returns:
b anything, including a c d
However, in this example, we are interested in .group(1)
, everything in between two strings, that is, the match between the parentheses:
anything, including a c
If we would add more parentheses, as we will see in the next example, then we would create more groups, which are numbered in order of appearance.
Example 4. Positive lookbehind
In the following example, we will take a look at another useful Regex trick. Imaging a situation where you are just interested in the year in which the writing of a certain book started. In such cases you can use a ‘positive lookbehind’ to start matching after particular (Regex) characters.
We can divide the pattern below into three parts. Each part is put between brackets to compile separate groups from which we can retrieve information. (?<=writing)
looks behind the word ‘writing.’ (.*?)
matches on anything that comes after that. And (\d{4})
matches on 4 digit numbers that come after the word writing. In other words, this string searching algorithm starts becoming interested as soon as it sees ‘writing,’ then it processes anything, until it bumps into a 4 digit number.
Now see it in action.
s = 'The Philosopher\'s Stone is published in 1997, but the writing started in 1990.'
m = re.search(r'(?<=writing)(.*?)(\d{4})', s)
m.group(2).strip()
Some background: observe that there are 4 groups in total (group 0 to 3). Also note that that you can also match ‘ordinary’ characters, such as the letters that form the word ‘writing.’ You are allowed to use Regex and ordinary characters side-by-side.
Example 5. Positive lookahead
Similarly to the positive lookbehind, Regex also offer a ‘positive lookahead,’ which can yield matches before a particular character. Use (?=...)
, where ‘…’ should be replaced by a character of your choice. In the example below, the Regex pattern will match on any single word character which is followed by a space and the letter ‘c.’
s = 'a b c d'
m = re.search(r'\w{1}(?=\sc)', s) # note that you need to add a space character either before the c or after the \w{1}
m
Example 6. Optional patterns
Regex also makes it possible to use optional patterns, you can match those, but they shouldn’t necessarily occur. A question mark (?
) makes the preceding pattern/group optional. For instance, Jan(uary)?
allows you to match two different notations of ‘January’. In the example below, you will see that the Regex pattern Harry(\sPotter)?
will match both on ‘Harry’ and ‘Harry Potter.’
sentences = ['This story is about Harry', 'This story is about Harry Potter', 'This story is about Hagrid']
for s in sentences:
m = re.search(r'Harry(\sPotter)?', s)
if m:
print(m)
else:
print(f'{s}, which is not about Harry Potter.')
Note that you won’t need such a For Loop in the challenges below.
Example 7. Word boundaries
Finally, Regex is all about finding the right pattern for the right match(es). You don’t want to make the pattern too ‘greedy’ (so it gives your more than you want), and you also don’t want to make it too ‘strict’ (so it gives you less than you want).
For instance, imagine that you like to find all three letter words (The, but, the) in the string below. \w{3}
will not give you what you like. Just try it out. You need to demarcate the pattern with so-called ‘word boundaries,’ using \b\w{3}\b
. This will make the pattern less greedy.
s = 'The Philosopher\'s Stone is published in 1997, but the writing started in 1990.'
m = re.findall(r'\b\w{3}\b',s)
m
Here is a re-cap of the Regex characters and patterns that we discussed in the examples above:
\d
matches digits{...}
quantifies particular number+
quantifies one or more\w
matches word characters\s
matches whitespace.*?
matches on anything(?<=...)
positive lookbehind(?=...)
positive lookahead(...)?
optional pattern\b
word boundaries
The 6 Hogwarts Regex challenges
Challenge 1.
Is it likely? No. But, imaging a wave of modernization at Hogwarts, in which Professors added email to their stock pile of communication methods. As a data wizard, you will need to extract all email addresses from the existing documentation to make a clean email list. Find all email addresses in the string below.
To help you in the right direction: for this challenge you can use re.findall()
similarly to how it is used in Example 1.
# Answer to Challenge 1.
s = 'Please submit your assignments to the following email addresses. \nAstronomy: sinistra@hogwarts.edu \nDefence Against the Dark Arts: lupin@hogwarts.edu \nPotions: snape@hogwarts.edu'
Challenge 2.
Snape, being Snape, expressed a preference for Gmail. Write a pattern that is flexible enough to process such anomalies.
# Answer to Challenge 2.
s = 'Please submit your assignments to the following email addresses. \nAstronomy: sinistra@hogwarts.edu \nDefence Against the Dark Arts: lupin@hogwarts.edu \nPotions: snape@gmail.com'
Challenge 3.
It’s not just email which is the enemy of owl post, professors may turn to telephones, too! Identify all telephone numbers within the following string. Also note that you have to match the whitespaces in between the numbers.
# Answer to Challenge 3.
s = 'In case of emergency, please do call your professor. Reach out to Professor Sinistra at 010 4529 6017, Professor Lupin at 010 5529 9036, or Professor Snape at 010 8865 9046'
Challenge 4.
This challenge has two parts.
First, re.findall
returns a list, and, if you search for multiple groups, a list of tuples. Use re.findall
and a Regex pattern that returns 3 groups: (1) ‘Professor Some_Family_Name’, (2) ‘ at ‘, and (3) ‘a telephone number’. Hence your output should look like this:
[(‘Professor Sinistra’, ‘ at ‘, ‘010 4529 6017’),
(‘Professor Lupin’, ‘ at ‘, ‘010 5529 9036’),
(‘Professor Snape’, ‘ at ‘, ‘010 8865 9046’)]
Note that you also have to match the whitespace before and after ‘ at ‘
Second, having compiled your list of tuples, your next goal is to turn it into a Pandas Dataframe and to make the ‘at’ dissapear. The end result should looks as follows (you will need to rename the column names yourself):
name professor | telephone number | |
---|---|---|
0 | Professor Sinistra | 010 4529 6017 |
1 | Professor Lupin | 010 5529 9036 |
2 | Professor Snape | 010 8865 9046 |
There are various ways to do this. Teach yourself a method of choice. For this part you may use other websites than the ones listed above.
# Answer to Challenge 4 (part 1).
s = 'In case of emergency, please do call your professor. Reach out to Professor Sinistra at 010 4529 6017, Professor Lupin at 010 5529 9036, or Professor Snape at 010 8865 9046'
# Answer to Challenge 4 (part 2).
Challenge 5.
This challenge again has two parts.
First, you create a list of tuples, in which you capture both the subjects and grades in Harry’s grade overview. For this you will need a positive lookahead and some optional items.
Second, this list of tuples should then also be turned into a Dataframe (similarly as for Challenge 4). It is allowed to use other websites for this challenge.
# Answer to Challenge 5 (part 1).
s = 'History of Magic: A; Muggle Studies: A; Potions O; Transfiguration: E; Arithmancy: A; Divination: O;'
# Answer to Challenge 5 (part 2).
Challenge 6.
Let’s have a look at one more Regex example before going to the final challenge. re.sub() is an often used operation to clean and organize (textual) data. It works with a pattern that matches something, which is then replaced by something else. For instance, here we replace all underscores for spaces.
s = 'Hogwarts_School_of_Witchcraft_and_Wizardry'
m = re.sub(r'_', ' ', s)
m
Things can be made more interesting by adding a function that can deal with different scenarios. Built a function called grader which transforms the single letter grades into grades that are fully written out. If you do not know them by heart, then you can find the meaning of the different grades at Hogwarts here. For this part you may again use other websites than the ones listed above.
# Answer to Challenge 6.
s = 'History of Magic: A; Muggle Studies: A; Potions O; Transfiguration: E; Arithmancy: A; Divination: O'
Acknowledgements
Big thanks to Nanne van Noord for providing feedback to earlier draft versions of this coding challenge.
Sources
Photos by Finn and Adam Chang.