Beginner's Guide to Regular Expressions (Regex) in Python

Beginner's Guide to Regular Expressions (Regex) in Python

Learning Objectives

The aim of this lesson is to introduce students to regular expressions. By the end of this article, students should be able to:

  • Understand what Regular Expressions (Regex) are.

  • Understand how Regex can be used in Python.

  • Can write accurate Regex from the table in Python.

  • Understand how to use the findall() and sub() methods.

  • Can use findall() with Regex to return all instances of a target.

  • Can use sub() with Regex to replace values in a string.

Introduction

Let’s imagine, you have a large dataset of customer reviews containing the good, the bad, and the amazing; and you need to extract the contact details of every customer who left a positive review so you can follow up with a thank you email. Or you have a dataset of log files from a web server and you need to identify all the requests that resulted in errors.

In both scenarios, you can either manually sort through the large volume of data; review by review, or log file by log file which would be terribly time-consuming, head-inducing, and irritatingly prone to errors or you make ample use of Regular Expressions!!!

(In my opinion, the first option should most definitely not be on the table as a data scientist)

The question now is, what are “Regular Expressions?”

Regular Expressions

Regular expressions, also known as Regex, are a sequence of characters that define a search pattern. They can be used to search, edit, or manipulate text and data, making them a valuable tool for data cleaning, text processing, and more. Regex is capable of a wide variety of tasks, such as: validating user input, extracting data from text, finding and replacing text, cleaning and normalizing data, etc.

Some Advantages of using Regex include:

  • Efficiency in searching and matching text data. This makes it a good choice for tasks that need to be performed on large datasets, especially when they are unstructured.

  • Defining complex search patterns that would be difficult or impossible to express in other ways.

  • Supported by most programming languages and text-processing tools.

Now that we have a basic understanding of Regex, let's see how to use it in Python.

How to use Regular Expressions in Python

Python has a built-in module called re for working with regular expressions which provides the necessary functions and methods for working with regular expressions. To use regular expressions in Python, you first need to import the re module:

import re

The Regex module provides a number of functions. Let’s look at the Regex search function.

Searching for Text

re.search(): This method is used to search for text for a specific string and returns a regular expression "match" object that contains information about where the match was found. For example:

import re

text = "This course will introduce the basics of data science"
match = re.search(r"data science", text)
print(match.start())

In this example, re.search(r"data science", text) call searches for the string “data science” and returns a regular expression “match” object that contains information about where this match was found.

Writing Regular Expressions

To write accurate regular expressions, you need to understand the regular expression syntax. Here are some of the most commonly used syntax elements:

OperatorDescription
.Matches any character
*Matches zero or more occurrences of the preceding character
+Matches one or more occurrences of the preceding character
?Matches zero or one occurrence of the preceding character
^Matches the beginning of a string
$Matches the end of a string
[ ]Matches any character inside the brackets
\wMatches any word character (alphanumeric characters plus underscore).
@Matches the "@" character literally.
\dMatches a digit
\.Matches the "." character literally

Using the findall() and sub() Methods

The findall() method returns a list containing all matches of a regular expression in a string. Here's an example:

import re
# Define a regular expression pattern
pattern = r"\d+"

# Define a string to search
string = "I have 3 cats and 2 dogs"

# Use the findall() method to find all occurrences of the pattern in the string
matches = re.findall(pattern, string)

# Print the matches
print(matches)

In this example, we define a regular expression pattern that matches one or more digits. We then define a string to search, which contains the numbers 3 and 2. We use the findall() method to find all occurrences of the pattern in the string, and the method returns a list containing the matches.

Another Example

Suppose you have a text containing multiple email addresses, and you want to extract all of them using re.findall()

import re

text = "Contact us at john@example.com or jane@email.org for assistance. For more information, email support@company.com."
email_pattern = r'\S+@\S+'  # A basic pattern for matching email addresses

email_addresses = re.findall(email_pattern, text)
print(email_addresses)

The Sub Method

The sub() method replaces all occurrences of a regular expression in a string with a new string[3]. It can be used to automatically substitute one expression for another within the string. Here's an example:

import re

# Define a regular expression pattern
pattern = r"\d+"

# Define a string to search
string = "I have 3 cats and 2 dogs"

# Use the sub() method to replace all occurrences of the pattern with the string "number"
new_string = re.sub(pattern, "number", string)

# Print the new string
print(new_string)

In this example, we define a regular expression pattern that matches one or more digits. We then define a string to search, which contains the numbers 3 and 2. We use the sub() method to replace all occurrences of the pattern with the string "number", and the method returns a new string with the replacements made.

Another example: Suppose you have a text with a recurring misspelled word, and you want to correct it throughout the text:

import re

text = "I lovee programming, and I am a proffesional developer."
misspelled_word = 'lovee'  # The misspelled word to be corrected
correction = 'love'  # The correction to be applied

corrected_text = re.sub(misspelled_word, correction, text)
print(corrected_text)

In this example, the re.sub() method is used to correct the misspelled word "lovee" by replacing it with "love" throughout the text.

The re.sub() method is a versatile tool for performing text replacement operations in Python, and it is particularly useful for tasks like data cleansing, anonymization, or correction of specific patterns within text data.

Practice Tasks

Task 1. What is a regular expression?

Solution: A regular expression is a sequence of characters that defines a search pattern.

Task 2. Write a Python program that uses a regular expression to search for the word "hello" in a string.

Solution:

import re

# Define a regular expression pattern
pattern = r"hello"

# Define a string to search
string = "hello world"

# Use the search() method to find the pattern in the string
match = re.search(pattern, string)

# Print the match object
print(match)

Task 3. Write a Python program that uses the findall() method to find all occurrences of the word "cat" in a string.

Solution:

import re

string = "The cat sat on the mat. The cat was black and white."

# Define the regular expression pattern for matching the word "cat".
cat_regex = r'cat'

# Use the findall() method to search for all occurrences of the pattern in the string.
cat_occurrences = re.findall(cat_regex, string)

# Print all occurrences of the word "cat" in the string.
for cat_occurrence in cat_occurrences:
  print(cat_occurrence)

Task 4: Write a regular expression that matches any phone number in the United States format.

Solution: The following regular expression matches any phone number in the United States format:

PHONE_NUMBER_REGEX = r'\(?\d{3}\)?\s?\d{3}\-\d{4}'

Extras:

1. Write a regular expression that matches any string that starts with "cat" and ends with "dog"?

Hint: use the ^ and $ operators.

Solution:

import re

text = "cat and dog"
match = re.search(r"^cat.*dog$", text)
print(match)

2. Write a regular expression that matches any string that contains at least one digit.

Hint: use the \d operator.

Solution:

import re

text = "hello123world"
match = re.search(r"\d+", text)
print(match)

3. Write a regular expression that matches any string that contains the word "apple" or "orange"?

Hint: use the | operator.

Solution:

import re
text = "I like apples and oranges"
match = re.search(r"apple|orange", text)
print(match)

In summary, we've learned what regular expressions (regex) are and how to use them in Python. These skills will help you efficiently handle text data in various tasks.