🔍 Lesson 6: Regular Expressions
Master Python's re module — write patterns that search, match, extract, and transform text with surgical precision.
🎯 Learning Objectives
By the end of this lesson, you will be able to:
- Write regular expression patterns using character classes, quantifiers, and anchors
- Use
re.search(),re.match(),re.findall(), andre.sub() - Extract structured data from text using capturing groups
- Compile patterns for reuse and performance
- Apply regex to real-world tasks: validation, parsing, and text transformation
Estimated Time: 60 minutes
Project: Build a log file parser that extracts timestamps, levels, and messages using regex
In This Lesson
🤔 What Are Regular Expressions?
A regular expression (regex) is a mini-language for describing text patterns. Instead of searching for an exact string like "hello", you can describe a shape — "a word starting with a capital letter followed by digits" or "an email address" or "a date in YYYY-MM-DD format".
Regex is used everywhere: form validation, log parsing, data extraction, search-and-replace, syntax highlighting, and URL routing. Every programming language supports it, and mastering it in Python transfers directly.
📖 Key Terms
Pattern: The regex string that describes what you're looking for (e.g., r"\d{3}-\d{4}").
Match: A successful result — the pattern was found in the text.
Raw string (r"..."): A Python string where backslashes are treated literally — essential for regex because patterns use lots of \ characters.
Capture group: Parentheses in a pattern that extract specific parts of a match.
Why Raw Strings?
In normal Python strings, \n means newline and \t means tab. But in regex, \d means "any digit" and \b means "word boundary." Without raw strings, you'd have to double every backslash:
# Without raw string — messy and error-prone
pattern = "\\d{3}-\\d{4}"
# With raw string — clean and readable
pattern = r"\d{3}-\d{4}"
# Both produce the same pattern, but r"..." is the convention
✅ Rule of Thumb
Always use raw strings (r"...") for regex patterns. It's a universal Python convention and prevents subtle backslash bugs.
🧩 Basic Pattern Syntax
Most characters in a pattern match themselves literally. The power comes from metacharacters — characters with special meaning:
| Metacharacter | Meaning | Example | Matches |
|---|---|---|---|
. |
Any character (except newline) | c.t |
cat, cot, c9t, c!t |
\d |
Any digit (0–9) | \d\d |
42, 07, 99 |
\D |
Any non-digit | \D\D |
AB, hi, !@ |
\w |
Word character (letter, digit, _) |
\w+ |
hello, var_1, Python3 |
\W |
Non-word character | \W |
spaces, punctuation |
\s |
Whitespace (space, tab, newline) | \s+ |
one or more spaces |
\S |
Non-whitespace | \S+ |
any "word" without spaces |
\ |
Escape a metacharacter | \. |
literal dot |
import re
# . matches any character
print(re.findall(r"c.t", "cat cot cut c t c9t"))
# ['cat', 'cot', 'cut', 'c t', 'c9t']
# \d matches digits
print(re.findall(r"\d+", "Order #1234 has 5 items worth $67.89"))
# ['1234', '5', '67', '89']
# \w+ matches word characters
print(re.findall(r"\w+", "Hello, World! Python_3 is great."))
# ['Hello', 'World', 'Python_3', 'is', 'great']
# Escaping: \. matches a literal dot
print(re.findall(r"\d+\.\d+", "Price: $12.99 and $3.50"))
# ['12.99', '3.50']
🛠️ The re Module Functions
Python's re module provides several functions for working with patterns. Here are the most important ones:
| Function | Returns | Use For |
|---|---|---|
re.search(pattern, string) |
First match (or None) |
Find a pattern anywhere in text |
re.match(pattern, string) |
Match at start only (or None) |
Check if text starts with a pattern |
re.fullmatch(pattern, string) |
Match entire string (or None) |
Validate that a string matches exactly |
re.findall(pattern, string) |
List of all matches | Extract every occurrence |
re.finditer(pattern, string) |
Iterator of match objects | Extract matches with position info |
re.sub(pattern, repl, string) |
New string with replacements | Search and replace |
re.split(pattern, string) |
List of substrings | Split on a pattern instead of fixed string |
re.search() — Find First Match
import re
text = "My phone is 555-1234 and office is 555-5678"
match = re.search(r"\d{3}-\d{4}", text)
if match:
print(f"Found: {match.group()}") # '555-1234'
print(f"Start: {match.start()}") # 13
print(f"End: {match.end()}") # 21
print(f"Span: {match.span()}") # (13, 21)
else:
print("No match found")
Output:
Found: 555-1234
Start: 13
End: 21
Span: (13, 21)
re.match() vs re.search()
import re
text = "Error: file not found"
# re.match() — only checks the BEGINNING of the string
print(re.match(r"Error", text)) # <re.Match object; ...>
print(re.match(r"file", text)) # None (not at start)
# re.search() — checks ANYWHERE in the string
print(re.search(r"file", text)) # <re.Match object; ...>
⚠️ Common Gotcha
re.match() only matches at the beginning of the string — not the beginning of each line. Most of the time, you want re.search() instead. Use re.match() when you're specifically validating that a string starts with a pattern.
re.findall() — Get All Matches
import re
text = "Prices: $12.99, $3.50, $149.00, and $0.99"
prices = re.findall(r"\$\d+\.\d{2}", text)
print(prices)
# ['$12.99', '$3.50', '$149.00', '$0.99']
# Convert to floats
amounts = [float(p.replace("$", "")) for p in prices]
print(f"Total: ${sum(amounts):.2f}")
# Total: $166.48
re.split() — Split on a Pattern
import re
# Split on any combination of comma, semicolon, or whitespace
text = "apple, banana;cherry date;;elderberry"
items = re.split(r"[,;\s]+", text)
print(items)
# ['apple', 'banana', 'cherry', 'date', 'elderberry']
# Compare with str.split() — can only split on one fixed string
print(text.split(",")) # Less flexible
# ['apple', ' banana;cherry date;;elderberry']
need?"} --> B["Find first occurrence"] A --> C["Find ALL occurrences"] A --> D["Check if string
starts with pattern"] A --> E["Validate entire string"] A --> F["Replace matches"] B --> B2["re.search()"] C --> C2["re.findall()"] D --> D2["re.match()"] E --> E2["re.fullmatch()"] F --> F2["re.sub()"] style B2 fill:#3b82f6,color:#fff style C2 fill:#10b981,color:#fff style D2 fill:#6366f1,color:#fff style E2 fill:#f59e0b,color:#fff style F2 fill:#ef4444,color:#fff
📦 Character Classes
Square brackets [...] define a character class — a set of characters that can match at a single position:
import re
# [aeiou] matches any single vowel
print(re.findall(r"[aeiou]", "Hello World"))
# ['e', 'o', 'o']
# [A-Z] matches any uppercase letter (range)
print(re.findall(r"[A-Z]", "Hello World"))
# ['H', 'W']
# [0-9a-fA-F] matches any hex digit
print(re.findall(r"[0-9a-fA-F]+", "Color: #FF5733 and #1a2b3c"))
# ['FF5733', '1a2b3c']
# [^...] means NOT these characters (negation)
print(re.findall(r"[^0-9\s]+", "abc 123 def 456"))
# ['abc', 'def']
| Class | Meaning | Equivalent |
|---|---|---|
[abc] |
a, b, or c | — |
[a-z] |
Any lowercase letter | — |
[A-Za-z] |
Any letter | — |
[0-9] |
Any digit | \d |
[^abc] |
Any character except a, b, c | — |
[a-zA-Z0-9_] |
Any word character | \w |
🧠 Special Characters Inside [...]
Most metacharacters lose their special meaning inside brackets. A literal . inside [.] matches a dot — no escaping needed. To include a literal ], put it first: []]. To include a literal -, put it first or last: [-abc] or [abc-]. To include ^ as a literal, don't put it first: [a^b].
🔢 Quantifiers
Quantifiers control how many times the preceding element must appear:
| Quantifier | Meaning | Example | Matches |
|---|---|---|---|
* |
0 or more | ab*c |
ac, abc, abbc, abbbc |
+ |
1 or more | ab+c |
abc, abbc (not ac) |
? |
0 or 1 (optional) | colou?r |
color, colour |
{n} |
Exactly n | \d{4} |
1234, 0000, 9999 |
{n,} |
n or more | \d{2,} |
12, 123, 1234567 |
{n,m} |
Between n and m (inclusive) | \d{2,4} |
12, 123, 1234 |
import re
# ? makes the preceding element optional
print(re.findall(r"colou?r", "color and colour"))
# ['color', 'colour']
# + requires at least one
print(re.findall(r"\d+", "I have 3 cats and 12 fish"))
# ['3', '12']
# {n,m} specifies a range
print(re.findall(r"\b\w{3,5}\b", "I am a Python programmer"))
# ['Python'] — wait, let's check more carefully:
text = "I am a big Python fan today"
print(re.findall(r"\b\w{3,5}\b", text))
# ['big', 'fan', 'today']
Greedy vs. Lazy
By default, quantifiers are greedy — they match as much as possible. Add ? after a quantifier to make it lazy (match as little as possible):
import re
html = "<b>bold</b> and <i>italic</i>"
# Greedy: .* grabs as much as possible
print(re.findall(r"<.*>", html))
# ['<b>bold</b> and <i>italic</i>'] — grabbed everything!
# Lazy: .*? grabs as little as possible
print(re.findall(r"<.*?>", html))
# ['<b>', '</b>', '<i>', '</i>'] — each tag separately
⚓ Anchors & Boundaries
Anchors don't match characters — they match positions in the string:
| Anchor | Meaning | Example |
|---|---|---|
^ |
Start of string (or line with re.MULTILINE) |
^Hello |
$ |
End of string (or line with re.MULTILINE) |
world$ |
\b |
Word boundary | \bcat\b — matches "cat" not "cats" or "scatter" |
\B |
Not a word boundary | \Bcat\B — matches "scatter" but not "cat" |
import re
# \b word boundaries — match whole words only
text = "the cat scattered categories across the catalog"
# Without boundary — finds "cat" inside other words too
print(re.findall(r"cat", text))
# ['cat', 'cat', 'cat', 'cat']
# With boundary — only the standalone word "cat"
print(re.findall(r"\bcat\b", text))
# ['cat']
# ^ and $ anchors
lines = "ERROR: disk full\nWARNING: low memory\nERROR: timeout"
# re.MULTILINE makes ^ and $ match each line, not just the whole string
errors = re.findall(r"^ERROR.*", lines, re.MULTILINE)
print(errors)
# ['ERROR: disk full', 'ERROR: timeout']
🧠 The re.MULTILINE Flag
By default, ^ and $ match the start and end of the entire string. With re.MULTILINE (or re.M), they match the start and end of each line. This is essential when processing multi-line log files, config files, or any text with newlines.
🎯 Groups & Capturing
Parentheses (...) create capturing groups that extract specific parts of a match. This is one of regex's most powerful features.
Basic Groups
import re
# Extract area code and number separately
text = "Call 555-1234 or 800-5678"
# Parentheses create groups
matches = re.findall(r"(\d{3})-(\d{4})", text)
print(matches)
# [('555', '1234'), ('800', '5678')]
# With re.search(), access groups by index
match = re.search(r"(\d{3})-(\d{4})", text)
if match:
print(f"Full match: {match.group(0)}") # '555-1234'
print(f"Area code: {match.group(1)}") # '555'
print(f"Number: {match.group(2)}") # '1234'
Output:
[('555', '1234'), ('800', '5678')]
Full match: 555-1234
Area code: 555
Number: 1234
Named Groups
For readability, you can name your groups with (?P<name>...):
import re
log_line = "2024-01-15 10:30:45 ERROR Failed to connect"
pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.+)"
match = re.search(pattern, log_line)
if match:
print(f"Date: {match.group('date')}")
print(f"Time: {match.group('time')}")
print(f"Level: {match.group('level')}")
print(f"Message: {match.group('message')}")
# .groupdict() returns all named groups as a dict
print(match.groupdict())
Output:
Date: 2024-01-15
Time: 10:30:45
Level: ERROR
Message: Failed to connect
{'date': '2024-01-15', 'time': '10:30:45', 'level': 'ERROR', 'message': 'Failed to connect'}
Non-Capturing Groups
Sometimes you need grouping for structure but don't need to capture. Use (?:...):
import re
# We want to match "http" or "https", but only capture the domain
urls = "Visit http://example.com or https://docs.python.org"
# (?:...) groups without capturing
domains = re.findall(r"(?:https?://)(\S+)", urls)
print(domains)
# ['example.com', 'docs.python.org']
# Without (?:...) — would capture "http://" too
both = re.findall(r"(https?://)(\S+)", urls)
print(both)
# [('http://', 'example.com'), ('https://', 'docs.python.org')]
Alternation with |
The pipe | works like "or" — matches the pattern on either side:
import re
text = "I have a cat and a dog but not a fish"
# Match "cat" or "dog"
pets = re.findall(r"cat|dog", text)
print(pets) # ['cat', 'dog']
# Group alternation for shared context
text = "file.py file.js file.txt file.css"
code_files = re.findall(r"file\.(?:py|js|css)", text)
print(code_files) # ['file.py', 'file.js', 'file.css']
🔄 Search & Replace with re.sub()
re.sub() finds all matches of a pattern and replaces them. The replacement can be a string (with backreferences) or a function.
Simple Replacement
import re
# Replace all digits with #
text = "My SSN is 123-45-6789"
redacted = re.sub(r"\d", "#", text)
print(redacted)
# 'My SSN is ###-##-####'
# Replace multiple spaces with one
messy = "Too many spaces here"
clean = re.sub(r"\s+", " ", messy)
print(clean)
# 'Too many spaces here'
Backreferences in Replacement
Use \1, \2 (or \g<name>) in the replacement string to reference captured groups:
import re
# Reformat dates from MM/DD/YYYY to YYYY-MM-DD
text = "Dates: 01/15/2024 and 12/25/2023"
reformatted = re.sub(
r"(\d{2})/(\d{2})/(\d{4})",
r"\3-\1-\2",
text
)
print(reformatted)
# 'Dates: 2024-01-15 and 2023-12-25'
# With named groups
reformatted = re.sub(
r"(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})",
r"\g<year>-\g<month>-\g<day>",
text
)
print(reformatted)
# 'Dates: 2024-01-15 and 2023-12-25'
Function-Based Replacement
Pass a function to re.sub() for dynamic replacements:
import re
def censor_word(match):
"""Replace matched word with asterisks of same length."""
word = match.group()
return "*" * len(word)
text = "The password is secret123 and the code is abc"
censored = re.sub(r"\b(secret\w*|abc)\b", censor_word, text)
print(censored)
# 'The password is ********* and the code is ***'
def double_number(match):
"""Double any matched number."""
num = int(match.group())
return str(num * 2)
text = "I have 5 cats and 3 dogs"
doubled = re.sub(r"\d+", double_number, text)
print(doubled)
# 'I have 10 cats and 6 dogs'
⚡ Compiling Patterns
When you use a pattern multiple times, compile it into a reusable regex object with re.compile(). This is slightly faster and makes your code more readable:
import re
# Compile once, use many times
EMAIL_PATTERN = re.compile(
r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
)
texts = [
"Contact us at support@example.com for help",
"Send to alice@company.org or bob@mail.co.uk",
"No emails here!",
"Try info@my-site.net or hello@world.com",
]
for text in texts:
emails = EMAIL_PATTERN.findall(text)
if emails:
print(f"Found: {emails}")
else:
print("No emails found.")
Output:
Found: ['support@example.com']
Found: ['alice@company.org', 'bob@mail.co.uk']
No emails found.
Found: ['info@my-site.net', 'hello@world.com']
Flags
Compiled patterns (and all re functions) accept flags to modify behavior:
import re
# re.IGNORECASE (re.I) — case-insensitive matching
pattern = re.compile(r"python", re.IGNORECASE)
print(pattern.findall("Python PYTHON python PyThOn"))
# ['Python', 'PYTHON', 'python', 'PyThOn']
# re.DOTALL (re.S) — makes . match newlines too
text = "<div>\n Hello\n</div>"
print(re.findall(r"<div>.*</div>", text, re.DOTALL))
# ['<div>\n Hello\n</div>']
# re.VERBOSE (re.X) — allows comments and whitespace in patterns
phone_pattern = re.compile(r"""
\(? # Optional opening paren
\d{3} # Area code (3 digits)
\)? # Optional closing paren
[-.\s]? # Optional separator
\d{3} # First three digits
[-.\s]? # Optional separator
\d{4} # Last four digits
""", re.VERBOSE)
phones = ["555-1234", "(555) 123-4567", "555.123.4567", "5551234567"]
for p in phones:
if phone_pattern.search(p):
print(f" Valid: {p}")
✅ Use re.VERBOSE for Complex Patterns
The re.VERBOSE flag lets you add comments and whitespace to patterns. This makes complex regex readable and maintainable — use it whenever your pattern gets longer than a single line.
🌍 Real-World Examples
Extracting Data from Log Files
import re
log_pattern = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
r"\s+(?P<level>INFO|WARNING|ERROR|DEBUG|CRITICAL)"
r"\s+(?P<message>.+)"
)
log_lines = [
"2024-01-15 10:00:01 INFO Server started on port 8080",
"2024-01-15 10:01:12 WARNING High memory usage: 85%",
"2024-01-15 10:02:30 ERROR Connection timeout after 30s",
"This line doesn't match the format",
]
for line in log_lines:
match = log_pattern.search(line)
if match:
data = match.groupdict()
print(f"[{data['level']:8s}] {data['timestamp']} — {data['message']}")
else:
print(f" ⚠️ Unparseable: {line!r}")
Output:
[INFO ] 2024-01-15 10:00:01 — Server started on port 8080
[WARNING ] 2024-01-15 10:01:12 — High memory usage: 85%
[ERROR ] 2024-01-15 10:02:30 — Connection timeout after 30s
⚠️ Unparseable: "This line doesn't match the format"
Validating Input
import re
def validate_password(password):
"""Check password meets security requirements."""
checks = {
"At least 8 characters": r".{8,}",
"Contains uppercase": r"[A-Z]",
"Contains lowercase": r"[a-z]",
"Contains digit": r"\d",
"Contains special char": r"[!@#$%^&*(),.?\":{}|<>]",
}
results = {}
for description, pattern in checks.items():
results[description] = bool(re.search(pattern, password))
return results
# Test
for pwd in ["hello", "Hello123", "Hello123!", "H3ll0_W0rld!"]:
print(f"\n'{pwd}':")
for check, passed in validate_password(pwd).items():
status = "✅" if passed else "❌"
print(f" {status} {check}")
Cleaning & Normalizing Text
import re
def clean_text(text):
"""Normalize messy text for processing."""
# Remove HTML tags
text = re.sub(r"<[^>]+>", "", text)
# Replace multiple whitespace with single space
text = re.sub(r"\s+", " ", text)
# Remove leading/trailing whitespace
text = text.strip()
# Normalize quotes
text = re.sub(r"[""'']", '"', text)
return text
messy = """
<p>Hello <b>World</b>!</p>
This has "smart quotes" and extra
spaces everywhere.
"""
print(clean_text(messy))
# 'Hello World! This has "smart quotes" and extra spaces everywhere.'
🏋️ Hands-on Exercises
🏋️ Exercise 1: Data Extractor
Objective: Use regex to extract structured data from unstructured text.
Requirements:
- Write a function
extract_emails(text)that finds all email addresses - Write a function
extract_dates(text)that finds dates in MM/DD/YYYY or YYYY-MM-DD format - Write a function
extract_urls(text)that finds HTTP/HTTPS URLs - Test all three on a sample paragraph containing mixed data
Starter Code:
import re
def extract_emails(text):
"""Find all email addresses in text."""
# TODO: Write pattern for emails
pass
def extract_dates(text):
"""Find dates in MM/DD/YYYY or YYYY-MM-DD format."""
# TODO: Write pattern (use | for alternation)
pass
def extract_urls(text):
"""Find all http:// or https:// URLs."""
# TODO: Write pattern for URLs
pass
sample = """
Contact us at support@example.com or sales@company.org.
Our website is https://www.example.com/products.
The deadline is 01/15/2024 (or 2024-01-15 in ISO format).
Also check http://docs.python.org for references.
Send feedback to feedback@my-site.net by 12/31/2024.
"""
print("Emails:", extract_emails(sample))
print("Dates:", extract_dates(sample))
print("URLs:", extract_urls(sample))
💡 Hint
For emails: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. For dates, use alternation: (\d{2}/\d{2}/\d{4})|(\d{4}-\d{2}-\d{2}) — but a cleaner approach is two separate patterns or one pattern with optional separators. For URLs: https?://\S+ is a simple starting point.
✅ Solution
import re
def extract_emails(text):
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
return re.findall(pattern, text)
def extract_dates(text):
# Match MM/DD/YYYY or YYYY-MM-DD
pattern = r"\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2}"
return re.findall(pattern, text)
def extract_urls(text):
pattern = r"https?://[^\s,)\"']+"
return re.findall(pattern, text)
sample = """
Contact us at support@example.com or sales@company.org.
Our website is https://www.example.com/products.
The deadline is 01/15/2024 (or 2024-01-15 in ISO format).
Also check http://docs.python.org for references.
Send feedback to feedback@my-site.net by 12/31/2024.
"""
print("Emails:", extract_emails(sample))
# ['support@example.com', 'sales@company.org', 'feedback@my-site.net']
print("Dates:", extract_dates(sample))
# ['01/15/2024', '2024-01-15', '12/31/2024']
print("URLs:", extract_urls(sample))
# ['https://www.example.com/products.', 'http://docs.python.org']
🏋️ Exercise 2: Log Parser
Objective: Build a structured log file parser using named groups and compiled patterns.
Requirements:
- Create a compiled regex pattern with named groups for: timestamp, level, module, and message
- Parse a list of log lines into a list of dictionaries
- Filter to show only ERROR entries
- Count entries by log level
- Handle lines that don't match the expected format gracefully
💡 Hint
Use re.compile() with (?P<name>...) named groups. The pattern should match a timestamp like 2024-01-15 10:30:45, then a level word, then a module name in brackets like [server], then the message. Use match.groupdict() to convert each match to a dictionary.
✅ Solution
import re
from collections import Counter
# Compile the log pattern with named groups
LOG_PATTERN = re.compile(
r"(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
r"\s+(?P<level>\w+)"
r"\s+\[(?P<module>\w+)\]"
r"\s+(?P<message>.+)"
)
log_lines = [
"2024-01-15 10:00:01 INFO [server] Server started on port 8080",
"2024-01-15 10:00:05 DEBUG [database] Connection pool initialized",
"2024-01-15 10:01:12 WARNING [memory] High memory usage: 85%",
"2024-01-15 10:02:30 ERROR [network] Connection timeout after 30s",
"This is not a valid log line",
"2024-01-15 10:02:31 INFO [server] Retrying request...",
"2024-01-15 10:02:35 ERROR [database] Query failed: table not found",
"2024-01-15 10:03:00 INFO [server] Request completed in 250ms",
]
# Parse all lines
parsed = []
unparseable = []
for line in log_lines:
match = LOG_PATTERN.search(line)
if match:
parsed.append(match.groupdict())
else:
unparseable.append(line)
# Show all parsed entries
print(f"Parsed {len(parsed)} entries, {len(unparseable)} unparseable\n")
# Filter errors only
errors = [entry for entry in parsed if entry["level"] == "ERROR"]
print("=== ERRORS ===")
for e in errors:
print(f" [{e['module']}] {e['timestamp']}: {e['message']}")
# Count by level
level_counts = Counter(entry["level"] for entry in parsed)
print("\n=== Level Counts ===")
for level, count in level_counts.most_common():
print(f" {level}: {count}")
# Show unparseable
if unparseable:
print(f"\n=== Unparseable Lines ===")
for line in unparseable:
print(f" ⚠️ {line!r}")
🎯 Quick Quiz
Question 1: What's the difference between re.match() and re.search()?
Question 2: What does re.findall(r"(\d+)-(\d+)", "12-34 56-78") return?
Question 3: What does adding ? after a quantifier (like *? or +?) do?
📏 Best Practices
✅ Do's
- Always use raw strings —
r"..."for every pattern, no exceptions - Compile patterns you reuse —
re.compile()at module level for clarity and slight performance gains - Use named groups —
(?P<name>...)makes patterns self-documenting - Use
re.VERBOSEfor complex patterns — add comments, break across lines - Test incrementally — build your pattern piece by piece, testing each addition
- Prefer specific patterns —
\d{3}is better than\d+when you know the format
❌ Don'ts
- Don't use regex for everything — simple string methods (
.startswith(),.split(),in) are faster and clearer for simple tasks - Don't parse HTML or XML with regex — use
BeautifulSouporlxmlinstead. Regex can't handle nested tags correctly. - Don't write "clever" one-liners — if a regex needs a comment to explain it, it's too complex. Break it up or use
re.VERBOSE - Don't forget to handle the
Nonecase —re.search()andre.match()returnNoneon failure. Always check before calling.group() - Don't use
.*without thinking — greedy.*often matches too much. Consider.*?or a more specific pattern like[^)]*
💡 Pro Tips
- Use an online regex tester (like regex101.com) to build and debug patterns interactively — it explains each part of your pattern
- The
re.finditer()function returns match objects (with position info) instead of strings — use it when you need to know where matches are re.split()is often cleaner than chaining multiple.replace()calls for text normalization- For very complex text parsing, consider the
parselibrary (third-party) which inverts regex — you write the output format and it generates the pattern
📝 Summary
🎉 Key Takeaways
- Regex is a mini-language for describing text patterns — far more powerful than
str.find()orin - Always use raw strings —
r"..."prevents backslash confusion re.search()finds the first match anywhere;re.findall()finds all matches;re.sub()replaces- Character classes (
[...]) match sets of characters; quantifiers (*,+,?,{n}) control repetition - Groups (
(...)) capture parts of a match; named groups ((?P<name>...)) are self-documenting re.compile()withre.VERBOSEmakes complex patterns readable and reusable
| Task | Code Pattern |
|---|---|
| Find first match | m = re.search(r"pattern", text) |
| Find all matches | re.findall(r"pattern", text) |
| Replace | re.sub(r"old", "new", text) |
| Split on pattern | re.split(r"[,;\s]+", text) |
| Named group | (?P<name>\d+) |
| Non-capturing group | (?:pattern) |
| Word boundary | \bword\b |
| Lazy quantifier | .*? instead of .* |
📚 Additional Resources
🚀 What's Next?
In the next lesson, we'll master Comprehensions & Generators — Pythonic one-liners for transforming data, plus lazy evaluation with yield for memory-efficient processing of huge datasets.
🎉 Module 2 Complete!
You've completed the Working with Data module! You can now read/write files, handle errors gracefully with custom exceptions, and extract patterns from text with regex. On to Pythonic code!