Skip to main content

🔍 Lesson 6: Regular Expressions

Master Python's re module — write patterns that search, match, extract, and transform text with surgical precision.

🎯 Learning Objectives

By the end of this lesson, you will be able to:

  • Write regular expression patterns using character classes, quantifiers, and anchors
  • Use re.search(), re.match(), re.findall(), and re.sub()
  • Extract structured data from text using capturing groups
  • Compile patterns for reuse and performance
  • Apply regex to real-world tasks: validation, parsing, and text transformation

Estimated Time: 60 minutes

Project: Build a log file parser that extracts timestamps, levels, and messages using regex

In This Lesson

🤔 What Are Regular Expressions?

A regular expression (regex) is a mini-language for describing text patterns. Instead of searching for an exact string like "hello", you can describe a shape — "a word starting with a capital letter followed by digits" or "an email address" or "a date in YYYY-MM-DD format".

Regex is used everywhere: form validation, log parsing, data extraction, search-and-replace, syntax highlighting, and URL routing. Every programming language supports it, and mastering it in Python transfers directly.

📖 Key Terms

Pattern: The regex string that describes what you're looking for (e.g., r"\d{3}-\d{4}").

Match: A successful result — the pattern was found in the text.

Raw string (r"..."): A Python string where backslashes are treated literally — essential for regex because patterns use lots of \ characters.

Capture group: Parentheses in a pattern that extract specific parts of a match.

Why Raw Strings?

In normal Python strings, \n means newline and \t means tab. But in regex, \d means "any digit" and \b means "word boundary." Without raw strings, you'd have to double every backslash:

# Without raw string — messy and error-prone
pattern = "\\d{3}-\\d{4}"

# With raw string — clean and readable
pattern = r"\d{3}-\d{4}"

# Both produce the same pattern, but r"..." is the convention

✅ Rule of Thumb

Always use raw strings (r"...") for regex patterns. It's a universal Python convention and prevents subtle backslash bugs.

🧩 Basic Pattern Syntax

Most characters in a pattern match themselves literally. The power comes from metacharacters — characters with special meaning:

Metacharacter Meaning Example Matches
. Any character (except newline) c.t cat, cot, c9t, c!t
\d Any digit (0–9) \d\d 42, 07, 99
\D Any non-digit \D\D AB, hi, !@
\w Word character (letter, digit, _) \w+ hello, var_1, Python3
\W Non-word character \W spaces, punctuation
\s Whitespace (space, tab, newline) \s+ one or more spaces
\S Non-whitespace \S+ any "word" without spaces
\ Escape a metacharacter \. literal dot
import re

# . matches any character
print(re.findall(r"c.t", "cat cot cut c t c9t"))
# ['cat', 'cot', 'cut', 'c t', 'c9t']

# \d matches digits
print(re.findall(r"\d+", "Order #1234 has 5 items worth $67.89"))
# ['1234', '5', '67', '89']

# \w+ matches word characters
print(re.findall(r"\w+", "Hello, World! Python_3 is great."))
# ['Hello', 'World', 'Python_3', 'is', 'great']

# Escaping: \. matches a literal dot
print(re.findall(r"\d+\.\d+", "Price: $12.99 and $3.50"))
# ['12.99', '3.50']
r"\d{3}-\d{4}" \d{3} Exactly 3 digits - Literal hyphen \d{4} Exactly 4 digits Matches: 555-1234, 800-5678

🛠️ The re Module Functions

Python's re module provides several functions for working with patterns. Here are the most important ones:

Function Returns Use For
re.search(pattern, string) First match (or None) Find a pattern anywhere in text
re.match(pattern, string) Match at start only (or None) Check if text starts with a pattern
re.fullmatch(pattern, string) Match entire string (or None) Validate that a string matches exactly
re.findall(pattern, string) List of all matches Extract every occurrence
re.finditer(pattern, string) Iterator of match objects Extract matches with position info
re.sub(pattern, repl, string) New string with replacements Search and replace
re.split(pattern, string) List of substrings Split on a pattern instead of fixed string

re.search() — Find First Match

import re

text = "My phone is 555-1234 and office is 555-5678"

match = re.search(r"\d{3}-\d{4}", text)

if match:
    print(f"Found: {match.group()}")   # '555-1234'
    print(f"Start: {match.start()}")    # 13
    print(f"End:   {match.end()}")      # 21
    print(f"Span:  {match.span()}")     # (13, 21)
else:
    print("No match found")

Output:

Found: 555-1234
Start: 13
End:   21
Span:  (13, 21)

re.match() vs re.search()

import re

text = "Error: file not found"

# re.match() — only checks the BEGINNING of the string
print(re.match(r"Error", text))    # <re.Match object; ...>
print(re.match(r"file", text))     # None (not at start)

# re.search() — checks ANYWHERE in the string
print(re.search(r"file", text))    # <re.Match object; ...>

⚠️ Common Gotcha

re.match() only matches at the beginning of the string — not the beginning of each line. Most of the time, you want re.search() instead. Use re.match() when you're specifically validating that a string starts with a pattern.

re.findall() — Get All Matches

import re

text = "Prices: $12.99, $3.50, $149.00, and $0.99"

prices = re.findall(r"\$\d+\.\d{2}", text)
print(prices)
# ['$12.99', '$3.50', '$149.00', '$0.99']

# Convert to floats
amounts = [float(p.replace("$", "")) for p in prices]
print(f"Total: ${sum(amounts):.2f}")
# Total: $166.48

re.split() — Split on a Pattern

import re

# Split on any combination of comma, semicolon, or whitespace
text = "apple, banana;cherry  date;;elderberry"
items = re.split(r"[,;\s]+", text)
print(items)
# ['apple', 'banana', 'cherry', 'date', 'elderberry']

# Compare with str.split() — can only split on one fixed string
print(text.split(","))  # Less flexible
# ['apple', ' banana;cherry  date;;elderberry']
graph TD A{"What do you
need?"} --> B["Find first occurrence"] A --> C["Find ALL occurrences"] A --> D["Check if string
starts with pattern"] A --> E["Validate entire string"] A --> F["Replace matches"] B --> B2["re.search()"] C --> C2["re.findall()"] D --> D2["re.match()"] E --> E2["re.fullmatch()"] F --> F2["re.sub()"] style B2 fill:#3b82f6,color:#fff style C2 fill:#10b981,color:#fff style D2 fill:#6366f1,color:#fff style E2 fill:#f59e0b,color:#fff style F2 fill:#ef4444,color:#fff

📦 Character Classes

Square brackets [...] define a character class — a set of characters that can match at a single position:

import re

# [aeiou] matches any single vowel
print(re.findall(r"[aeiou]", "Hello World"))
# ['e', 'o', 'o']

# [A-Z] matches any uppercase letter (range)
print(re.findall(r"[A-Z]", "Hello World"))
# ['H', 'W']

# [0-9a-fA-F] matches any hex digit
print(re.findall(r"[0-9a-fA-F]+", "Color: #FF5733 and #1a2b3c"))
# ['FF5733', '1a2b3c']

# [^...] means NOT these characters (negation)
print(re.findall(r"[^0-9\s]+", "abc 123 def 456"))
# ['abc', 'def']
Class Meaning Equivalent
[abc] a, b, or c
[a-z] Any lowercase letter
[A-Za-z] Any letter
[0-9] Any digit \d
[^abc] Any character except a, b, c
[a-zA-Z0-9_] Any word character \w

🧠 Special Characters Inside [...]

Most metacharacters lose their special meaning inside brackets. A literal . inside [.] matches a dot — no escaping needed. To include a literal ], put it first: []]. To include a literal -, put it first or last: [-abc] or [abc-]. To include ^ as a literal, don't put it first: [a^b].

🔢 Quantifiers

Quantifiers control how many times the preceding element must appear:

Quantifier Meaning Example Matches
* 0 or more ab*c ac, abc, abbc, abbbc
+ 1 or more ab+c abc, abbc (not ac)
? 0 or 1 (optional) colou?r color, colour
{n} Exactly n \d{4} 1234, 0000, 9999
{n,} n or more \d{2,} 12, 123, 1234567
{n,m} Between n and m (inclusive) \d{2,4} 12, 123, 1234
import re

# ? makes the preceding element optional
print(re.findall(r"colou?r", "color and colour"))
# ['color', 'colour']

# + requires at least one
print(re.findall(r"\d+", "I have 3 cats and 12 fish"))
# ['3', '12']

# {n,m} specifies a range
print(re.findall(r"\b\w{3,5}\b", "I am a Python programmer"))
# ['Python']  — wait, let's check more carefully:

text = "I am a big Python fan today"
print(re.findall(r"\b\w{3,5}\b", text))
# ['big', 'fan', 'today']

Greedy vs. Lazy

By default, quantifiers are greedy — they match as much as possible. Add ? after a quantifier to make it lazy (match as little as possible):

import re

html = "<b>bold</b> and <i>italic</i>"

# Greedy: .* grabs as much as possible
print(re.findall(r"<.*>", html))
# ['<b>bold</b> and <i>italic</i>']  — grabbed everything!

# Lazy: .*? grabs as little as possible
print(re.findall(r"<.*?>", html))
# ['<b>', '</b>', '<i>', '</i>']  — each tag separately
Greedy vs. Lazy Quantifiers Greedy .* <b>bold</b> and <i>italic</i> Lazy .*? <b> </b> <i> </i> Add ? after any quantifier to make it lazy: *? +? ?? {n,m}?

⚓ Anchors & Boundaries

Anchors don't match characters — they match positions in the string:

Anchor Meaning Example
^ Start of string (or line with re.MULTILINE) ^Hello
$ End of string (or line with re.MULTILINE) world$
\b Word boundary \bcat\b — matches "cat" not "cats" or "scatter"
\B Not a word boundary \Bcat\B — matches "scatter" but not "cat"
import re

# \b word boundaries — match whole words only
text = "the cat scattered categories across the catalog"

# Without boundary — finds "cat" inside other words too
print(re.findall(r"cat", text))
# ['cat', 'cat', 'cat', 'cat']

# With boundary — only the standalone word "cat"
print(re.findall(r"\bcat\b", text))
# ['cat']

# ^ and $ anchors
lines = "ERROR: disk full\nWARNING: low memory\nERROR: timeout"

# re.MULTILINE makes ^ and $ match each line, not just the whole string
errors = re.findall(r"^ERROR.*", lines, re.MULTILINE)
print(errors)
# ['ERROR: disk full', 'ERROR: timeout']

🧠 The re.MULTILINE Flag

By default, ^ and $ match the start and end of the entire string. With re.MULTILINE (or re.M), they match the start and end of each line. This is essential when processing multi-line log files, config files, or any text with newlines.

🎯 Groups & Capturing

Parentheses (...) create capturing groups that extract specific parts of a match. This is one of regex's most powerful features.

Basic Groups

import re

# Extract area code and number separately
text = "Call 555-1234 or 800-5678"

# Parentheses create groups
matches = re.findall(r"(\d{3})-(\d{4})", text)
print(matches)
# [('555', '1234'), ('800', '5678')]

# With re.search(), access groups by index
match = re.search(r"(\d{3})-(\d{4})", text)
if match:
    print(f"Full match: {match.group(0)}")  # '555-1234'
    print(f"Area code:  {match.group(1)}")  # '555'
    print(f"Number:     {match.group(2)}")  # '1234'

Output:

[('555', '1234'), ('800', '5678')]
Full match: 555-1234
Area code:  555
Number:     1234

Named Groups

For readability, you can name your groups with (?P<name>...):

import re

log_line = "2024-01-15 10:30:45 ERROR Failed to connect"

pattern = r"(?P<date>\d{4}-\d{2}-\d{2}) (?P<time>\d{2}:\d{2}:\d{2}) (?P<level>\w+) (?P<message>.+)"

match = re.search(pattern, log_line)

if match:
    print(f"Date:    {match.group('date')}")
    print(f"Time:    {match.group('time')}")
    print(f"Level:   {match.group('level')}")
    print(f"Message: {match.group('message')}")

    # .groupdict() returns all named groups as a dict
    print(match.groupdict())

Output:

Date:    2024-01-15
Time:    10:30:45
Level:   ERROR
Message: Failed to connect
{'date': '2024-01-15', 'time': '10:30:45', 'level': 'ERROR', 'message': 'Failed to connect'}

Non-Capturing Groups

Sometimes you need grouping for structure but don't need to capture. Use (?:...):

import re

# We want to match "http" or "https", but only capture the domain
urls = "Visit http://example.com or https://docs.python.org"

# (?:...) groups without capturing
domains = re.findall(r"(?:https?://)(\S+)", urls)
print(domains)
# ['example.com', 'docs.python.org']

# Without (?:...) — would capture "http://" too
both = re.findall(r"(https?://)(\S+)", urls)
print(both)
# [('http://', 'example.com'), ('https://', 'docs.python.org')]

Alternation with |

The pipe | works like "or" — matches the pattern on either side:

import re

text = "I have a cat and a dog but not a fish"

# Match "cat" or "dog"
pets = re.findall(r"cat|dog", text)
print(pets)  # ['cat', 'dog']

# Group alternation for shared context
text = "file.py file.js file.txt file.css"
code_files = re.findall(r"file\.(?:py|js|css)", text)
print(code_files)  # ['file.py', 'file.js', 'file.css']

🔄 Search & Replace with re.sub()

re.sub() finds all matches of a pattern and replaces them. The replacement can be a string (with backreferences) or a function.

Simple Replacement

import re

# Replace all digits with #
text = "My SSN is 123-45-6789"
redacted = re.sub(r"\d", "#", text)
print(redacted)
# 'My SSN is ###-##-####'

# Replace multiple spaces with one
messy = "Too    many     spaces    here"
clean = re.sub(r"\s+", " ", messy)
print(clean)
# 'Too many spaces here'

Backreferences in Replacement

Use \1, \2 (or \g<name>) in the replacement string to reference captured groups:

import re

# Reformat dates from MM/DD/YYYY to YYYY-MM-DD
text = "Dates: 01/15/2024 and 12/25/2023"
reformatted = re.sub(
    r"(\d{2})/(\d{2})/(\d{4})",
    r"\3-\1-\2",
    text
)
print(reformatted)
# 'Dates: 2024-01-15 and 2023-12-25'

# With named groups
reformatted = re.sub(
    r"(?P<month>\d{2})/(?P<day>\d{2})/(?P<year>\d{4})",
    r"\g<year>-\g<month>-\g<day>",
    text
)
print(reformatted)
# 'Dates: 2024-01-15 and 2023-12-25'

Function-Based Replacement

Pass a function to re.sub() for dynamic replacements:

import re

def censor_word(match):
    """Replace matched word with asterisks of same length."""
    word = match.group()
    return "*" * len(word)

text = "The password is secret123 and the code is abc"
censored = re.sub(r"\b(secret\w*|abc)\b", censor_word, text)
print(censored)
# 'The password is ********* and the code is ***'


def double_number(match):
    """Double any matched number."""
    num = int(match.group())
    return str(num * 2)

text = "I have 5 cats and 3 dogs"
doubled = re.sub(r"\d+", double_number, text)
print(doubled)
# 'I have 10 cats and 6 dogs'

⚡ Compiling Patterns

When you use a pattern multiple times, compile it into a reusable regex object with re.compile(). This is slightly faster and makes your code more readable:

import re

# Compile once, use many times
EMAIL_PATTERN = re.compile(
    r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
)

texts = [
    "Contact us at support@example.com for help",
    "Send to alice@company.org or bob@mail.co.uk",
    "No emails here!",
    "Try info@my-site.net or hello@world.com",
]

for text in texts:
    emails = EMAIL_PATTERN.findall(text)
    if emails:
        print(f"Found: {emails}")
    else:
        print("No emails found.")

Output:

Found: ['support@example.com']
Found: ['alice@company.org', 'bob@mail.co.uk']
No emails found.
Found: ['info@my-site.net', 'hello@world.com']

Flags

Compiled patterns (and all re functions) accept flags to modify behavior:

import re

# re.IGNORECASE (re.I) — case-insensitive matching
pattern = re.compile(r"python", re.IGNORECASE)
print(pattern.findall("Python PYTHON python PyThOn"))
# ['Python', 'PYTHON', 'python', 'PyThOn']

# re.DOTALL (re.S) — makes . match newlines too
text = "<div>\n  Hello\n</div>"
print(re.findall(r"<div>.*</div>", text, re.DOTALL))
# ['<div>\n  Hello\n</div>']

# re.VERBOSE (re.X) — allows comments and whitespace in patterns
phone_pattern = re.compile(r"""
    \(?         # Optional opening paren
    \d{3}       # Area code (3 digits)
    \)?         # Optional closing paren
    [-.\s]?     # Optional separator
    \d{3}       # First three digits
    [-.\s]?     # Optional separator
    \d{4}       # Last four digits
""", re.VERBOSE)

phones = ["555-1234", "(555) 123-4567", "555.123.4567", "5551234567"]
for p in phones:
    if phone_pattern.search(p):
        print(f"  Valid: {p}")

✅ Use re.VERBOSE for Complex Patterns

The re.VERBOSE flag lets you add comments and whitespace to patterns. This makes complex regex readable and maintainable — use it whenever your pattern gets longer than a single line.

🌍 Real-World Examples

Extracting Data from Log Files

import re

log_pattern = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>INFO|WARNING|ERROR|DEBUG|CRITICAL)"
    r"\s+(?P<message>.+)"
)

log_lines = [
    "2024-01-15 10:00:01 INFO  Server started on port 8080",
    "2024-01-15 10:01:12 WARNING  High memory usage: 85%",
    "2024-01-15 10:02:30 ERROR  Connection timeout after 30s",
    "This line doesn't match the format",
]

for line in log_lines:
    match = log_pattern.search(line)
    if match:
        data = match.groupdict()
        print(f"[{data['level']:8s}] {data['timestamp']} — {data['message']}")
    else:
        print(f"  ⚠️ Unparseable: {line!r}")

Output:

[INFO    ] 2024-01-15 10:00:01 — Server started on port 8080
[WARNING ] 2024-01-15 10:01:12 — High memory usage: 85%
[ERROR   ] 2024-01-15 10:02:30 — Connection timeout after 30s
  ⚠️ Unparseable: "This line doesn't match the format"

Validating Input

import re

def validate_password(password):
    """Check password meets security requirements."""
    checks = {
        "At least 8 characters": r".{8,}",
        "Contains uppercase":    r"[A-Z]",
        "Contains lowercase":    r"[a-z]",
        "Contains digit":        r"\d",
        "Contains special char": r"[!@#$%^&*(),.?\":{}|<>]",
    }

    results = {}
    for description, pattern in checks.items():
        results[description] = bool(re.search(pattern, password))

    return results

# Test
for pwd in ["hello", "Hello123", "Hello123!", "H3ll0_W0rld!"]:
    print(f"\n'{pwd}':")
    for check, passed in validate_password(pwd).items():
        status = "✅" if passed else "❌"
        print(f"  {status} {check}")

Cleaning & Normalizing Text

import re

def clean_text(text):
    """Normalize messy text for processing."""
    # Remove HTML tags
    text = re.sub(r"<[^>]+>", "", text)
    # Replace multiple whitespace with single space
    text = re.sub(r"\s+", " ", text)
    # Remove leading/trailing whitespace
    text = text.strip()
    # Normalize quotes
    text = re.sub(r"[""'']", '"', text)
    return text

messy = """
  <p>Hello   <b>World</b>!</p>
  This   has    "smart quotes"  and  extra
    spaces    everywhere.
"""

print(clean_text(messy))
# 'Hello World! This has "smart quotes" and extra spaces everywhere.'

🏋️ Hands-on Exercises

🏋️ Exercise 1: Data Extractor

Objective: Use regex to extract structured data from unstructured text.

Requirements:

  1. Write a function extract_emails(text) that finds all email addresses
  2. Write a function extract_dates(text) that finds dates in MM/DD/YYYY or YYYY-MM-DD format
  3. Write a function extract_urls(text) that finds HTTP/HTTPS URLs
  4. Test all three on a sample paragraph containing mixed data

Starter Code:

import re

def extract_emails(text):
    """Find all email addresses in text."""
    # TODO: Write pattern for emails
    pass

def extract_dates(text):
    """Find dates in MM/DD/YYYY or YYYY-MM-DD format."""
    # TODO: Write pattern (use | for alternation)
    pass

def extract_urls(text):
    """Find all http:// or https:// URLs."""
    # TODO: Write pattern for URLs
    pass


sample = """
Contact us at support@example.com or sales@company.org.
Our website is https://www.example.com/products.
The deadline is 01/15/2024 (or 2024-01-15 in ISO format).
Also check http://docs.python.org for references.
Send feedback to feedback@my-site.net by 12/31/2024.
"""

print("Emails:", extract_emails(sample))
print("Dates:", extract_dates(sample))
print("URLs:", extract_urls(sample))
💡 Hint

For emails: [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}. For dates, use alternation: (\d{2}/\d{2}/\d{4})|(\d{4}-\d{2}-\d{2}) — but a cleaner approach is two separate patterns or one pattern with optional separators. For URLs: https?://\S+ is a simple starting point.

✅ Solution
import re

def extract_emails(text):
    pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
    return re.findall(pattern, text)

def extract_dates(text):
    # Match MM/DD/YYYY or YYYY-MM-DD
    pattern = r"\d{2}/\d{2}/\d{4}|\d{4}-\d{2}-\d{2}"
    return re.findall(pattern, text)

def extract_urls(text):
    pattern = r"https?://[^\s,)\"']+"
    return re.findall(pattern, text)


sample = """
Contact us at support@example.com or sales@company.org.
Our website is https://www.example.com/products.
The deadline is 01/15/2024 (or 2024-01-15 in ISO format).
Also check http://docs.python.org for references.
Send feedback to feedback@my-site.net by 12/31/2024.
"""

print("Emails:", extract_emails(sample))
# ['support@example.com', 'sales@company.org', 'feedback@my-site.net']

print("Dates:", extract_dates(sample))
# ['01/15/2024', '2024-01-15', '12/31/2024']

print("URLs:", extract_urls(sample))
# ['https://www.example.com/products.', 'http://docs.python.org']

🏋️ Exercise 2: Log Parser

Objective: Build a structured log file parser using named groups and compiled patterns.

Requirements:

  1. Create a compiled regex pattern with named groups for: timestamp, level, module, and message
  2. Parse a list of log lines into a list of dictionaries
  3. Filter to show only ERROR entries
  4. Count entries by log level
  5. Handle lines that don't match the expected format gracefully
💡 Hint

Use re.compile() with (?P<name>...) named groups. The pattern should match a timestamp like 2024-01-15 10:30:45, then a level word, then a module name in brackets like [server], then the message. Use match.groupdict() to convert each match to a dictionary.

✅ Solution
import re
from collections import Counter

# Compile the log pattern with named groups
LOG_PATTERN = re.compile(
    r"(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})"
    r"\s+(?P<level>\w+)"
    r"\s+\[(?P<module>\w+)\]"
    r"\s+(?P<message>.+)"
)

log_lines = [
    "2024-01-15 10:00:01 INFO  [server] Server started on port 8080",
    "2024-01-15 10:00:05 DEBUG [database] Connection pool initialized",
    "2024-01-15 10:01:12 WARNING [memory] High memory usage: 85%",
    "2024-01-15 10:02:30 ERROR [network] Connection timeout after 30s",
    "This is not a valid log line",
    "2024-01-15 10:02:31 INFO  [server] Retrying request...",
    "2024-01-15 10:02:35 ERROR [database] Query failed: table not found",
    "2024-01-15 10:03:00 INFO  [server] Request completed in 250ms",
]

# Parse all lines
parsed = []
unparseable = []

for line in log_lines:
    match = LOG_PATTERN.search(line)
    if match:
        parsed.append(match.groupdict())
    else:
        unparseable.append(line)

# Show all parsed entries
print(f"Parsed {len(parsed)} entries, {len(unparseable)} unparseable\n")

# Filter errors only
errors = [entry for entry in parsed if entry["level"] == "ERROR"]
print("=== ERRORS ===")
for e in errors:
    print(f"  [{e['module']}] {e['timestamp']}: {e['message']}")

# Count by level
level_counts = Counter(entry["level"] for entry in parsed)
print("\n=== Level Counts ===")
for level, count in level_counts.most_common():
    print(f"  {level}: {count}")

# Show unparseable
if unparseable:
    print(f"\n=== Unparseable Lines ===")
    for line in unparseable:
        print(f"  ⚠️ {line!r}")

🎯 Quick Quiz

Question 1: What's the difference between re.match() and re.search()?

Question 2: What does re.findall(r"(\d+)-(\d+)", "12-34 56-78") return?

Question 3: What does adding ? after a quantifier (like *? or +?) do?

📏 Best Practices

✅ Do's

  • Always use raw stringsr"..." for every pattern, no exceptions
  • Compile patterns you reusere.compile() at module level for clarity and slight performance gains
  • Use named groups(?P<name>...) makes patterns self-documenting
  • Use re.VERBOSE for complex patterns — add comments, break across lines
  • Test incrementally — build your pattern piece by piece, testing each addition
  • Prefer specific patterns\d{3} is better than \d+ when you know the format

❌ Don'ts

  • Don't use regex for everything — simple string methods (.startswith(), .split(), in) are faster and clearer for simple tasks
  • Don't parse HTML or XML with regex — use BeautifulSoup or lxml instead. Regex can't handle nested tags correctly.
  • Don't write "clever" one-liners — if a regex needs a comment to explain it, it's too complex. Break it up or use re.VERBOSE
  • Don't forget to handle the None casere.search() and re.match() return None on failure. Always check before calling .group()
  • Don't use .* without thinking — greedy .* often matches too much. Consider .*? or a more specific pattern like [^)]*

💡 Pro Tips

  • Use an online regex tester (like regex101.com) to build and debug patterns interactively — it explains each part of your pattern
  • The re.finditer() function returns match objects (with position info) instead of strings — use it when you need to know where matches are
  • re.split() is often cleaner than chaining multiple .replace() calls for text normalization
  • For very complex text parsing, consider the parse library (third-party) which inverts regex — you write the output format and it generates the pattern

📝 Summary

🎉 Key Takeaways

  • Regex is a mini-language for describing text patterns — far more powerful than str.find() or in
  • Always use raw stringsr"..." prevents backslash confusion
  • re.search() finds the first match anywhere; re.findall() finds all matches; re.sub() replaces
  • Character classes ([...]) match sets of characters; quantifiers (*, +, ?, {n}) control repetition
  • Groups ((...)) capture parts of a match; named groups ((?P<name>...)) are self-documenting
  • re.compile() with re.VERBOSE makes complex patterns readable and reusable
Task Code Pattern
Find first match m = re.search(r"pattern", text)
Find all matches re.findall(r"pattern", text)
Replace re.sub(r"old", "new", text)
Split on pattern re.split(r"[,;\s]+", text)
Named group (?P<name>\d+)
Non-capturing group (?:pattern)
Word boundary \bword\b
Lazy quantifier .*? instead of .*

📚 Additional Resources

🚀 What's Next?

In the next lesson, we'll master Comprehensions & Generators — Pythonic one-liners for transforming data, plus lazy evaluation with yield for memory-efficient processing of huge datasets.

🎉 Module 2 Complete!

You've completed the Working with Data module! You can now read/write files, handle errors gracefully with custom exceptions, and extract patterns from text with regex. On to Pythonic code!