⚡ Lesson 7: Comprehensions & Generators
Write elegant, Pythonic one-liners with comprehensions and process massive datasets without running out of memory using generators and yield.
🎯 Learning Objectives
By the end of this lesson, you will be able to:
- Transform data concisely with list, dict, and set comprehensions
- Apply filters and nested loops inside comprehensions
- Build generator functions with
yieldfor lazy evaluation - Use generator expressions for memory-efficient pipelines
- Choose between comprehensions, generators, and loops for a given task
Estimated Time: 60 minutes
Project: Build a data pipeline that processes a large CSV lazily, filtering and transforming rows without loading the entire file
In This Lesson
🤔 Why Comprehensions?
One of the most common patterns in programming is: take a collection, transform each element, and collect the results. In many languages, this requires a multi-line loop. In Python, you can express it in a single, readable line.
# The loop way — 4 lines
squares = []
for x in range(10):
squares.append(x ** 2)
# The comprehension way — 1 line, same result
squares = [x ** 2 for x in range(10)]
print(squares)
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Comprehensions aren't just shorter — they're a declarative way to express data transformations. Instead of describing how to build the list step by step, you describe what the result looks like. This makes code easier to read once you're familiar with the syntax.
📖 Key Terms
Comprehension: A concise syntax for building a list, dict, or set from an iterable with optional transformation and filtering.
Generator: A function that produces values one at a time using yield, pausing between each value — lazy evaluation.
Lazy evaluation: Computing values only when they're needed, rather than all at once. Saves memory for large datasets.
Iterable: Any object you can loop over — lists, strings, files, ranges, generators, and more.
📋 List Comprehensions
The basic syntax is: [expression for item in iterable]
Transform Every Element
# Uppercase every name
names = ["alice", "bob", "carlos"]
upper_names = [name.upper() for name in names]
print(upper_names) # ['ALICE', 'BOB', 'CARLOS']
# Convert temperatures from Celsius to Fahrenheit
celsius = [0, 20, 37, 100]
fahrenheit = [(c * 9/5) + 32 for c in celsius]
print(fahrenheit) # [32.0, 68.0, 98.6, 212.0]
# Get lengths of words
words = ["Python", "is", "awesome"]
lengths = [len(w) for w in words]
print(lengths) # [6, 2, 7]
Filter with if
Add an if clause to keep only elements that satisfy a condition:
# Even numbers only
evens = [x for x in range(20) if x % 2 == 0]
print(evens) # [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
# Words longer than 3 characters
words = ["the", "quick", "brown", "fox", "jumps", "over", "a", "lazy", "dog"]
long_words = [w for w in words if len(w) > 3]
print(long_words) # ['quick', 'brown', 'jumps', 'over', 'lazy']
# Filter and transform in one step
scores = [45, 92, 78, 55, 88, 31, 95, 67]
passing_grades = [f"{s} ✅" for s in scores if s >= 60]
print(passing_grades)
# ['92 ✅', '78 ✅', '88 ✅', '95 ✅', '67 ✅']
Conditional Expression (if/else)
To transform differently based on a condition, put the if/else in the expression part (before for):
# Label each score as pass or fail
scores = [45, 92, 78, 55, 88, 31]
labels = ["pass" if s >= 60 else "fail" for s in scores]
print(labels)
# ['fail', 'pass', 'pass', 'fail', 'pass', 'fail']
# Clamp values to a range
values = [-5, 3, 12, 8, -1, 15, 7]
clamped = [max(0, min(10, v)) for v in values]
print(clamped)
# [0, 3, 10, 8, 0, 10, 7]
⚠️ Two Different if Positions
Filter (if after for): [x for x in items if condition] — keeps or drops items.
Transform (if/else before for): [a if condition else b for x in items] — chooses between two expressions for every item.
A filter if cannot have an else. A transform if/else must have an else.
📖 Dictionary Comprehensions
Same idea as list comprehensions, but produce a dictionary. Use curly braces with a key: value expression:
# Basic: {key_expr: value_expr for item in iterable}
# Square numbers as key-value pairs
squares = {x: x ** 2 for x in range(6)}
print(squares)
# {0: 0, 1: 1, 2: 4, 3: 9, 4: 16, 5: 25}
# Word lengths
words = ["hello", "world", "python", "is", "great"]
word_lengths = {w: len(w) for w in words}
print(word_lengths)
# {'hello': 5, 'world': 5, 'python': 6, 'is': 2, 'great': 5}
Transforming Dictionaries
# Invert a dictionary (swap keys and values)
original = {"a": 1, "b": 2, "c": 3}
inverted = {v: k for k, v in original.items()}
print(inverted) # {1: 'a', 2: 'b', 3: 'c'}
# Filter a dictionary
scores = {"Alice": 95, "Bob": 67, "Carlos": 82, "Dana": 58, "Eve": 91}
honors = {name: score for name, score in scores.items() if score >= 80}
print(honors) # {'Alice': 95, 'Carlos': 82, 'Eve': 91}
# Transform values
prices_usd = {"widget": 9.99, "gadget": 24.99, "doohickey": 4.99}
prices_eur = {item: round(price * 0.92, 2) for item, price in prices_usd.items()}
print(prices_eur)
# {'widget': 9.19, 'gadget': 22.99, 'doohickey': 4.59}
Building from Parallel Lists
# zip() pairs up two lists
names = ["Alice", "Bob", "Carlos"]
ages = [30, 25, 35]
people = {name: age for name, age in zip(names, ages)}
print(people) # {'Alice': 30, 'Bob': 25, 'Carlos': 35}
🔵 Set Comprehensions
Curly braces without a colon produce a set — automatically deduplicating values:
# Basic set comprehension
squares = {x ** 2 for x in range(-5, 6)}
print(squares) # {0, 1, 4, 9, 16, 25} — no duplicates!
# Extract unique first letters
names = ["Alice", "Anna", "Bob", "Barbara", "Carlos", "Cathy"]
initials = {name[0] for name in names}
print(initials) # {'A', 'B', 'C'}
# Unique word lengths in a sentence
sentence = "the quick brown fox jumps over the lazy dog"
unique_lengths = {len(w) for w in sentence.split()}
print(sorted(unique_lengths)) # [1, 3, 4, 5]
🧠 Set vs. List Comprehension
Use a set comprehension when you need unique values and don't care about order. Use a list comprehension when order matters or duplicates are meaningful. Sets also give you fast in membership testing — O(1) instead of O(n).
building?"} --> B["Ordered collection
with possible duplicates"] A --> C["Key-value mapping"] A --> D["Unique values only"] B --> B2["[expr for x in iter]
List comprehension"] C --> C2["{k: v for x in iter}
Dict comprehension"] D --> D2["{expr for x in iter}
Set comprehension"] style B2 fill:#3b82f6,color:#fff style C2 fill:#6366f1,color:#fff style D2 fill:#10b981,color:#fff
🔄 Nested Comprehensions
You can nest for clauses in a comprehension. The order matches the equivalent nested loop — outer loop first, inner loop second:
# Flatten a 2D list
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
# Nested loop version
flat = []
for row in matrix:
for val in row:
flat.append(val)
# Comprehension version — same order as the loop
flat = [val for row in matrix for val in row]
print(flat) # [1, 2, 3, 4, 5, 6, 7, 8, 9]
Cartesian Product
# All combinations of suit and rank
suits = ["♠", "♥", "♦", "♣"]
ranks = ["A", "K", "Q", "J", "10"]
cards = [f"{rank}{suit}" for suit in suits for rank in ranks]
print(cards[:8])
# ['A♠', 'K♠', 'Q♠', 'J♠', '10♠', 'A♥', 'K♥', 'Q♥']
print(f"Total: {len(cards)} cards")
# Total: 20 cards
Nested with Filter
# Find all pairs where the sum is 10
pairs = [(a, b) for a in range(1, 10) for b in range(a, 10) if a + b == 10]
print(pairs)
# [(1, 9), (2, 8), (3, 7), (4, 6), (5, 5)]
⚠️ Readability Limit
Two levels of nesting is usually the limit for comprehension readability. If you need three or more loops, or complex logic inside, use a regular for loop. Comprehensions are meant to be clearer than loops — if they're harder to read, switch back.
🔋 Generator Functions & yield
A list comprehension builds the entire list in memory at once. What if you're processing a million records? You'd need a million items in RAM. Generators solve this by producing values one at a time, on demand.
Your First Generator
def count_up_to(n):
"""Generate numbers from 1 to n."""
i = 1
while i <= n:
yield i # Pause here, produce i, resume when next() is called
i += 1
# Create a generator object (nothing executes yet!)
gen = count_up_to(5)
print(type(gen)) # <class 'generator'>
# Pull values one at a time
print(next(gen)) # 1
print(next(gen)) # 2
print(next(gen)) # 3
# Or use in a for loop (most common)
for num in count_up_to(5):
print(num, end=" ")
# 1 2 3 4 5
📖 How yield Works
When Python sees yield in a function, it becomes a generator function. Calling it doesn't execute the body — it returns a generator object. Each call to next() runs the function until the next yield, produces that value, and freezes the function's state (local variables, position). The next next() call resumes from exactly where it left off.
Practical Generator: Reading Large Files
def read_large_csv(filepath):
"""Yield one parsed row at a time — never loads entire file."""
with open(filepath, "r", encoding="utf-8") as f:
header = f.readline().strip().split(",")
for line in f:
values = line.strip().split(",")
yield dict(zip(header, values))
# Process a million-row file with constant memory usage
for row in read_large_csv("huge_data.csv"):
if float(row["amount"]) > 1000:
print(f"Large transaction: {row['id']} — ${row['amount']}")
Generator vs. List: Memory Comparison
import sys
# List: stores ALL 1 million numbers in memory
numbers_list = [x ** 2 for x in range(1_000_000)]
print(f"List size: {sys.getsizeof(numbers_list):,} bytes")
# List size: 8,448,728 bytes (~8 MB)
# Generator: stores only the recipe, not the values
numbers_gen = (x ** 2 for x in range(1_000_000))
print(f"Generator size: {sys.getsizeof(numbers_gen):,} bytes")
# Generator size: 200 bytes (!!)
✅ Key Insight
A list of 1 million items uses ~8 MB. The equivalent generator uses ~200 bytes. The generator computes each value on-the-fly and discards it after use. This is lazy evaluation — values are computed only when needed.
Generators Are Single-Use
gen = (x ** 2 for x in range(5))
# First pass: works fine
print(list(gen)) # [0, 1, 4, 9, 16]
# Second pass: empty! Generator is exhausted
print(list(gen)) # []
# If you need multiple passes, use a list or re-create the generator
⚠️ Single-Use Pitfall
Generators can only be iterated once. After exhaustion, they produce no more values. If you need to iterate multiple times, either convert to a list first (list(gen)) or create a new generator each time.
🌀 Generator Expressions
A generator expression looks like a list comprehension but with parentheses instead of brackets. It creates a generator, not a list:
# List comprehension — builds the entire list
squares_list = [x ** 2 for x in range(10)]
# Generator expression — produces values lazily
squares_gen = (x ** 2 for x in range(10))
# Both work with for loops
for s in squares_gen:
print(s, end=" ")
# 0 1 4 9 16 25 36 49 64 81
Passing Directly to Functions
Generator expressions are perfect for passing directly to functions that consume iterables. You can drop the extra parentheses when the generator is the only argument:
# Sum of squares — no intermediate list needed
total = sum(x ** 2 for x in range(1000))
print(total) # 332833500
# Largest word length
words = ["Python", "is", "remarkably", "elegant"]
longest = max(len(w) for w in words)
print(longest) # 10
# Check if any value passes a test
scores = [45, 67, 82, 55, 91]
has_honor = any(s >= 90 for s in scores)
print(has_honor) # True (stops at first True — efficient!)
# Check if ALL pass
all_passing = all(s >= 60 for s in scores)
print(all_passing) # False (stops at first False)
🧠 Short-Circuit Evaluation
any() and all() with generators are especially efficient because they short-circuit. any() stops as soon as it finds a True value, and all() stops as soon as it finds a False. With a generator, items after the short-circuit point are never computed at all.
Joining Strings Efficiently
# Generate formatted lines and join them
names = ["Alice", "Bob", "Carlos"]
scores = [95, 87, 92]
report = "\n".join(
f" {name}: {score}/100"
for name, score in zip(names, scores)
)
print(report)
# Alice: 95/100
# Bob: 87/100
# Carlos: 92/100
🔗 Chaining Generators into Pipelines
The real power of generators shows when you chain them together. Each stage processes one item at a time and passes it to the next — like an assembly line. The entire pipeline runs with constant memory regardless of input size.
def read_lines(filepath):
"""Stage 1: Yield lines from a file."""
with open(filepath, "r", encoding="utf-8") as f:
for line in f:
yield line.strip()
def filter_nonempty(lines):
"""Stage 2: Skip blank lines."""
for line in lines:
if line:
yield line
def parse_csv_row(lines):
"""Stage 3: Split each line into fields."""
for line in lines:
yield line.split(",")
def filter_high_value(rows, threshold=1000):
"""Stage 4: Keep rows where amount > threshold."""
for row in rows:
if len(row) >= 3 and float(row[2]) > threshold:
yield row
# Chain the pipeline — nothing executes until we iterate!
lines = read_lines("transactions.csv")
nonempty = filter_nonempty(lines)
rows = parse_csv_row(nonempty)
high_value = filter_high_value(rows, threshold=5000)
# NOW it runs — one item flows through the entire chain at a time
for row in high_value:
print(f" {row[0]}: ${row[2]}")
Pipeline with Generator Expressions
For simpler transformations, you can chain generator expressions:
# Process log file: find error messages, extract timestamps
with open("server.log", "r") as f:
lines = (line.strip() for line in f)
errors = (line for line in lines if "ERROR" in line)
timestamps = (line.split()[0] for line in errors)
for ts in timestamps:
print(f"Error at: {ts}")
📖 Why Pipelines Are Powerful
Memory: Only one item exists in memory at any point — process terabytes with megabytes of RAM.
Composability: Each stage is a simple function that does one thing. Mix, match, and reorder stages easily.
Testability: Each stage can be tested independently with a small list.
Laziness: Nothing executes until you consume the final generator — setup is instant.
🤔 When to Use What
| Situation | Use | Why |
|---|---|---|
| Simple transform/filter, need all results | List comprehension | Concise, readable, and you can iterate multiple times |
| Build a lookup table or mapping | Dict comprehension | Creates key-value pairs in one expression |
| Need unique values | Set comprehension | Auto-deduplication |
| Large dataset / stream processing | Generator (function or expression) | Constant memory, lazy evaluation |
Feeding sum(), max(), any(), all() |
Generator expression | No need to store the list — feed directly |
| Complex logic, side effects, state | Regular for loop |
Readability trumps brevity |
| Multi-stage data processing | Generator pipeline | Composable, testable, memory-efficient |
# ✅ Good: Simple transform → list comprehension
names = [user.name.upper() for user in users]
# ✅ Good: Feeding a function → generator expression
total = sum(order.total for order in orders if order.status == "complete")
# ✅ Good: Complex logic → regular loop
results = []
for item in inventory:
if item.quantity == 0:
send_restock_alert(item)
continue
price = calculate_discount(item)
results.append({"item": item.name, "price": price})
# ❌ Bad: Too complex for a comprehension
# result = [{"item": i.name, "price": calculate_discount(i)}
# for i in inventory if i.quantity > 0
# if not send_restock_alert(i)] # Side effects in comprehension!
🏋️ Hands-on Exercises
🏋️ Exercise 1: Comprehension Workout
Objective: Practice all three types of comprehensions with filters and transforms.
Requirements:
- Given a list of dictionaries representing students, use a list comprehension to create a list of formatted strings for students with GPA ≥ 3.5
- Use a dict comprehension to create a grade-to-students mapping (A: 4.0, B: 3.0–3.9, etc.)
- Use a set comprehension to find all unique majors
- Use a nested comprehension to flatten the course lists
Starter Code:
students = [
{"name": "Alice", "gpa": 3.9, "major": "CS", "courses": ["Python", "Algorithms"]},
{"name": "Bob", "gpa": 3.2, "major": "Math", "courses": ["Calculus", "Stats"]},
{"name": "Carlos", "gpa": 3.7, "major": "CS", "courses": ["Python", "Databases"]},
{"name": "Dana", "gpa": 2.8, "major": "Bio", "courses": ["Chemistry", "Genetics"]},
{"name": "Eve", "gpa": 4.0, "major": "CS", "courses": ["ML", "Python", "Ethics"]},
{"name": "Frank", "gpa": 3.5, "major": "Math", "courses": ["Stats", "Linear Algebra"]},
]
# 1. Honor roll: list of "Name (GPA)" strings for GPA >= 3.5
honor_roll = # TODO
# 2. GPA lookup: {name: gpa} dict for all students
gpa_lookup = # TODO
# 3. Unique majors: set of all majors
majors = # TODO
# 4. All courses: flat list of every course (with duplicates)
all_courses = # TODO
# 5. Unique courses: set of all unique courses
unique_courses = # TODO
print("Honor Roll:", honor_roll)
print("GPA Lookup:", gpa_lookup)
print("Majors:", majors)
print("All Courses:", all_courses)
print("Unique Courses:", unique_courses)
💡 Hint
For the honor roll, filter with if s["gpa"] >= 3.5. For the flat course list, use a nested comprehension: [course for s in students for course in s["courses"]]. For unique courses, wrap it in a set comprehension instead.
✅ Solution
students = [
{"name": "Alice", "gpa": 3.9, "major": "CS", "courses": ["Python", "Algorithms"]},
{"name": "Bob", "gpa": 3.2, "major": "Math", "courses": ["Calculus", "Stats"]},
{"name": "Carlos", "gpa": 3.7, "major": "CS", "courses": ["Python", "Databases"]},
{"name": "Dana", "gpa": 2.8, "major": "Bio", "courses": ["Chemistry", "Genetics"]},
{"name": "Eve", "gpa": 4.0, "major": "CS", "courses": ["ML", "Python", "Ethics"]},
{"name": "Frank", "gpa": 3.5, "major": "Math", "courses": ["Stats", "Linear Algebra"]},
]
# 1. Honor roll
honor_roll = [f"{s['name']} ({s['gpa']})" for s in students if s["gpa"] >= 3.5]
# 2. GPA lookup
gpa_lookup = {s["name"]: s["gpa"] for s in students}
# 3. Unique majors
majors = {s["major"] for s in students}
# 4. All courses (flat, with duplicates)
all_courses = [course for s in students for course in s["courses"]]
# 5. Unique courses
unique_courses = {course for s in students for course in s["courses"]}
print("Honor Roll:", honor_roll)
# ['Alice (3.9)', 'Carlos (3.7)', 'Eve (4.0)', 'Frank (3.5)']
print("GPA Lookup:", gpa_lookup)
# {'Alice': 3.9, 'Bob': 3.2, 'Carlos': 3.7, 'Dana': 2.8, 'Eve': 4.0, 'Frank': 3.5}
print("Majors:", majors)
# {'CS', 'Math', 'Bio'}
print("All Courses:", all_courses)
# ['Python', 'Algorithms', 'Calculus', 'Stats', 'Python', 'Databases',
# 'Chemistry', 'Genetics', 'ML', 'Python', 'Ethics', 'Stats', 'Linear Algebra']
print("Unique Courses:", unique_courses)
# {'Python', 'Algorithms', 'Calculus', 'Stats', 'Databases',
# 'Chemistry', 'Genetics', 'ML', 'Ethics', 'Linear Algebra'}
🏋️ Exercise 2: Lazy Data Pipeline
Objective: Build a generator pipeline that processes data stage by stage with constant memory.
Requirements:
- Write a generator
generate_transactions(n)that yieldsnrandom transactions (dicts withid,amount,category) - Write a generator
filter_category(transactions, category)that yields only matching transactions - Write a generator
apply_tax(transactions, rate)that adds ataxandtotalfield to each transaction - Chain all three into a pipeline and process 1 million transactions without running out of memory
- Use
sum()with a generator expression to compute the total revenue
💡 Hint
Use random.choice() for categories and random.uniform() for amounts. Each generator takes the previous generator as its input and yields modified items. The key insight: at any moment, only one transaction exists in memory across the entire pipeline.
✅ Solution
import random
import sys
def generate_transactions(n):
"""Stage 1: Generate n random transactions."""
categories = ["food", "tech", "clothing", "books", "travel"]
for i in range(n):
yield {
"id": f"TXN-{i:07d}",
"amount": round(random.uniform(1, 500), 2),
"category": random.choice(categories),
}
def filter_category(transactions, category):
"""Stage 2: Keep only transactions in the given category."""
for txn in transactions:
if txn["category"] == category:
yield txn
def apply_tax(transactions, rate=0.08):
"""Stage 3: Add tax and total fields."""
for txn in transactions:
txn["tax"] = round(txn["amount"] * rate, 2)
txn["total"] = round(txn["amount"] + txn["tax"], 2)
yield txn
# Build the pipeline (instant — nothing executes yet)
NUM_TRANSACTIONS = 1_000_000
all_txns = generate_transactions(NUM_TRANSACTIONS)
tech_only = filter_category(all_txns, "tech")
with_tax = apply_tax(tech_only, rate=0.10)
# Process: count and sum totals
count = 0
total_revenue = 0.0
for txn in with_tax:
count += 1
total_revenue += txn["total"]
print(f"Processed {NUM_TRANSACTIONS:,} transactions")
print(f"Tech transactions: {count:,}")
print(f"Total tech revenue: ${total_revenue:,.2f}")
print(f"Average tech sale: ${total_revenue / count:,.2f}")
# Memory usage is constant regardless of NUM_TRANSACTIONS!
🎯 Quick Quiz
Question 1: What's the key difference between [x**2 for x in range(10)] and (x**2 for x in range(10))?
Question 2: Where does the if go to filter out items in a list comprehension?
Question 3: What happens when you iterate over a generator a second time?
📏 Best Practices
✅ Do's
- Use comprehensions for simple transforms — they're more readable than the equivalent loop for straightforward cases
- Use generators for large data — whenever you're processing files, API responses, or any data that might not fit in memory
- Prefer generator expressions with
sum(),any(),all(),max(),min()— no reason to build a list just to aggregate it - Name your generators descriptively —
active_usersis better thangen - Keep comprehensions to one or two lines — if it wraps past two lines, use a loop
❌ Don'ts
- Don't put side effects in comprehensions —
[print(x) for x in items]builds a list ofNonevalues. Use a loop for side effects - Don't nest more than two levels — deeply nested comprehensions are harder to read than loops
- Don't use a list comprehension when you only need to iterate once — use a generator expression to save memory
- Don't forget that generators are single-use — if you
list(gen)and then iterategenagain, you'll get nothing - Don't sacrifice readability for cleverness — a clear 4-line loop beats an obscure 1-line comprehension
💡 Pro Tips
yield fromdelegates to another generator:yield from other_genis cleaner than looping over it- Use
itertools.chain()to concatenate multiple generators without nesting - The
collections.Counterconstructor accepts any iterable, including generators:Counter(word for line in f for word in line.split()) - In Python 3.12+, comprehensions share the enclosing scope — no more quirky variable leaking issues
📝 Summary
🎉 Key Takeaways
- List comprehensions —
[expr for x in iter if cond]— concise, readable data transformation - Dict comprehensions —
{k: v for x in iter}— build mappings in one line - Set comprehensions —
{expr for x in iter}— auto-deduplicate values - Generator functions — use
yieldto produce values lazily, one at a time - Generator expressions —
(expr for x in iter)— lazy comprehensions that save memory - Pipelines — chain generators for composable, memory-efficient data processing
- Generators are single-use — once exhausted, they produce nothing
| Syntax | Result | Memory |
|---|---|---|
[x**2 for x in range(n)] |
List | O(n) — all values stored |
{x: x**2 for x in range(n)} |
Dict | O(n) — all pairs stored |
{x**2 for x in range(n)} |
Set | O(n) — unique values stored |
(x**2 for x in range(n)) |
Generator | O(1) — one value at a time |
def f(): yield ... |
Generator function | O(1) — lazy, stateful |
📚 Additional Resources
- Python Docs — List Comprehensions
- Python Functional Programming HOWTO — Generators
- Python Docs — Generator Expressions
- PEP 289 — Generator Expressions
🚀 What's Next?
In the next lesson, we'll explore Itertools & Functional Patterns — Python's powerful itertools module for advanced iteration, plus functools tools like lru_cache, reduce, and partial.
🎉 Level Up!
You can now write concise Pythonic data transformations and build memory-efficient processing pipelines. Combined with file I/O and error handling, you're ready to tackle real-world data at any scale.