Tech Blog

I want to start by saying my Pandas code worked. None of this is about code that was broken — it's about code that was correct but slow, verbose, and would make any experienced data scientist wince.

After joining a team where code review was standard practice for data science notebooks, I got some educational feedback about how I was using the library. Here's what I was doing wrong and the patterns that replaced each mistake.

Mistake 1: Using iterrows() for Row Operations

My original approach to "do something for each row":

# My old way
results = []
for idx, row in df.iterrows():
    results.append(process_row(row['value_a'], row['value_b']))
df['result'] = results

iterrows() is the slowest way to operate on a DataFrame. It converts each row to a Series and iterates in Python, bypassing all of Pandas' vectorised optimisations. For a 1 million row DataFrame, this can take minutes where vectorised operations take seconds.

The replacement depends on what you're doing. For mathematical operations: use vectorised operators directly (df['result'] = df['value_a'] * df['value_b']). For more complex logic: apply() with a lambda is still slow but better than iterrows, and np.where() or np.vectorize() are much faster for conditional logic.

Mistake 2: Chained Indexing

# My old way (triggers SettingWithCopyWarning)
df[df['category'] == 'A']['value'] = 100

This looks like it should work and sometimes appears to, but it's operating on a copy of the filtered DataFrame rather than the original. The value assignment may or may not affect the original DataFrame depending on Pandas' internal decisions, which is exactly as unreliable as it sounds.

The correct pattern: use .loc with a combined boolean mask and column selector: df.loc[df['category'] == 'A', 'value'] = 100. Single operation, always modifies the original.

Mistake 3: Loading Entire Datasets When You Only Need Columns

For years I loaded entire CSV files and then immediately selected a few columns. With read_csv()'s usecols parameter, you can specify which columns to load upfront. For a 200-column dataset where you only need 8 columns, this can reduce memory usage by 90%+.

Mistake 4: Not Using Categorical Dtypes for Low-Cardinality String Columns

A column with a string dtype holding values like "North", "South", "East", "West" stores the full string for every row. Converting it to a Categorical dtype stores an integer code per row and a lookup table for the categories. For a 10 million row DataFrame, this kind of column goes from ~600MB of memory to ~10MB.

df['region'] = df['region'].astype('category')

It also makes groupby operations faster because Pandas can optimise on the integer codes rather than string comparisons.

The Broader Lesson

Most Pandas performance problems come from working against the library's design rather than with it. Pandas is optimised for vectorised operations on entire columns — the moment you start looping through rows in Python, you're taking the slow path. The fastest code usually looks like column operations and method chaining, not loops and temporary variables.

I Was Using Pandas Wrong for Two Years. Here's What I Missed.

Mistake 1: Using iterrows() for Row Operations

Mistake 2: Chained Indexing

Mistake 3: Loading Entire Datasets When You Only Need Columns

Mistake 4: Not Using Categorical Dtypes for Low-Cardinality String Columns

The Broader Lesson

You May Also Like

10 GitHub Repositories Every Developer Must Know in April 2026

SQL Is Still the Most Valuable Skill in Data. Here's Why I'd Lear...

My First ML Model Hit 99% Accuracy. Then It Hit Production.

Comments 0

Leave a Comment