The idea was simple: I get around 80 emails a day, probably 60 of which require a predictable response or action. Why am I doing that manually?
So in February I spent a weekend building an n8n workflow that: reads every incoming email, classifies it, drafts a reply for ones that fit a pattern, files it into the right folder, and flags anything that needs my actual attention. I ran it for 30 days. Here's the honest account.
The Setup (Brief Version)
I used n8n with a Gmail trigger, a Claude API call for classification and drafting, and some conditional logic to route different email types. The whole workflow took about 6 hours to build including testing. Not counting the two days I spent fixing the edge cases I'd missed.
Week One: Embarrassingly Good at Simple Things
The agent handled newsletter subscriptions, automated notifications, meeting confirmations, and simple factual questions ("when is your next availability?") with impressive accuracy. For these categories, it drafted responses I'd approve maybe 85% of the time with no edits.
I felt like a productivity genius. I told two colleagues about it. I probably shouldn't have done that yet.
Week Two: The First Disaster
A client sent an email that started with "Quick question about the invoice..." and the agent classified it as a routine billing inquiry. It was not a routine billing inquiry. The client had actually made an offhand comment about the invoice and then spent three paragraphs describing a critical bug in the software I'd delivered. The agent drafted a response to the invoice question and ignored the bug entirely.
The classification prompt I'd written had "invoice" and "billing" as high-confidence signals for the billing category. I'd never thought about what happens when an email mentions billing but is actually about something else entirely. Lesson: classification by keyword is a trap. You need the model to understand intent, not just match terms.
Week Three: Tone Problems
The drafts were accurate but weirdly formal. I'm naturally pretty casual in emails — I use contractions, I sometimes start replies with "yep, totally —" and I don't sign off with "Best regards." The agent's drafts sounded like a 1997 email client's auto-responder. I had to add significant tone guidance to the prompt and feed it several examples of my actual emails to calibrate it. After that, it improved considerably, but it still occasionally slips into corporate-ese when I'm not paying attention.
Week Four: Actually Pretty Good
By the fourth week, with the calibration fixes in, the agent was handling maybe 55% of my email volume without any input from me. The other 45% it correctly flagged for my attention. My daily email time dropped from about 45 minutes to 15 minutes.
The main ongoing issues: anything emotionally sensitive gets handled awkwardly, multi-threaded email chains confuse it, and it occasionally misses context from previous conversations in a thread.
Am I Still Running It?
Yes, but with a twist. I don't let it send anything automatically anymore. After the invoice incident, I changed the workflow so everything goes into a "drafts" folder for me to approve before sending. I lost maybe 20 minutes of the time savings, but I gained the ability to catch anything weird before it goes out.
The honest verdict: AI email agents are genuinely useful as a drafting tool. They're not ready to be fully autonomous, at least not without significant investment in edge case handling. Think of it as a very eager junior assistant who drafts good emails but shouldn't be sending them without you checking first.