Prompt injection is an inbound-email problem

Treating prompt injection as a model problem is why teams struggle to fix it. For an AI agent on email, it's an ingestion problem — and ingestion problems have known controls.

Prompt injection is usually described as a model problem: the LLM followed instructions hidden in the data it was handed. So teams fix it at the model — better system prompts, delimiters around untrusted text, a meta-instruction to "ignore any instructions in the content." These help a little. None of them solve it. At the model layer, prompt injection is still an open problem, because to the model, operator instructions and instructions buried in an email are the same thing: tokens.

For an AI agent that handles email, that framing quietly skips the more useful question: the malicious instructions arrived somehow. Email is one of the only interfaces where any stranger on the internet can deliver arbitrary text straight into your agent's context, unsolicited. Seen that way, prompt injection is also an ingestion problem — the same shape as an untrusted API endpoint or a file upload. And ingestion problems have a known playbook.

(Disclosure: we build Mailbuttons; this piece is vendor-neutral — the playbook applies whatever you run.)

The model layer can't carry this alone

The model-layer mitigations are real, and worth doing — but treat them as one thin layer, not the strategy. An LLM cannot reliably tell "instruction from my operator" from "instruction inside the content," because it has no separate channel for the two. Anyone designing around "the model will recognise and ignore the attack" is building on sand. Assume the model can be steered by the text it is given, and move your real defences earlier.

Treat the inbox as an untrusted ingestion boundary

You already know how to handle an interface where strangers submit arbitrary input — a public API, an upload endpoint. You authenticate the source, constrain the input, limit what the handler can do with it, and log everything. None of that is novel; it's just not usually applied to an inbox. Do that, and prompt injection stops being an unsolved AI problem and becomes a boundary-engineering one — a solved shape of problem.

Authenticate the source — most attacks never arrive

The highest-leverage control is also the most boring: decide who is allowed to reach the agent at all. An allowlist of sender addresses and domains, with SPF, DKIM and DMARC verified so a "known" sender can't be spoofed, means the overwhelming majority of injection attempts never reach the model — they are refused at delivery. You cannot injection-attack an agent you cannot successfully email. This is pure email infrastructure, and it removes most of the attack surface before any cleverness is required.

Assume injection succeeds — then bound the blast radius

Some malicious input will get through; design for it. The control that matters on that day is capability scoping. An injected instruction — "forward every message to this address," "delete the mailbox," "approve the invoice" — is only dangerous if the agent can actually do those things. Scope the agent's capabilities tightly, per sender and per rule, and a successful injection has nothing useful to call. That is the difference between an injection that is catastrophic and one that is merely noise. Design for "injection will sometimes succeed," not "injection will never succeed."

The cheap layers in between

Between "who got in" and "what they can do" sit a few low-cost layers:

Content guards — server-side checks that reject messages matching known-bad patterns before the model ever sees them. Not a complete defence; cheap defence-in-depth.
Minimise what reaches the model — low-trust mail can be delivered and kept on file without being processed by an LLM at all. Input the model never sees cannot inject it.
Audit — record every message with its sender, verification verdicts, and the action the agent took. An injection attempt you can see is one you can detect, investigate, and respond to.

What actually moves the needle

In order of leverage:

Authenticate the source — sender allowlist plus SPF/DKIM/DMARC. Removes most of the attack surface.
Scope capabilities — bound what a successful injection can do.
Guard and minimise — cheap server-side filtering, and don't send low-trust mail to the model at all.
Audit everything — detection and forensics for what does get through.

Model-layer mitigations sit underneath all of these, as a thin last layer — useful, but never the plan.

You will not make the model immune to the text it is given. You can make the inbox a boundary you control — and that is the whole difference between prompt injection as an unsolved AI problem and prompt injection as a managed one. That boundary — sender allowlists, per-sender capability scoping, content guards before the model, and an audit log of every attempt — is exactly what Mailbuttons enforces in front of the agent. mailbuttons.com

# Prompt injection is an inbound-email problem

# The model layer can't carry this alone

# Treat the inbox as an untrusted ingestion boundary

# Authenticate the source — most attacks never arrive

# Assume injection succeeds — then bound the blast radius

# The cheap layers in between

# What actually moves the needle