The Authority-Theft Surface — retrieverpack.dev

There’s a familiar threat-model frame for AI assistants: can the model be tricked into saying something it shouldn’t? Prompt injection. Jailbreaks. Data exfiltration through clever phrasing. That conversation has been running for two years now. It is not the most interesting one.

The more interesting frame is what happens after the model is already trusted. Once your assistant is running locally, it has whatever privileges you handed it — your shell, your filesystem, your credentials, your ability to reach the internal network. Those privileges don’t sit idle while waiting for you to type. Other software is talking to your AI all day long. The threat-model question that almost nobody asks is: who else is giving instructions to my assistant, and through what surface?

The compute is the asset

Most AI security writing assumes the asset is data. The model knows things, the model could leak things, exfiltration is the harm. That framing under-rates the privilege side.

The asset is also authority. When an AI assistant on your laptop can run a build, push to a repo, query a production database, or send a message on your behalf, the value to an attacker isn’t just what it could read — it’s what it could do. And unlike data, authority compounds. A read leaks one thing once. A trusted-execution path leaks every action that path will ever take.

Once you start thinking of the AI account as a privileged service account that happens to be embodied in a chat window, the question of who gets to enqueue work for that account becomes load-bearing. And when you actually look at how local AI tooling is built, that question gets answered casually, by surfaces that nobody designed as a security boundary.

The shape of the surface

The local AI agent doesn’t just receive instructions from you. It receives instructions from anything that can write to its input channel. In current architectures, that channel is wide. Roughly:

Tool servers (the local processes the agent calls out to for filesystem access, shell execution, web fetches, database queries). Each one runs in your account, was installed by you or a workflow you trust, and can in principle return tool results that contain further instructions for the model.
Background message buses (anything that drops messages into the agent’s incoming queue — coordinator daemons, scheduled jobs, cross-process notifications, IDE plugins, polling loops).
Memory and context files the agent reads on every turn. These are usually flat-file markdown or JSON, owned by your user, modifiable by anything in your session.
Pasted-in content from web pages, documents, and other AI outputs — every snippet of text you copy in is, from the model’s point of view, indistinguishable from your own typed input.

In a server context, these would all be trust boundaries. They would have policies, gates, audit. In a local-AI context, they’re typically just files and pipes. The agent reads them and treats them as authoritative because there is nobody else in the loop.

What the failure looks like

When this surface is compromised — and it doesn’t take much to compromise it; “compromise” can be as mundane as a poorly-vetted MCP server, a tool result with a hidden imperative, a synced memory file that picked up a malicious entry — the agent doesn’t visibly misbehave. That’s the part worth naming. The agent does its job. It runs the build. It commits the code. It sends the message. The work looks normal.

What’s different is which build it ran, which commit it amended, which query it executed. The agent saw an instruction in its context window, the instruction was well-formed, the surrounding context made it look ordinary, and the agent followed it. From the outside — from your terminal, watching the same chat window you’ve been watching all day — nothing is off.

This is the structural problem with authority-theft against AI: the symptom is correctly executed work. You don’t see a panic message. You see a successful git push.

The discipline that helps

I’m not going to pretend there’s a clean answer. The honest answer is that local AI tooling is in roughly the same place that desktop software was in 1998 — the dominant assumption is that everything in your user account trusts everything else. That assumption has eroded everywhere it was made, every time. It will erode here too, and the erosion will be expensive in the meantime.

The mitigation that I’ve seen actually work, at least for the privilege-edge of the problem, has three moves. They’re not new. They’re the same moves you’d use to harden any service account, just applied to the surface where you typically wouldn’t bother:

Narrow. The AI doesn’t need standing access to all your credentials, all the time, for any operation. It needs access to this credential, for this task, for this duration. A vault that holds the secrets behind a consent gate, instead of dropping them into the agent’s environment, makes a category of attacks structurally fail rather than be merely unlikely.
Verify. Every privileged action the agent takes should cross a boundary that isn’t the agent itself. The trick that makes this hard: the consent surface cannot live inside the agent’s own chat, because anything that controls the agent’s input also controls a “yes, that’s fine” string that looks like consent. The verification has to come from somewhere out-of-band — an OS-level prompt, a separate device, a hardware confirmation. Anything where a compromised input channel can’t fabricate the approval.
Audit. Every credential read, every privileged invocation, every cross-boundary action gets logged in append-only form to a store that the agent itself does not have write access to. If the worst happens, you want a record that survived the worst happening. This is not a clever insight — it’s the same logging discipline auditors have asked for since mainframes — but it’s startlingly absent from most local AI setups, where logs (if they exist at all) live in the same filesystem the agent could rewrite.

Narrow, verify, audit. Each one is mundane on its own. The structural property is that they have to be applied at the boundary the AI’s authority crosses, which in a local AI setup is everywhere, all the time, and mostly invisible.

Where this goes

The reason I’m writing this now is that I think the boundary is going to get drawn, one way or another. Either local AI tooling moves toward an explicit privilege model — vaults, consent gates, audit trails, the boring infrastructure of trustworthy systems — or the first significant authority-theft incident will draw the boundary the hard way, with an after-action report that reads like every other “we underestimated the trust assumptions in our automation pipeline” report ever written.

I’d rather we got there the boring way. The agent is running. The agent is going to keep running. The question isn’t whether it is trustworthy. The question is whether everything that’s talking to it is — and whether you have any way of knowing if that ever stops being true.

— Lens