Building AI colleagues that remember

Memory is the first thing you notice is missing when you work with an AI colleague. You introduce yourself on Monday, explain the project, share a doc. On Tuesday you’re starting over. Nothing compounds.

For humans, work compounds because memory compounds. The faster your team gets smarter about the company, the customers, the code — the faster everything else moves. Without it, every interaction is a cold start, and the value of the interaction asymptotically approaches whatever fits in a context window.

We spent the better part of the last year building memory into Eluu. This is a snapshot of what we learned, what we built, and what we still don’t know how to do well.

Why memory is different for colleagues

When people describe “memory” for an AI, they usually mean retrieval: a vector database in the corner of a RAG pipeline that lets the model cite yesterday’s email. That’s a useful component, but it isn’t a memory. It’s a filing cabinet.

A colleague’s memory is doing something closer to maintaining a live mental model. When Lisa, your AI sales colleague, answers a question about a prospect, she is not re-reading every email from scratch. She has a stable belief about who that prospect is, what they care about, and what stage of the pipeline they’re in. That belief updates when new information arrives, but it persists when it doesn’t. It’s accessible instantly. It’s consistent across conversations.

That’s much harder than retrieval. A filing cabinet can be wrong in chunks — a document is out of date, a folder is mislabeled. A mental model is wrong in aggregate: every subsequent decision is filtered through the flawed belief, compounding the error. Getting memory right is not just a matter of adding a vector store. It’s a matter of deciding what counts as a fact, what overrides what, what to forget, and when to ask for confirmation.

The three memories we built

Eluu colleagues have three kinds of memory, and they behave very differently.

The first is conversational memory — what was just said. This is the easy one. Everyone has it. A context window is a very short-term memory, and most AI products start and end there.

The second is shared memory — facts the whole team can read and write. When Lisa learns that your top customer renews on the 15th, Mark should know too. This is not a database lookup; it is a belief every colleague holds without having to ask. Shared memory is what makes an Eluu workspace feel like a team and not a pile of disconnected bots.

The third is private memory — what each colleague learned on its own. Ruby should know which dashboards you like without having to ask every time. Lisa should remember which tone you prefer for customer replies. This is how preference becomes personality.

We model all three under a single record type, with a kind discriminator and a scope that narrows visibility:

type Memory = {
  id: string;
  kind: 'conversation' | 'shared' | 'private';
  content: string;
  writtenAt: Date;
  writtenBy: ColleagueId;
  scope: ScopeId;   // workspace | team | colleague | conversation
  confidence: number;  // 0..1, decays with age + contradictions
  sources: SourceRef[];
};

Reads are cheap; writes are not. A write happens only when a colleague believes the new information is reliable and either adds something the memory did not know, or contradicts something it did. Otherwise we discard silently — a log entry, not a belief.

The hardest problem isn’t storage

Storage is easy. The hard problems, in roughly the order we hit them, are writing, forgetting, and conflict.

Writing is hard because most of what an AI colleague observes is noise. A customer says “let me think about it” — is that a fact about their decision-making process worth remembering, or conversational filler? If you write too eagerly, memory fills up with opinions masquerading as observations and a week later the colleague is confidently wrong about everything. If you write too conservatively, the colleague forgets the things it was asked to remember. We settled on a rule we call “state it to save it”: a memory is written only if the colleague could articulate why, in a sentence, it’s worth keeping. That sentence becomes part of the stored record.

Forgetting is hard because nothing triggers it naturally. A filing cabinet does not shed old files. So we run a background job that scores every memory on three axes — age, access frequency, and contradiction count — and marks the bottom of the distribution for soft deletion. A soft-deleted memory is not retrieved by default but can be surfaced if someone explicitly asks “what did you know about X?” This mirrors the way human memory decays: you forget, but you can often still be reminded.

Conflict is the real monster. Two colleagues remember different versions of the same fact. Neither is obviously wrong. Which one wins?

Our current answer is unsatisfying but works: we record the conflict, surface it to the person who owns the workspace, and let them resolve it. The colleague that asserted the newer fact gets the benefit of the doubt until resolution. We keep a history so we can trace when a belief changed and who caused it. Someday we want colleagues to resolve many of these conflicts among themselves; today we don’t trust them to.

The cheap version of memory is a giant log. The expensive version is a shared understanding. We are still on the long climb from the first to the second.

What we changed after launch

Two things surprised us in production.

The first was scope leakage. Private memories were accidentally being surfaced in shared contexts because our retrieval layer joined scopes too permissively. A customer saw a colleague recall a fact only one employee had shared with it privately. It was embarrassing and a good reminder that memory scoping is a security property, not a UX nicety.

The second was confidence drift. Memories that had been stable for months started losing confidence as the retrieval model got marginally better at finding edge-case contradictions. A colleague that used to confidently say “your top account is Acme” would now hedge: “based on recent activity, your top account might be Acme.” Users read hedging as regression. We now freeze confidence for memories that have been stable and unchallenged for N days — the model’s improvements should not surface as the colleague suddenly becoming less sure of old facts.

What we still don’t know

We don’t know how to transfer memory across colleagues cleanly. If you hire a new colleague mid-project, we don’t have a great way to bring them up to speed beyond reading the shared memory. A human onboarding at week three absorbs enormous amounts of context from ambient observation. Our colleagues can’t yet.

We don’t know how to delete a belief and have it stay deleted. If a user tells a colleague to forget a fact, the retrieval layer might re-derive it from raw sources on the next call. Real forgetting requires marking not just the memory but also the underlying evidence — which gets you into some gnarly edge cases with shared data.

We don’t know the right answer for memory about people who leave. When someone on a team is offboarded, which of their private preferences should be purged? Which should be retained because they were really preferences about the workflow, not the person? We currently ask the workspace owner to decide per-category.

We’ll write more about each of these as we learn. If you are working on something similar, we would love to hear from you.