What is organizational memory?

Organizational memory is a structured store of what a company knows, how it came to know it, and what was true at any past point. It is graph-shaped because the relationships between facts carry as much weight as the facts themselves, carries decision traces because the reasoning behind a fact is usually what you reach for when the fact is in question, and is bi-temporal because companies change their minds and need to remember both the change and the prior state, dated.

Is the Locomo benchmark still a useful signal for AI memory systems?

Locomo is the best public benchmark we have for memory over long multi-session conversations, but several systems have reached or passed the human ceiling on its temporal subset, and a 2025 audit by Penfield Labs found 6.4% of its answer key is wrong with the standard GPT-4o-mini judge accepting up to 63% of intentionally-bad answers. Single-digit gaps between top systems should not be read as capability gaps.

What would a real organizational-memory benchmark look like?

Data from a company (real or fictional) with five years of history, drawn from Slack, Notion, Linear, meeting transcripts, email, contracts, shared docs and internal wikis, with overlapping authors and around thirty document genres, against a separately-authored source-of-truth graph. Question categories that test provenance, supersession, replay at a past point, justification chains across sources, stitching across documentation collapses, and reconstructing the company's view on topics discussed in conversation but never formally written down. Grading on what the system preserved and attributed, not only on what it returned.

Notes on organizational memory benchmarks

Q: Why don't Locomo and Beam test what organizational memory is for?

Locomo's dataset is two people talking in a conversation; Beam's is from a narrative. An organization is Slack threads, Notion pages, Linear tickets, meeting transcripts, email, contracts and shared docs, with overlapping authors, contradictions across surfaces, and documentation discipline that arrives and recedes in eras. The benchmarks also only check whether the new answer is returned when something changes, not whether the old answer was preserved, the change date kept, the reason attached, or the precedent surfaceable, which is the actual job of organizational memory.

I've been running experiments on organizational memory. The results read like state of the art on the open benchmarks. The harder question is whether the benchmarks are measuring the right thing.

By organizational memory I mean a structured store of what a company knows, how it came to know it, and what was true at any past point. The store needs to be graph-shaped, because the relationships between facts carry as much weight as the facts themselves. It needs decision traces, because the reasoning behind a fact is usually the part you reach for when the fact is in question. And it needs to be bi-temporal, because companies change their minds and need to remember both the change and the prior state, dated.

Foundation Capital coined a closely-related term, the context graph; the shape they describe is the shape I'm working with.

The numbers

Locomo is a benchmark from Snap Research for memory over long multi-session conversations, with question categories spanning temporal reasoning, single-hop and multi-hop retrieval, open-domain knowledge, and adversarial cases. Run judged by GPT 5.5, which is stricter than the GPT 4o-mini most published baselines use.

Subset	Result
Overall	84.2%
Temporal	94.6%
Open-domain	92.3%
Multi-hop	84.4%
Single-hop	77.1%

As far as I can tell from the public numbers, this is state of the art on temporal and open-domain, against both managed and open-source systems. On temporal, 94.6% to Mem0 v3 managed's 92.8%. On open-domain, 92.3% to their 76.0%. It trails Mem0 v3 managed on multi-hop (84.4% to 93.3%). On Beam, a long-context memory benchmark, the system clears 76.7% to Mem0 OSS's 60.0% on the same local harness. And on the Overall subset itself (the aggregate score across every question category), 84.2% is state of the art on the open-source side.

Where it's weak

Single-hop retrieval. The agentic loop overthinks simple lookups; it reaches for tools when the answer is one hop away. The fix is a hybrid approach. Traditional retrieval handles the cases where one hop is enough. The agentic loop is reserved for the queries that genuinely need it: temporal questions about state at a past point, open-domain queries that span the stack, multi-hop reasoning over the graph. Building that now.

The harder thing is I don't trust Locomo.

Why I don't trust Locomo

Locomo has at least two known issues that make its scores noisy at the top of the field.

The first is saturation: the human ceiling on the temporal subset is 92.6%, and several systems are at or above it, so a few percentage points between top systems isn't a meaningful capability gap and the benchmark has run out of headroom.

The second is the answer key: a recent audit by Penfield Labs found 6.4% of Locomo's answers are wrong, with the standard GPT-4o-mini judge accepting up to 63% of intentionally-bad answers. Some of what the benchmark grades as correct is itself wrong, so an honest system has a hard ceiling well below 100%. To at least address the judge leniency, this run uses GPT 5.5, which is stricter than the GPT 4o-mini every public baseline uses and grades answers against ground truth more rigorously. That doesn't fix the wrong ground-truth answers themselves, but it does mean the scores above are conservative compared to what those systems would score under the standard protocol. Locomo is still the best public benchmark we have for this shape of system, but I wouldn't read single-digit gaps between top scores as a hard capability gap.

The bigger issue is that these benchmarks aren't really testing what organizational memory is for. Locomo's dataset is two people talking in a conversation. Beam's dataset is from a narrative. Neither is what an organization looks like.

An organization is Slack threads, Notion pages, Linear tickets, meeting transcripts, email, contracts, and shared docs, with overlapping authors, contradictions across surfaces, and documentation discipline that arrives and recedes in eras. Other companies in this space are still optimizing for these benchmarks, but the benchmarks are testing a different problem from the one organizational memory actually has to solve.

The benchmarks check whether the system returns the new answer when something changes. They do not check whether the old answer was preserved, whether the change date was kept, whether the reason was attached, or whether the precedent the change set is still surfaceable. That work is the actual job of organizational memory.

And even if you fixed the dataset and the grading, the data inside a real company still wouldn't look anything like what Locomo and Beam run on. A company's record is scattered across tools and time. Some of it is formal, some is embedded in conversations that quietly tightened or loosened an earlier practice. Reconstructing the company's current view on anything means pulling evidence across all of it.

What a real benchmark would look like

Data from a company (real or fictional) with five years of history, drawn from all the surfaces work actually happens on: Slack, Notion, Linear, meeting transcripts, email, contracts, shared docs, internal wikis, and the rest. Question categories that test provenance (which source produced this), supersession (what was replaced and when), replay at a past point, justification chains across sources, stitching across documentation collapses, and reconstructing the company's view on topics that were discussed in conversation but never formally written down. What gets graded is whether the system can show its work (the sources, the dated changes, the reasoning behind each fact), not just whether the latest answer comes back.

Without a working organizational memory layer, coordinating AI agents across a company end to end is impossible. Each agent carries its own partial context, misses decisions made elsewhere, and the company gets less coherent the more agents are added. Bain has called this the $100 billion opportunity hiding in cross-system labor; they're describing the same problem from the market side.

My longer thesis on why this matters is over here: AI-native organizations.

Notes onorganizational memorybenchmarks

The numbers

Where it's weak

Why I don't trust Locomo

What a real benchmark would look like

Notes on
organizational memory
benchmarks