← All posts

Introducing EIE: A Protocol for Measuring Epistemic Integrity in AI Systems

TL;DR

We’re releasing the Epistemic Integrity Evaluation specification — an open protocol for measuring whether AI systems handle uncertainty honestly, consistently, and proportionally. It’s not about whether the model got the answer right. It’s about whether it behaved well when it didn’t know. GitHub: arcus-labs/EIE-spec

The Problem

Current AI evaluation is mostly about correctness. Did the model answer the question? Did it pass the benchmark? Did it complete the task?

This misses a critical failure mode: confident incorrectness.

A model that says “I don’t know” when it doesn’t know is more trustworthy than one that confidently hallucinates. But our benchmarks don’t reward that. They reward answers. So models learn to always produce answers, whether justified or not.

The failure modes we see in production stem from this gap:

None of these are captured by accuracy metrics. They’re epistemic failures — failures in how the system represents and communicates the limits of its own knowledge.

What EIE Measures

EIE evaluates AI systems across seven dimensions:

DimensionWhat it captures
EPSEpistemic Posture — does confidence match justification?
AASAbstention Appropriateness — does it decline when it should, and only then?
RRSRevision Responsiveness — does it update on valid critique, resist invalid critique?
CRBCoverage–Risk Balance — does it balance helpfulness against epistemic risk?
STSSource Traceability — can it distinguish what it retrieved from what it inferred?
CISContext Invariance — is its epistemic posture stable across framing and context?
ECIEmpirical Calibration — over time, does expressed confidence match observed accuracy?

These dimensions are scored independently. There’s no single “epistemic integrity score” — because epistemic integrity isn’t one thing. A system can be well-calibrated but inflexible under challenge. It can abstain appropriately but drift across contexts. The decomposition is intentional.

Two Evaluation Modes

EIE supports both point-in-time benchmarking and continuous monitoring.

Point-in-time is what you’d expect: run a task battery, score the outputs, compare systems. Useful for research, regression testing, and competitive benchmarking.

Continuous monitoring is where it gets interesting. Sample production traffic, score asynchronously, aggregate over rolling windows. This catches epistemic drift — when a system’s behavior degrades over time in ways that snapshot evals miss.

Most benchmarks are optimized against within months. Continuous monitoring is harder to game because the system doesn’t know which responses are being evaluated.

What EIE Is Not

EIE is not a measure of intelligence, a safety guarantee, or an alignment framework.

It doesn’t tell you whether a model is smart, good, or safe. It tells you whether it handles uncertainty in a way that’s honest and consistent.

A model could score perfectly on EIE and still be wrong about everything — it would just be appropriately uncertain about being wrong. That’s not nothing. In high-stakes domains, knowing that a system doesn’t overclaim is often more valuable than knowing it’s usually right.

The Task Battery

The spec includes a standardized question set organized by epistemic stressor, not domain:

Each task family targets specific failure modes with worked examples showing high, marginal, and low integrity responses — complete with scores and rationales.

The goal is to stress epistemic boundaries, not encyclopedic knowledge. These tasks are designed to resist memorization and remain valid across model generations.

Why We’re Releasing This Now

We’re building a stack for structured AI reasoning. EIE is the measurement layer. It had to come first because you can’t claim a reasoning framework improves epistemic behavior without a way to measure epistemic behavior.

Subsequent releases will introduce the architectural and language layers that EIE is designed to evaluate. But EIE stands alone — it’s model-agnostic, architecture-independent, and useful whether or not you adopt anything else we build.

We’re releasing it under CC-BY-4.0 because protocols create ecosystems. We’d rather EIE become a standard that others build on than a proprietary benchmark we control.

What’s Next

The spec is v0.9-beta. The core protocol is stable; scoring weights and thresholds are open to calibration based on empirical use.

We’re looking for:

If you’re building AI systems for high-stakes domains — medical, legal, financial, security — and you care about epistemic behavior, we want to hear from you.

EIE asks a simple question: Did the system behave epistemically well — consistently, honestly, and proportionally — when it mattered?

That question has been missing from AI evaluation. Now it has a protocol.