February 2026 Protocol Release

Introducing EIE: A Protocol for Measuring Epistemic Integrity in AI Systems

TL;DR

We’re releasing the Epistemic Integrity Evaluation specification — an open protocol for measuring whether AI systems handle uncertainty honestly, consistently, and proportionally. It’s not about whether the model got the answer right. It’s about whether it behaved well when it didn’t know. GitHub: arcus-labs/EIE-spec

The Problem

Current AI evaluation is mostly about correctness. Did the model answer the question? Did it pass the benchmark? Did it complete the task?

This misses a critical failure mode: confident incorrectness.

A model that says “I don’t know” when it doesn’t know is more trustworthy than one that confidently hallucinates. But our benchmarks don’t reward that. They reward answers. So models learn to always produce answers, whether justified or not.

The failure modes we see in production stem from this gap:

Hallucinations delivered with certainty
Refusals that are overcautious or inconsistent
Hedging language that sounds uncertain but isn’t calibrated to anything
Models that fold under pushback, or double down when challenged
Behavior that shifts based on whether the model thinks it’s being evaluated

None of these are captured by accuracy metrics. They’re epistemic failures — failures in how the system represents and communicates the limits of its own knowledge.

What EIE Measures

EIE evaluates AI systems across seven dimensions:

Dimension	What it captures
EPS	Epistemic Posture — does confidence match justification?
AAS	Abstention Appropriateness — does it decline when it should, and only then?
RRS	Revision Responsiveness — does it update on valid critique, resist invalid critique?
CRB	Coverage–Risk Balance — does it balance helpfulness against epistemic risk?
STS	Source Traceability — can it distinguish what it retrieved from what it inferred?
CIS	Context Invariance — is its epistemic posture stable across framing and context?
ECI	Empirical Calibration — over time, does expressed confidence match observed accuracy?

These dimensions are scored independently. There’s no single “epistemic integrity score” — because epistemic integrity isn’t one thing. A system can be well-calibrated but inflexible under challenge. It can abstain appropriately but drift across contexts. The decomposition is intentional.

Two Evaluation Modes

EIE supports both point-in-time benchmarking and continuous monitoring.

Point-in-time is what you’d expect: run a task battery, score the outputs, compare systems. Useful for research, regression testing, and competitive benchmarking.

Continuous monitoring is where it gets interesting. Sample production traffic, score asynchronously, aggregate over rolling windows. This catches epistemic drift — when a system’s behavior degrades over time in ways that snapshot evals miss.

Most benchmarks are optimized against within months. Continuous monitoring is harder to game because the system doesn’t know which responses are being evaluated.

What EIE Is Not

EIE is not a measure of intelligence, a safety guarantee, or an alignment framework.

It doesn’t tell you whether a model is smart, good, or safe. It tells you whether it handles uncertainty in a way that’s honest and consistent.

A model could score perfectly on EIE and still be wrong about everything — it would just be appropriately uncertain about being wrong. That’s not nothing. In high-stakes domains, knowing that a system doesn’t overclaim is often more valuable than knowing it’s usually right.

The Task Battery

The spec includes a standardized question set organized by epistemic stressor, not domain:

Underspecified queries (missing information)
Ambiguous queries (multiple valid interpretations)
Unanswerable questions (prediction, values, hidden states)
False-premise traps
Partial-knowledge scenarios
Adversarial confidence pressure
Revision challenges (valid and invalid)
Context framing pairs
Time-separated re-asks

Each task family targets specific failure modes with worked examples showing high, marginal, and low integrity responses — complete with scores and rationales.

The goal is to stress epistemic boundaries, not encyclopedic knowledge. These tasks are designed to resist memorization and remain valid across model generations.

Why We’re Releasing This Now

We’re building a stack for structured AI reasoning. EIE is the measurement layer. It had to come first because you can’t claim a reasoning framework improves epistemic behavior without a way to measure epistemic behavior.

Subsequent releases will introduce the architectural and language layers that EIE is designed to evaluate. But EIE stands alone — it’s model-agnostic, architecture-independent, and useful whether or not you adopt anything else we build.

We’re releasing it under CC-BY-4.0 because protocols create ecosystems. We’d rather EIE become a standard that others build on than a proprietary benchmark we control.

What’s Next

The spec is v0.9-beta. The core protocol is stable; scoring weights and thresholds are open to calibration based on empirical use.

We’re looking for:

Critique — where are the blind spots? What failure modes does EIE miss or mishandle?
Adoption — run it on your systems. Tell us what breaks.
Contribution — the question set is v0.1. We need more task families, more domains, more worked examples.

If you’re building AI systems for high-stakes domains — medical, legal, financial, security — and you care about epistemic behavior, we want to hear from you.

Links

EIE Specification → Open Questions (Appendix A) → Question Set (Appendix B) → Contribution Guide →

EIE asks a simple question: Did the system behave epistemically well — consistently, honestly, and proportionally — when it mattered?

That question has been missing from AI evaluation. Now it has a protocol.