What SR 11-7 Means for AI-Driven Decision Making
SR 11-7, the Federal Reserve’s model risk management guidance, was written for statistical models with inspectable coefficients. LLMs break every assumption the framework rests on. When an examiner asks how the model arrived at a specific decision, the answer “we trust the output” is not an answer. Here’s what examination readiness actually requires.
The question examiners are already asking
SR 11-7, issued by the Federal Reserve in April 2011, is the foundational guidance on model risk management for banking institutions. It was written for a world where “model” meant regression models, credit scoring engines, and statistical classifiers. The guidance assumes you can open the model, inspect its mechanics, and validate that it behaves as intended.
Fifteen years later, banks are running large language models in credit assessment, fraud detection, document analysis, and customer-facing communications. The examiners are asking the same questions SR 11-7 always required. The models, as currently deployed, can’t answer them.
What SR 11-7 requires
The guidance organizes model risk management around three pillars: development and implementation, validation, and ongoing monitoring. Beneath each pillar sits a set of requirements that were tractable when the models in question were statistical.
Documentation. The model must be documented in enough detail that a knowledgeable third party can understand, replicate, and evaluate it. An examiner reviewing a logistic regression can read the coefficients, understand the feature construction, and form an independent view of whether the model is sound.
Validation. An independent team must test the model’s logic, identify its limitations, and verify that its outputs are reliable across the range of conditions in which it will be used. For a credit score model, validation involves holdout testing, champion-challenger comparisons, and sensitivity analysis on key inputs.
Explainability. When a model drives a consequential decision — a loan denial, a fraud flag, a risk tier assignment — the institution must be able to explain how it arrived there. The explanation is not philosophical. It is operational. It means: here are the inputs, here are the weights, here is why this application scored 0.73 and not 0.81.
Change management. Any modification to the model must go through a documented review process. A new variable, a retrained version, a threshold adjustment — each is a model risk event.
For the models SR 11-7 was designed to govern, these requirements are hard but tractable. The information the framework demands exists somewhere in the model’s structure. You can find it if you look in the right place.
Why LLMs break the framework
Large language models are structurally different from statistical classifiers in the ways that matter most for SR 11-7. They don’t have inspectable coefficients. Their behavior is not determined by a fixed set of weighted features. A given input doesn’t produce a deterministic output that can be traced back through model parameters. The reasoning — such as it is — happens across billions of neural network activations and evaporates when the response is complete.
This creates problems at every level of compliance.
On documentation: you can document an LLM’s architecture, training regime, and prompt design. But an examiner reading that documentation cannot predict what the model will produce given a specific loan application. The documentation describes the system in general. It doesn’t explain the decision in front of the examiner now.
On validation: holdout testing can tell you whether an LLM’s outputs are accurate on average. It cannot tell you whether any specific output was produced reliably. More critically, it cannot tell you whether the reasoning was sound — only whether the final answer happened to be correct. A model can hallucinate its way to a right answer and fail correctly for the wrong reasons. Average accuracy, across a holdout set, is a thin thread to hang examination readiness on.
On explainability: this is where the framework breaks most visibly. Under SR 11-7, if a bank uses a model to deny a loan, it must be able to explain why. “The model said so” has never been acceptable. For statistical models, the explanation is in the coefficients. For LLMs, there is no equivalent. The response was generated by next-token prediction over a context window. The specific reasoning path is not recorded and cannot be reconstructed after the fact.
On change management: every prompt modification is functionally a model change. A revised system prompt, a new few-shot example, a rephrased instruction — each alters how the model behaves across a range of inputs in ways that are not always predictable. In an LLM-based system, prompt engineering in production is a model risk event. Most institutions deploying LLMs today have no framework for treating it as one.
What examiners are finding
Banking examiners are encountering this gap with increasing frequency. Institutions that deployed LLMs in document-processing and decision-support workflows in 2023 and 2024 are now being asked to demonstrate SR 11-7 compliance for those systems. In many cases, the systems were not built with compliance in mind. The institution has logs of what the model produced. It does not have records of how the model reasoned.
This creates a specific examination risk — not because the models are performing badly, but because the institution cannot demonstrate that they’re performing well. There is a difference between a model that happens to make good decisions and a model whose decision quality can be governed, validated, and defended under examination. SR 11-7 requires the latter. Having neither coefficients to inspect nor reasoning traces to audit, institutions are left arguing that their outputs are trustworthy without being able to show why.
The examination question is simple: show me how this system arrived at this specific decision. For LLMs deployed without structured reasoning infrastructure, it is a question without a clean answer.
The three validation assumptions that fail
SR 11-7 validation rests on three assumptions that hold for statistical models and break for LLMs.
The first assumption is that performance on a test set predicts performance on individual decisions. For a well-specified regression model operating in a stable environment, this is reasonable. For an LLM, behavior can vary significantly across inputs that appear similar in the training distribution. Confidence in average accuracy does not translate to confidence in specific decisions.
The second assumption is that the model can be isolated and tested. Statistical models have fixed parameters that can be evaluated in controlled conditions. LLMs behave differently depending on context, framing, and instruction phrasing. Validating “the model” in isolation is a category error when the system’s behavior is jointly determined by parameters and prompts that are modified in production.
The third assumption is that model changes are discrete and documentable. SR 11-7 requires that changes go through a review process. For LLMs, the boundary between operation and modification is blurry. Prompt changes, retrieval configuration changes, tool availability changes — any of these can meaningfully alter behavior. Traditional model change management frameworks weren’t designed to capture them.
What examination readiness actually looks like
An LLM-based system is SR 11-7 ready when it can answer two questions for any specific output.
First: What did the system consider, and in what order? Not in general terms, but for this invocation. What information was retrieved, what strategy was developed, what checks were applied, and in what sequence did the reasoning proceed?
Second: Where did the reasoning require correction, and how was that correction documented? If the system produced an initial output that required revision, that revision should be traceable. If the system identified insufficient evidence and declined to proceed, that abstention should be logged as a deliberate outcome, not a failure.
For traditional models, this structure is implicit in the mathematics. For LLMs, it has to be imposed externally. The model provides intelligence. The governance framework around it provides the accountability structure.
The path forward
The institutions that will successfully extend SR 11-7 compliance to LLM-based systems are the ones that treat reasoning as an artifact, not a side effect. Rather than deploying an LLM and logging its outputs, they structure the inference process in advance: define what needs to be clarified, what needs to be verified, what checks need to run before a decision is produced. Each step becomes a documented node. The whole process becomes inspectable.
This is what structured reasoning protocols like IRG are designed to provide. An IRG-compliant system produces a reasoning trace by architecture: each decision is accompanied by a graph documenting what was considered, what was checked, where uncertainty was flagged, and how the system converged. The trace is exportable and auditable. An examiner can review it the way they would review a model’s coefficients — not to reconstruct neural activations, but to verify that the decision process was governed.
SR 11-7 isn’t going away, and regulators have shown no interest in creating a carve-out for LLMs. The institutions that figure out how to apply model risk management discipline to unstructured inference are the ones that will be able to deploy AI in the workflows that matter. The regulatory framework exists. The missing piece is an architecture that produces the artifacts the framework requires.
The examination question — how did the system arrive at this decision? — is not new. What’s new is that the systems being examined can’t answer it without a different kind of infrastructure underneath them.