It’s one thing to have a language model with near-perfect performance: accurate predictions are exactly we aim for when we train or fine-tune a model for a business use-case. Indeed, performance is a fundamental – if the model doesn’t work, nothing else can be built on that foundation.
But what happens when the model does make that rare error? What happens when the user simply doesn’t understand why the model made a particular decision and wants to know more? This is where we come to the second important factor: explainability, alternatively known as interpretability.
Explainability isn’t just important for understanding the model: It’s also important for accountability. At SPRYFOX, we work with many clients from the healthcare and legal domains, for whom it’s vital to know not just that their model works, but how it arrived at its decision. This information helps them be transparent with their users, and also to justify high-stakes decisions.
For this reason, we’re choosing to invest in LLM explainability research. In this article, we want to share some of the methods we’re investigating.
Contrary to expectation, there is no “best” method for interpreting LLMs. It’s not a one-size-fits all; the type of method that’s best to use depends on the specific project. But whatever the case is, we want to make sure that the explanations we provide have high quality.
The question is, how do we judge whether an explanation is “good”? Even among humans, perceptions of quality are highly subjective. One user might be satisfied with just a short sentence, while another might want a highly detailed traceback of all decision steps. When designing an explainability method, we should account for as diverse of a user pool as possible – but how do we account for everyone’s demands? Is there any fundamental quality of an explanation that most, if not all, users would agree is important?
Fortunately, there are two well-known metrics that we can use to help us orient ourselves: plausibility and faithfulness.
Plausibility is whether the contents of an explanation are factually correct. This can be in terms of general world knowledge (i.e. that 10 is greater than 5, that the Earth is round, etc.), but also in terms of the data that the LLM has been given during training or during interaction with the user. For example, suppose we have a business process in the chemical domain that describes how to deal with a spill of Solution A. The process involves first ventilating the area, putting on protective equipment, then using a mop. After an LLM is trained on this process, a user asks whether it is correct to wear protective equipment before mopping up the spill, and the LLM answers “yes”. The answer is correct – so far, so good. But if the LLM’s explanation for this is that “protective equipment must be worn instead of ventilating the area”, then this explanation is not plausible – it directly contradicts the process. If the model makes this kind of error, then it becomes questionable how well it learned from its own training data, even though its yes/no answer was factually correct. Maybe this correct output occurred at random. Maybe, if the prompt had been worded slightly differently, the LLM would have given the user an incorrect answer as well as an incorrect explanation. It’s safe to say that a user who gets these kinds of responses will have little trust in the LLM, leading to dissatisfaction. This makes plausibility necessary for any kind of explanation method.
Faithfulness, on the other hand, is whether the explanation corresponds to the information that the model actually used to make its prediction. This is a subtler concept that is more difficult to evaluate. How do we know if the model “actually” used the part of the process document that described how to handle the spill of Solution A? What if the model based its answer on a completely different part, for example, a paragraph talking about Solution B? While this could still yield correct predictions, and even perhaps plausible explanations, the model didn’t base its prediction on anything that had to do with Solution A. For a user interacting with the model, this too will be disorienting, and would also raise justified questions about how well the model understands the task.
In choosing the correct explainability method for a specific case, it’s crucial to determine how we will ensure both plausibility and faithfulness. Depending on the method we choose, there will be different ways of doing this.
There are many tools available for LLM explainability, some of which are familiar from earlier machine learning methods, and others which have recently been discovered along with the pace of LLM development.
Integrated Gradients: When an LLM produces an output, we can work backwards through its calculations to ask: which words in the input actually drove this result? By measuring how sensitive the output is to each input word, and then summing those sensitivities, we get a score for every word: its contribution to the answer. The result is a simple, visual highlight map that shows, at a glance, which parts of your prompt mattered most.
In the picture above, an LLM is given the sentence "This movie was absolutely wonderful" as input. The model processes it and predicts a positive sentiment as output.
To explain why the model made this prediction, the Integrated Gradients method then works backwards through the model, calculating a sensitivity score for each input word — essentially asking: how much did each word actually drive this result? These individual scores are then summed up across the integration steps to produce a final contribution score per word.
The result is an explainability map: a visual highlight over the original sentence showing which words mattered most to the prediction. In this case, the word "wonderful" is highlighted, indicating it was the most influential word in the model's decision to predict positive sentiment.
The method of Integrated Gradients easily ensures plausibility, since the attributions are made to the tokens of the input and therefore can’t contradict the input. It also ensures faithfulness, since it is directly based on the model’s internal computation. For example, in our previous process about Solution A, the user would reliably be able to see whether the model uses tokens related to Solution A to answer a question, or if it uses some other parts of the input.
However, one drawback of this method is that it provides little further insight other than the most important tokens. It can’t tell us how exactly these tokens were utilized, or how different tokens of similar importance might have interacted. For a more detailed glimpse of how an LLM works, additional explainability methods might be necessary.
Natural Language Explanations: LLMs are perfectly suited for generating coherent and logical texts. So, it’s natural to wonder whether we can leverage those capabilities to have LLMs generate explanations for their own predictions, perhaps even at the same time as that prediction is made.
A famous form of natural language explanation for LLMs is Chain-of-Thought Prompting. Several years ago, Wei et al. (2022) made the discovery that LLMs were able to solve complex, multi-step problems if the user included a series of example solutions in the prompt, each of which contained a step-by-step breakdown of the problem, mimicking how someone might typically reason to themselves about it.
The following figure from Wei et al. (2022) demonstrates the technique:
Remarkably, the researchers found that the LLMs that were given these Chain-of-Thought prompts achieved much greater accuracy than LLMs that were prompted the standard way, without a reasoning chain. Even more surprisingly, this technique doesn’t involve retraining the model at all. It’s enough just to give the model a few examples of reasoning chains, and the model will learn to generate its own to arrive at the correct answer. Chain-of-Thought prompting seems to solve two problems at once: 1) it makes models more accurate, and 2) also provides a built-in way of explaining the model’s answer.
But all that glitters is not gold: While these reasoning chains might seem impressive at the first glance, neither faithfulness nor plausibility is automatically guaranteed. Recent research has shown that models can generate incorrect reasoning chains yet still arrive at the correct answer (implausibility), and that the reasoning process described in the LLM’s generations often does not mirror the computational steps that occur in the model (unfaithfulness).
For example, Turpin et al. (2023) ran experiments in which they asked LLMs multiple-choice questions and included sample chains-of-thought in the prompt, but biased the reasoning chains so that the correct answer was always option (A). As expected, the LLMs were influenced by the bias and did in fact choose option (A) most of the time in their responses. But the reasoning chains showed something surprising: not only did the models never verbalize that they were being influenced by the bias, but they also tried to justify the incorrect answers! These kinds of insights make it clear that Chain-of-Thought prompting must be utilized with care, and never with blind trust.
It’s also important to note that not all LLMs are able to generate chains-of-thought. In order to have this ability, the model must be of a sufficient size, preferably several billion parameters, the size of model that typically requires large GPUs or API access to run. Most likely, a local model that is trained to analyze law firm letters for a small company won’t be able to generate chains-of-thought or use them to bring about a large performance improvement. But if you’re working with a trained model that is large enough to need hosting on a dedicated server, then keeping Chain-of-Thought prompting in mind is a good tool in your arsenal. As long as you’re aware of the pitfalls of imperfect faithfulness and plausibility, Chain-of-Thought prompting can still be used as an initial diagnostic to gauge what the model knows and what it can do.
Additionally, you might notice that the core idea of step-by-step thinking is already incorporated in state-of-the-art frontier models. The next time you use Claude Code or Gemini, examine the “Thinking” tokens that these models produce and see how the models break down the problem to arrive at the solution. In fact, you probably didn’t even need to tell the model to do this, the model simply did it on its own! This remarkable effect is achieved due to the size of these models, as well as the complexity and data-richness of their training procedures.
Mechanistic Interpretability: This is a newly-emerging field of explainability that looks in detail at model components and their interactions. Rather than observing a model's behavior from the outside, mechanistic interpretability dissects the model from within. It examines individual components (neurons, attention heads, circuits) and traces how information flows between them to produce an output. Think of it less like driving a car and seeing how it behaves when you push various buttons, and more like opening the hood to understand every gear.
One technique commonly used in mechanistic interpretability is known as activation patching. With this method, you don’t just observe what a model outputs; you actually intervene: surgically remove a piece of the input (say, a single step in a business process) and watch how the model's confidence in its original answer changes. Do this across every step, removing the ones ranked most relevant first, and you get a picture of the causal effect of each step. If the model truly relied on the steps flagged as important, removing them should cause a steep confidence drop. Removing the "irrelevant" steps should barely move the needle.
The figure below illustrates this pipeline in three steps. First, the model's full internal computation is laid bare — every layer, every connection — rather than treating it as a black box. Second, activation patching is applied: specific signals inside the model are selectively interrupted to monitor which ones causally drive the output, acting as a kind of controlled experiment within the network itself. Third, the raw technical findings are interpreted and translated into a form that can actually be understood and acted upon.
This method provides a strong measure of faithfulness: it tells you not just what the model found important, but whether that importance has a real impact on the model’s behavior. Additionally, unlike integrated gradients, mechanistic interpretability can be used to understand how different components of the model interact: the many layers and neural connections between them. We also finally have a way of seeing how different words in an input prompt interact together inside the model, since we can trace a model’s entire computation when it processes the prompt.
But there's a catch: The insights that mechanistic interpretability yields are often deeply technical. Understanding that "attention head 7 in layer 4 implements an induction mechanism" is meaningful to an AI researcher, but opaque to a product manager, a regulator, or an end user trying to understand why the model behaved a certain way.
The challenge in applying mechanistic methods to real-world use cases isn’t ensuring a better quality of explanations, but translating them. Bridging the gap between circuit-level findings and human-readable insights is one of the open challenges that will define how useful mechanistic interpretability becomes in practice.
At SPRYFOX , we’re working on building this bridge. We’re integrating mechanistic methods directly into our LLM-based projects, collecting both the insights as well as the pain points. We’re using these methods not just as another lens in a collection, but as a way of shaping our applications as we’re developing them — discovering what patterns and behaviors in the model jump out, and anticipating which ones it’s worth shining a spotlight on when the user comes along. Additionally, in awareness of hardware constraints of small businesses, we’re looking into how to apply mechanistic interpretability methods to quantized and LoRA models.
Explainability in enterprise AI is not pursued in a vacuum. Unlike academic settings, where experiments can run for days on specialized hardware and methods can be explored without immediate pressure, industry research operates under clear constraints: limited compute, tighter timelines, and the need to deliver actionable results. Explaining LLMs is already a complex task. Doing so within these boundaries raises the bar even further. This makes the choice of the explainability method not just a technical decision, but a strategic one.
While many explainability techniques exist, two criteria stand out as non-negotiable in enterprise contexts: plausibility and faithfulness. These are not abstract ideals; they are prerequisites for trust. An explanation that is easy to understand but misleading undermines credibility, while a technically faithful explanation that no stakeholder can interpret fails its purpose just as much. Industry practitioners must constantly evaluate methods through both lenses.
At the same time, it’s important to recognize that not all promising research is immediately production-ready. Techniques like mechanistic interpretability offer deep insights into model behavior, but they often require additional layers of refinement, simplification, or translation before they can be meaningfully presented to clients or integrated into products. Bridging this gap between cutting-edge research and practical usability remains an ongoing challenge, one that we’re addressing by integrating it into our projects and seeing what it can tell us, both about our models and about what the user might want to know.
Looking ahead, the outlook is both challenging and promising. We see it from our customers every day: as models grow more and more impressive, the demand for trustworthy explanations only increases. At the same time, explainability research is evolving quickly, with growing attention on making methods more scalable, faithful, and more accessible to non-experts. For enterprises, the opportunity lies in staying close to this frontier: not just adopting new techniques, but shaping them.
As our projects continue in the coming months, we’ll be sharing more about the insights and challenges we’re facing when adapting LLM explainability methods to enterprise AI. Now we know that we have many tools available and that they can tell us lots of things about what our models know. But what if we want to take it a step further — what if we want to change some aspect of the model’s knowledge, or resolve inconsistencies, without going through the costly process of retraining? We’ll be delving deeper into these topics next, so follow along if you’re interested.
[1] Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824-24837.
[2] Turpin, Miles, et al. "Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting." Advances in Neural Information Processing Systems 36 (2023): 74952-74965.