A Definition of Good Explanations and the Challenges of Explaining LLM Outputs

Challenges in Defining and Generating Good Explanations for LLM Outputs


Defining what constitutes a "good explanation" is a long-standing philosophical question that has gained renewed urgency in the era of artificial intelligence (AI). As large language models (LLMs) become increasingly integrated into high-stakes applications—from clinical diagnostics to legal reasoning—the ability to produce clear, faithful explanations of their outputs is critical for adoption and trust. However, achieving this requires first converging on a rigorous definition of what a good explanation actually is.


A Definition Grounded in Counterfactuals and Prior Beliefs


In their 2026 paper, A Definition of Good Explanations and the Challenges Explaining LLM Outputs, Mahon, Ford, and Hackett propose a definition inspired by counterfactual reasoning. Counterfactual explanations—which describe how altering specific inputs would change an output—have become a popular approach in explainable AI (XAI) because they are intuitive and directly answer "what-if" questions.


But the authors argue that counterfactuals alone are insufficient. A good explanation must also account for the prior beliefs of the person receiving it. Specifically, the value of each fact offered in an explanation depends on whether the interlocutor already holds that belief (or its negation). An explanation is only meaningful if it updates the receiver's understanding by presenting information they did not already know or had incorrect assumptions about.


Why LLMs Are Particularly Hard to Explain


Applying this definition to LLMs reveals several fundamental challenges:


  1. Massive, opaque internal states: The relevant prior beliefs of an LLM are encoded across billions of parameters, making it nearly impossible to isolate which specific facts or latent patterns influenced a given output.

    1. Contextual dependence: LLM outputs depend heavily on prompt wording, few-shot examples, and token-level probability flows—factors that are not easily captured by static counterfactual frameworks.

      1. User priors mismatch: The static, training-set-derived knowledge of an LLM rarely aligns perfectly with the dynamic, individual priors of any single user. An explanation that is informative for one person may be redundant or opaque for another.

        1. Trade-offs between fidelity and intelligibility: As of 2026, state-of-the-art explanation methods (e.g., attention attribution, probing classifiers, or neuron-level interpretability) often sacrifice either faithfulness to the model's actual reasoning or comprehensibility to a human audience—a tension the authors' framework makes explicit.

        2. Implications for the Field


          By linking explainability to epistemic state—what the receiver knows or believes—this work shifts focus from merely illuminating model mechanics to designing explanations that are genuinely communicative. It also highlights the need for adaptive explanation systems that can model user priors and tailor counterfactual narratives accordingly—a direction that is likely to grow in importance as LLMs are deployed in personalized educational, medical, and legal contexts.


          The authors conclude that without a clear, user-aware definition of what makes an explanation good, evaluating model transparency risks becoming a purely subjective exercise. Their 2026 contribution provides a formal basis for moving toward explanations that are not just transparent, but truly explanatory.

          via ArXiv AI

Related