Mechanistic Interpretability Needs Philosophy

Citation

Williams, I., Oldenburg, N., Dhar, R., Hatherley, J., Fierro, C., Rajcic, N., Schiller, S.R., Stamatiou, F., & Søgaard, A. (2025). Mechanistic Interpretability Needs Philosophy. arXiv:2506.18852.

Abstract

Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. The authors argue that MI needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems.

Framing

Position paper arguing for interdisciplinary collaboration between philosophy and MI. Frames MI as “pre-paradigmatic” — a field with fundamental open problems and unexamined assumptions, where philosophical partnership can accelerate progress. Parallels arguments made for philosophy’s role in physics, cognitive science, economics.

Key Contributions

  1. Defines MI by two commitments: (a) explaining via causal mechanisms not just correlations, (b) producing scientific understanding for researchers vs. explanations for end-users

  2. Examines three open problems to demonstrate philosophy’s value:

    • Network decomposition (philosophy of mechanistic explanation)
    • Features/representations (philosophy of mind, content theories)
    • Deception detection (ethics, philosophy of language)
  3. Addresses four objections to philosophical engagement in MI

Core Arguments

On Decomposition:

  • Challenges the assumption of “one true decomposition” - no privileged level at which mechanistic truth resides
  • Behavior and mechanism are interrelated, not opposed - behavior constrains mechanism hypotheses

On Features:

  • Distinguishes representational vehicles (internal components) from content (what they represent)
  • This distinction clarifies research questions and connects to prior work on representation

On Deception:

  • Lying and deception require cognitive complexity (beliefs, intentions, assertoric commitment)
  • It’s controversial whether LLMs possess these in the relevant sense
  • Not all information concealment is ethically equivalent

Extracted Content

  • Sharkey et al. 2025 - Open Problems in Mechanistic Interpretability
  • Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety
  • Kästner & Crook 2024 - Explaining AI through Mechanistic Interpretability
  • Chalmers 2025 - Propositional Interpretability in AI