Behavior and Mechanism Are Interrelated
The contrast between “inner” mechanistic approaches and “behavioral” approaches to AI interpretability is misleading. They’re not opposed — they’re interdependent.
Behavior informs mechanism hypotheses. Carefully designed behavioral studies that map edge cases, identify breakdown patterns, and test for “signatures” of specific algorithms constrain what internal mechanisms are plausible. Benchmarking a model’s success at a task is the narrow version of this. The fruitful version systematically probes unexpected behaviors.
Internal structure requires behavioral validation. A component’s functional significance is validated through observed effects on system-level outputs. Philosophy of cognitive science calls this “looking down, around, and up” — understanding a part’s role requires determining its contribution within larger behavioral contexts.
The most fruitful MI research therefore (1) identifies internal components and (2) demonstrates that these components play well-defined causal roles in producing system-level behavior. Many interventional methods already do this — ablations, activation patching, causal tracing.
The philosophical insight: internal structure acquires explanatory force only when its functional significance is validated through behavior.
Related: [None yet]