Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schiøler, Andreas Brink-Kjær, Sadasivan Puthusserypady, Sándor Beniczky, James Zou, Lars Kai Hansen
Key claim
Framework reveals critical representational failures in EEG models.
This paper presents a framework for interpreting EEG foundation models by extracting sparse feature dictionaries and grounding them in clinical taxonomies. A key result is the identification of operational regimes that reveal critical representational failures, impacting clinical trust in model predictions.
Introduces a novel framework for interpreting EEG transformer models.
Employs solid methodology with robust evaluations across multiple architectures.
Deep reliability assessment
The methodology supports the extraction of interpretable features from EEG foundation models using Sparse Autoencoders and concept steering, but it may overclaim the generalizability of these findings across different architectures without thorough validation. The framework's ability to translate latent manipulations into clinically interpretable frequency signatures is promising but requires further empirical backing.
Reproducibility
Yes, the code is available at https://github.com/BrainCapture/mechanistic-interpretability-for-eeg-foundation-models/tree/preprint and the dataset consists of a clinical 27-lead EEG dataset collected from 3,036 subjects.
Discussion questions
- How might the assumption of monosemanticity impact the interpretation of clinical features in EEG data?
- What are the implications of this framework for the development of trustworthy AI systems in healthcare?
- If the spectral decoder fails to accurately reconstruct phase information, how would that affect the clinical applicability of the model?
Key figure
Figure 1 illustrates the interpretability pipeline, which integrates a spectral decoder, Sparse Autoencoders, and concept attribution to map high-dimensional embeddings to clinically relevant features.