The medium is not the message: Deconfounding text embeddings via linear concept erasure
Summary
This paper argues that pretrained sentence embeddings encode “medium” signals — source, language, style — that act as observed confounders when researchers use embedding similarity to cluster or retrieve documents pooled from heterogeneous corpora. The authors recast linear concept erasure (LEACE) as a way to subtract these confounder loadings from a structural decomposition of dot-product similarity, and show empirically across ten embedding models and a new paired benchmark that erasure substantially improves clustering and retrieval — sometimes spectacularly — without degrading out-of-distribution performance. The practical claim is that linear erasure should be a default preprocessing step whenever applied analysts pool texts across known sources or languages.
Key Contributions
- A formal framing of embedding debiasing as removing observed-confounder contributions from similarity estimands.
- A paired benchmark spanning category-level (Comparative Agendas Project) and event-level (Super-SCOTUS, SemEval 2022 Task 8, SwilTra-Bench Swiss court summaries) data designed to isolate confounder effects.
- Broad empirical evaluation of LEACE across ten embedding models, with clustering, retrieval, and OOD-MTEB diagnostics.
- A variance-alignment diagnostic linking erasure gains to PC1: the more confounders dominate top variance directions, the larger the gain.
- Open-source code framing linear erasure as a cheap, principled preprocessing step for applied computational social science.
Methods
The authors apply the closed-form LEACE algorithm to precomputed embeddings to remove subspaces linearly predictive of metadata confounders (source, language). They evaluate on (i) k-means purity and ARI for clustering, (ii) Recall@1/@10 for paired retrieval against distractor pools, and (iii) MTEB legal retrieval, news retrieval, STS, and bitext mining tasks for OOD effects. A PCA-removal baseline (drop PC1) is included for contrast, and a correlation analysis ties variance-in-PC1 to Recall@1 improvement.
Findings
- Erasing source improved clustering for every CAP source-pair across all ten models (e.g., GIST-small on Bills–Newspapers: +0.169 purity, +0.157 ARI).
- Language erasure produced very large cross-language retrieval gains on Swiss court summaries (E5-large +0.651 Recall@1 on DE–IT) and on SemEval multilingual news (E5-small +0.236 Recall@1).
- All model/dataset combinations on SCOTUS paired summaries improved with erasure.
- LEACE-trained erasers transferred to MTEB legal/news/STS tasks with no meaningful degradation; on bitext mining, E5-large-instruct + LEACE set new SOTA on three leaderboard tasks.
- PC1 variance share correlated strongly with Recall@1 improvement (r = 0.79).
- Naive PC1 removal gave inconsistent in-domain gains and catastrophically harmed MTEB, unlike LEACE.
- Erasure is weaker when confounder categories are numerous relative to data, and in some short-query retrieval settings.
Connections
This work is methodological infrastructure for the growing strand of CSS research using embedding similarity to compare or cluster heterogeneous corpora — relevant to applications like cross-platform or cross-source measurement in Bouchaud2026-lr, Balluff2026-if, and Bastos2025-ya, and to embedding-based pipelines for political or legal text such as Peters2026-mo. It also complements critical methodological work questioning what unsupervised text representations actually capture (e.g., Bak-Coleman2026-mk, Munger2025-cz), reframing those concerns as a tractable observed-confounding problem rather than a fundamental limit.