Red-Teaming and Public Involvement in AI Evaluation
From Technical Procedure to Sociotechnical Practice
The papers gathered here share a foundational move: they reframe what look like neutral, technical methods for evaluating AI — red-teaming, benchmarking, auditing — as sociotechnical practices saturated with value judgments, labor arrangements, and institutional politics. Gillespie2026-aa makes this case most pointedly for red-teaming, drawing a direct historical parallel to commercial content moderation: a practice that was similarly rationalized as technical hygiene before its labor conditions, psychological tolls, and embedded value choices became visible. Unknown2025-qj, working empirically through interviews and observation at public red-teaming events, corroborates this reframing from the ground up, showing how what counts as a “harm” or “vulnerability” depends on institutional framing, participant composition, and the genealogy of adversarial methods imported from cybersecurity and military contexts. Matias2025-px generalizes the move beyond red-teaming to AI evaluation writ large, arguing that reliability itself — not merely legitimacy — is a sociotechnical achievement that purely technical methodologies cannot deliver.
Who Evaluates, and With What Expertise
A second thread concerns the epistemic question of who is competent to identify AI harms. Gillespie2026-aa observes that internal red-teamers, constrained by NDAs and homogeneous in background, typically lack the sociocultural and linguistic range to surface the harms that matter to differently situated publics. Matias2025-px turns this critique into a positive program: lived-experience expertise contributes situated knowledge that credentialed scientists cannot supply, and is essential at five distinct stages of evaluation (equipoise, measurement, explanation, inference, interpretation). The Allegheny Family Screening Tool reanalysis and the Chicago police-complaint relabeling cases show concretely how community knowledge surfaces failures that AUC-style metrics or single-category coding obscure. Unknown2025-qj occupies a middle position empirically — public red-teaming events do bring diverse perspectives into vulnerability identification, but the paper is careful not to romanticize participation, noting that institutional framing constrains what participants can even name as harm.
Labor, Participation, and the Risk of Extraction
Where the papers most productively diverge — and converge — is on the political economy of participation. Gillespie2026-aa is sharply skeptical: outsourced and crowdsourced red-teaming reproduces the labor arbitrage and precarity of content moderation, while volunteer-driven events like DEFCON risk extractively relying on the unpaid labor of the very marginalized communities they claim to represent. The distinctive psychological cost of red-teaming — secondary trauma, moral injury from inhabiting adversarial personas — compounds these concerns. Matias2025-px acknowledges parallel issues around consent and worker treatment in contributory citizen-science models, but argues that participatory science has developed methods to address objections about subjectivity, scale, and cost without resorting to extraction. Unknown2025-qj sits at the empirical fulcrum of this debate, documenting how public-facing red-teaming events surface considerations invisible to industry red-teaming while also varying significantly with organizational context — implying that “public interest” is itself a contested designation that institutional sponsors can capture.
Toward a Research and Governance Agenda
Read together, the three papers sketch a coherent agenda. All reject the framing of AI evaluation as a closed technical loop owned by developers; all argue that legitimacy and rigor are intertwined; all warn against premature institutionalization of evaluation practices whose labor and value foundations remain opaque. Gillespie2026-aa calls for a coordinated, interdisciplinary research network to study red-teaming empirically before market-driven forms ossify. Matias2025-px offers a methodological scaffold for how that empirical work — and AI evaluation more broadly — should incorporate lived-experience expertise as a matter of scientific quality. Unknown2025-qj provides early empirical grounding for what public-interest red-teaming actually looks like in practice. The open question across the cluster is governance: whether participatory and public-interest forms of evaluation can be institutionalized without reproducing the extractive, opaque, well-being-indifferent patterns that the content-moderation analogy warns against.