AI red-teaming is a sociotechnical problem

Summary

This essay reframes AI red-teaming — the adversarial probing of generative AI systems for harmful outputs, vulnerabilities, and biases — as a sociotechnical practice rather than a purely technical safety procedure. Gillespie, Shaw, Gray, and Suh argue that the rapid institutionalization of red-teaming across industry and policy (e.g., Biden’s EO 14110, DEFCON 2023) has obscured three critical dimensions: the embedded value judgments about what counts as harm, the labor arrangements that organize the work, and the psychological costs borne by red-teamers. Drawing an extended analogy to the history of commercial content moderation, the authors warn that the field is repeating familiar patterns of opacity, outsourcing, and worker neglect — and call for a coordinated interdisciplinary research agenda before these arrangements become entrenched.

Key Contributions

  • Establishes AI red-teaming as a sociotechnical problem domain rather than a technical evaluation method.
  • Develops a structured analogy between red-teaming and commercial content moderation, exposing shared dynamics of harm definition, labor arbitrage, and worker harm.
  • Identifies moral injury and secondary traumatic stress as red-teaming-specific psychological risks tied to sustained adversarial roleplay.
  • Provides a critical vocabulary — values, labor, well-being — for empirical and policy research on AI safety work.
  • Issues a programmatic call for a cross-disciplinary research network spanning CS, social science, humanities, and law.

Methods

A conceptual and critical essay rather than empirical study. The authors synthesize their prior research on Responsible AI labor and participatory AI governance, draw comparative lessons from the content moderation literature (Roberts, Gray & Suri), and engage STS, labor studies, and psychology. They review public-facing materials from major AI companies (OpenAI, Anthropic, Google, Microsoft), U.S. policy documents, and high-profile red-teaming events.

Findings

  • Red-teaming remains conceptually fuzzy, blurring with evaluation, bug bounties, penetration testing, and ethical hacking.
  • Internal red-teamers often lack the sociocultural and linguistic expertise to surface diverse harms, and are constrained by NDAs and corporate incentives.
  • Third-party and crowdsourced red-teaming replicates the precarity, weak protections, and labor arbitrage of content moderation pipelines.
  • Volunteer/event-based formats (e.g., DEFCON) broaden participation but risk extractive reliance on marginalized communities and don’t scale.
  • Workers face secondary trauma, PTSD-like symptoms, and moral injury — the last sharpened by the demand to inhabit adversarial, transgressive personas.
  • Existing well-being supports (EAPs, content warnings, opt-outs) are unevenly applied and undermined by surveillance and performance pressures; non-use is wrongly read as non-need.
  • Claims that red-teaming will be automated away obscure rather than eliminate the human labor involved.

Connections

This paper extends the content-moderation-as-labor tradition into generative AI safety and connects directly to participatory and community-based approaches to AI evaluation, such as Matias2025-px and Unknown2025-qj, which similarly interrogate who gets to identify harms and on what terms. Its critique of internal, proprietary harm definitions resonates with calls to broaden participation in red-teaming beyond corporate boundaries, while its attention to worker well-being adds a labor-and-health dimension often missing from participatory governance discussions.

Podcast

A research-radio episode discusses this paper: Listen