Participatory and Public-Interest AI Evaluation (Structure)

From Technical Audit to Sociotechnical Practice

The three papers gathered here share a common premise: AI and digital systems cannot be adequately evaluated through internal technical metrics alone, because their harms, failures, and accountability gaps are produced at the intersection of design, organizational context, and lived experience. Each paper picks a different leverage point — regulatory design auditing, scientific evaluation methodology, and safety testing — and shows how publics, experts-by-experience, or legal-design intermediaries reshape what counts as a defect worth surfacing.

Reframing Evaluation Around Lived Experience and Equipoise

Matias2025-px makes the most explicit methodological argument: public involvement is not a political concession but a source of scientific rigor. Across five stages — equipoise, measurement, explanation, inference, interpretation — the authors document how lived-experience experts catch flaws professional evaluators miss, such as the bipartite-ranking bias in the Allegheny Family Screening Tool or the miscoded sexual-violation allegations in Chicago police complaints. The implicit claim is that “trustworthy AI” reduces to trustworthy science, and trustworthy science is constitutively participatory. This recasts older worries about subjectivity or scale as solvable methodological problems rather than reasons to exclude publics.

Adversarial and Critical Engagement: Red-Teaming as Public Practice

Unknown2025-qj picks up where Matias et al. leave off, examining one concrete venue — red-teaming — where publics are actively recruited into the discovery of harms. Where Matias frames participation as improving measurement and inference, this paper highlights how the framing of an event determines what even registers as a vulnerability. Adversarial methods inherited from cybersecurity, the authors argue, must be reworked when the “system under test” is a generative model whose harms are diffuse, contextual, and often invisible to its developers. The two papers together suggest a spectrum of participatory evaluation: from cocreated study design (Matias) to time-bounded adversarial probing (red-teaming), each suited to different evaluation questions.

Design-Oriented Reasoning in Regulatory Enforcement

Ahuja2025-ku approaches the same sociotechnical terrain from the regulatory side. Rather than asking publics to identify harms, it asks how legal categories — the DSA’s prohibitions on deception, manipulation, and distortion/impairment — can be operationalized through HCI’s design vocabulary. The “law-to-design” framework, by mapping 59 dark patterns onto eight design factors across information and choice spaces, performs a translation function analogous to what Matias calls measurement and explanation: it gives enforcers a reasoned account of why a particular interface violates autonomy. The paper’s call for “regulatory design auditing” as a new HCI subfield parallels the red-teaming paper’s argument that safety practices need institutional homes attentive to public interest.

Points of Convergence and Tension

All three papers reject the view that evaluation is a neutral technical exercise downstream of system design. They converge on the claim that harms are co-produced by design choices, deployment contexts, and the framings used to look for them — and that addressing this requires new intermediary practices (participatory protocols, public red-teams, law-to-design frameworks) that sit between developers, regulators, and affected communities. They differ in their theory of change: Matias2025-px reforms science from within; Unknown2025-qj democratizes an existing safety ritual; Ahuja2025-ku equips regulators with conceptual tools to enforce existing law. A productive tension runs through the set: participatory inclusion (Matias, red-teaming paper) emphasizes situated knowledge from below, while the law-to-design move (Ahuja) consolidates expert reasoning to make enforcement tractable. Whether these are complementary stages of a single accountability pipeline — publics surface harms, experts codify them, regulators act — or whether codification risks displacing the participatory work that uncovered the harms in the first place, is the open question the topic leaves for further notes.

fg-zettelkasten

Explorer

Participatory and Public-Interest AI Evaluation (Structure)

From Technical Audit to Sociotechnical Practice

Reframing Evaluation Around Lived Experience and Equipoise

Adversarial and Critical Engagement: Red-Teaming as Public Practice

Design-Oriented Reasoning in Regulatory Enforcement

Points of Convergence and Tension

Graph View

Table of Contents

Backlinks