Computational Methods and LLMs in Communication Research

A methodological agenda in formation

The papers gathered under this topic mark out a coherent computational turn in communication research, one organized less around a single technique than around a set of recurring design questions: when to use an LLM versus a smaller specialized model, how to validate outputs that no longer come from transparent classifiers, and how to align measurement with the theoretical estimands communication scholars actually care about. Balluff2026-if frames this agenda critically, warning that the field has adopted LLMs faster than it has reckoned with their epistemic, environmental, and reproducibility costs, and arguing for a “trade-off mindset” that prefers the least resource-intensive tool sufficient for the task. Most of the empirical work in this folder can be read as implicit responses to that challenge — staking out where LLMs genuinely add value and where simpler pipelines suffice.

LLMs as annotators and classifiers

A first cluster takes LLMs seriously as substitutes or scaffolds for human coders. Tan2024-vl provides the broad survey, while Brown2025-jk supplies an unusually sober empirical check: across four contentious annotation datasets, demographic bias in LLM labels is small, dataset-specific rather than model-specific, and dwarfed by item difficulty as a predictor of LLM–human agreement. Domain-specific fine-tuning emerges as a second strategy: Meher2025-qb shows QLoRA-tuned Llama-3 reaching strong performance on terrorism event classification on consumer hardware, while Bailard2024-pj uses fine-tuned DeBERTa to classify collective action frames at scale and link them to offline Proud Boys violence — a reminder that smaller supervised transformers, in line with Balluff et al.’s critique, often remain the right tool. Larsson2026-ro and Iris2026-pg use GPT-4 zero- or few-shot for sentiment and entity extraction in non-English news and Facebook contexts, treating LLMs as flexible multilingual coders validated against human judgment.

Several papers extend annotation into more demanding inferential terrain. Lee2026-je shows that GPT-4o can infer users’ partisanship even from non-political Reddit and Debate.org text, exploiting culturally politicized vocabulary; Le-Mens2025-qz turns prompting into a scaling method by asking and averaging LLM placements of political texts on ideological dimensions; and DiGiuseppe2025-es pairs LLMs with paired-comparison designs to scale open-ended survey responses. Paci2025-ag supplies the counterpoint: on implicit content in Italian political speech, even GPT-4o-mini falls more than twenty points below human ceiling, suggesting that pragmatic competence — not just classification accuracy — remains a hard limit.

Pipelines, validation, and the LLMs-in-the-loop turn

A second cluster treats LLMs as components in multi-stage pipelines, foregrounding validation. Marino2024-2fbc690f articulates this most explicitly, describing a three-phase expert-validation protocol for an LLMs-in-the-loop pipeline (classifier → embedding-based clustering → LLM cluster labeling) on Italian Facebook data, and arguing that crowdworkers are no longer adequate evaluators when LLMs match or outperform them. Giglietto2024-cbeb3f70 supplies the embedding-comparison companion piece, showing OpenAI’s text-embedding-3-large outperforming Italian BERT variants for clustering political news. Ober2026-vd generalizes the human-in-the-loop logic to qualitative interview analysis, using topic models plus LLM labeling to refine a human codebook without surrendering interpretive depth. Waight2025-al formalizes the same design philosophy for cross-lingual narrative similarity, combining SBERT candidate generation with fine-tuned GPT-4o annotation and explicitly validating the resulting estimand against text reuse, topic models, and Relatio.

Two papers attack validation problems at the representation level. Fan2025-ut reframes corpus-driven structure as observed confounding in embedding space and uses linear concept erasure (LEACE) to remove source- and language-level signal from similarity measures — a low-cost preprocessing step that markedly improves clustering and retrieval. DeVerna2025-dl meanwhile shows that for political fact-checking, the bottleneck is not model capability but curated context: a RAG pipeline over PolitiFact summaries raises macro F1 by an average of 233% over reasoning and web-search variants. Both push back against the assumption that scale alone will solve measurement problems.

Multimodality and visual communication

A growing thread extends these methods beyond text. Achmann-Denkler2026-lx benchmarks GPT-4o against specialized computer vision models for face recognition and person counting on Instagram campaign images, finding the multimodal LLM strongly superior and lowering technical barriers for visual political communication research. Arminio2025-tw makes the complementary semantic argument: VLLM-generated textual descriptions of images, embedded and clustered, capture connotative meaning (renewable-energy imagery, eco-fascist symbols) that CNN pipelines miss, while also yielding interpretable TF-IDF cluster summaries. Arora2025-tx generalizes this to multimodal framing analysis of gun-violence news, integrating textual, visual, and cross-modal cues to detect editorial framing differences across the political spectrum.

Structure, dynamics, and large-scale discourse

A final cluster uses computational methods to characterize large-scale discursive and behavioral structures. Elfes2026-jb operationalizes Greimas’ actantial model via DeepSeek to introduce “narrative polarisation,” finding that YouTube videos on Israel–Palestine are sharply polarized while comments converge on the surface but diverge in deeper narrative motifs. Gerard2025-br proposes t-CANE, a discourse-centered network embedding that reconstructs cross-platform user networks via shared narrative clusters rather than interactions, revealing a tiny set of “bridge users” responsible for ~70% of narratives migrating between Truth Social and X. Bruns2025-fz and Sarmiento2025-as develop complementary practice-mapping and unsupervised-framing pipelines for tangled social network and polarized discussion data, while Minici2024-tf approaches coordinated information operations through a graph foundation model combining language and structural signals. Nenno2025-xa applies computational news-value detection across 24 countries to study perceived misinformation, and Bouchaud2026-lr reconstructs X’s recommender embedding space to show that political ideology is learned as a linear direction — an audit of representation rather than output.

The temporal dimension receives its own programmatic treatment in Fan2026-af, which reviews six families of methods (sequence analysis, event history, HMMs, networks, process mining, language-based embeddings) for analyzing user-sequences in donated digital trace data, arguing that communication theory’s processual ambitions remain underserved by cross-sectional measurement.

An emerging consensus

Read together, these papers sketch a maturing methodological program. LLMs are most defensible where they perform tasks — pragmatic inference, multimodal interpretation, narrative-level similarity, contextual labeling — that earlier pipelines could not do well; for many classification problems, fine-tuned smaller models remain competitive and more reproducible. Validation is being rebuilt around expert coders, task-specific protocols, and explicit estimand–estimator alignment rather than off-the-shelf agreement metrics. And across applications, the field is converging on hybrid LLM-in-the-loop architectures in which embeddings, clustering, retrieval, and generative models each play delimited roles. The critical voice of Balluff2026-if is less a dissent than a constitutive part of this agenda: the methodological question is no longer whether to use LLMs but how to use them in ways communication research can still call its own.