Who Does It Best? Comparing Rules, LLMs, and Radiologists for CT Report Labeling
Central question: When curating real-world imaging cohorts or automating radiology workflows, do large language models, rule-based systems, or radiologists produce the most accurate labels—and what hybrid mixes make sense?
What this might mean
Structured, rules-first workflows with LLM “salvage” routines meet or exceed both human and legacy automated baselines, with F1s pushing 0.99 and little penalty in time. While LLMs alone outperform rules, fully automating the process with rules-first handoff offers practical gains in auditability and throughput. These findings suggest real clinical viability for AI in scalable cohort curation, with radiologists still needed for edge-case arbitration. The domain (PE rather than aging conditions) means findings must be re-examined for dementia imaging and frail populations.