Treffer: Proof-of-concept: AI-assisted, natural-language-guided survival analysis achieves concordance with human-conducted results in glioblastoma
Weitere Informationen
Background: The use of large language models with natural-language prompting may simplify survival analysis but requires validation against standard analyses carried out by trained analysts. We compared a human-conducted SPSS survival analysis to the Replit cloud development platform for conversational statistical analysis using a fully observed cohort of 1265 glioblastoma patients. Patients and methods: Two statistical procedures—Kaplan–Meier estimation and Cox proportional hazards regression—were implemented. The reference analysis was carried out by an experienced clinical researcher using SPSS; the Replit-based analysis by an oncologist with no graduate level statistical or programming training employed the lifelines package through Claude language model integration via conversational chatbox interface. Statistical results were reviewed by a data science professor. Artificial intelligence- (AI) generated code was reviewed by a machine learning engineer. All patients had died at last follow-up, so no censoring occurred. Concordance was defined as exact matches in median survival times and hazard ratios (HR), reflecting the scientific principle that identical datasets processed with identical statistical methods must produce identical results. Results: Initial comparison of median survival across 12 molecular/age subgroups found exact concordance in 7 of 12 subgroups (58.3%). The remaining discrepancies represented preprocessing differences rather than acceptable analytical variation. Natural-language troubleshooting through Replit’s conversational interface identified three sources of discrepancy: patient inclusion differences, age-group boundary definitions, and inconsistent molecular encoding. After harmonizing these factors, median survival times and HRs were identical across both analyses, achieving 100% concordance. The Replit-based approach required 1 h 40 min total time compared with 8.5 h for traditional analysis, representing a 80% time reduction while maintaining statistical rigor. Conclusions: This proof-of-concept demonstrates that the Replit platform can achieve exact replication of standard Kaplan–Meier and Cox analyses carried out by an experienced analyst when subgroup definitions and data preprocessing are aligned. Although our findings are limited to a single dataset and workflow, they suggest conversational AI interfaces could reduce barriers to statistical analysis for clinical researchers. Broader validation across varied analytical scenarios is essential before widespread clinical implementation. Statistical expertise remains essential for data quality assessment and model validation.