Loading…
Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study
Large language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferenci...
Saved in:
Published in: | Journal of medical Internet research 2024-08, Vol.26 (2), p.e56500 |
---|---|
Main Authors: | , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Large language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared with traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored.
We evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews with patients with adult-acquired buried penis (AABP).
Qualitative data were obtained from semistructured interviews with 20 patients with AABP. Human analysis involved a structured 3-stage process-initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: (1) a naïve phase, where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes and subthemes and (2) a comparison phase, where AI-generated themes were compared with human-identified themes to assess agreement. We used a general qualitative description approach.
The study population (N=20) comprised predominantly White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m
. Human qualitative analysis identified "urinary issues" in 95% (19/20) and GPT-4 in 75% (15/20) of interviews, with the subtheme "spray or stream" noted in 60% (12/20) and 35% (7/20), respectively. "Sexual issues" were prominent (19/20, 95% humans vs 16/20, 80% GPT-4), although humans identified a wider range of subthemes, including "pain with sex or masturbation" (7/20, 35%) and "difficulty with sex or masturbation" (4/20, 20%). Both analyses similarly highlighted "mental health issues" (11/20, 55%, both), although humans coded "depression" more frequently (10/20, 50% humans vs 4/20, 20% GPT-4). Humans frequently cited "issues using public restrooms" (12/20, 60%) as impacting social life, whereas GPT-4 emphasized "struggles with romantic relationships" (9/20, 45%). "Hygiene issues" were consistently recognized (14/20, 70% humans vs 13/20, 65% GPT-4). Humans uniquely identified "contributing factors" as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (κ=0.401). Reliability assessments of GPT-4's analyses showed consistent coding for themes including "body image |
---|---|
ISSN: | 1438-8871 1439-4456 1438-8871 |
DOI: | 10.2196/56500 |