TY - JOUR
T1 - Clinical Evaluation of Deep Learning for Tumor Delineation on 18F-FDG PET/CT of Head and Neck Cancer
AU - Kovacs, David G.
AU - Ladefoged, Claes N.
AU - Andersen, Kim F
AU - Brittain, Jane M.
AU - Christensen, Charlotte B
AU - Dejanovic, Danijela
AU - Hansen, Naja L
AU - Loft, Annika
AU - Petersen, Jørgen H.
AU - Reichkendler, Michala
AU - Andersen, Flemming L.
AU - Fischer, Barbara M.
PY - 2024
Y1 - 2024
N2 - Artificial intelligence (AI) may decrease 18F-FDG PET/CT-based gross tumor volume (GTV) delineation variability and automate tumor-volume-derived image biomarker extraction. Hence, we aimed to identify and evaluate promising state-of-the-art deep learning methods for head and neck cancer (HNC) PET GTV delineation. Methods: We trained and evaluated deep learning methods using retrospectively included scans of HNC patients referred for radiotherapy between January 2014 and December 2019 (ISRCTN16907234). We used 3 test datasets: an internal set to compare methods, another internal set to compare AI-to-expert variability and expert interobserver variability (IOV), and an external set to compare internal and external AI-to-expert variability. Expert PET GTVs were used as the reference standard. Our benchmark IOV was measured using the PET GTV of 6 experts. The primary outcome was the Dice similarity coefficient (DSC). ANOVA was used to compare methods, a paired t test was used to compare AI-to-expert variability and expert IOV, an unpaired t test was used to compare internal and external AI-to-expert variability, and post hoc Bland-Altman analysis was used to evaluate biomarker agreement. Results: In total, 1,220 18F-FDG PET/CT scans of 1,190 patients (mean age ± SD, 63 ± 10 y; 858 men) were included, and 5 deep learning methods were trained using 5-fold cross-validation (n = 805). The nnU-Net method achieved the highest similarity (DSC, 0.80 [95% CI, 0.77-0.86]; n = 196). We found no evidence of a difference between expert IOV and AI-to-expert variability (DSC, 0.78 for AI vs. 0.82 for experts; mean difference of 0.04 [95% CI, -0.01 to 0.09]; P = 0.12; n = 64). We found no evidence of a difference between the internal and external AI-to-expert variability (DSC, 0.80 internally vs. 0.81 externally; mean difference of 0.004 [95% CI, -0.05 to 0.04]; P = 0.87; n = 125). PET GTV-derived biomarkers of AI were in good agreement with experts. Conclusion: Deep learning can be used to automate 18F-FDG PET/CT tumor-volume-derived imaging biomarkers, and the deep-learning-based volumes have the potential to assist clinical tumor volume delineation in radiation oncology.
AB - Artificial intelligence (AI) may decrease 18F-FDG PET/CT-based gross tumor volume (GTV) delineation variability and automate tumor-volume-derived image biomarker extraction. Hence, we aimed to identify and evaluate promising state-of-the-art deep learning methods for head and neck cancer (HNC) PET GTV delineation. Methods: We trained and evaluated deep learning methods using retrospectively included scans of HNC patients referred for radiotherapy between January 2014 and December 2019 (ISRCTN16907234). We used 3 test datasets: an internal set to compare methods, another internal set to compare AI-to-expert variability and expert interobserver variability (IOV), and an external set to compare internal and external AI-to-expert variability. Expert PET GTVs were used as the reference standard. Our benchmark IOV was measured using the PET GTV of 6 experts. The primary outcome was the Dice similarity coefficient (DSC). ANOVA was used to compare methods, a paired t test was used to compare AI-to-expert variability and expert IOV, an unpaired t test was used to compare internal and external AI-to-expert variability, and post hoc Bland-Altman analysis was used to evaluate biomarker agreement. Results: In total, 1,220 18F-FDG PET/CT scans of 1,190 patients (mean age ± SD, 63 ± 10 y; 858 men) were included, and 5 deep learning methods were trained using 5-fold cross-validation (n = 805). The nnU-Net method achieved the highest similarity (DSC, 0.80 [95% CI, 0.77-0.86]; n = 196). We found no evidence of a difference between expert IOV and AI-to-expert variability (DSC, 0.78 for AI vs. 0.82 for experts; mean difference of 0.04 [95% CI, -0.01 to 0.09]; P = 0.12; n = 64). We found no evidence of a difference between the internal and external AI-to-expert variability (DSC, 0.80 internally vs. 0.81 externally; mean difference of 0.004 [95% CI, -0.05 to 0.04]; P = 0.87; n = 125). PET GTV-derived biomarkers of AI were in good agreement with experts. Conclusion: Deep learning can be used to automate 18F-FDG PET/CT tumor-volume-derived imaging biomarkers, and the deep-learning-based volumes have the potential to assist clinical tumor volume delineation in radiation oncology.
KW - 18F-FDG PET/CT
KW - Deep learning
KW - Head and neck cancer
KW - Imaging biomarkers
KW - Tumor volume delineation
U2 - 10.2967/jnumed.123.266574
DO - 10.2967/jnumed.123.266574
M3 - Journal article
C2 - 38388516
SN - 0161-5505
VL - 65
JO - Journal of Nuclear Medicine
JF - Journal of Nuclear Medicine
IS - 3
ER -