Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers

Xiaofan Xiong; Brian J Smith; Stephen A Graves; Michael M Graham; John M Buatti; Reinhard R Beichel

doi:10.3390/tomography9050151

Head and Neck Cancer Segmentation in FDG PET Images: Performance Comparison of Convolutional Neural Networks and Vision Transformers

Tomography. 2023 Oct 18;9(5):1933-1948. doi: 10.3390/tomography9050151.

Authors

Xiaofan Xiong¹, Brian J Smith², Stephen A Graves³, Michael M Graham³, John M Buatti⁴, Reinhard R Beichel⁵

Affiliations

¹ Department of Biomedical Engineering, The University of Iowa, Iowa City, IA 52242, USA.
² Department of Biostatistics, The University of Iowa, Iowa City, IA 52242, USA.
³ Department of Radiology, The University of Iowa, Iowa City, IA 52242, USA.
⁴ Department of Radiation Oncology, University of Iowa Hospitals and Clinics, Iowa City, IA 52242, USA.
⁵ Department of Electrical and Computer Engineering, The University of Iowa, Iowa City, IA 52242, USA.

Abstract

Convolutional neural networks (CNNs) have a proven track record in medical image segmentation. Recently, Vision Transformers were introduced and are gaining popularity for many computer vision applications, including object detection, classification, and segmentation. Machine learning algorithms such as CNNs or Transformers are subject to an inductive bias, which can have a significant impact on the performance of machine learning models. This is especially relevant for medical image segmentation applications where limited training data are available, and a model's inductive bias should help it to generalize well. In this work, we quantitatively assess the performance of two CNN-based networks (U-Net and U-Net-CBAM) and three popular Transformer-based segmentation network architectures (UNETR, TransBTS, and VT-UNet) in the context of HNC lesion segmentation in volumetric [F-18] fluorodeoxyglucose (FDG) PET scans. For performance assessment, 272 FDG PET-CT scans of a clinical trial (ACRIN 6685) were utilized, which includes a total of 650 lesions (primary: 272 and secondary: 378). The image data used are highly diverse and representative for clinical use. For performance analysis, several error metrics were utilized. The achieved Dice coefficient ranged from 0.833 to 0.809 with the best performance being achieved by CNN-based approaches. U-Net-CBAM, which utilizes spatial and channel attention, showed several advantages for smaller lesions compared to the standard U-Net. Furthermore, our results provide some insight regarding the image features relevant for this specific segmentation application. In addition, results highlight the need to utilize primary as well as secondary lesions to derive clinically relevant segmentation performance estimates avoiding biases.

Keywords: CNN; FDG PET; Vision Transformer; head and neck cancer; segmentation.

Publication types

Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't

MeSH terms

Fluorodeoxyglucose F18*
Head and Neck Neoplasms* / diagnostic imaging
Humans
Neural Networks, Computer
Positron Emission Tomography Computed Tomography
Positron-Emission Tomography / methods

Substances

Fluorodeoxyglucose F18

Grants and funding

U01 CA140206/CA/NCI NIH HHS/United States