Machine learning in the differential diagnosis of ulcerative colitis and Crohn’s disease: a systematic review

Jin Huang; Xinyi Zhu; Yueying Ma; Zhenjie Zhang; Jinrong Zhang; Zhou Hao; Luyi Wu; Huirong Liu; Huangan Wu; Chunhui Bao

doi:10.21037/tgh-24-117

Review Article

Machine learning in the differential diagnosis of ulcerative colitis and Crohn’s disease: a systematic review

Jin Huang^1#, Xinyi Zhu^1,2#, Yueying Ma^1#, Zhenjie Zhang³, Jinrong Zhang³, Zhou Hao^1,2, Luyi Wu¹, Huirong Liu^1,2, Huangan Wu^1,2, Chunhui Bao^1,2

¹Yueyang Hospital of Integrated Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, Shanghai, China; ²Key Laboratory of Acupuncture and Immunological Effects, Shanghai University of Traditional Chinese Medicine, Shanghai, China; ³Shanghai University of Traditional Chinese Medicine, Shanghai, China

Contributions: (I) Conception and design: C Bao, H Wu; (II) Administrative support: All authors; (III) Provision of study materials or patients: C Bao, H Wu, L Wu, H Liu; (IV) Collection and assembly of data: Z Zhang, J Zhang; (V) Data analysis and interpretation: J Huang, Y Ma; (VI) Manuscript writing: All authors; (VII) Final approval of manuscript: All authors.

^#These authors contributed equally to this work as co-first authors.

Correspondence to: Huangan Wu, MD, PhD; Chunhui Bao, MD, PhD. Yueyang Hospital of Integrated Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Shanghai 200437, China; Key Laboratory of Acupuncture and Immunological Effects, Shanghai University of Traditional Chinese Medicine, 650 South Wanping Road, Shanghai 200030, China. Email: wuhuangan@shutcm.edu.cn; baochunhui@shutcm.edu.cn.

Background: Inflammatory bowel disease (IBD) is a complex chronic disease of the gastrointestinal tract. This systematic review aimed at highlighting the latest findings on the use of machine learning (ML) in the IBD subtypes, ulcerative colitis and Crohn’s disease (CD), with a view to obtaining a basis for the clinical application of ML to differentiate between these subtypes.

Methods: We conducted an extensive search of six major databases, including PubMed, Web of Science, Embase, Cochrane Library, Scopus, and Ovid, for entries made between 1 January 2000 and 28 November 2024. The search was focused on identifying studies that used ML to construct diagnostic models for ulcerative colitis and CD. Quality Assessment of Diagnostic Accuracy Studies was used to assess the risk of bias and concerns about the applicability of these studies. The protocol for this review was registered in PROSPERO (CRD42024543036).

Results: After a rigorous screening and assessment process, 31 papers were found to be suitable for inclusion in the review, with a total sample size of 15,140. Most of the included studies were retrospective (n=27, 87%), with the vast majority of studies (n=20, 65%) published between 2021 and 2023. Random forest (RF) was identified as the most commonly used (n=10, 32%), followed by support vector machines (n=9, 29%), and the majority of the studies were focused on model evaluation metrics of ML.

Conclusions: Our findings indicate that ML holds the potential to enhance diagnostic accuracy in distinguishing between ulcerative colitis and CD, particularly through the utilization of models developed from endoscopic and fecal biomarker data based on deep learning and RF.

Keywords: Inflammatory bowel disease (IBD); ulcerative colitis (UC); Crohn’s disease (CD); subtype diagnosis; machine learning (ML)

Received: 03 September 2024; Accepted: 04 March 2025; Published online: 07 July 2025.

doi: 10.21037/tgh-24-117

Highlight box

Key findings

• Machine learning (ML), particularly random forest (RF) and support vector machine (SVM) show significant promise in improving the diagnostic accuracy of differentiating between ulcerative colitis (UC) and Crohn’s disease (CD).

• Deep learning and RF stand out as the most effective methods among all the model developments which utilized endoscopic and fecal biomarker data.

What is known and what is new?

• Inflammatory bowel disease (IBD) is a chronic condition that includes UC and CD, which are difficult to differentiate clinically due to overlapping symptoms.

• This review synthesizes recent findings on the application of ML in IBD, highlighting its potential to accurately distinguish UC and CD using advanced algorithms like RF and SVM.

What is the implication, and what should change now?

• ML could revolutionize the diagnostic approach for IBD, leading to more personalized treatment strategies and better patient outcomes.

• Clinicians and researchers should consider integrating ML models into routine diagnostic procedures and focus on further validating these models in prospective studies to ensure their applicability in diverse clinical settings.

Introduction

Inflammatory bowel disease (IBD), comprising the subtypes of ulcerative colitis (UC) and Crohn’s disease (CD), is a chronic autoimmune gastrointestinal condition characterized by diarrhea with or without blood, fatigue, and abdominal pain as the major symptoms (1). IBD is characterized by uncertainty, unpredictability, and symptomatic invasiveness (2,3). The morbidity of IBD has been steadily increasing, involving more than 6.8 million people globally, at a cost of approximately $23,000 per patient per year (4,5). The differential diagnosis of the IBD subtypes UC and CD has significant importance for the choice of the clinical treatment and surgical options that would be appropriate for any given patient with IBD (6). Currently, the differential diagnosis between UC and CD still remains challenging, and there is still no single reference standard. The differentiation between UC and CD is mainly based on endoscopic, histological, and radiography findings, combined with clinical manifestations, and is easily influenced by subjective factors. Due to the large number of pathogenic factors involved in IBD and the multitude of changes occurring during the course of the disease, patients tend to exhibit a variety of non-specific features (7). In about 10–15% of the cases, the differentiation between UC and CD is not possible (8-10) or one condition may be misdiagnosed for the other (11). Therefore, improvement in the diagnosis and treatment of IBD through precision medicine strategies is certainly desirable.

Machine learning (ML) is a subfield of artificial intelligence (AI) encompassing various disciplines that can learn and practice on extensive historical data to develop algorithmic models and provide accurate predictions and assessments for new data (12,13). Some of these ML models include support vector machines (SVM), Naive Bayes (NB), and random forests (RF) (14).

With the rapid advancement in modern science and technology, the healthcare industry is on the brink of highly intelligent, personalized treatment, with the use of ML becoming increasingly valuable in various diseases such as IBD (15). A systemic review by Stafford et al. (16) suggests that ML models can help facilitate the practice of personalized medicine in IBD. ML has been applied for the subtype classification of IBD, and studies have reported good performance of the model (17,18), indicating that the application of ML in classifying disease subtypes may be beneficial in selecting the optimal treatment regimens. However, there have been no comprehensive reviews summarizing relevant studies, whereby it still remains unclear whether the application of ML for IBD subtype classification surpasses traditional methods.

Therefore, in this study, we systematically review studies on the use of ML for the classification of IBD subtypes UC and CD. Additionally, we summarize and compare the performance of various models reported in each study, with a view to establish a foundation for the application of ML in clinical practice and identifying the potential directions for future work in this area. We present this article in accordance with the PRISMA reporting checklist (19) (available at https://tgh.amegroups.com/article/view/10.21037/tgh-24-117/rc).

Methods

The protocol for this review was registered in PROSPERO (CRD42024543036).

Literature search

In this study, we conducted a comprehensive search across six electronic databases, including PubMed, Web of Science, Embase, Cochrane library, Scopus, and Ovid, and identified studies published between 1 January 2000 and 28 November 2024, when the search was completed. The keywords used in the literature search included “inflammatory bowel disease or colitis, ulcerative or Crohn disease”, “machine learning”, “classification” and other combinations of keywords, using conjunctions “AND” and “OR”. The specific search strategy with PubMed as the example is presented in Table S1.

After the search was completed, the documents were imported into the Endnote X9 (Clarivate, Philadelphia, Pennsylvania, United States) to exclude duplicates. Studies were initially screened by reading the titles and the abstracts, and rescreened by reviewing the full text. Each step was completed independently by two researchers (J.Z. and Z.Z.) and discrepancies were resolved by consensus through the third person (C.B.). No automated tools were used and the entire process was screened manually.

Inclusion and exclusion criteria

Inclusion criteria

Studies meeting the following criteria were included in this review: (I) full-text articles from peer-reviewed journals or conference proceedings; (II) studies establishing diagnostic models for classifying UC and CD using ML or deep learning (DL); and (III) studies on patients with a definitive diagnosis of UC or CD, which could include otherwise healthy individuals or patients with other comorbidities, regardless of gender, age, race, nationality, and disease duration. No specific requirements were placed for the diagnostic methods. Additionally, in the event of duplicate publications, only the most recent work was taken into consideration.

Exclusion criteria

Studies were excluded if they were (I) conducted on non-human subjects (animal models); (II) published in non-English languages; (III) non-ML studies; (IV) not classifying UC and CD; and (V) reviews, case reports, expert opinions, conference abstracts, letters, or editorial comments.

Data extraction and model evaluation

After identifying the studies eligible for inclusion, two independent reviewers (J.H. and Y.M.) extracted the required data into Microsoft Excel spreadsheets using a pre-designed form. If there was a lack of consensus between the reviewers, the disagreements were resolved by a third party (C.B.) to achieve consensus. Data on the following parameters were extracted from each included study: study, country, type of study, validation methods, study design, sample size, index test, reference standard, ML, area under the curve (AUC), sensitivity, and specificity.

Quality assessment

Quality assessment of the studies was conducted by two independent reviewers (J.H. and Y.M.) using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) (20). If a consensus could not be reached between the two reviewers, a third party (C.B.) was consulted for resolution of conflict. QUADAS-2 consists of four sections: patient selection, index test, reference standard, and flow and timing. Each component was assessed for the risk of bias, while the first three components were also assessed for clinical applicability. The risk of bias assessment of the four sections was evaluated with 2 to 4 key questions, while no key questions were specifically designated for judging concerns about clinical applicability. The risk of bias was deemed as “low”, “high”, or “uncertain”, based on responses of “yes”, “no”, or “uncertain” to relevant questions within each section. The response “uncertain” indicates a lack of detailed information in the articles, making it difficult for the evaluator to make a judgment and should only be used if there are insufficient data.

Results

Studies screening

A total of 3,420 articles were retrieved, of which 2,682 duplicates were excluded. The titles and abstracts of the remaining 738 studies were read, and 42 studies were identified as meeting the inclusion criteria. The full text of these 42 studies were read, and 11 studies were excluded; 31 studies were finally included (Figure 1).

Figure 1 PRISMA flowchart of studies screening. Embase: excerpta medica database. IBD, inflammatory bowel disease; ML, machine learning.

Characteristics of included studies and risk of bias

Characteristics of included studies

This study included a total of 31 research articles, involving 15,140 samples. In terms of the year of article publication, the earliest publication on the use of ML for the differential diagnosis of UC and CD was in 2012. Since 2020, there has been an increase in the number of publications, with the maximum number of studies being published in 2021 and 2023 (Figure 2). Among these studies, 9 (29%) were conducted in China (21-29), 5 (16%) in the UK (30-34), 4 (13%) in the USA (35-38), 4 (13%) in South Korea (39-42), and 1 (3%) each in Germany (43), Italy (44), Canada (45), Serbia (46), Japan (47), Spain (48), Finland (49), Switzerland (50), and Portugal (51). The included studies were mostly retrospective (n=27, 87%), while the remaining were prospective (n=4, 13%). One study was a retrospective analysis for CD patients while conducting a prospective analysis for UC patients (39). Most of the studies have reported on the use of reference standards (n=16, 52%), primarily involving endoscopy and histopathology, while the remaining studies did not specify the reference standards (n=15, 48%). The detailed basic characteristics of the included studies are shown in Table 1.

Figure 2 Summary of article publication time. Numbers and the circle size represent the number of studies.

Table 1

Basic characteristics of included studies

Author, year	Country	Type of study	Validation methods	Research design	Sample size (UC/CD)	Index test	Reference standard
Bielecki 2012 (43)	Germany	Development and validation	Internal validation (cross-validation)	Retrospective	27 (13/14)	Raman spectroscopic histopathology	Histopathology
Chierici 2022 (44)	Italy	Development and validation	Internal validation (random split validation)	Retrospective	N/A	Endoscopic images	N/A
Crooke 2012 (35)	America	Development and validation	Internal validation (cross-validation)	Retrospective	86 (40/46)	Gene	Colonoscopy or sigmoidoscopy and tissue biopsy
Dhaliwal 2021 (45)	Canada	Development	Internal validation (split validation)	Retrospective	58 (41/17)	Baseline clinical, endoscopic, radiologic, and histologic data	Colectomy specimen diagnosis or the label following consensus review
Han 2018 (36)	America	Development and validation	External validation	Retrospective	A: 15 (11/4)	Gene	N/A
					C: N/A	N/A	N/A
Huang 2021 (21)	China	Development and validation	Internal validation (random split validation)	Retrospective	117 (41/76)	Fecal multi-omics data	N/A
Jiang 2021 (22)	China	Development and validation	Internal validation (cross-validation)	Retrospective	A: N/A	Fecal metagenome data	N/A
Kang 2023a (39)	South Korea	Development and validation	Internal validation (cross-validation)	CD: retrospective; UC: prospective	299 (175/124)	Oral microbial markers	N/A
Li 2021 (23)	China	Development and validation	Internal validation (random split validation)	Retrospective	A: 132 (79/53)	CT	Colonoscopy or enteroscopy and pathology, and “expert guidance on imaging examination and reporting of inflammatory bowel disease in China”
					B: 33 (20/13)
Manandhar 2021 (30)	United Kingdom	Development and validation	Internal validation (random split validation)	Retrospective	N/A	Gut microbiome	Professional doctor
Mihajlović 2021 (46)	Serbia	Development and validation	Internal validation (cross-validation)	Retrospective	103 (65/38)	Fecal microbiota	N/A
Mokhtari 2023 (31)	United Kingdom	Development	External validation	Prospective	N/A	Endoscopic marking	Pathologist assessment
Mossotto 2017 (32)	United Kingdom	Development and validation	Internal validation (random split validation)	Prospective	A: 210 (67/143)	Endoscopic clinical findings and histological data	Porto Standard
					B: 48 (13/35)
Nojima 2022 (47)	Japan	Development and validation	Internal validation (cross-validation)	Retrospective	N/A	PAFhy-3D images	Histopathology
Park 2021 (40)	South Korea	Development and validation	Internal validation (cross-validation)	Prospective	127 (94/33)	RNA-seq data from endoscopic biopsy tissue	N/A
Ruan 2022 (24)	China	Development and validation	Internal validation (random split validation)	Retrospective	A: 1,358 (440/444)	Endoscopic images	Clinical courses and endoscopic, histopathological, and radiological findings
			Internal validation (cross-validation)		B: 218 (72/64)
			External validation		C: 196 (67/67)
Sarrabayrouse 2021 (48)	Spain	Development	N/A	Retrospective	65 (31/34)	Fecal fungal and bacterial loads	Endoscopy and histological
Seeley 2013 (37)	America	Development and validation	Internal validation (cross-validation)	Retrospective	62 (36/26)	Proteomic patterns of colonic mucosal tissues	Clinical and pathological features
Smolander 2019 (49)	Finland	Development	N/A	Retrospective	85 (26/59)	Genomics	N/A
Sokollik 2023 (50)	Switzerland	Development and validation	Internal validation (cross-validation)	Prospective	100 (50/50)	Antibody profile	Combination of clinical, biochemical, stool, endoscopic, and histological examinations
Stafford 2023 (33)	United Kingdom	Development and validation	Internal validation (random split validation)	Retrospective	A: 488 (244/244)	WES data	Porto criteria, and British Society of Gastroenterology guidelines
					B: 418 (62/356)
Tong 2020 (25)	China	Development and validation	Internal validation (cross-validation)	Retrospective	6,003 (5,128/875)	Descriptions of colonoscopic images of the patients’ index colonoscopy in the form of free text	Chinese consensus of IBD (2018)
Wang 2022 (26)	China	Development and validation	Internal validation (random split validation)	Retrospective	496 (279/217)	Endoscopic images	Combination of clinical, laboratory, endoscopic, and histological criteria according to the third ECCO consensus
Wei 2013 (38)	America	Development and validation	Internal validation (cross-validation)	Retrospective	N/A	Gene	N/A
Wingfield 2016 (34)	United Kingdom	Development and validation	Internal validation (cross-validation)	Retrospective	A: 122 (N/A)	Fecal 16S rDNA	N/A
Xu 2021 (27)	China	Development and validation	Internal validation (cross-validation)	Retrospective	N/A	Gut microbiome data	N/A
Zhou 2023 (28)	China	Development and validation	Internal validation (random split validation)	Retrospective	A: 221 (99/122)	CT	Combining clinical, radiological, endoscopic, and histological findings by an experienced multidisciplinary team based on WGO global guidelines
Zhou 2023 (28)	China	Development and validation	Internal validation (random split validation)	Retrospective	B: 95 (36/59)	CT
Kang 2023b (41)	South Korea	Development and validation	Internal validation (cross-validation)	Retrospective	432 (259/173)	Fecal microbiota	N/A
			External validation		80 (30/50)
Kim 2023 (42)	South Korea	Development	Internal validation (cross-validation)	Retrospective	226 (113/113)	Fecal microbiome	N/A
Pei 2024 (29)	China	Development and validation	Internal validation (cross-validation)	Retrospective	A: 414 (131/283)	N/A	N/A
			External validation		C: 100 (24/76)
Maurício 2024 (51)	Portugal	Development and validation	Internal validation (cross-validation)	Retrospective	2,656 (1,296/1,360)	Endoscopic images	N/A
			External validation

A: training set; B: internal validation set; C: external validation set. 3D, three-dimensional; CD, Crohn’s disease; CT, computed tomography; ECCO, European Crohn’s and Colitis Organization; IBD, inflammatory bowel disease; N/A, not available; PAFhy, periodic acid-FAM hydrazide; RNA-seq, RNA-sequencing; UC, ulcerative colitis; WES, whole exome sequencing; WGO, World Gastroenterology Organization.

Basic characteristics of ML algorithms

All 31 studies included were based on the use of ML algorithms for differential diagnosis of UC and CD. The majority (n=29) of these studies employed supervised ML algorithms, while two studies simultaneously used both supervised and unsupervised algorithms (28,45). The most widely used ML algorithm was RF (n=10, 32%), followed by SVM (n=9, 29%). Most (n=26, 84%) of the studies concentrated on both development and validation, while a few studies were limited to development (n=5, 16%). In terms of validation, internal validation emerged as the predominant approach (n=27, 87%) using the cross-validation (n=17, 55%) and random split validation (n=10, 32%) techniques. Some studies (n=6, 19%) (24,29,31,36,41,51) employed external validation, with four of these studies using both internal and external validation methods (24,29,41,51). Additionally, two studies (6%) did not specify their validation method (48,49)^. Since multiple validation methods were used in some studies, the total number of algorithms exceed 31.

The most common datasets used in ML modeling were fecal samples (n=9, 29%) (21,22,27,30,34,41,42,46,48), endoscopic images (n=8, 26%) (24-26,31,32,44,45,51), and histological samples (n=7, 23%) (32,36,37,40,43,45,47). Additionally, ML modeling was also done using data from blood samples (n=6, 19%) (29,33,35,38,49,50), imaging (n=3, 10%) (23,28,45), saliva samples (n=1, 3%) (39), and clinical characteristics (n=1, 3%) (45) (Table 1). Since some of the studies used multiple types of datasets during the modeling process, the total percentage exceeds 100%.

Assessment of risk of bias and concerns about applicability of included studies

The included documents were subjected to a quality assessment by the QUADAS-2; the results are shown in Figures 3,4. In 26 (84%) studies, the risk of bias in patient selection remained unclear due to lack of clarity in the description of the time frame and continuity of case inclusion. Eleven (35%) studies had a low risk of bias, whereas 3 (10%) studies had a high risk of bias in the index test due to the failure to confirm the threshold value beforehand, while 17 (55%) studies had an unclear risk. Fourteen (45%) studies did not specify the reference standard, resulting in an unclear risk of bias in reference standard, and 17 (55%) studies had a low risk. Furthermore, 14 (45%) studies had a high risk of bias in flow and timing due to the inconsistent reference standard and incomplete case inclusion, with 9 (29%) studies having an unclear risk and 8 (26%) studies having a low risk.

Figure 3 Regulation chart of quality evaluation results. The proportion of studies with low, high, and unclear risk of bias or applicability concerns. Quality assessment of included studies by QUADAS-2 tool. The authors’ judgments for each domain of each included study were reviewed. The proportion of included studies that indicated low, unclear, high risk, and applicability concerns were shown in green, yellow, and red, respectively. QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies-2.

Figure 4 Results of quality evaluation using QUADAS-2. Risk of bias and applicability concerns summary: review authors’ judgments about each domain for each included study. The proportion of included studies that indicated low, unclear, high risk, and applicability concerns were shown in green, yellow, and red, respectively. Kang 2023a: reference (39); Kang 2023b: reference (41). QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies-2.

Concern about applicability in patient selection was low for all studies (100%). For 6 (19%) studies, concern about applicability in the index test was high since the reported ML model was not purely designed for differential diagnosis of UC and CD subgroups; concern about applicability was low for 25 (81%) studies. Concern about applicability in reference standard was unclear for 15 (48%) studies, due to the lack of a reporting specific reference standard reporting, and low for 16 (52%) studies. The responses to all signaling questions of each study are shown in Table 2.

Table 2

Risk of bias and applicability concerns of included studies

Author, year	Risk of bias															Suitability
Author, year	1.1	1.2	1.3	Patient selection	2.1	2.2	Index test	3.1	3.2	Reference standard	4.1	4.2	4.3	4.4	Flow and timing	Patient selection	Index test	Reference standard
Bielecki 2012 (43)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	HC	LC
Chierici 2022 (44)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Unclear	Yes	UR	Unclear	Yes	Yes	Yes	UR	LC	LC	UC
Crooke 2012 (35)	Yes	Yes	Yes	LR	Yes	Yes	LR	Yes	Yes	LR	Unclear	Yes	Yes	No	HR	LC	LC	LC
Dhaliwal 2021 (45)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	No	No	No	HR	LC	LC	LC
Han 2018 (36)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Unclear	No	No	Yes	HR	LC	LC	UC
Huang 2021 (21)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Unclear	No	No	Yes	HR	LC	HC	UC
Jiang 2021 (22)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Unclear	Unclear	Unclear	Yes	UR	LC	HC	UC
Kang 2023a (39)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Unclear	Yes	UR	Unclear	Unclear	Unclear	No	HR	LC	LC	UC
Li 2021 (23)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	No	No	Yes	HR	LC	LC	LC
Manandhar 2021 (30)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Yes	Yes	LR	Unclear	Unclear	Unclear	No	HR	LC	LC	LC
Mihajlović 2021 (46)	Unclear	Yes	Yes	UR	Yes	No	HR	Unclear	Yes	UR	Unclear	Unclear	Unclear	Yes	UR	LC	LC	UC
Mokhtari 2023 (31)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Unclear	Yes	Yes	No	HR	LC	LC	LC
Mossotto 2017 (32)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	LC	LC
Nojima 2022 (47)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Yes	Yes	No	HR	LC	HC	LC
Park 2021 (40)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Unclear	Yes	UR	Yes	Unclear	Unclear	Yes	UR	LC	LC	UC
Ruan 2022 (24)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Unclear	Unclear	Yes	UR	LC	LC	LC
Sarrabayrouse 2021 (48)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	No	No	Yes	HR	LC	LC	LC
Seeley 2013 (37)	Unclear	Yes	Yes	UR	Yes	No	HR	Yes	Yes	LR	Unclear	Yes	Yes	No	HR	LC	LC	LC
Smolander 2019 (49)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Unclear	Yes	UR	Unclear	Unclear	Unclear	Yes	UR	LC	LC	UC
Sokollik 2023 (50)	Yes	Yes	Yes	LR	Yes	No	HR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	LC	LC
Stafford 2023 (33)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Yes	Yes	LR	Unclear	No	No	No	HR	LC	LC	LC
Tong 2020 (25)	Yes	Yes	Yes	LR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	HC	LC
Wang 2022 (26)	Yes	Yes	Yes	LR	Yes	Yes	LR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	HC	LC
Wei 2013 (38)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Unclear	No	No	Yes	HR	LC	LC	UC
Wingfield 2016 (34)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Unclear	Yes	UR	Unclear	No	No	Yes	HR	LC	LC	UC
Xu 2021 (27)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Unclear	Unclear	Unclear	Yes	UR	LC	LC	UC
Zhou 2023 (28)	Yes	Yes	Yes	LR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	LC	LC
Kang 2023b (41)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Yes	Unclear	Unclear	Yes	UR	LC	LC	UC
Kim 2023 (42)	Unclear	Yes	Yes	UR	Yes	Yes	LR	Unclear	Yes	UR	Yes	Unclear	Unclear	Yes	UR	LC	LC	UC
Pei 2024 (29)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Unclear	Yes	UR	Yes	Yes	Yes	Yes	LR	LC	LC	UC
Maurício 2024 (51)	Unclear	Yes	Yes	UR	Yes	Unclear	UR	Yes	Yes	LR	Yes	Yes	Yes	Yes	LR	LC	LC	UC

1.1: Was a consecutive or random sample of patients enrolled? 1.2: Was a case-control design avoided? 1.3: Did the study avoid inappropriate exclusions? 2.1: Were the index test results interpreted without knowledge of the results of the reference standard? 2.2: If a threshold was used, was it prespecified? 3.1: Is the reference standard likely to correctly classify the target condition? 3.2: Was the reference standard results interpreted without knowledge of the results of the index test? 4.1: Was there an appropriate interval between index tests and reference standard? 4.2: Has the only reference standard been implemented for all patients? 4.3: Did all patients receive the same reference standard? 4.4: Were all patients included in the analysis? HC, high concern; HR, high risk; LC, low concern; LR, low risk; UC, unclear concern; UR, unclear risk.

ML algorithm model performance

Most studies reported the evaluation metrics for the ML models to quantify the performance of the algorithms in terms of various parameters, including accuracy, precision, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), AUC, and F1 score. However, in this study, only AUC, sensitivity, and specificity data were extracted during data extraction, as shown in Table 3. Among these metrics, AUC was the most commonly used metric for the evaluation of models (n=23, 74%), with only 3 (10%) studies failing to report any data on AUC, sensitivity, or specificity (40,45,49). For each of the reported metrics, a higher value is indicative of better algorithm performance.

Table 3

Model performance of machine learning algorithms included in the study

Author, year	Machine learning	AUC	Sensitivity	Specificity
Bielecki 2012 (43)	SVM	N/A	UC =0.99, CD =0.99	UC =0.99, CD =0.99
Chierici 2022 (44)	Ensemble learning	N/A	0.73	0.89
Crooke 2012 (35)	Ratio score	N/A	0.98	1
	SVM1	N/A	0.94	0.85
	SVM2	N/A	0.89	0.92
Dhaliwal 2021 (45)	SNF	N/A	N/A	N/A
Dhaliwal 2021 (45)	RF	N/A	N/A	N/A
Han 2018 (36)	RF	C: 0.821	N/A	N/A
Huang 2021 (21)	LSVM	1: 0.830; 2: 0.850; 3: 0.840	1: UC =0.37, CD =0.81; 2: UC =0.63, CD =0.61; 3: UC =0.33, CD =0.54	1: UC =0.97, CD =0.61; 2: UC =0.86, CD =0.86; 3: UC =0.95, CD =0.91
	SVM
	AdaBoost
	RF
	MLP
Jiang 2021 (22)	RF	N/A	Control vs. UC vs. CD vs. colorectal cancer: UC =0.37, CD =0.88	Control vs. UC vs. CD vs. colorectal cancer: UC =0.81, CD =0.59
Jiang 2021 (22)	RF	N/A	UC vs. CD vs. colorectal cancer: UC =0.68, CD =0.93	UC vs. CD vs. colorectal cancer: UC =0.97, CD =0.89
Kang 2023a (39)	sPLS-DA	0.923	0.82	0.86
Li 2021 (23)	LR	A: 0.988; B: 0.808	A: 0.96; B: 0.80	A: 0.96; B: 0.54
	SVM	A: 1.000; B: 0.727	A: 0.97; B: 0.60	A: 1.00; B: 0.62
	RF	A: 1.000; B: 0.735	A: 1.00; B: 0.80	A: 1.00; B: 0.31
	SGD	A: 0.990; B: 0.800	A: 0.96; B: 0.75	A: 0.91; B: 0.54
	LDA	A: 0.922; B: 0.754	A: 0.88; B: 0.85	A: 0.72; B: 0.54
Manandhar 2021 (30)	RF	B: Taxa =0.910, OTUs =0.920	B: Taxa: 0.85, OTUs: 0.85	B: Taxa: 0.79, OTUs: 0.80
Mihajlović 2021 (46)	RF	0.900	UC: 0.87; CD: 0.94	UC: 0.94; CD: 0.87
Mokhtari 2023 (31)	DSMIL	0.692	N/A	N/A
Mokhtari 2023 (31)	HIPC	0.865	N/A	N/A
Mossotto 2017 (32)	SVM	A: combined model =0.870, histological model =0.820, endoscopic model =0.780	A: combined model =0.83, histological model =0.86, endoscopic model =0.68	N/A
Mossotto 2017 (32)	SVM	N/A	B: combined model: UC =0.85, CD =0.83	N/A
Nojima 2022 (47)	CNN	Sagittal optical slice images: UC =0.950, CD =0.920; horizontal optical slice images: UC =0.900, CD =0.910	Sagittal optical slice images: UC =0.94, CD =0.70, horizontal optical slice images: UC =0.83, CD =0.56	Sagittal optical slice images: UC =0.72, CD =0.86, horizontal optical slice images: UC =0.64, CD =0.79
Park 2021 (40)	PLS-DA	N/A	N/A	N/A
Park 2021 (40)	sPLS-DA	N/A	N/A	N/A
Ruan 2022 (24)	CNN	B: UC =0.997; CD =0.996	B: UC =1.00; CD =0.97	B: UC =0.99; CD =1.00
Ruan 2022 (24)	CNN	C: 1, UC =0.990; CD =0.971; 2, UC =0.979; CD =0.988; 3, UC =1.000; CD =0.974	C: 1, UC =0.96; CD =0.89; 2, UC =0.97; CD =0.93; 3, UC =1.00; CD =0.80	C: 1, UC =0.94; CD =0.98; 2, UC =0.93; CD =0.95; 3, UC =0.84; CD =0.95
Sarrabayrouse 2021 (48)	RF	A: CD vs. UC =0.759	A: 0.82	A: 0.70
Sarrabayrouse 2021 (48)	RF	A: UC vs. CD =0.859	A: 0.90	A: 0.82
Seeley 2013 (37)	SVM	N/A	N/A	N/A
Smolander 2019 (49)	DBN	N/A	N/A	N/A
Smolander 2019 (49)	SVM	N/A	N/A	N/A
Sokollik 2023 (50)	BMA	0.85	0.88	0.66
Stafford 2023 (33)	RF	B: autoimmune gene panel: 0.680	B: UC =0.68; CD =0.63	B: UC =0.63; CD =0.68
		B: IBD gene panel: 0.610	B: UC =0.68; CD =0.46	B: UC =0.46; CD =0.68
		B: all genes: 0.570	B: UC =0.50, CD =0.58	B: UC =0.50, CD =0.58
Tong 2020 (25)	RF	0.936	0.89	0.84
Wang 2022 (26)	CNN	N/A	B: UC =0.90; CD =0.88	B: UC =0.95; CD =0.95
Wei 2013 (38)	LR	CD =0.864, UC =0.826	CD =0.71	CD =0.83
	SVM	CD =0.862, UC =0.826	N/A	N/A
	GBT	CD =0.802, UC =0.782	N/A	N/A
Wingfield 2016 (34)	SVM	B: UC =0.740, CD =0.700	N/A	N/A
Xu 2021 (27)	LightGBM	B: WGS: 0.942, 16S rRNA: 0.966	N/A	N/A
Zhou 2023 (28)	CNN	A: 0.987	A: 0.99	A: 0.90
	CNN	B: 0.693	B: 0.94	B: 0.41
	PCA	A: 0.753	A: 0.78	A: 0.65
	PCA	B: 0.662	B: 0.78	B: 0.63
	LASSO	A: 0.855	A: 0.80	A: 0.75
	LASSO	B: 0.717	B: 0.78	B: 0.63
Kang 2023b (41)	Penalized LR	A: 0.873	0.77	0.79
Kang 2023b (41)	Penalized LR	C: 0.633	0.60	0.66
Kim 2023 (42)	sPLS-DA	0.988	0.94	0.94
Pei 2024 (29)	MLP-ANN	A: 0.923	0.92	0.82
	MLP-ANN	C: 1.000	1.00	1.00
	RBFNN	0.732	0.86	0.53
	DT	0.790	0.91	0.67
	PLS-DA	0.85	0.82	0.78
Maurício 2024 (51)	ViT-S/16	A: 1.000	A: 1.00	N/A
		C1: 0.993	C1: 0.99	N/A
		C2: 0.898	C2: 0.97	N/A

A: training set; B: internal validation set; C: external validation set; C1: ViT-S/16 student; C2: DeiT ViT-S/16; 1: a three-classification individual diagnosis model based on the optimal feature set of the total population; 2: a three-classification individual diagnosis model based on people who self-evaluate as “very well”; 3: a three-classification individual diagnosis model based on people who self-evaluate as “slightly below par”. AdaBoost, adaptive boosting; AUC, area under the curve; BMA, Bayesian modelling average; CD, Crohn’s disease; CNN, convolutional neural network; DL, decision tree; DSMIL, dual stream multi-instance learning; DBN, deep belief network; GBC, gradient boosting classifier; GBT, gradient boosting tree; HIPC, hierarchical image pyramid converter; IBD, inflammatory bowel disease; LDA, linear discriminant analysis; LASSO, least absolute shrinkage and selection operator; LightGBM, light gradient boosting machine; LR, logistic regression; LSVM, linear support vector machine; MLP, multilayer perceptron; N/A, not available; OTUs, operational taxonomic units; PCA, principal component analysis; PLS-DA, partial least squares discriminant analysis; RF, random forest; SGD, stochastic gradient descent; SNF, similarity network fusion; sPLS-DA, sparse partial least squares discriminant analysis; SVM, support vector machine; UC, ulcerative colitis; WGS, whole genome sequencing.

The reported AUC values ranged from 0.61 to 1, with 20 (65%) of the studies reporting values exceeding 0.8 and 13 (42%) reporting values above 0.9. A total of four studies attained the pinnacle of AUC scores, culminating in a perfect value of 1.

The reported sensitivity of the reported models ranged from 0.34 to 1, with 12 (39%) studies reporting values higher than 0.9. The sensitivity metric for four exceptional studies climbed to its maximum, achieving a score of 1.

The reported specificity of the models ranged from 0.31 to 1, with 10 (32%) studies reporting specificity exceeding 0.9. The highest specificity value of 1 was reported for the ratio score based on the genetic data modeling by Crooke et al. (35), the SVM and RF models by Li et al. (23), and the CNN model (CD) by Ruan et al. (24) all reported, as well as the MLP-ANN model by Pei et al. (29) and the ViT-S/16 model by Maurício (51). The detailed characteristics of the ML algorithms included in the study are shown in Table 3.

Discussion

This review summarizes and consolidates the studies since 2000 on the application of ML in the differentiating between UC and CD, the subtypes of IBD, with the objective of obtaining evidence for application of ML in clinical practice. The results show an increase in publications since 2021, reflecting the growing interest in the use of ML for differential diagnosis in UC and CD research in recent years. RF and SVM were identified as the most commonly used ML algorithms for distinguishing UC from CD, which are findings consistent with previous reports by Stafford et al. (52). Both RF and SVM are implemented through supervised learning. RF excels in handling high-dimensional feature inputs and complex data structures (53), which enhances its performance; on the other hand, SVM achieves higher accuracy and can extract linear combinations of features (54). Only 8 (26%) studies used DL methods.

The strength of DL lies in its novelty and the ability to build models without relying on known clinical features. However, DL is not widely used due to its limited interpretability, requirement of large sample sizes, and the long training duration (55).

In terms of modeling data, most studies used endoscopy, fecal samples, and intestinal histological samples to differentiate between UC and CD. Currently, the combined use of endoscopy with histopathological biopsy data serves as the reference standard for diagnosing IBD (56-58) and differentiating UC from CD (59). ML applications in this field facilitate real-time treatment decisions by clinicians, thereby reducing the reliance on biopsies to identify remission and minimizing the time required for image interpretation (60-62). Some studies have reported that the accuracy of ML is superior to that of clinicians (63-65). In light of these advantages, ML-driven modeling of endoscopy is the most widely used. Imaging examination is another basic technique for diagnosing IBD and distinguishing UC from CD (66) ML models excel in the interpretation of imaging examinations, outperforming radiologists in some cases (67). While endoscopy and imaging examination are useful in the detection of intestinal inflammation, their frequent use is limited by the high cost and invasiveness. Therefore, researchers are increasingly showing interest in non-invasive or minimal invasive alternatives, such as fecal biomarkers, for the diagnosis and differential diagnosis of IBD (56,68,69). Fecal biomarkers, which are accessible, directly related to the inflammatory sites, and high in concentration (70) offer the advantages of being the most non-invasive, simple, rapid and economical tools, along with being suitable for home use (69) and having relatively widespread dissemination. The most commonly chosen fecal biomarkers are genetic products. IBD is known to have a genetic predisposition, with about 15% of CD patients having family members with the same disease (71). Several studies have shown the existence of a certain relationship between gene mutation and genetic susceptibility to IBD (72,73). Certain genes have been shown to exhibit differential expression in CD and UC (74), opening up the possibility of using strategies based on genetic markers to guide early clinical diagnosis and individualized treatment (75). Alfonso Perez et al. (76) have shown that ML models can be used to differentiate between UC and CD by identifying a small number of IBD-related genes. However, DL models are not widely used in clinical practice due to the complex operation and high cost as well as the difficulty in interpretation of the results.

The most common metrics reported for the performance of ML models included AUC, sensitivity, and specificity, with AUC (n=23, 74%) being the most frequently used metric. Among the 16 (52%) studies that reported all three metrics of AUC, sensitivity, and specificity, six (19%) (24,25,29,30,42,46) studies showed superior performance of the models. The models used were based on CNN in two studies, RF in three, endoscopy data in two, and fecal biomarker data in four studies. Ruan et al. (24) reported the best model performance, which was obtained for a DL model based on CNN to facilitate the clinical diagnosis of IBD. The results showed that the recognition accuracy and reading efficiency of the DL model was superior to those of experienced endoscopists. These findings suggest that RF and CNN are promising ML methods for distinguishing between UC and CD. RF, which is widely used and with better model performance, can save resources and yield results more quickly. On the other hand, CNN, as an emerging algorithm, performs best despite implementation challenges and its requirement of a large sample size and long working time. Future studies can identify appropriate ML models based on different needs and apply adaptive methods for different levels of hospitals. Modelling using endoscopic images and fecal microbiota data is also a reliable approach. Currently, endoscopic diagnosis is widely used because of its accuracy; however, it has the drawbacks of invasiveness, high cost, and subjectivity in the interpretation of results. In contrast, fecal microbiota data offers a non-invasive, convenient, and cost-effective alternative. The combined use of these two approaches can facilitate accurate diagnosis and effective treatment at different stages of the disease. Moreover, fecal microbiota can be used to assess patients in remission, thereby reducing the trauma of the patients and economic burden. In terms of validation of the models, internal validation is usually followed by external validation. Sole reliance on internal validation may lead to overestimation of the AUC value due to the lack of model generalization. External validation using additional datasets can enhance the accuracy and reliability of the model (77), which, in turn, is crucial for assessing the stability of the model in different clinical settings. The difficulty lies in the need for additional external datasets (78,79). Only 6 (19%) studies included in this research incorporated external validation (24,31,36). The lack of external validation compromises the potential of the models for generalizability. Therefore, the importance of external validation should be fully recognized, and future studies should incorporate measures to address this such as including multi-center cohorts (80,81).

In this study, the quality and applicability of the models were evaluated using the QUADAS-2. The results revealed several issues. First, the overall quality of the studies regarding patient selection is unclear. Providing details regarding the targeted population and the specific selection processes is crucial to ensure high quality of studies. Lack of clarity in describing the time frame and continuity of case inclusion may affect the applicability and validity of the results. Twenty-six (84%) of the studies included in this review did not clearly describe the time frame and continuity of case inclusion. Furthermore, selection bias can easily occur if patient selection is not random or continuous, whereby the study results cannot be adequately extrapolated to the target population in clinical practice (77,82,83). Second, with regard to the index test, three (10%) studies had a high risk of bias, while eleven (35%) had a low risk; the risk of bias was unclear in the case of the remaining 17 (55%) studies. For most studies, the answers to the question “If a threshold was used, was it prespecified?” were unclear since there was no specific statement in this regard. Moreover, in nearly half of the studies, the reference standard was not clearly described, leading to unclear bias risks and concerns about applicability, as well as a high risk of bias in terms of flow and timing. This is mainly because most study data were obtained from publicly available sources, resulting in unclear and inconsistent descriptions of the reference standard. In addition, the use of non-recognized reference standards may result in confirmation bias, overestimation of model performance, and diminished credibility of the results obtained by the ML algorithms. Moreover, the lack of uniformity in the reference standards may result in erroneous estimates of true negatives, leading to falsely high sensitivity and specificity values (84). Furthermore, in some of the studies, certain cases were excluded from model construction, potentially removing data unfavorable to the model construction and retaining only those that would yield positive results. This may facilitate article publication but compromises the objectivity of the research. The results of quality assessment show that most studies have focused on the development and validation of models, with detailing of the ML algorithm; however, they have neglected the methodological quality of the articles. Issues such as patient selection and reference standard have often been overlooked, undermining the widespread clinical applicability of the developed ML models. Therefore. the methodological quality of the studies needs to be improved. In order to provide clinically meaningful and methodologically reliable ML diagnostic models, researchers should familiarize themselves with the appropriate research guidelines before planning the study and ensure adherence to recognized methodological standards, such as the Standards for Reporting Diagnostic Accuracy (STARD) statement (85). However, currently, since STARD is not fully applicable to DL models (86), specific reporting standards like SPIRIT-AI (87) and other standards that are currently under development for use in ML such as STARD-AI (88) can also be utilized.

The current study has several limitations: (I) we were unable to conduct a meta-analysis because of the significant heterogeneity among the included studies, and this may lead to highly biased results. (II) Most of the included studies were retrospective, with only a small number of studies being prospective; this could introduce selection bias due to missing information and unavailable confounding factors (89). (III) Very few of the included studies involved external validation. The generalizability of the ML models of studies without external validation could not be adequately assessed. (IV) Only articles published in English were included, which introduces the possibility of selection bias. (V) Assessment using QUADAS-2 is relatively subjective and not fully applicable to ML models, and could compromise the objectivity and accuracy of the interpretation of the results to some extent. Researchers have recommended the QUADAS-AI, a quality assessment tool for diagnostic test accuracy centered on AI (90), a tool for assessing the diagnostic accuracy of ML models, and we intend to use this in future studies.

Conclusions

In summary, several studies have shown that ML methods, especially DL and RF methods, which are based on modeling of endoscopic and fecal biomarker data, are useful in facilitating the differentiation between UC and CD in clinical test settings. However, these studies are still in the preliminary stages and are quality in quality. The generalization of the currently available ML models is weak. In light of these aspects, before successfully introducing ML into a clinical environment for the differentiation between UC and CD, further large-sample multicenter studies are warranted, with focus on methodological norms, opening scientific data, and comprehensively utilizing both internal and external validation to improve the accuracy, reliability and applicability of the models.

Acknowledgments

The authors would like to acknowledge all the researchers and authors whose valuable contributions and work in the field of IBD have made this systematic review possible.

Footnote

Reporting Checklist: The authors have completed the PRISMA reporting checklist. Available at https://tgh.amegroups.com/article/view/10.21037/tgh-24-117/rc

Peer Review File: Available at https://tgh.amegroups.com/article/view/10.21037/tgh-24-117/prf

Funding: This work was supported by the Natural Science Foundation of Shanghai (No. 22ZR1458300), 2024 Shanghai Oriental Talent Plan Youth Project, the Special Clinical Research Project in the Health Industry of Shanghai Municipal Health Commission (No. 202340036), and National Natural Science Foundation of China (No. 82174501).

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tgh.amegroups.com/article/view/10.21037/tgh-24-117/coif). The authors have no conflicts of Interest to declare.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Open Access Statement: This is an Open Access article distributed in accordance with the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License (CC BY-NC-ND 4.0), which permits the non-commercial replication and distribution of the article with the strict proviso that no changes or edits are made and the original work is properly cited (including links to both the formal publication through the relevant DOI and the license). See: https://creativecommons.org/licenses/by-nc-nd/4.0/.

References

Cosnes J, Gower-Rousseau C, Seksik P, et al. Epidemiology and natural history of inflammatory bowel diseases. Gastroenterology 2011;140:1785-94. [Crossref] [PubMed]
Nicholas DB, Otley A, Smith C, et al. Challenges and strategies of children and adolescents with inflammatory bowel disease: a qualitative examination. Health Qual Life Outcomes 2007;5:28. [Crossref] [PubMed]
Roberts CM, Gamwell KL, Baudino MN, et al. Youth and Parent Illness Appraisals and Adjustment in Pediatric Inflammatory Bowel Disease. J Dev Phys Disabil 2019;31:777-90. [Crossref]
Mehta F. Report: economic implications of inflammatory bowel disease and its management. Am J Manag Care 2016;22:s51-60. [PubMed]
The global, regional, and national burden of inflammatory bowel disease in 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet Gastroenterol Hepatol 2020;5:17-30. [Crossref] [PubMed]
Tontini GE, Vecchi M, Pastorelli L, et al. Differential diagnosis in inflammatory bowel disease colitis: state of the art and future perspectives. World J Gastroenterol 2015;21:21-46. [Crossref] [PubMed]
Gecse KB, Vermeire S. Differential diagnosis of inflammatory bowel disease: imitations and complications. Lancet Gastroenterol Hepatol 2018;3:644-53. [Crossref] [PubMed]
Nikolaus S, Schreiber S. Diagnostics of inflammatory bowel disease. Gastroenterology 2007;133:1670-89. [Crossref] [PubMed]
Everhov ÅH, Sachs MC, Malmborg P, et al. Changes in inflammatory bowel disease subtype during follow-up and over time in 44,302 patients. Scand J Gastroenterol 2019;54:55-63. [Crossref] [PubMed]
Burisch J, Kiudelis G, Kupcinskas L, et al. Natural disease course of Crohn's disease during the first 5 years after diagnosis in a European population-based inception cohort: an Epi-IBD study. Gut 2019;68:423-33. [Crossref] [PubMed]
Lerrigo R, Coffey JTR, Kravitz JL, et al. The Emotional Toll of Inflammatory Bowel Disease: Using Machine Learning to Analyze Online Community Forum Discourse. Crohn's & Colitis 2019;360:otz011.
Deo RC. Machine Learning in Medicine. Circulation 2015;132:1920-30. [Crossref] [PubMed]
Handelman GS, Kok HK, Chandra RV, et al. eDoctor: machine learning and the future of medicine. J Intern Med 2018;284:603-19. [Crossref] [PubMed]
Romagnoni A, Jégou S, Van Steen K, et al. Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. Sci Rep 2019;9:10351. [Crossref] [PubMed]
Da Rio L, Spadaccini M, Parigi TL, et al. Artificial intelligence and inflammatory bowel disease: Where are we going? World J Gastroenterol 2023;29:508-20. [Crossref] [PubMed]
Stafford IS, Gosink MM, Mossotto E, et al. A Systematic Review of Artificial Intelligence and Machine Learning Applications to Inflammatory Bowel Disease, with Practical Guidelines for Interpretation. Inflamm Bowel Dis 2022;28:1573-83. [Crossref] [PubMed]
Chen G, Shen J. Artificial Intelligence Enhances Studies on Inflammatory Bowel Disease. Front Bioeng Biotechnol 2021;9:635764. [Crossref] [PubMed]
Gubatan J, Levitte S, Patel A, et al. Artificial intelligence applications in inflammatory bowel disease: Emerging technologies and future directions. World J Gastroenterol 2021;27:1920-35. [Crossref] [PubMed]
Zhang X, Tan R, Lam WC, et al. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Extension for Chinese Herbal Medicines 2020 (PRISMA-CHM 2020). Am J Chin Med 2020;48:1279-313. [Crossref] [PubMed]
Whiting PF, Rutjes AW, Westwood ME, et al. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011;155:529-36. [Crossref] [PubMed]
Huang Q, Zhang X, Hu Z. Application of Artificial Intelligence Modeling Technology Based on Multi-Omics in Noninvasive Diagnosis of Inflammatory Bowel Disease. J Inflamm Res 2021;14:1933-43. [Crossref] [PubMed]
Jiang P, Wu S, Luo Q, et al. Metagenomic Analysis of Common Intestinal Diseases Reveals Relationships among Microbial Signatures and Powers Multidisease Diagnostic Models. mSystems 2021;6:e00112-21. [Crossref] [PubMed]
Li H, Mo Y, Huang C, et al. An MSCT-based radiomics nomogram combined with clinical factors can identify Crohn's disease and ulcerative colitis. Ann Transl Med 2021;9:572. [Crossref] [PubMed]
Ruan G, Qi J, Cheng Y, et al. Development and Validation of a Deep Neural Network for Accurate Identification of Endoscopic Images From Patients With Ulcerative Colitis and Crohn's Disease. Front Med (Lausanne) 2022;9:854677. [Crossref] [PubMed]
Tong Y, Lu K, Yang Y, et al. Can natural language processing help differentiate inflammatory intestinal diseases in China? Models applying random forest and convolutional neural network approaches. BMC Med Inform Decis Mak 2020;20:248. [Crossref] [PubMed]
Wang L, Chen L, Wang X, et al. Development of a Convolutional Neural Network-Based Colonoscopy Image Assessment Model for Differentiating Crohn's Disease and Ulcerative Colitis. Front Med (Lausanne) 2022;9:789862. [Crossref] [PubMed]
Xu C, Zhou M, Xie Z, et al. LightCUD: a program for diagnosing IBD based on human gut microbiome data. BioData Min 2021;14:2. [Crossref] [PubMed]
Zhou Z, Xiong Z, Cheng R, et al. Volumetric visceral fat machine learning phenotype on CT for differential diagnosis of inflammatory bowel disease. Eur Radiol 2023;33:1862-72. [Crossref] [PubMed]
Pei J, Wang G, Li Y, et al. Utility of four machine learning approaches for identifying ulcerative colitis and Crohn's disease. Heliyon 2023;10:e23439. [Crossref] [PubMed]
Manandhar I, Alimadadi A, Aryal S, et al. Gut microbiome-based supervised machine learning for clinical diagnosis of inflammatory bowel diseases. Am J Physiol Gastrointest Liver Physiol 2021;320:G328-37. [Crossref] [PubMed]
MokhtariRHamidinekooASuttonDInterpretable histopathology-based prediction of disease relevant features in Inflammatory Bowel Disease biopsies using weakly-supervised deep learning (2023). Available online: https://arxiv.org/abs/2303.12095
Mossotto E, Ashton JJ, Coelho T, et al. Classification of Paediatric Inflammatory Bowel Disease using Machine Learning. Sci Rep 2017;7:2427. [Crossref] [PubMed]
Stafford IS, Ashton JJ, Mossotto E, et al. Supervised Machine Learning Classifies Inflammatory Bowel Disease Patients by Subtype Using Whole Exome Sequencing Data. J Crohns Colitis 2023;17:1672-80. [Crossref] [PubMed]
Wingfield B, Coleman S, McGinnity TM, et al. A metagenomic hybrid classifier for paediatric inflammatory bowel disease. 2016 International Joint Conference on Neural Networks (IJCNN); 24-29 July 2016; Vancouver, BC, Canada. IEEE; 2016:1083-9.
Crooke PS, Tossberg JT, Horst SN, et al. Using gene expression data to identify certain gastro-intestinal diseases. J Clin Bioinforma 2012;2:20. [Crossref] [PubMed]
Han L, Maciejewski M, Brockel C, et al. A probabilistic pathway score (PROPS) for classification with applications to inflammatory bowel disease. Bioinformatics 2018;34:985-93. [Crossref] [PubMed]
Seeley EH, Washington MK, Caprioli RM, et al. Proteomic patterns of colonic mucosal tissues delineate Crohn's colitis and ulcerative colitis. Proteomics Clin Appl 2013;7:541-9. [Crossref] [PubMed]
Wei Z, Wang W, Bradfield J, et al. Large sample size, wide variant spectrum, and advanced machine-learning technique boost risk prediction for inflammatory bowel disease. Am J Hum Genet 2013;92:1008-12. [Crossref] [PubMed]
Kang SB, Kim H, Kim S, et al. Potential Oral Microbial Markers for Differential Diagnosis of Crohn's Disease and Ulcerative Colitis Using Machine Learning Models. Microorganisms 2023;11:1665. [Crossref] [PubMed]
Park SK, Kim S, Lee GY, et al. Development of a Machine Learning Model to Distinguish between Ulcerative Colitis and Crohn's Disease Using RNA Sequencing Data. Diagnostics (Basel) 2021;11:2365. [Crossref] [PubMed]
Kang DY, Park JL, Yeo MK, et al. Diagnosis of Crohn's disease and ulcerative colitis using the microbiome. BMC Microbiol 2023;23:336. [Crossref] [PubMed]
Kim H, Na JE, Kim S, et al. A Machine Learning-Based Diagnostic Model for Crohn's Disease and Ulcerative Colitis Utilizing Fecal Microbiome Analysis. Microorganisms 2023;12:36. [Crossref] [PubMed]
Bielecki C, Bocklitz TW, Schmitt M, et al. Classification of inflammatory bowel diseases by means of Raman spectroscopic imaging of epithelium cells. J Biomed Opt 2012;17:076030. [Crossref] [PubMed]
Chierici M, Puica N, Pozzi M, et al. Automatically detecting Crohn's disease and Ulcerative Colitis from endoscopic imaging. BMC Med Inform Decis Mak 2022;22:300. [Crossref] [PubMed]
Dhaliwal J, Erdman L, Drysdale E, et al. Accurate Classification of Pediatric Colonic Inflammatory Bowel Disease Subtype Using a Random Forest Machine Learning Classifier. J Pediatr Gastroenterol Nutr 2021;72:262-9. [Crossref] [PubMed]
Mihajlović A, Mladenović K, Lončar-Turukalo T, et al. Machine Learning Based Metagenomic Prediction of Inflammatory Bowel Disease. Stud Health Technol Inform 2021;285:165-70. [Crossref] [PubMed]
Nojima S, Ishida S, Terayama K, et al. A Novel Three-Dimensional Imaging System Based on Polysaccharide Staining for Accurate Histopathological Diagnosis of Inflammatory Bowel Diseases. Cell Mol Gastroenterol Hepatol 2022;14:905-24. [Crossref] [PubMed]
Sarrabayrouse G, Elias A, Yáñez F, et al. Fungal and Bacterial Loads: Noninvasive Inflammatory Bowel Disease Biomarkers for the Clinical Setting. mSystems 2021;6:e01277-20. [Crossref] [PubMed]
Smolander J, Dehmer M, Emmert-Streib F. Comparing deep belief networks with support vector machines for classifying gene expression data from complex disorders. FEBS Open Bio 2019;9:1232-48. [Crossref] [PubMed]
Sokollik C, Pahud de Mortanges A, Leichtle AB, et al. Machine Learning in Antibody Diagnostics for Inflammatory Bowel Disease Subtype Classification. Diagnostics (Basel) 2023;13:2491. [Crossref] [PubMed]
Maurício J, Domingues I. Distinguishing between Crohn’s disease and ulcerative colitis using deep learning models with interpretability. Pattern Anal Appl 2024;27:1. [Crossref]
Stafford IS, Kellermann M, Mossotto E, et al. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ Digit Med 2020;3:30. [Crossref] [PubMed]
Qi Y. Random forest for bioinformatics. New York, NY: Springer; S 2012:307-23.
Cortes C, Vapnik V. Support-Vector Networks. Mach Learn 1995;20:273-97. [Crossref]
Choi RY, Coyner AS, Kalpathy-Cramer J, et al. Introduction to Machine Learning, Neural Networks, and Deep Learning. Transl Vis Sci Technol 2020;9:14. [PubMed]
Maaser C, Sturm A, Vavricka SR, et al. ECCO-ESGAR Guideline for Diagnostic Assessment in IBD Part 1: Initial diagnosis, monitoring of known IBD, detection of complications. J Crohns Colitis 2019;13:144-64. [Crossref] [PubMed]
Magro F, Langner C, Driessen A, et al. European consensus on the histopathology of inflammatory bowel disease. J Crohns Colitis 2013;7:827-51. [Crossref] [PubMed]
Annese V, Daperno M, Rutter MD, et al. European evidence based consensus for endoscopy in inflammatory bowel disease. J Crohns Colitis 2013;7:982-1018. [Crossref] [PubMed]
Spiceland CM, Lodhia N. Endoscopy in inflammatory bowel disease: Role in diagnosis, management, and treatment. World J Gastroenterol 2018;24:4014-20. [Crossref] [PubMed]
Maeda Y, Kudo SE, Ogata N, et al. Evaluation in real-time use of artificial intelligence during colonoscopy to predict relapse of ulcerative colitis: a prospective study. Gastrointest Endosc 2022;95:747-756.e2. [Crossref] [PubMed]
Takenaka K, Ohtsuka K, Fujii T, et al. Development and Validation of a Deep Neural Network for Accurate Evaluation of Endoscopic Images From Patients With Ulcerative Colitis. Gastroenterology 2020;158:2150-7. [Crossref] [PubMed]
Yao H, Najarian K, Gryak J, et al. Fully automated endoscopic disease activity assessment in ulcerative colitis. Gastrointest Endosc 2021;93:728-736.e1. [Crossref] [PubMed]
Maeda Y, Kudo SE, Mori Y, et al. Fully automated diagnostic system with artificial intelligence using endocytoscopy to identify the presence of histologic inflammation associated with ulcerative colitis (with video). Gastrointest Endosc 2019;89:408-15. [Crossref] [PubMed]
Aoki T, Yamada A, Aoyama K, et al. Clinical usefulness of a deep learning-based system as the first screening on small-bowel capsule endoscopy reading. Dig Endosc 2020;32:585-91. [Crossref] [PubMed]
Le Berre C, Sandborn WJ, Aridhi S, et al. Application of Artificial Intelligence to Gastroenterology and Hepatology. Gastroenterology 2020;158:76-94.e2. [Crossref] [PubMed]
Simpson P, Papadakis KA. Endoscopic evaluation of patients with inflammatory bowel disease. Inflamm Bowel Dis 2008;14:1287-97. [Crossref] [PubMed]
Li X, Liang D, Meng J, et al. Development and Validation of a Novel Computed-Tomography Enterography Radiomic Approach for Characterization of Intestinal Fibrosis in Crohn's Disease. Gastroenterology 2021;160:2303-2316.e11. [Crossref] [PubMed]
Liu D, Saikam V, Skrada KA, et al. Inflammatory bowel disease biomarkers. Med Res Rev 2022;42:1856-87. [Crossref] [PubMed]
Alghoul Z, Yang C, Merlin D. The Current Status of Molecular Biomarkers for Inflammatory Bowel Disease. Biomedicines 2022;10:1492. [Crossref] [PubMed]
Lopez RN, Leach ST, Lemberg DA, et al. Fecal biomarkers in inflammatory bowel disease. J Gastroenterol Hepatol 2017;32:577-82. [Crossref] [PubMed]
Loddo I, Romano C. Inflammatory Bowel Disease: Genetics, Epigenetics, and Pathogenesis. Front Immunol 2015;6:551. [Crossref] [PubMed]
Brant SR. Promises, delivery, and challenges of inflammatory bowel disease risk gene discovery. Clin Gastroenterol Hepatol 2013;11:22-6. [Crossref] [PubMed]
Xin R. Inflammatory Gene Panel Guiding the Study of Genetics in Inflammatory Bowel Disease. Mol Diagn Ther 2024;28:389-401. [Crossref] [PubMed]
Vennou KE, Piovani D, Kontou PI, et al. Multiple outcome meta-analysis of gene-expression data in inflammatory bowel disease. Genomics 2020;112:1761-7. [Crossref] [PubMed]
Haberman Y. Tissue-based Gene Expression as Potential Biomarkers for IBD Course. Inflamm Bowel Dis 2020;26:1485-9. [Crossref] [PubMed]
Alfonso Perez G, Castillo R. Gene Identification in Inflammatory Bowel Disease via a Machine Learning Approach. Medicina (Kaunas) 2023;59:1218. [Crossref] [PubMed]
Steyerberg EW. Overfitting and Optimism in Prediction Models. In: Steyerberg EW, editor. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Cham: Springer International Publishing; 2019:95-112.
Bleeker SE, Moll HA, Steyerberg EW, et al. External validation is necessary in prediction research: a clinical example. J Clin Epidemiol 2003;56:826-32. [Crossref] [PubMed]
Perone CS, Cohen-Adad J. Promises and limitations of deep learning for medical image segmentation. J Med Artif Intell 2019;2:1. [Crossref]
Kim DW, Jang HY, Kim KW, et al. Design Characteristics of Studies Reporting the Performance of Artificial Intelligence Algorithms for Diagnostic Analysis of Medical Images: Results from Recently Published Papers. Korean J Radiol 2019;20:405-10. [Crossref] [PubMed]
Ghassemi M, Oakden-Rayner L, Beam AL. The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit Health 2021;3:e745-50. [Crossref] [PubMed]
Park SH, Han K. Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology 2018;286:800-9. [Crossref] [PubMed]
Sica GT. Bias in research studies. Radiology 2006;238:780-9. [Crossref] [PubMed]
Rutjes AW, Reitsma JB, Di Nisio M, et al. Evidence of bias and variation in diagnostic accuracy studies. CMAJ 2006;174:469-76. [Crossref] [PubMed]
Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. BMJ 2015;351:h5527. [Crossref] [PubMed]
Sounderajah V, Ashrafian H, Aggarwal R, et al. Developing specific reporting guidelines for diagnostic accuracy studies assessing AI interventions: The STARD-AI Steering Group. Nat Med 2020;26:807-8. [Crossref] [PubMed]
Cruz Rivera S, Liu X, Chan AW, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit Health 2020;2:e549-60. [Crossref] [PubMed]
Sounderajah V, Ashrafian H, Golub RM, et al. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open 2021;11:e047709. [Crossref] [PubMed]
Talari K, Goyal M. Retrospective studies - utility and caveats. J R Coll Physicians Edinb 2020;50:398-402. [Crossref] [PubMed]
Sounderajah V, Ashrafian H, Rose S, et al. A quality assessment tool for artificial intelligence-centered diagnostic test accuracy studies: QUADAS-AI. Nat Med 2021;27:1663-5. [Crossref] [PubMed]

doi: 10.21037/tgh-24-117
Cite this article as: Huang J, Zhu X, Ma Y, Zhang Z, Zhang J, Hao Z, Wu L, Liu H, Wu H, Bao C. Machine learning in the differential diagnosis of ulcerative colitis and Crohn’s disease: a systematic review. Transl Gastroenterol Hepatol 2025;10:56.

Machine learning in the differential diagnosis of ulcerative colitis and Crohn’s disease: a systematic review

Highlight box

Introduction

Methods

Literature search

Inclusion and exclusion criteria

Inclusion criteria

Exclusion criteria

Data extraction and model evaluation

Quality assessment

Results

Studies screening

Characteristics of included studies and risk of bias

Characteristics of included studies

Table 1

Basic characteristics of ML algorithms

Assessment of risk of bias and concerns about applicability of included studies

Table 2

ML algorithm model performance

Table 3

Discussion

Conclusions

Acknowledgments

Footnote

References

Article Options

Download Citation

Share