Introduction
One of the recurring tasks in dentistry is the recording of dental status and the detection and diagnosis of pathological findings, including caries. This assessment eventually results in individual recommendations for preventive and operative management (
Schwendicke, Splieth, et al. 2019). From a clinical point of view, visual examination (VE) is the preferred method, as it can be performed easily and achieve acceptable accuracy after tooth cleaning and drying (
Ekstrand et al. 1997,
2018;
Nyvad et al. 1999;
Ekstrand 2004;
Kühnisch et al. 2009,
2011;
Pitts 2009;
World Health Organization [WHO] 2013;
Gimenez et al. 2015). Although diagnostic studies have shown that trained dentists are generally able to achieve good intra- and interexaminer reproducibility (e.g.,
Litzenburger et al. 2018), there are repeatedly situations observed in daily clinical practice in which different dentists make contradictory diagnoses. Therefore, independent verification through artificial intelligence (AI) methods may be desirable (
Schwendicke, Golla, et al. 2019;
Schwendicke, Samek, et al. 2020). In the case of the visual assessment of teeth, the analysis of intraoral photographs in machine-readable form can be considered equivalent to VE and provide pictorial information, which means they are the basic requirement for automated analysis. The first studies were published recently using deep learning with convolutional neural networks (CNNs) to detect caries on dental X-rays (
Bejnordi et al. 2018;
Lee et al. 2018a,
2018b;
Park and Park 2018;
Moutselos et al. 2019;
Cantu et al. 2020;
Geetha et al. 2020;
Khan et al. 2020) or near-infrared light transillumination images (
Casalegno et al. 2019;
Schwendicke, Elhennawy, et al. 2020). However, a few attempts have been made to use intraoral images for automatic, AI-based caries detection (
Askar et al. 2021). Therefore, this diagnostic study focused on caries detection and categorization with a CNN (the test method) and compared the diagnostic performance with respect to expert evaluation (the reference standard) on intraoral photographs. In detail, it was expected that a diagnostic accuracy of at least 90% would be reached.
Materials and Methods
This study was approved by the Ethics Committee of the Medical Faculty of the Ludwig-Maximilians-University of Munich (project number 020-798). The reporting of this investigation followed the recommendations of the Standard for Reporting of Diagnostic Accuracy Studies (STARD) steering committee (
Bossuyt et al. 2015) and topic-related recommendations (
Schwendicke et al. 2021).
Photographic Images
All the images were taken in the context of previous studies as well as for documentation or teaching purposes by an experienced dentist (JK). In detail, all the images were photographed with a professional single-reflex lens camera (Nikon D300, D7100, or D7200 with a Nikon Micro 105-mm lens) and Macro Flash EM-140 DG (Sigma) after tooth cleaning and drying. Molar teeth were photographed indirectly using intraoral mirrors (Reflect-Rhod; Hager & Werken) heated before positioning in the oral cavity to prevent condensation on the mirror surface.
To ensure the best possible image quality, inadequate photographs (e.g., out-of-focus images or images with saliva contamination) were excluded. Furthermore, duplicated photos from identical teeth or surfaces were removed from the data set. With this selection step, it was ensured that equal clinical photographs were included once only. All jpeg images (RGB format, resolution 1,200 × 1,200 pixel, no compression) were cropped to an aspect ratio of 1:1 and/or rotated in a standard manner using professional image editing software (Affinity Photo; Serif) until, finally, the tooth surface filled most of the frame. With respect to the study aim, only images of healthy teeth or carious surfaces were included. Photographs with (additional) noncarious hard tissue defects (e.g., enamel hypomineralization, hypoplasia, erosion or tooth wear, fissure sealants, and direct and indirect restorations) were excluded to rule out potential evaluation bias. Finally, 2,417 anonymized, high-quality clinical photographs from 1,317 permanent occlusal surfaces and 1,100 permanent smooth surfaces (anterior teeth and canines = 734; posterior teeth = 366) were included.
Caries Evaluation on All the Images (Reference Standard)
Each image was examined on a PC aimed at detecting and categorizing caries lesions in agreement with widely accepted classification systems: the WHO standard (
WHO 2013), International Caries Detection and Assessment System (
Pitts 2009,
http://www.icdas.org), and Universal Visual Scoring System (
Kühnisch et al. 2009,
2011). All the images were labeled by an experienced examiner (JK, >20 y of clinical practice and scientific experience), who was also aware of the patients’ history and overall dental status, into the following categories: 0, surfaces with no caries; 1, surfaces with signsof a noncavitated caries lesion (first signs, established lesion, localized enamel breakdown); and 2, surfaces with caries-related cavitation (dentin exposure, large cavity). Both caries thresholds are of clinical relevance and also commonly used in caries diagnostic studies (
Schwendicke, Splieth, et al. 2019;
Kapor et al. 2021). Each diagnostic decision—1 per image—served as a reference standard for cyclic training and repeated evaluation of the deep learning–based CNN. The annotator’s (JK) intra- and interexaminer reproducibility was published earlier as a result of different training and calibration sessions. The κ values showed at least a substantial capability for caries detection and diagnostics: 0.646/0.735 and 0.585/ 0.591 (UniViSS;
Kühnisch et al. 2011) and 0.93 to 1.00 (DMF index and UniViSS;
Heitmüller et al. 2013).
Programming and Configuration of the Deep Learning–Based CNN for Caries Detection and Categorization (Test Method)
The CNN was trained using a pipeline of several established methods, mainly image augmentation and transfer learning. Before training, the entire image set (2,417 images/853 healthy tooth surfaces/1,086 noncavitated carious lesions/431 cavitations/47 automatically excluded images during preprocessing) was divided into a training set (N = 1,891/673/870/348) and a test set (N = 479/180/216/83). The latter was never made available to the CNN as training material and served as an independent test set.
Image augmentation was used to provide a large number of variable images to the CNN on a recurring basis. For this purpose, the randomly selected images (batch size = 16) were multiplied by a factor of ~3, altered by image augmentation (random center and margin cropping by up to 20% each, random rotation by up to 30°), and resized (224 × 224 pixel) by using torchvision (version 0.6.1,
https://pytorch.org) in connection with the PyTorch library (version 1.5.1,
https://pytorch.org). In addition, all the images were normalized to compensate for under- and overexposure.
MobileNetV2 (
Sandler et al. 2018) was used as the basis for the continuous adaptation of the CNN for caries detection and categorization. This architecture uses inverted residual blocks, whose skip connections allow access to previous activations, and enables the CNN to achieve high performance with low complexity (
Bianco et al. 2018). The model architecture was mainly chosen for better inference time and improved usability in clinical settings. When training the CNN, backpropagation was used, aiming at determining the gradient for learning. Backpropagation was repeated iteratively over images and labels using the abovementioned batch size and parameters. Overfitting was prevented, first, by selecting a low learning rate (0.001). Second, dropout (rate 0.2) on final linear layers was used as a regularization technique (
Srivastava et al. 2014). To train the CNN, this step was repeated for 50 epochs. Moreover, cross-entropy loss as an error function and Adam optimizer (betas 0.9 and 0.999, epsilon 1e
−8) were applied. A learning rate scheduler was included to monitor the training effects. In the case of no training, progress over 5 epochs reduced the learning rate (factor 0.1).
To accelerate the training process of the CNN, an open-source neural network with pretrained weights was used (MobileNetV2 pretrained on ImageNet, Stanford Vision and Learning Lab, Stanford University). This step enabled the transfer of existing learning to recognize basic structures in the existing image set more efficiently. The training of the CNN was performed on a server at the university-based data center with the following specifications: Tesla GPU V100 SXM2 16 GB (Nvidia), Xeon CPU E5-2630 (Intel Corp.), and 24 GB RAM.
Determination of the Diagnostic Performance
The training of the CNN was repeated 4 times. In each run, 25%, 50%, 75%, and 100% of the training data were used (random selection), and each time, the resulting model was evaluated on the test set. This allowed an evaluation of the model performance in relation to the amount of training data. It is noteworthy that the independent test set of images was always available for evaluating the diagnostic performance.
Statistical Analysis
The data were analyzed using R (
http://www.r-project.org) and Python (
http://www.python.org, version 3.7). The overall diagnostic accuracy (ACC = (TN + TP) / (TN + TP + FN + FP)) was determined by calculating the number of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs) after using 25%, 50%, 75%, and 100% images of the training data set. Furthermore, the sensitivity (SE), specificity (SP), positive and negative predictive values (PPV and NPV, respectively), and the area under the receiver operating characteristic (ROC) curve (AUC) were computed for the selected types of teeth and surfaces (
Matthews and Farewell 2015). In addition, saliency maps were plotted to identify image areas that were of importance for the CNN to make an individual decision. We calculated the saliency maps (
Simonyan et al. 2014) by backpropagating the prediction of the CNN and visualized the gradient of the input on the resized images (224 × 224 pixels).
Results
In the present work, it was shown that the CNN was able to correctly classify caries in 92.5% of the images when all the included images were considered (
Table 1). For caries-related cavitation detection, 93.3% of all tooth surfaces could be correctly classified (
Table 2). In addition, diagnostic performance was calculated for each of the caries classes (
Table 3); here it was shown that the accuracy was found to be highest for caries-free surfaces (accuracy of 90.6%), followed by noncavitated caries lesions (85.2%) and cavitated caries lesions (79.5%).
The following results can be seen when comparing the model metrics. First, the CNN was able to achieve a high model performance in the detection of caries and cavities; this situation is particularly evident in the high AUC values (
Tables 1–
3 and
Fig. 1), which were found to be more favorable for overall caries detection. Second, in the case of caries detection, the SP values were mostly higher than the SE values (
Table 1), whereas this tendency could not be confirmed in the case of cavitation detection (
Table 2). Third, the diagnostic parameters varied slightly according to the considered tooth surfaces or groups of teeth. Fourth, the correct classification of healthy surfaces performed better in comparison to diseased ones (
Table 3).
When viewing the result of interim evaluations for 25%, 50%, 75%, or 100% of all the available images (
Tables 1–
3), it became obvious that an overall agreement of approximately 80% could be achieved with 25% of the available training data. By using half of the available images, the parameters of the diagnostic performance could generally be increased to approximately 90%. The inclusion of the remaining images was accompanied by smaller improvements (
Tables 1–
3). The saliency maps (
Fig. 2) illustrate which image areas the CNN used for decision making. Interestingly, in most of the randomly selected cases, the CNN predominately used thecaries-affected sites.
Discussion
In the present diagnostic study, it was demonstrated that AI algorithms are able to detect caries and caries-related cavities on machine-readable intraoral photographs with an accuracy of at least 90%. Thus, the intended study goal was achieved. In addition, a web tool for independent evaluation of the AI algorithm by dental researchers was developed. Our approach also offers interesting potential for future clinical applications: carious lesions could be captured with intraoral cameras and evaluated almost simultaneously and independently from dentists to provide additional diagnostic information.
The present work is part of the latest efforts to evaluate diagnostic images automatically using AI methods. The most advanced AI method seems to be caries detection on dental X-rays.
Lee et al. (2018b) evaluated 3,000 apical radiographs using a deep learning–based CNN and achieved accuracies of well over 80%, and their AUC values varied between 0.845 and 0.917.
Cantu et al. (2020) assessed 3,293 bitewing radiographs and reached a diagnostic accuracy of 80%. If these data were compared with the methodology and results of the present study (
Tables 1–
3), both the number of images used and the documented diagnostic performance were essentially identical.
Nevertheless, the results achieved (
Tables 1–
3,
Figs. 1 and
2) must be critically evaluated. It should be highlighted that our study provided data for the caries and cavitation detection level (
Tables 1 and
2). Both categories are of clinical relevance in daily routine and linked with divergent management strategies (
Schwendicke, Splieth, et al. 2019). Another unique feature was the determination of the diagnostic accuracy for each of the included categories (
Table 3). Here, it became obvious that especially cavities were detected with a lower accuracy by the CNN in comparison to healthy tooth surfaces or noncavitated caries lesions. This detail could not be taken from the overall analysis (
Tables 1 and
2) and justified its consideration. Another methodological issue needs to be discussed: the use of standardized, high-quality, single-tooth photographs that will not be typically captured under daily routines. It can be hypothesized that the use of different image types with divergent resolutions, compression rates, or formats may affect the diagnostic outcome (
Dodge and Karam 2016;
Dziugaite et al. 2016;
Koziarski and Cyganek 2018). In addition, it must be mentioned that the image material used included only healthy tooth surfaces and caries of various lesion stages. Cases with developmental defects, fissure sealants, fillings, or indirect restorations were excluded in this project phase to allow unbiased learning by the CNN. Consequently, these currently excluded dental conditions need to be trained separately. Furthermore, the high quality of the usable image material certainly had a positive influence on the results achieved. All the included photographs were free of plaque, calculus, and saliva and were not over- or underexposed. Therefore, these methodological requirements, which are also in line with fundamental demands on an errorless clinical examination (
Pitts 2009), led to a valid evaluation of the diagnostic accuracy efficacy (
Fryback and Thornbury 1991) and probably contributed to the encouraging results (
Figs. 1 and
2). Conversely, it needs to be mentioned that the AI algorithm requires further development under the inclusion of differential diagnoses and regular evaluation aiming at optimizing and documenting its quality, respectively. In addition, its potential application under clinical conditions needs to be critically discussed in relation to the simple facts that perfect examination conditions cannot be consistently safeguarded in daily dental practice and that AI-based diagnoses have to be critically judged by professionals. The remaining and important tasks—risk and activity assessment, consideration of possible treatment options, and consenting an individual health care strategy with the patient—still need clinical evaluations and can hardly be replaced by AI algorithms so far. Nevertheless, a future real-time diagnostic assistance system may be beneficial for daily dental practice but requires from today’s point of view a consistent further development of the initial work.
From the methodological point of view, the choice of single-tooth pictures may benefit the optimal learning of the AI algorithm since disturbing factors such as margins or adjacent teeth were almost excluded. It is expected that the transfer of the algorithms to other image formats (e.g., images from quadrants, complete upper/lower jaws, or images captured with intraoral cameras) will be associated with a lower diagnostic accuracy. Conversely, it can be hypothetically assumed that an initially more precise CNN will later be more reliable for more complex images. Furthermore, the model performance depends on the annotator’s reference decision and cannot provide better results than the expert. This highlights the importance of the annotator’s diagnostic ability to classify dental pathologies correctly. In the present study, only 1 experienced clinical scientist made all diagnostic decisions, which must be considered a potential source of bias. This aspect can be controversially debated, especially if the inclusion of other less experienced or independent annotators can potentially increase the trustworthiness of dental decisions and the resulting model performance. Nevertheless, reliability issues are of high relevance, and we consider forming an expert panel controlling and finally consenting diagnostic decisions in future projects.
Regarding the previously mentioned aspects, it becomes clear that automated detection of caries or other pathological findings requires a methodologically well-structured approach (
Schwendicke et al. 2021). In this context, it should be noted that the documented model performance increased steadily with each additional training cycle. However, at a certain point, at least ~50% of the images used in the present analysis could no longer be substantially improved with the available source of images and classification approach (
Tables 1–
3). This indicates that a data saturation effect does exist, and further improvements can be expected by the inclusion of an exponentially larger number of images only. Here, the overall number of included images must be considered crucial. At its best, several thousand photographs from different teeth or surfaces as well as lesion types should be available. This supports the assumption that the pool of images used represents probably the minimum amount for the training of AI algorithms. Furthermore, it needs to be referred to the class imbalance in the used image sample with an underrepresentation of cavitation, which affects the training and test set. As a result, the model metrics might be biased, which was linked to a limited training of the CNN and a lower diagnostic performance for this category in comparison to healthy surfaces or noncavitated caries lesions (
Table 3). In general, this aspect supports the need to increase continuously the data set and to provide a wide range of caries lesions from all teeth, surfaces, and caries lesion types. Otherwise, the developed AI algorithms cannot be generalized. The long-term goal should be to achieve close to perfect accuracy in caries classification on the basis of several thousand intraoral photographs using an AI method.
Future strategies to improve AI-based caries detection on intraoral images should consider image segmentation as an alternative method, which has to be conducted by well-trained and calibrated dental professionals under supervision of senior experts. For this purpose, it is necessary to mark caries lesions pixel by pixel on each available image and to reassess the diagnostic accuracy. This more precise but otherwise time- and resource-intensive approach offers a detailed caries localization in comparison to the presently used classification model.