Enhancing Weighted Averaging for CNN Model Ensemble in Plant Diseases Image Classification

Deep learning, especially convolutional neural networks (CNN), has gained traction in the field of image classification. In the specific case of plant disease classification, improving the accuracy and reliability of image classification is paramount. This paper delves into the ensemble prediction technique using a weighted soft-voting method. Instead of assigning a generalized weight to each CNN model, our approach emphasizes giving weights to each label's prediction within every individual model. We employed three esteemed CNN architectures for our experiments: DenseNet201, InceptionV3


Introduction
Plant diseases have long been a major threat to global agriculture, impacting food security and economic stability.Over time, the methods used for diagnosing these diseases have undergone significant transformations.Initially, basic tools such as magnifying glasses and simple observational techniques were prevalent [1].However, with the increasing demand for more efficient and scalable diagnostic methods, a shift toward advanced computational techniques has become necessary.This change is further driven by the growing global population and the corresponding rise in food demand, emphasizing the need for rapid and precise diagnosis of plant diseases [2].
Deep learning, an advanced branch of machine learning, is swiftly emerging as a cornerstone in the field of modern plant pathology.This technique, inspired by the human brain's structure and powered by artificial neural networks, offers extraordinary computational capabilities.Deep learning algorithms are adept at processing extensive numbers of plant images, detecting intricate patterns that are often overlooked by the human eye.They excel not only in identifying plant diseases but also in comprehending the progression and potential impacts of these diseases [3].With the evolving complexity of plant pathogens, the focus is increasingly shifting from mere disease identification to understanding their entire trajectory and devising effective intervention strategies.Traditional diagnostic methods, largely reliant on human expertise, struggle with the unpredictable nature of disease outbreaks.In stark contrast, deep learning leverages extensive datasets to continuously refine its analysis, emerging as a crucial tool capable of anticipating disease outbreaks.This predictive advantage offers agriculturists and farmers the opportunity for timely interventions, thereby potentially reducing the adverse effects on crop yields.With the growing recognition of its capabilities, deep learning is steadily being acknowledged by researchers and industry professionals alike as a key to proactive management in the realm of plant disease control [4].
The application of deep learning in plant disease classification is not a nascent concept, and several pioneering works have underscored its efficacy.Zhang et al. [5] embarked on an exploration using convolutional neural networks (CNNs) to classify Maize leaf diseases, and their results indicated a remarkable 98.9% accuracy.Another noteworthy study, by using deep learning methods, Singh et al [6] propose a multilayer convolutional neural network (MCNN) to accurately classify Anthracnose fungal disease in mango leaves, with results outperforming other stateof-the-art approaches based on a dataset of 1070 images.Ghosal S et al [7] researched transfer learning, repurposing pre-trained models for rice plant disease classification, and achieved a reduction in training time without compromising on accuracy.In more recent research, Ho et al [8] and Yuvalatha et al [9] are using ensemble CNN techniques, both studies leveraged various transfer learning models, including ResNet and DenseNet architectures, to achieve high accuracies in identifying plant diseases early, with results showing promise for enhancing agricultural practices and reducing crop losses.
The application of deep learning in plant disease classification is well-established, with several studies demonstrating its efficacy.Pioneering works in this area have included the use of convolutional neural networks (CNNs) for classifying various plant diseases, showcasing remarkable accuracies and outperforming other state-of-the-art approaches and traditional methods.Ensemble learning, particularly methods like soft-voting and its improved variant, weighted softvoting, has further enhanced model performance [10] - [20].This research contributes to the field of plant disease classification by addressing a critical gap in the application of ensemble learning methodologies.Existing literature predominantly focuses on the effectiveness of ensemble methods like soft-voting and its more sophisticated variant, weighted soft-voting.However, these methods typically rely on the assumption that a model's performance is consistent across all classes, an assumption that this study challenges.We propose that a model may exhibit suboptimal performance in most classes but excel in specific ones.To capitalize on this insight, our research refines the weighted soft-voting approach by assigning unique weights to each label within a model, rather than applying a uniform weight across the entire model.This strategy aims to significantly improve the precision and accuracy of plant disease classification by harnessing the distinct strengths of individual models in their areas of competence.By adopting this more granular approach to weight distribution, we anticipate not only enhanced predictive performance in complex classification scenarios but also the establishment of a new benchmark in the utilization of ensemble learning techniques.
In the field of plant disease classification, ensemble learning has become a pivotal approach, harnessing the collective strengths of multiple models to enhance prediction accuracy and robustness [21].Among these methods, soft-voting and its advanced form, weighted soft-voting, stand out for their effectiveness in complex classification tasks.These techniques integrate diverse model predictions, adjusting the influence of each based on its demonstrated accuracy.However, this approach often overlooks the varying performance of models across different classes.This research aims to refine this aspect by introducing a novel adaptation to weighted soft-voting, assigning individual weights to each label of a model, thereby enhancing precision in plant disease classification.
Ensemble learning is increasingly favoured in machine learning due to its ability to leverage the strengths of multiple models, thereby enhancing overall performance.This approach is particularly effective in complex tasks where a single model might not capture all aspects of the data or may be prone to overfitting.By combining predictions from multiple models, ensemble methods often achieve higher accuracy and robustness compared to individual models.This makes them especially suitable for applications in fields like plant disease classification, where the accuracy and reliability of predictions can have significant practical implications In the broader scope of ensemble methods, soft voting stands out as a popular technique.Soft-voting is an ensemble method that combines the predictions of multiple models by averaging each prediction score and using the maximum score as the label output [22].Studies have shown the simplicity and effectiveness of the soft-voting method in improving the performance of machine-learning models [10], [18].
As advancements in ensemble techniques continued, soft-voting underwent refinements to further enhance its classification results.Recognizing that different models exhibit varying levels of performance, some being strong and others weak, it became logical to assign more influence to stronger models in the voting process.This led to the development of weighted softvoting, where each prediction from a model is given a weight based on its prior prediction accuracy [23].
In prior research, the application of weighted softvoting for plant disease classification has been investigated, with weights being assigned at the model level [24], [25].These studies, however, are based on the critical assumption that a model's performance is uniformly effective across all classifications, a presumption that may not always hold.This conventional approach tends to overlook instances where a model, despite its overall average performance, could exhibit exceptional accuracy in specific classes.Our study addresses this oversight by introducing a refined version of the weighted soft-voting method, wherein weights are allocated to each label within a model independently.This modification aims to enhance the granularity and accuracy of classification outcomes, particularly in complex scenarios where the strengths and weaknesses of models vary markedly across different tasks.By adopting this tailored approach, our research endeavours to bring a more sophisticated and precise dimension to the field of plant disease classification using ensemble learning techniques.
In the initial phase of our research, we embarked on a preliminary experiment to assess the individual performance of various models in classifying specific labels within a grape disease dataset, the detail can be seen in Figure 3, in section 3.This was crucial to understanding the unique strengths and weaknesses of each model.Ensuring data variation and balance is crucial in this phase.Balance is achieved by equalizing data across all labels to prevent f1-score bias while varying the data characteristics enhances the robustness of the f1-score.The results from this Preliminary Experiment can be seen in Table 1 and Figure 1.Models such as DenseNet201 and Xception displayed a propensity for accurately classifying labels 2 and 3, but their performance was less effective for labels 0 and 1.Interestingly, despite their similarities, DenseNet201 generally outperformed Xception.On the other hand, Inception-V3 showed a contrasting strength, being more adept at classifying labels 0 and 1, compared to 3 and 4.
These findings underscored the importance of a nuanced approach in the subsequent phase of the research.By assigning weights based on the demonstrated strengths of each model in classifying specific labels of grape diseases, we anticipate an enhancement in the overall accuracy of the ensemble.The next step involves the application of these insights to compute weight values for each model label, as outlined in Formula 5.This step is a key part of enhancing our ensemble method for the main experiment discussed in section 3.

Soft-voting and Weighted Soft-voting
Soft-voting distinguishes itself as a key technique in ensemble methods.This approach averages the prediction scores from multiple models, using the highest score for the final classification.Further refinement led to the development of weighted softvoting, where predictions are weighted based on a model's accuracy, thus improving the method's precision in various applications.
Let ℎ  be the ensemble score result for label-i, T be the count of models,  , be the prediction score from model-m for label-i, and   be the weight for prediction of model-m then to calculate the ensemble prediction of the CNN model for image classification, the normal soft-voting formula can be written as Formula 1, and for the weighted soft-voting as Formula 2.
Based on [20], let   be the accuracy of model-n and   be the weight for model-m; the weight of the model can be calculated by using Formula 2.

Proposed Method
The proposed method in this paper is that, rather than giving weight to the models like Formula 2, giving weight to labels inside the models will give a chance to make the ensemble result more robust and better.
The architecture proposed methods in Figure 2 can explain Formula 3 which is the formula to calculate the ensemble prediction score by our proposed method.
Where ℎ  is the ensemble score result for label-i, T is the count of models,  , is the prediction score from model-m for label-i, and  , is the weight for prediction of model-m for label-i.(3) Figure 2. The proposed method of modifying the weighted softvoting Additionally, instead of using the accuracy score for calculating the weight of each label, using the f1-score is a better choice, since the f1-score is a metric that can describe the performance of the model in predicting each label or class.The formula can be written as Formula 4 where  , be the weight of prediction for model-m for label-i,  , be the f1-score of model-m for label-i.

CNN Architectures
A convolutional neural network (or CNN) is a special type of multilayer neural network or deep learning architecture inspired by the visual system of living beings [26].There are so many advanced architectures of CNN that have been discovered in recent years.In this paper, the experiment uses three CNN architectures to be ensembled.Those architectures are DenseNet201, InceptionV3, and Xception.
DenseNet201 is one of the architectures in the DenseNet family, which has 201 layers.The key idea behind DenseNet is to connect all layers in a feedforward fashion so that each layer receives the feature maps from all preceding layers as input.This creates a dense block, where the input and output of each layer are concatenated together.By doing this, DenseNet can reuse features learned at earlier layers, leading to better feature reuse and higher accuracy while also reducing the chance of gradient vanishing and the number of parameters [26].DenseNet has been used in varying image classifications, such as in articles [27], [28] and [29].
InceptionV3 is a superior version of the basic model of the Inception family.The InceptionV3 model is made up of 42 layers, which is bit more than the previous versions (the Inception V1 and V2 models).InceptionV3 uses a combination of 1x1, 3x3, and 5x5 convolutions, as well as max pooling and average pooling, to extract features from images.It also introduces the inception module, which consists of multiple parallel convolutional branches with different filter sizes.By doing this, Inception-V3 can capture both local and global features in an image.In addition, Inception-v3 factorized convolution to reduce the number of parameters in the network.This involves decomposing a standard convolution into two separate convolutions with smaller filter sizes, which reduces the number of parameters without sacrificing accuracy.
Those make the efficiency of the inceptionV3 to be impressive [30].Some of the research using this model can be read in articles [31] and [32].
The last CNN architecture is Xception.It is based on the Inception architecture, but it replaces the standard Inception modules with depthwise separable convolutions.Depthwise separable convolutions consist of two separate operations: a depthwise convolution and a pointwise convolution.The depthwise convolution applies a single filter to each input channel, while the pointwise convolution applies a 1x1 filter to combine the output of the depthwise convolution across all channels.By doing this, Xception can capture complex patterns in the input data using fewer parameters.In addition to depthwise separable convolutions, Xception also uses a series of residual connections, which help alleviate the problem of vanishing gradients during training.These connections allow gradients to flow more easily through the network, which can lead to faster convergence and better performance.The advantage of Xception is how easy the architecture is to implement in code and reduce the computer cost [33].Xception has been used for several classification cases, such as image weather classification [34] and plant diseases classification [35].

Evaluation Models
There are several metrics to measure how well the models or methods perform.In this experiment, the metrics that will be used are recall, precision, f1-score, accuracy and the confusion matrix.The formula to calculate all of those metrics can be seen in Formulas 5, 6, 7, and 8. where TP is the number of true positives (the model predicts a positive result and the ground truth is also positive), TN is the number of true negatives (the model predicts a negative result and the ground truth is also negative), FP is the number of false positives (the model predicts a positive result, but the ground truth is negative), and FN is the number of false negatives (the model predicts a negative result, but the ground truth is positive).

Dataset
In this study, we utilized a dataset comprising classification images of grape leaf diseases, provided by PlantVillage [36], [37].The focus is on four primary classes of grape disease: Black Rot (1,180 images), Esca (1,382 images), Leaf Blight (1,076 images), and Healthy (423 images).To create a robust dataset for machine learning analysis, each class has been augmented to a total of 2,000 images.This standardization ensures uniformity across all classes.The augmented dataset is systematically divided into three key segments.Firstly, the training phase, which consists of 700 images for model training and 300 images for validation.Secondly, the Preliminary Experiment testing phase utilizes a subset of 500 images.Lastly, the Experiment testing phase also employs 500 images.For ease of analysis and clarity, each class within the dataset is distinctly labelled, as illustrated in Figure 3 3

.2 Training Process
Our analysis of the training process, as illustrated in Figure 4 and Figure 5, reveals a stark contrast between models trained with and without pre-trained weights.The use of pre-trained models significantly accelerated the convergence speed, leading to faster stabilization in both accuracy and validation accuracy metrics.This suggests that the pre-trained models were able to leverage previously learned patterns, thereby reducing the time and computational resources needed to reach optimal performance.
Additionally, the figure highlights a more stable and consistent decline in both loss and validation loss for the pre-trained models compared to their counterparts.This stability indicates that pre-trained models are not only faster in reaching convergence but also more reliable in maintaining performance consistency throughout the training process.The combined insights from these figures underscore the efficacy of using pre-trained models in enhancing the efficiency and robustness of the training phase

Result
The outcomes of applying the proposed methods in the experiment testing are detailed in Tables 2-5, which present the classification report, and Figures 6-7, illustrating the confusion matrices.Table 4 and Figure 5 provide an initial assessment of each model's performance in the testing phase, revealing that all models yield results consistent with those observed in the Preliminary Experiment.This consistency validates the reliability of the Preliminary Experiment data for calculating the weight labels in our proposed method.A comprehensive summary of all ensemble testing experiments is documented in Tables 5-8 and Figure 6.
These tests unequivocally demonstrate that our proposed method, which involves assigning weights to each model's label, outperforms traditional soft-voting and weighted soft-voting methods.Notably, the highest accuracy score achieved is 0.96650, facilitated by the combination of Inception-V3 and DenseNet201, while the lowest is 0.90100, resulting from the DenseNet201 and Xception pairing.
One of the most intriguing aspects of our study is the differing outcomes produced by various ensemble combinations.Insights drawn from the Preliminary Experiment and the individual performance metrics of each model, as detailed in Table 1, Table 2, Figure 1, and Figure 6, reveal a notable distinction.DenseNet201 and Xception, while sharing similar characteristics, differ fundamentally from Inception-V3.This similarity between DenseNet201 and Xception implies a limitation in their ability to augment each other's performance, as they are prone to similar weaknesses.In contrast, when combined with Inception-V3, which exhibits a different set of characteristics, there is a complementary interaction.This synergy allows for the mutual compensation of weaknesses, where the strengths of one model effectively counterbalance the shortcomings of the other.Such findings highlight the critical importance of model diversity in constructing effective ensemble approaches, emphasizing how distinct model characteristics can lead to a more robust overall system.achieving a collective accuracy of 0.95150.From this, we can deduce that Xception, while sharing similarities with DenseNet201, unfortunately, acts as a limiting factor in the ensemble configuration, thereby reducing the overall efficacy of the method.These findings highlight the crucial role of strategic model selection in ensemble techniques, underscoring the need for each component to enhance, rather than impede, the collective performance.

Additional Testing
The validation of our proposed method is extended by applying it to an alternative dataset [38], thereby examining its generalizability and consistency across different scenarios.This testing is crucial to ascertain the robustness of our ensemble approach, particularly when exposed to diverse data characteristics.The results, presented in Table 7, demonstrate a remarkable consistency in performance, mirroring the high accuracy and precision observed in the primary dataset.Specifically, the model combination of InceptionV3 and DenseNet201, which yielded the highest accuracy in the primary testing, continues to exhibit superior performance with an accuracy score of 0.96562 in the alternate dataset.Similarly, the combination of DenseNet201 and Xception maintains its lower, yet stable, accuracy score of 0.88687.These findings suggest that the predictive capabilities of our model are not dataset-specific, but rather indicative of the inherent strength of the proposed ensemble approach.The consistent performance across varied datasets reinforces the validity of our initial hypothesis and underscores the potential of our refined weighted soft-voting method in diverse plant disease classification scenarios.

Conclusions
This research marks a significant advancement in the application of convolutional neural networks (CNNs) for the classification of plant disease images, particularly in grape leaves.By innovatively adapting the weighted soft-voting method to assign weights to specific label models, rather than uniformly across all labels, the study introduces a refined approach that considerably enhances classification accuracy.This novel method was rigorously tested on renowned CNN architectures such as Xception, DenseNet201, and Inception-V3.The results consistently demonstrated its superiority, outperforming traditional ensemble strategies like normal soft-voting and weighted softvoting.Among the combinations tested, the ensemble of DenseNet201 and Inception-V3 was particularly effective, delivering a commendable accuracy of 96.65% in identifying grape plant diseases.
The research underscores the significant influence of individual model characteristics on the ensemble's overall performance.A critical insight from this study is the synergistic effect observed when models with contrasting strengths are paired.For instance, the pairing of DenseNet201 and Inception-V3, which exhibit complementary characteristics, resulted in a significant boost in accuracy.This contrasts with combinations like DenseNet201 and Xception, where similarity in model characteristics did not produce the same level of efficacy.Moreover, the study also highlights the importance of model selection in ensemble methods.The analysis revealed that certain models, though similar, could act as performance bottlenecks when paired inappropriately.This observation emphasizes the need for careful consideration of individual model performances and their interactions within an ensemble.

Figure 3 .
Figure 3. Sample images of grape diseases dataset and their labels

Figure 4 .Figure 5 .
Figure 4. Comparison accuracy and validation accuracy of the training process between using and not using a pre-trained model

Figure 7 .
Figure 7. Confusion Matrix of the Proposed method from the Crossing Ensemble model Closer analysis reveals the reason for the significant outperformance of the DenseNet201 and Inception-V3 ensemble, which achieves an accuracy of 0.96650, compared to the Xception and Inception-V3 combination which attains an accuracy of 0.93700.A detailed analysis of Tables 1 and 4 offers a compelling explanation.It becomes clear that DenseNet201 consistently surpasses Xception across a range of evaluation metrics.This difference in individual model performance becomes even more pronounced when all three models, DenseNet201, Inception-V3, and Xception, are integrated into a single ensemble,

Table 1 .
Accuracy and f1-score of each model for preliminary experiment testing

Table 2 .
Accuracy and f1-score of models in experiment testing

Table 3 .
Accuracy and f1-score of ensemble models using Xception and InceptionV3

Table 4 .
Accuracy and f1-score of ensemble models using Xception and DenseNet201

Table 5 .
Accuracy and f1-score of ensemble models using

Table 7 .
Accuracy score of the proposed method in the alternative dataset