Implementation of Self-Organizing Map (SOM) Algorithm for Image Classification of Medicinal Weeds

Wild plants or weeds often become enemies or disturb the main cultivated plants. In its development, wild plants or weeds actually have ingredients that are beneficial to the body and can be used as medicine. However, many people still need knowledge about the types of weed plants that have medicinal properties, especially the leaves. The purpose of this research is to classify the image of weed leaves with medicinal properties based on color and texture characteristics with an artificial neural network using a Self-Organizing Map (SOM). To improve information in feature extraction, RGB and HSV color features are used as well as texture features with Gray Level Co-occurrence Matrix (GLCM). Furthermore, the results of feature extraction will be identified as groups or classes with the Self-Organizing Map (SOM) algorithm which divides the input pattern into several groups so that the network output is in the form of a group that is most similar to the input provided. The test produces a precision value of 91.11%, a recall value of 88.17% and an accuracy value of 89.44%. The results of the accuracy of the SOM model for image classification on medicinal weed leaves are in the good category.


Introduction
Having a tropical climate makes Indonesia rich in biological natural resources.This diversity is very beneficial, especially with the many species of herbs and plants that can be used medicinally.This can be seen from the number of plants in Indonesia reaching around 30,000 species and approximately 9,600 species have ingredients that can be used as medicine [1].The use of plants as medicine has long been practiced throughout the world, both in developing countries and in developed countries.The World Health Organization (WHO) notes that herbal plants have been used as medicine by up to 65% of the population in developed countries and 80% of the population in developing countries [2].Plants are used as medicine, because they contain useful ingredients and are needed by the human body [3].Apparently, the ingredients in plants that can be used as medicine are also found in wild plants.Wild plants or weeds often become enemies or disturb the main cultivated plants.Weeds usually grow in unwanted areas of the plantation area.In its development, these wild plants or weeds actually contain ingredients for treatment.From the plant part, the leaves become one of the parts that can be used as medicine.However, many people do not know the types of medicinal weed leaves.This is because weed leaves have medicinal properties.Thus, it is necessary to have a system capable of classifying images of weed species that have medicinal properties.Digital image processing can be a solution in solving this problem.
Digital image processing can be interpreted as digital manipulation and interpretation of images to obtain information that can be utilized [4].Leaf images can be information that can be used to identify plants [5].Image classification is one of several applications of image processing.Image classification can be interpreted as a process for making image elements into groups so that they can be interpreted as a specific property [6], [7].Previous research on the classification and identification of images on medicinal plants, including research on the application of the K-Nearest Neighbor (KNN) method combined with the Principal Component analysis (PCA) method to classify medicinal plants [8].The model proposed in this study is able to classify herbal plants with an accuracy rate of 88.67%.The PCA algorithm can reduce data and the KNN algorithm is used for classification through learning based on proximity to other data to determine groups or carry out classifications.However, KNN has weaknesses in handling outliers and is vulnerable to non-informative variables [9].Another research on image classification of medicinal plants with the implementation of the Support Vector Machine (SVM) method [10].The model developed in this study is capable of producing an accuracy rate of up to 76%.However, the SVM method is less effective in complex class cases, because SVM works by getting the best hyperplane and dividing it into two classes.Subsequent research, regarding image classification for herbal leaves by implementing backpropagation neural networks [11].Based on testing, the built model obtains an accuracy rate of up to 88.75%.Basically, using artificial neural networks can build systems that can carry out learning that is adopted from the workings of human nerves [12].This is why the implementation of artificial neural networks is often used in solving classification cases.However, the backpropagation neural network has a weakness, namely it cannot provide information about weights, so it will affect the input pattern, which results in inconstant training results [13].For this reason, the classification problem requires an algorithm that is able to divide classes based on dimensional space with the right characteristics.
In this study, the Self-Organizing Map (SOM) artificial neural network algorithm was applied.The SOM algorithm is a data grouping algorithm based on similarity or dissimilarity without any known clusters called unsupervised learning [14].According to the structure, clustering is divided into two, hierarchically based a single data can be considered a cluster and partition based which divides the data set into a cluster that does not overlap [15].The SOM algorithm has the ability to describe data through data dimension reduction to make it easier to represent data that has high dimensions and is then mapped onto data that has low dimensions [16].The SOM algorithm has an unsupervised learning method, where the structured topology will be divided into units or clusters.Several studies applying the SOM algorithm to image classification have obtained good results.Previous research, regarding the application of the SOM method for facial expression recognition [17].In this study, from an evaluation conducted using the Cohn-Kanade and AT&T dataset, the performance of the developed model reached 96.81% and 96.55%.Subsequent research, regarding the classification of hyperspectral images using the SOM algorithm [18].The results of accuracy testing in this study obtained an optimal accuracy level of 96.30%.Another study, regarding the application of Self-Organizing Map (SOM) for medical image retrieval on feature extraction based on texture [19].Based on the evaluation results, the average accuracy resulting from the developed model is 93.33%.
The purpose of this research is to classify the image of weed leaves with medicinal properties based on color and texture characteristics with an artificial neural network using a Self-Organizing Map (SOM).To improve the information in feature extraction, RGB and HSV color features are used and texture features use the Gray Level Co-occurrence Matrix (GLCM).Furthermore, the values obtained in feature extraction are input for grouping into classes using the Self-Organizing Map (SOM) algorithm which divides the input pattern into several groups so that the network output is in the form of a group that is most similar to the input provided.

Research Methods
In order to be able to conduct research in a structured manner and in harmony with the research objectives, it is necessary to arrange research stages.The research stages consist of the steps to be taken in solving the research problem [20].The steps taken by the researcher to conduct the research are presented in Figure 1.

Collecting Dataset
Dataset is a crucial factor in pattern recognition and classification, because dataset availability is a determining factor for model performance [21].The types of medicinal weeds used in this study were herbaceous vegetation.In botany, herbaceous vegetation refers to herbs, but in the field of herbal medicine it means fresh plant parts or high water content used as a tonic or medicine [2].Types of wild herbaceous vegetation used based on the book

Image Segmentation with Thresholding
Image segmentation is a method of breaking digital images into subgroups known as segments [22].Usually, the process of splitting or grouping is based on the needs of image processing.Image segmentation can be in the form of separating the foreground from the background or creating groups of considered pixel regions of the same color or shape [23], [24].In this technique it takes a limit value known as the threshold value.The threshold value obtained from the image intensity value that is more than or equal to the threshold value will be changed to 1 (white in color) while the image intensity value that is less than the threshold value will be changed to 0 (black in color).
The equation used to convert image pixel values to binary in the segmentation process uses equation (1).
where, (, ) is a grayscale image, (, ) is a binary image, while  denotes a threshold value

Feature Extraction Using Color and Texture
Feature extraction is a process to obtain distinguishing characteristics that distinguish an object from other objects [25].For color features, use RGB and HSV color features.This feature extraction is used to get the color information contained in the object.RGB and HSV values are obtained based on the average color values contained.The mean feature used is equation (2).
Meanwhile, the texture feature uses the Gray-Level Co-Occurrence Matrix (GLCM) approach.GLCM is an approach to obtain features from an image by calculating the probability value from the calculation of the adjacency relationship between two pixels at a certain distance and angular orientation [26].The GLCM features used include: energy, contrast, correlation and homogeneity.The following is an explanation of each GLCM feature.

1) Energy
This feature is used to measure the uniformity or often called the angular second moment.Energy will be of high value when the pixel values are similar to each other.The energy value can be calculated using equation (3).
2) Contrast This feature represents the difference in intensity between the highest (brightest) and lowest (darkest) values of a pair of adjacent pixels.To get the contrast value, you can use the equation (4).
3) Correlation This feature is used to measure the linearity of a number of pixel pairs.To get the correlation value, you can use equation (5).

4) Homogeneity
Homogeneity is used to measure the homogeneity of image intensity variations.Homogeneity values can be generated from calculations with equation (6).
2.5.Image Classification with Self-Organizing Maps (SOM) Self-Organizing Map (SOM) is an Artificial Neural Network method which was first introduced by Teuvo Kohonen in 1981.The SOM algorithm is one of several Unsupervised Artificial Neural Network algorithms, where the training process does not involve supervision [27].It is called Self-Organizing because it does not require supervision/unsupervised learning and it is called Map because it tries to map its weights to match the given input data.The neurons in this network arrange themselves based on certain input values in a group, commonly called a cluster [28].Figure 2 is the architecture of the SOM algorithm.In Figure 2, it can be seen that the process of determining the winning neuron is obtained from the cluster that has the closest weight vector.The winning neuron and neighboring neurons will fix their respective weights.
The steps for solving the SOM algorithm are : Initialize input neurons: 1, 2, …, 1; Initialization of output neurons (output layer) as many as:   1 = 11, 12, …, 1; Fill in the weights between input and output neurons  with random numbers 0 to 1; Selection of one of the inputs from the existing input vectors; Calculation of the distance between the input vectors to the weights () with each output neuron through equation (8).
Of all the weights () look for the smallest.The most similar index of weights () is called the winning neuron; For each weight l the connection weights are updated using the formula that can be seen in equation (9).

Evaluation
The next stage is the evaluation stage.The evaluation stage is the stage for measuring the performance of the model [29].At this stage, the value of precision, recall, and accuracy will be sought using the confusion matrix.Confusion matrix, can be used to measure performance in classification problems.The Confusion Matrix is a matrix that has four different combinations obtained from the results of comparisons between predictions and actual values, these combinations include true positive, false positive, true negative, and false negative which are used to find precision, recall and accuracy values [30].To get the value of precision, recall and accuracy, equations ( 9), ( 10) and ( 11) are used.

Results and Discussions
In this study, we classified images of weeds as medicinal herbs for herbaceous vegetation, by taking five types of wild plants in Indonesia, namely: Susuruhan (Peperomia pellucida (L.) Kunth), Jawer Kotok (Coleus scutellarioides (L.) Benth), Antinganting (Acalypha indica L.), Daun Kahitutan (Paederia scandens (Lour.)Merr.) and Belimbing Tanah (Oxalis barrelieri L.).Before carrying out image classification, it begins with collecting datasets which will later be used as training and testing.Medicinal weed samples used as a dataset are 300 images.So, the amount of data used as training is 210 images and for testing as many as 90 images.Samples for each class are presented in Table 1.As seen in Figure 3, RGB to binary conversion is useful for distinguishing between the required objects and their background.Furthermore, with the thresholding technique, the most appropriate threshold value will be determined, so that it can be distinguished between the object and the background.From the binary image that has been generated then the required object has been segmented and will be returned to the RGB image to make it easier in the classification stage.The segmented image results are shown in Figure 4. Furthermore, the results of the segmented image are subjected to feature extraction to obtain information on the object in the image.The color features used are RGB and HSV features.This feature is used to obtain information about colors that depend on the image to make it easier in the classification stage.RGB and HSV values are obtained from calculating the average value for each color.As for texture feature extraction using the Gray-Level Co-Occurrence Matrix (GLCM).GLCM is useful for obtaining feature values by calculating probability values from the results of calculating adjacency relationships between image elements based on proximity and certain angle orientations through calculating energy, contrast, correlation and homogeneity values.Figure 5 shows a sample graph of the gain value for each feature.Furthermore, from the training model is used for testing.For testing, it is made in the form of a GUI using MATLAB to make it easier to use. Figure 8 is the GUI of the classification system for medicinal weed leaves.Based on Figure 9, the true positive, false positive, true negative, and false negative values obtained are used to calculate the precision, recall, and accuracy values that are sought using equations ( 9), ( 10) and (11).Then the results of these calculations are presented in Table 2, which contains the results of the precision, recall, and accuracy values of the developed model.The number of datasets is still small, so the model is not maximal in conducting learning.

Conclusion
This For further research, it requires several improvements, including improving the SOM algorithm, especially in determining the initial neuronal weights which are carried out randomly in order to improve pattern recognition properly.In addition, you can develop using deep learning so that various features can be resolved.For datasets, it is necessary to do experiments with larger datasets so that learning outcomes can be optimal.

Figure 2 .
Figure 2. Self-Organizing Map (SOM) Architecture or True Positive indicates true positive data, TN or True Negative indicates that the data from the test must indeed have a negative value to state the truth, False Positive is a false positive which means it is positive, otherwise False Negative is a value that is negative or a negative value is falsified.

Figure 3 .
Figure 3. (a) Original Image and (b) Binary Image

Figure 4 .
Figure 4. (a) Binary Image and (b) Image Segmentation Results

Figure 5 . 6 .Figure 6 .Figure 7 .
Figure 5. Gain Value of Each FeatureIn the feature extraction process, values will be used as input for the classification of medicinal weed leaf species.These values represent or indicate the characteristics of the object to be classified.The results of feature extraction using color and texture features on the image samples of medicinal weed leaves are shown in Figure6.

Figure 8 .
Figure 8. Medicinal Weed Classification System Interface After the model has been applied to applications in MATLAB, then the model will be evaluated to determine the performance of the developed model.The test data used were 90 images of medicinal weed leaves with an even distribution for each class of 18 images.The test was carried out by comparing the classification results of the developed model with the classification results of an expert, then entered into the confusion matrix to find true positive, false positive, true negative, and false negative values.The results of the confusion matrix obtained are presented in Figure 9.

Figure 9 .
Figure 9. Confusion Matrix Results The process of taking the image by taking a sample image of medicinal plant species using a 12MP camera.The image is taken with only one perspective perpendicular to the same light level.The collected dataset is then used to train and test data using the basic division of 70% as training and 30% as testing.The sample of medicinal weeds used as a dataset is 300 images.So, the amount of data used as training is 210 images and for testing 90 images.

Table 1 .
Sample Dataset for Each Class

Table 2 .
Results Value Precision, Recall and Accuracy

Table 2
However, from the accuracy obtained the resulting error rate reached 10.56%.This error rate is influenced by several factors, including: 1) The SOM algorithm in determining the initial neuronal weights is done randomly so that the resulting clustering values are different; 2) Most of the medicinal leaves have almost the same resemblance, so they require extraction of plant features; 3) The developed model requires a single object, if the data used has diverse backgrounds with various model viewpoints, it is difficult to classify; 4) Hendra Mayatopani, Nurdiana Handayani, Ri Sabti Septarini, Rini Nuraini, Nofitri Heriyani Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. 7 No. 3 (2023) DOI: https://doi.org/10.29207/resti.v7i3.4755Creative Commons Attribution 4.0 International License (CC BY 4.0) 443 [31]s that the precision value is 91.11%, the recall value is 88.17% and the accuracy value is 89.44%.Furthermore, the accuracy results obtained are converted into the following criteria: Good, if you get a value between 76% to 100%; Enough, if you get a score between 56% to 75%; Not Good, if you get a score between 40% to 55%, and Not Good, with a score below 40%[31].The accuracy obtained was 89.44%, meaning that the SOM model developed for the classification of medicinal weed leaf species is in the good category.These results are obtained because the SOM artificial neural network can divide groups or classes with an algorithm by dividing the input pattern into several groups so that the network output is in the form of a group that is most similar to the input given.The SOM algorithm obtains feature information through color feature extraction through the average RGB and HSV values and texture features through the parameters in the GLCM.
study classifies images of medicinal leaf types using the Self Organizing Map (SOM) artificial neural network algorithm with color and texture feature extraction.Extraction of color features based on RGB and HSV values and texture features using the Gray Level Co-occurrence Matrix (GLCM).The feature extraction used is able to provide information in the image to facilitate classification.The SOM algorithm can map high-dimensional data into low-dimensional maps, thus forming data groups that can be used as classifications.The results of the tests carried out with the Confusion Matrix used produced a precision value of 91.11%, a recall value of 88.17% and an accuracy value of 89.44%.The results of the accuracy of the SOM model for image classification on medicinal weed leaves are in the good category.