Pedestrian Detection System using YOLOv5 for Advanced Driver Assistance System (ADAS)

The technology in transportation is continuously developing due to reaching the self-driving vehicle. The need of detecting the situation around vehicles is a must to prevent accidents. It is not only limited to the conventional vehicle in which accident commonly happens, but also to the autonomous vehicle. In this paper, we proposed a detection system for recognizing pedestrians using a camera and minicomputer. The approach of pedestrian detection is applied using object detection method (YOLOv5) which is based on the Convolutional Neural Network. The model that we proposed in this paper is trained using numerous epochs to find the optimum training configuration for detecting pedestrians. The lowest value of object and bounding box loss is found when it is trained using 2000 epochs, but it needs at least 3 hours to build the model. Meanwhile, the optimum model’s configuration is trained using 1000 epochs which has the biggest object (1.49 points) and moderate bounding box (1.5 points) loss reduction compared to the other number of epochs. This proposed system is implemented using Raspberry Pi4 and a monocular camera and it is only able to detect objects for 0.9 frames for each second. As further development, an advanced computing device is needed due to reach real-time pedestrian detection.


Introduction
In the study of intelligent transportation, the existence of pedestrians is the most important thing that must be considered. The Indonesian Highway Capacity Manual [1] stated that, whenever pedestrians are crossing the roads, every motorized vehicle must be stopped and let them pass the roads. In a conventional vehicle, which is fully controlled by humans, there are lots of accidents caused by the foolishness of the drivers when facing the pedestrians who cross the roads [2].
Nowadays, the development in transportation technology allowed a computer to control a vehicle, commonly called an autonomous vehicle [3]. Based on this situation, the requirement of a pedestrian detection system is needed. It can work as an early warning system for pedestrians who crossed the roads. This system is needed, especially in Indonesia which had lots of pedestrians crossing the roads not at the right time and place [4].
The accidents occur not only in human-controlled vehicles, but also occurs in autonomous vehicles [5]- [7]. This condition happened because it is not reached the full stage of its autonomous system, furthermore, it still needs control from the drivers to prevent some unpredicted events (such as accidents) [8].
This paper proposed a simple solution for preventing the accident between vehicles and pedestrians, by using object detection and a minicomputer as the main processing unit. Our proposed framework contributes to prevents accident for pedestrians. For further purpose, we tried to reduce the number of accidents with vehicles and pedestrians. The proposed framework is the first compact dashcam that built to detect pedestrians to prevent the accidents. In road segments, the probability of pedestrians crossing the road is almost predictable. Its intention depends on the distance and speed of vehicles near pedestrians [9].
The pedestrians can be detected by using Convolutional Neural Network (CNN). This method applies a deep learning method that uses convolution by moving a convolutional multiplier [10]. Redmon, et al., improved the CNN due to fastening the object's detection result [11]. Our proposed system recognizes the pedestrians by using the fifth version of You Only Look Once (YOLOv5) which runs based on the CNN method [12].
Adopting YOLOv5 in a pedestrian detection system for an advanced driver assistance system (ADAS) has The contents of this paper are structured as follows. The First section discussed the background of the pedestrian detection system using object detection. Section 2 discussed the proposed research methods for detecting pedestrians using YOLOv5 and Raspberry Pi4. Section 3 contains the result of the training model using various epochs and the detection result. At last, Section 4 shows the conclusions of this research.

Research Methods
In the autonomous vehicle, there are lots of sensors that are placed in order to be aware of the situation around it [21]. One of the ways to understand its nearby situation can be done by placing the cameras. It can capture images and later can be processed by detecting any object using computer vision techniques. This method allowed the computer to work like the eyes of humans by seeing every object that appears in front of the cameras [22].
The requirement analysis is built in order to create a pedestrian detection system. At least, it needs a standard resolution camera and an edge computer (Raspberry Pi4) which will be used as the system's main computing unit and object detection method for estimating the pedestrians in front of a vehicle. In general, this research consists of two stages. The first stage collects data on the pedestrian using the monocular camera, and the second stage detects pedestrians using YOLOv5. Figure 1 shows the proposed method for detecting pedestrians crossing the streets. In the early stage, a monocular camera will capture an image or video which appeared from its point of view. Every image that is collected, will be preprocessed in order to simplify the images, easy to process, and adjust to the need of the convolutional stage.
Whenever the preprocessing step is done, the object detection process is begun. It tried to understand every object that appeared in an image. When it detects and recognizes any objects, a bounding box is drawn around the detected object. A bounding box has information such as its position and its size.

Data Collection with Monocular Camera
The pedestrian detection system proposed in this paper uses a minicomputer that is connected to a monocular camera. The definition of a monocular camera in this paper refers to a single camera which directed in line with the vehicle's direction.
The dataset used in this study is manually customized using the image search feature on search engines. The dataset is searched by tracing the shape of the pedestrian's form in accordance with the camera settings that have been carried out on the proposed system according to the needs of this study. The dataset distribution technique uses a ratio of 6:4. These are performed since that produces a somewhat more significant fraction of the training data, hence optimizing the training. The dataset of pedestrian image samples contains 21610 photos. Therefore, the quantity of training data utilized is 12966 photos, whereas the amount of test data utilized is 8644 images. The computer that is connected to the camera will process the images collected. It tried to collect all information or event that occurred in front of the vehicle. The illustration of the placement of the camera and minicomputer is shown in Figure 2.
In order to build the pedestrian detection system, there will be several requirements, namely the hardware and the development environment. Table 1 shows the requirement system used to build the pedestrian detection system. As shown in Table 1, the Raspberry Pi4 works as the main computer placed in the vehicle. It runs with Raspbian Buster OS and receives images from a monocular camera. Meanwhile, at first, the pedestrian detection software is developed on a different computer, as specified in the table. Whenever it's ready to run, it will be customized and moved to the Raspberry Pi4 to detect any pedestrians.

Pedestrian Detection System
The pedestrian detection system is built based on the object detection method. One of the top methods for detecting objects is You Only Look Once (YOLO). In its development, the various version of YOLO is proposed by several researchers. Redmon et al. proposed the first version of YOLO in 2015 and It works based on the Convolutional Neural Network [11] to define objects. Figure 3 shows the architecture of YOLO which was developed by Redmon. According to Deshpande et al. [23], YOLO is the fastest object detection method compared with Fast R-CNN and Faster R-CNN. Nowadays, YOLO is the most common method applied in intelligent transportation systems [24]- [26] or specifically in the development of autonomous vehicles [27].
Redmon et al., develop their methods to the next version which has an improvement in the accuracy and the speed of the algorithm [28]. In YOLOv3, the capability of detecting objects is improved [29]. It works for detecting common objects and it applies in every general case. YOLO's latest version is not developed directly by Redmon, but it is developed by other researchers in specified cases such as the intelligent transportation system [30]- [32].
To detect any pedestrians, the proposed system needs an image or video of the road's situation. It needs another preprocessing step when the system received videos as input. The system will read and store all frames in the video input. Whenever the frames are stored, the objects in the computer will be detected immediately.  Figure 4 shows the illustration of the road's situation based on a simple dashcam (using a smartphone) proposed by Nasution, et al. [33]. It detects other vehicles which appeared on the roads. The pedestrian detection system that we proposed almost has a similar concept, but we focused on detecting pedestrians using a minicomputer and monocular camera.

The Scenario of Experiment
We simulate the pedestrian detection system using a minicomputer which is already attached by a monocular camera. In the simulation, we only tried to detect pedestrians who standing in front of the camera. As shown in Figure 5, the pedestrians are detected in the proposed system. In order to get a better detection system, there will be several model training with different numbers of epochs. According to Afaq and Rao [34], the definition of an epoch in machine learning is a process for onetime model training using the whole dataset. It tried to validate the training model by finding the minimum error among the various number of epochs. The loss function tried to find in this test is bounding box loss (Mean Squared Error) and object loss. There is no classification loss since the class that tried to detect is Surya Michrandi Nasution, Fussy Mentari Dirgantara Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) Vol. Bounding box loss calculates by measuring differences between the predicted bounding box and the actual object's bounding box. It represents the capability of the method to find the center of an object and its bounding box to cover an object. Meanwhile, the object loss tends to show the confidence level of every detected object.

Results and Discussions
As mentioned in the previous section, in the beginning, we tried to find the minimum loss by training the model with various epochs. The loss function calculated in the experiment covers the bounding box and object loss. Later, we tried to detect the pedestrian by simulating our system outside the vehicle.

Training Model
The model is trained by using various epochs, namely 200, 500, 1000, and 2000 epochs. As aforementioned before, it is conducted in order to find the minimum loss with the most optimal number of epochs. The loss functions measured in the training model are bounding box and object loss. The classification loss is not measured in this paper since the category of object is limited to pedestrians (person).
Training Model with 200 Epochs: The 200 epochs training model takes time around 30 minutes with the average confidence level range of the detected object around 79%. In several epochs, there are some values of object loss that are less than 15%, which means the confidence level reached 85% for understanding the objects. The result of training for measuring the object loss with 200 epochs is shown in Figure 6. At first, the loss reached 11% and it decreased to 9% at the 11th epoch. In the end, the loss reduction is less significant, and it reached 8.75%. Training Model with 500 Epochs: The result of loss function testing in the training model is slightly better compared to the previous training. The object loss in detecting objects is around 21.69%, it has the same meaning as the confidence level in detecting objects is 78.31%. In this testing, the training model takes 75 minutes to finish its epochs. Figure 8 shows the training result of object loss with 500 epochs. The testing result of bounding box loss in the model training with 500 epochs is better than the previous testing. Almost similar to the training model using 200 epochs, its loss reduced significantly at the beginning of epochs. In the end, the bounding box loss reached less than 0.08, as shown in Figure 9. epochs tends to be greater than the previous model training. As shown in Figure 10, the average object loss is reaching 20.2%, which means the object is detected with a level of confidence is almost 80%.   As well as the object loss in this model training, the bounding box loss reached its lowest point. It reached 4% for its loss, and that means the system has 96% accuracy in the object's (pedestrian) detection area. Training with this number of epochs takes 180 minutes to build the model. Figure 13 shows the bounding box loss for model training with 2000 epochs.   The most optimized number epoch, however, is around 1000. It not only has the maximum loss reduction (1.49 points) between epochs and a moderate reduction in measuring the loss of the bounding box (1.5 points) but also needs moderate training time to build a pedestrian detection model. As seen in Table 2, twice the number of epochs does not mean the loss is also reduced twice.

Discussions Pedestrian Detection Result
The testing of the pedestrian detection system is done by using Raspberry Pi which is integrated with a monocular camera. The system tried to detect the pedestrian using several samples of the recorded image as a testing method. Figure 14 shows the result of the pedestrian detection system that has been built in this research. As seen in the figure, there is a pedestrian who stands in front of the view of the module (Raspberry Pi and Camera). Based on the testing result, it successfully detects a pedestrian at various distances. The pedestrian detection system also can recognize using live video streams, but it is only able to detect objects for 0.9 frames per second. The pedestrian system needs an improvement in the computing module to deliver realtime pedestrian detection.
According to the experiment that has been done, the pedestrian detection system using object detection (YOLOv5) works well for images. In order to detect objects using the camera's stream, the computer that is used must be improved due to speeding up the frame rate (near real-time detection). Whenever it is already able to detect objects in real-time, it can be relied on as a pedestrian detection system in autonomous vehicles.

Conclusion
Based on the testing results in this paper, we conclude the number of epochs in training a model can reduce the loss. But, expanding the number of epochs twice, does not mean the loss is reduced twice.
The maximum number of epochs that we tried is 2,000 epochs, and while it has an object loss of 20.11% and a bounding box loss of 4.48%, it takes at least three hours to develop a model. In the meanwhile, the optimal number of epochs is 1000 epochs; this has the highest possible object loss reduction (1.49 points) and only a moderate reduction in bounding box loss (1.5 points).
In comparison to the many other epochs, the training time for this one looks to be quite short.
On the other hand, the implementation of the pedestrian detection system in Raspberry Pi as a main processing unit is only reliable for recorded images since it is only capable to deliver 0.9 frames/seconds. Due to the necessity of real-time detection which comes from the camera's stream, the main processing unit must be upgraded to better hardware such as jetson or any other minicomputer that is specified for object detection.