Iranian (Iranica) Journal of Energy and Environment

Person re-identification (re-id) is one of the most critical and challenging topics in image processing and artificial intelligence. In general, person re-identification means that a person seen in the field of view of one camera can be found and tracked by other non-overlapped cameras. Low-resolution frames, high occlusion in crowded scene, and few samples for training supervised models make re-id challenging. This paper proposes a new model for person re-identification to overcome the noisy frames and extract robust features from each frame. To this end, a noise-aware system is implemented by training an auto-encoder on artificially damaged frames to overcome noise and occlusion. A model for person re-identification is implemented based on deep convolutional neural networks. Experimental results on two actual databases, CUHK01 and CUHK03, demonstrate that the proposed method performs better than state-of-the-art methods. doi : 10.5829/ijee.2023.14.04.01


INTRODUCTION 1
Person re-identification in video sequences is one of the most critical and challenging issues in video and image processing. Nowadays, due to the growing use of surveillance and control cameras in public places, the importance of recognizing people is more noticeable than ever. Person re-identification (re-id), in general, means that a person seen in the viewing angle of a camera can be found and tracked in other cameras that do not overlap with the primary camera. Naturally, in crowded public places, the number of people in the camera's field of view is vast, and therefore the target person (query) must be identified among all these people. In other words, person re-id means comparing the person seen in the search camera with a gallery of candidates seen in nonoverlapped cameras [1]. Figure 1 indicates the concept of person re-id systems.
Although image processing techniques and applications have made significant progress in recent years, some challenges in person re-id encourage researchers to work on it. For example, in a surveillance system, a person who leaves the frame of a camera will be seen at least in another camera, which may have a *Corresponding Author Email: hfarsi@birjand.ac.ir (H. Farsi) different distance and angle than the first camera [2,3]. Also, distances between the person and the cameras are high, and biometric features of individuals cannot be extracted well. Meanwhile, the frames recorded by these cameras have low quality. Therefore, the importance of personal re-id systems is felt more than ever.
Person re-id has many applications in industry. Among the most important of them, it can be mentioned security and surveillance applications in the city, airports, stores, and other places that are controlled by surveillance cameras. Also, health monitoring systems, long-tracking systems, and crime prevention are other critical applications of person re-identification. Every re-id system consists of two main parts, feature extraction, and similarity measurement. Researchers have developed different methods for feature extraction or metric learning (which will be noted in the next section). The mainstream of the proposed work in this paper is based on extracting concluded and efficient features based on the deep autoencoder and reducing re-id noises, such as occlusion, using another type of autoencoder. By combining these two parts, the proposed pipeline is defined. The main reason that CNN can extract robust features is that it simultaneously learns feature extraction and classification at an end-to-end optimization process despite traditional machine learning methods. Therefore, optimum weights and coefficients are calculated by minimizing the loss function. In our method, the model is forced to extract features even if the frame is noisy and damaged by combining CNN and auto-encoder and creating random arbitrary noises and distortions. Also, the main advantage of the proposed model is that it can repair distorted and noisy frames to overcome noise and occlusions, which are the main challenges of personal re-id. The rest of the paper is organized as follows. Related works and some state-of-the-art methods are noticed in section 2. Section 3 describes the proposed method, numerical results, databases, and metrics are referred to in section 4, and finally, section 5 concludes the paper.
The Focus of the research on personal re-id is the two categories of extracting the appropriate feature and creating a similarity measure. Also, recently, some researchers emphasized feature extraction and similarity learning simultaneously. Feature extraction and similarity metric learning have been done using either supervised or unsupervised learning, and extensive research has been conducted in each of the above categories. Munaro et al. [4] proposed a method for person re-id, which finds the key points of the body skeleton in order to extract features from crucial points. RGBD sensors were used to extract color and depth features. Also, texture features were extracted using SIFT and SURF methods, and the model used combined color and texture features for person re-id. Another method for person re-id was presented by Matsukawa et al. [5] for defining a descriptor based on the hierarchical distribution of pixel features. This descriptor improves the hierarchical covariance one, which has been well used in image classification Figure 1. Concept of person re-identification applications. Since two orders of Gaussian distribution are used, the descriptor is called Gaussian of Gaussian (GOG), and the color and texture features of the images are extracted simultaneously. Researchers in another work introduced a feature extraction method called Local Maximal Occurrence (LOMO) and a new metric learning method named Cross-view Quadratic Discriminant Analysis (XQDA). In the feature extraction section, the presented method tries to extract robust features against changes in the brightness and direction of the person. In order to create stability against direction changes, the occurrence of horizontal local features is checked, and the probability of occurrence of these local features is maximized in the selected feature vector. Therefore, the maximum horizontal local information in the image will be in the feature vector, and the changes in this feature will be less with the change in the person's direction [6]. Some other research also has been done on handcrafted features based on color, such as Color Based Ranking Aggregation (CBRA) [7] and color-based reranking [8].
Beyond deep learning methods for re-id, another model is presented that tries extracting features and training similarity measures simultaneously using convolutional neural networks. According to the authors, the first-person recognition method uses deep learning [9]. Convolutional neural networks have been used to learn appropriate features from images automatically. Also, the similarity learning stage is trained and tuned by the same networks. In another method recently proposed for recognizing a person, researchers have presented a model that uses the characteristics of a person's ranking in the ranking list of the gallery [10]. By computing similarity from different layers of deep convolutional neural networks, a model is proposed by Sezavar et al. [11] to find the best filter in each layer that better describes the similarities. Although other methods were introduced before to use these features, they are based on data post-processing, and in the training phase, the network does not have access to the features of the ranking list. The authors declare that the paper is the first method to show that the appearance feature and the ranking background information can be simultaneously optimized for training to extract more distinctive features and achieve better recognition performance.

MATERIALS AND METHODS
The proposed method consists of two main parts: quality improvement based on auto-encoder and feature extraction for person re-id based on deep CNNs. Before going on to details, it is necessary to review auto-encoders and CNN.
Auto-encoder refers to a multilayer neural network that aims to reconstruct input through encoder and Pre-processing Feature extraction Person matching decoder layers [12]. In the encoder, the input image (or signal) is fed to the network, and a small vector is produced by extracting features and down-sampling. Then, the decoder tries to reconstruct the input by feeding the encoded feature vector during a supervised learning process. Auto-encoders have been used in many tasks, such as image processing for feature extraction and denoising [13][14][15].
Convolutional Neural Network (CNN) is a powerful type of deep network in which, instead of using neurons and calculating the linear combination of inputs, the result of convolution of the filters in each layer and the inputs is transferred to the subsequent layers. When our inputs are two-dimensional (such as an image), the concept of twodimensional convolution becomes meaningful, and the network is made of filters whose coefficients can be trained. Also, CNN is one of the most important deep learning methods in which multiple layers are trained effectively and is commonly used in machine vision applications. Different parts of the image are given as input to hierarchical layers, and in each layer, by applying digital filters, significant features are extracted from the image [16][17][18].
The proposed method has three main duties: frame repairing for denoising and occlusion overcoming, feature extraction, and person re-identifying. Details of each part are explained in the following subsections.
Because detected frames in person re-id usually have low resolution and are noisy. In crowded scenes, occlusion occurs. This part is implemented to overcome occlusion and repair the frame by repairing occluded pixels. To this end, an auto-encoder is implemented and trained on randomly noised and damaged frames. Assume that the input frame of the person is, which is an RGB frame. We define Artificial Damaged Frame (ADF) as follows: which ′ is the ADF, is noise distribution (is supposed to be different kinds with different variances), and Ω is the operator that indicates the random patch pixel damaging. After creating the ADF, it is fed to the autoencoder as an input frame, while the desired output is the main frame, . During the training process, filter weights are updated until the mean squared error between the desired output and ADF is minimum.
For the next step, a CNN is implemented for feature extraction. To this end, VGGNet [19] (demonstrated as an influential CNN in literature) is considered the backbone, and some fully-connected layers are added to tune the network for our work. The network is trained as a classification task where each person is considered a separate category. During the training process, the weights of convolution layers will be tuned until the classification error become minimum. The output of each convolution and fully connected layer is calculated by Equations (3) and (4).
where refers to the input frame for each layer and ℎ indicates filter's kernel. Also, in the training phase and for each layer, neuron weights ( ) and input values ( ) are  tuned to classify the input frame to a specific person. Therefore, it means that the network is learned well to extract features of each frame. We extract the last fullyconnected features of each frame as a feature vector for person re-id. Then, the Euclidean distance between these feature vectors of each gallery frame and probe frame is calculated and sorted to find the nearest person. The structural diagram of the proposed method is shown in Figure 3 As shown in Figure 3, the trained auto-encoder on the ADF is used before feeding frames to the CNN. It causes improving frame quality and repairs damaged and occluded pixels. After that, extracted features from trained CNN are used to compare and match a similar person. The main novelty of the proposed method consists of two parts. The first one is that our model has an encoder-decoder in its backbone to make it noiseaware and repair noisy frames, which is a critical challenge in person re-id. To our best of knowledge, there have been no end-to-end models in the literature to do this for personal re-id. Also, our algorithm to create arbitrary distorted frames based on what is needed for a person's re-id helps in the training phase to make the model noiseaware. The second is training CNN to classify frames (by considering each person as a different class) and use features to compare frames to determine whether they are the same or different.

RESULTS AND DISCUSSION
In order to evaluate the proposed method, two famous databases, CUHK01 [20] and CUHK03 [21], are used. CUHK01 comprises 971 people, each with four frames from two non-overlapped cameras. Also, CUHK03 is made of 1467 people; for each, more than five frames are available, captured by five different non-overlapped cameras. For each dataset, half of the people are considered for the training step, and the remaining persons are used for the test, divided by gallery and probe set. Details of datasets and each set are shown in Table 1.
For evaluation, rank metrics are used. After computing the similarity between query frame and gallery frames, a ranked list is created by sorting similar frames decreasing. Rank-1 is the most challenging metric and means the probability of assigning the correct frame corresponding to the probe frame in the first retrieve. If the first frame in the ranked list is related to the query frame, Rank-1 is equal to 1 for this frame. Overall Rank-1 for the database is reported by averaging Rank-1 for all query frames. Rank-k also means the probability of being in the first k frames corresponding to the query frame. These metrics have been well-used in the literature for examining re-id systems. The training states for CNN and feature extraction is demonstrated in The numerical results for the proposed method on CUHK01 can be found in Table 2, in which, Rank-k for K=1, 5, 10, and 20 are computed and compared with some state-of-the-art models. Also, these results are computed for CUHK03, and comparison results are shown in Table 3.
It can be seen from Table 2 that the proposed method achieved 92.1% rank-1 and performed better than other methods. Also, in other ranks, the proposed method stands upper than other methods, and this is because of using an auto-encoder for improving frames and using deep CNN for extracting features. Table 3 shows that Rank-1 equals 94.4% for the proposed method, which is 1.9% better than the best model in state-of-the-art methods. Also, by looking at performance in other ranks, it can be found that the proposed model performs well compared to other models on CUHK03.
In order to evaluate better, Rank-1 accuracies for different artificial noises for both databases are shown in Figure 5, with and without using the ADH-trained autoencoder. As observed, we can see the improvement in performance by forcing the model to learn from noisy and randomly damaged inputs. This improvement in accuracy is because the proposed model is noise-aware and can repair frames and extract robust features from each    Figure 6, where query frames and some ranked frames are illustrated.

Figure 5.
Comparing Rank-1 for different noises with and without ADH trained auto-encoder Figure 6. Visual outputs of the proposed model ranked frames for three different query frames. A green marker in each row marks the correct re-identified frame.

CONCLUSION
Person re-identification is still a challenging and vital task in image processing. This paper proposed a new method for person re-identification by combining auto-encoders and deep CNNs. The auto-encoder was trained on artificially damaged frames to improve frame quality and overcome noise and occlusion. For extracting features, a model based on convolutional neural networks was designed in order to extract robust features for person re-identification. Experimental results on two famous datasets, CUHK01 and CUHK03, demonstrated that the proposed method performs better than state-of-the-art methods. Using an unsupervised method and combining it with generative models to overcome insufficient data for training can be considered in future work to improve the model.