A review on action recognition for accident detection in smart city transportation systems

Accident detection and public traffic safety is a crucial aspect of safe and better community. Monitoring traffic flow in smart cities using different surveillance cameras plays a crucial role in recognizing accidents and alerting first responders. In computer vision tasks, utilizing action recognition (AR) has contributed to high-precision video surveillance, medical imaging, and digital signal processing applications. This paper presents an intensive review focusing on action recognition in accident detection and autonomous transportation systems for smart city. This paper focused on AR systems that use diverse sources of traffic video, such as static surveillance cameras on traffic intersections, highway monitoring cameras, drone cameras, and dash-cams. Through this review, we identified the primary techniques, taxonomies, and algorithms used in AR for autonomous transportation and accident detection. We also examined datasets utilized in the AR tasks, identifying the primary sources of datasets and features of the datasets. This paper provides a potential research direction to develop and integrate accident detection systems for autonomous cars and public traffic safety systems by alerting emergency personnel and law enforcement in the event of road traffic accidents to minimize the human error in accident reporting and provide a spontaneous response to victims.


Introduction
In the field of computer vision, action recognition is a domain that has gained much attention since the advancement of convolution neural networks (CNNs) as a tool for solving complex computer vision problems and has received much attention in the research community over the past few years Chattopadhyay, Sarkar, Howlader and Balasubramanian (2017).Action recognition has been used in several real-life applications such as safety and security AI Al-Faris, pose estimation has not shown significance in vehicle accident detection due to the specificity of the problem and the differences between human and vehicle actions and physical construction.
Transfer learning is a commonly used technique in extracting features of deep neural networks that have been trained on a specific domain with robust data set to a new domain/area of application with reduced computational resources.Previous research has leveraged the transfer learning approach to improve video stream action localization.Iqbal, Richard and Gall (2019) experimented with action localization on pre-selected frames by leveraging transfer learning from the existing model.The overarching goal was to simplify the complex architectures, expensive computation cost, and inefficient inferencing in existing methodologies.
The current research trend in action recognition is focused on a classical deep neural network with two stream architectures (RGB and Optical flow) Sevilla-Lara, Liao, Güney, Jampani, Geiger and Black (2018).Transferring features from the pre-trained model on small action classes significantly improves AR models' performance, while other areas of focus have been on temporal localization and segmentation of actions in the untrimmed video.Hidden Markov's model has been used to capture long-range dependencies in frame-wise action recognition Kuehne, Arslan and Serre (2014).In contrast, Spatio-temporal convolution and semi-hidden Markov model were used in capturing multiple actions transitions in untrimmed video Lea, Reiter, Vidal and Hager (2016).Iqbal et al. (2019) utilized the transfer learning technique with the I3D network on the temporarily untrimmed video to localize all action class instances in a video stream.Their experimental research using deep vanilla temporal convolutions network on features extracted from the I3D yielded the state-of-the-art result with a lightweight model and simple convolution network to extract features from the existing model without multiple layers and gated convolutions Iqbal et al. (2019).
This paper provides a comprehensive review of action recognition focusing on accident detection and autonomous transportation in smart city transportation system.This review includes the state-of-the-art techniques that researchers have proposed, taxonomies of AR tasks, AR applications domain, and transfer learning algorithms from complex architectures.In addition, we provide the potential future research questions in new application domains leveraging existing model architecture.The main contribution of this paper can be summarized by: • Providing a comprehensive comparison of different action recognition techniques and taxonomies used in smart city transportation systems and synthesizing the state-of-the-art research findings within the past ten years on autonomous transportation.
• Interpret and analyze the currently used datasets, algorithms, and metrics used by relevant research in the traffic control and accident detection domain.
• Explore literature gaps in existing methodology that can be addressed by current technological advancement.
• Identified potential future research questions that leverage existing methodology with reduced model complexities and computation resources.
The structure of this paper is organized in the following: Section (2) presents background and existing literature review on the domain mentioned above.The literature search, methodology, inclusion, and exclusion criteria are discussed in Section (3).The results of our research and detailed analysis are discussed in Section (4).Finally, Sections (5 and 6) elaborated on the limitations and conclusion of the study.

Action Recognition in Smart City
A futuristic direction in computer vision tasks is the application of the intelligent system in autonomously performing human activities that are somewhat repetitive in nature and capital intensive.
In a smart city surveillance system, violence can easily be spotted to alert appropriate enforcement agencies with automated analysis of video contents in surveillance camera Fortun, Bouthemy and Kervrann (2015).The community-based monitoring paradigm focuses on tracking users, monitoring emergencies, and responding to them.The SenSquare system was implemented using crowd-sensing heterogeneous data sources for gathering data and developing classification algorithms in order to detect potential hazardous behavior in the environment Elsayed, Zaghloul, Azumah and Li (2021); Montori, Bedogni and Bononi (2018); Azumah, Elsayed, Adewopo, Zaghloul and Li (2021).Law enforcement agencies continuously face an uphill battle in controlling the increase in crime rates, and gun violence, the deployment of intelligent surveillance cameras can assist in the automatic detection of firearms and alert security agencies in real-time when a firearm has been detected.Romero and Salamea (2019) developed an object detection model that can detect firearms and crime scenes in dangerous situations based on Yolo's object detection framework using surveillance cameras.
Human behavior and specific human actions can be analyzed and classified using imaging and AI technologies.The application of AR models in understanding human behavior offers possibilities for smart city safety, especially in the aspect of tracking drivers' behavior.The National Highway Traffic Safety Administration (NHTSA) report highlighted an increase in the number of fatalities caused by distracted drivers between 2019 and 2020, which is higher than the number of fatalities caused by total accidents in 2017.The number of fatalities caused by distracted drivers rose to more than 8.5% of total fatalities during 2017 Stewart (2022).Celaya-Padilla, Galván-Tejada, Lozano-Aguilar, Zanella-Calzada, Luna-García, Galván-Tejada, Gamboa-Rosales, Velez Rodriguez and Gamboa-Rosales (2019) proposed a deep convolution neural network for detecting texting and driving behavior using a car-mounted wide-angle with a pre-trained Inception v3 model.Emerging technologies such as the AR model can be integrated with CCTV cameras to reduce fire accidents in smart cities.As described in Avazov, Mukhiddinov, Makhmudov and Cho (2021) on fire detection method in smart city environments using the Yolo4 algorithm, a robust model based on augmented data (different weather environments) as well as a reduced network structure demonstrated excellent performance and is highly effective for detecting fire disasters.In this paper, we focus on accident detection from data obtained from different types of surveillance cameras used to monitor a smart city's transportation system.

Action Recognition in Autonomous Transportation and Accident Detection
Robotics and auto navigation has also benefited from the use of AR system for automatic guidance, specifically in obstacle detection, accident prevention, and lane departure assistance Fortun et al. (2015).Accident detection in autonomous transportation systems is essential for tracking vehicles and identifying anomalies in traffic patterns.Cai, Wang, Chen and Jiang (2015) discussed the detection of abnormal traffic flow using clustering techniques on main flow direction vectors and a k-means clustering algorithm to identify outliers that deviates from normal trajectory pattern or motion flow in highways.Previous research has explored intelligent visual descriptions of scenes with connected image points using spatio-temporal dynamics in the Hidden Markov Model Morris and Trivedi (2011).More recent research work approached this challenge using machine learning algorithms and deep learning algorithms Huang, He, Rangarajan and Ranka (2019); Saunier and Sayed (2007); Robles-Serrano, Sanchez-Torres and Branch-Bedoya (2021).Robles-Serrano et al. ( 2021) combined convolution layers and long short-term memory LSTM architectures in capturing spatio-temporal features from a sequence of images in video streams that have been proven to achieve better performance Lim, Jang and Lee (2016); Elsayed, Maida and Bayoumi (2019) due to the capability of the convolutional layers to extract the features from each image in the video stream Kattenborn, Leitloff, Schiefer and Hinz (2021) and the capability of the LSTM to learn the temporal information between images in the video sequence Greff, Srivastava, Koutník, Steunebrink and Schmidhuber (2016); Elsayed, Maida and Bayoumi (2020).
Accident detection task includes the detection of the spatiotemporal dependencies in multiple frames from video surveillance.Hence correctly classifying video input as an accident is a more challenging task in developing an accident detection model.Carreira and Zisserman (2017) introduce a new two-stream inflated 3D ConvNet (I3D) based on a 2D ConvNet inflation.The authors seek to unravel the correlation of training on a more extensive network with performance boost by inflating the pooling kernel of pre-trained image classification architectures to an inflated two-stream inflated 3D ConvNets (I3D).The results of their proposed framework suggest that there is always a boost in performance by pre-training on a model.However, the extent of the boost varies significantly with the type of architecture.

Accident Detection Methods
In action recognition tasks, it has been found that many researchers propose their own datasets and evaluation criteria, making it challenging to identify the most appropriate datasets and results.Performance metrics also vary across multiple research works; developing a standardized evaluation technique will lead to more robust research in the application of Action Recognition tasks.Current methods allow some data samples to be repeated/duplicated in train/test data which directly causes bias in actual performance when evaluating a new research work Jordao, Nazare, Sena and Robson Schwartz. Stisen, Blunck, Bhattacharya, Prentow, Kjaergaard, Dey, Sonne and Jensen (2015) examined the effects of heterogeneous devices on the final performance of the classifier on different activities using handcrafted features and employed popular classifiers such as nearest-neighbor, support vector machines, and random forest.They noticed sampling instabilities occurred across various devices.
The dataset video source also plays a significant role in designing accident detection models.Videos that are captured by a dashcam hold different data trajectories and street vision than the highway or traffic lights surveillance cameras.The dashcams capture the traffic video from a horizontal view.In such captured videos, both the camera and the surrounding objects are moving.This increases the problem complexity, especially when determining the approaching objects towards the dashcam and the objects that are approached by the car that has the dashcam itself.The traffic light and highway cameras record the scene in a vertical view, with the camera in a fixed position, while moving objects are recorded at a fixed point.Therefore, addressing each type of video content plays a significant role in calculating the trajectories, the acceleration of objects, and the moving directions.

Machine Learning and Statistical Models
Most machine learning algorithms focus on vehicle trajectory, motion, acceleration, and car position in detecting car accidents.Singh and Mohan (2018) combined two algorithms using object detection and anomaly algorithm detection to identify accidents.Singh and Mohan (2018) proposed a framework that extracts deep representation using autoencoders and an unsupervised model (SVM) to detect the possibility of an accident.The vehicle's trajectories at the intersection points were used to increase the proposed architecture's precision and reliability.Joshua and Garber (1990) proposed mathematical relationships obtained through multiple linear and Poisson regression analyses to identify factors contributing to significant truck accidents on the highway using an accident dataset from Virginia highway traffic in combination with other geometric variables to model the percentage of trucks involved in a road accident.Arvin, Kamrani and Khattak (2019) leveraged the availability of extensive data from interconnected devices in making correlations between erratic driving volatility and historical crash datasets from intersections in Michigan.Statistical variables such as fixed parameter, random parameter, and geographically weighted Poisson regressions, longitudinal and lateral acceleration were used in identifying road accident crash hotspots.

Deep Machine Learning Models
Most deep learning algorithms focus on vehicle trajectory, motion/acceleration, and car position for detecting car accidents.Chan, Chen, Xiang and Sun (2016) proposed a Dynamic-Spatial-Attention (DSA) Recurrent Neural Network (RNN) for anticipating accidents in dashcam videos based on the vehicle trajectory and motion.The developed algorithm contains an objects detector to dynamically gather subtle cues and the temporal dependencies of all cues to predict accidents two seconds before they occur with a recall of 80% and low precision of 56.14%.The model generalizability in detecting accidents in varying weather conditions was not measured based on limited videos with rain, snow, and day/night, among other weather conditions.Robles-Serrano et al. ( 2021) explored the Deep Neural Networks for accident detection using a three-stage approach by firstly segmenting visual characteristics of objects in the dataset, building on the Inception V4 model architecture to extract the temporal components of the dataset used in detecting accidents followed by the temporal video segmentation.A structural similarity index was applied to the dataset at preprocessing time to accurately select image frames within the data representing an accident or no accident as part of the temporal video segmentation to eliminate frames that do not contain event occurrence or repetition of the selected event.During the preprocessing, pixel-to-pixel comparisons were made to select a certain number of consecutive frames that contained features to train the model based on a specified threshold.Finally, the framework was designed to detect accidents automatically using Convolution LSTM (ConvLSTM) layers to capture spatial and temporal dependencies in input data Shi, Chen, Wang, Yeung, Wong and Woo (2015); Elsayed, Maida and Bayoumi (2018).This type of neural network has been proven to have a better performance compared to the LSTM and CNN architectures when dealing with datasets that have both spatial and temporal structures.One of the potential limitations is a model bias based on vehicle types and other environmental conditions such as vehicle variety and the absence of pedestrians and cyclists.

Social Network and Geosocial Media Data
The enormous amount of information being constantly shared daily across various social media platforms contains artifacts that can be analyzed to generate meaningful insights for traffic events Rashidi, Abbasi, Maghrebi, Hasan and Waller (2017).Although the ability to monitor and analyze the exploding information manually seems impossible based on the high volume and unstructured formats of the information being presented Adewopo, Gonen, Elsayed, Ozer and Elsayed (2022).Monitoring traffic-related information on social media has been proven to be beneficial in detecting traffic events.Xu, Li and Wen (2018) provided a synthesis of research work that explored the usage of geosocial media data for detecting traffic events.Events such as road accidents, road closures, and traffic conditions are typically shared among a network of people through social media platforms.Such events can be tracked down with the aid of GPS in getting first responders to the event location and also often contain information that triggered such events.Xu, Li, Wen and Huang (2019) utilized Twitter data for mining and filtering noisy data by association rules among words related to traffic events.The proposed framework achieved 81% accuracy in classifying data into non-traffic events, traffic accidents, roadwork, and severe weather conditions.Similarly, Salas, Georgakis, Nwagboso, Ammari and Petalas (2017) developed a framework leveraging social media data to crawl, process, and filter social media data for implying traffic incidents and real-time detection of traffic events with text classification algorithm Gu, Qian and Chen (2016).

Literature Search
The literature search process, which has been performed, consists of four steps, including i) selecting eligibility criteria (Inclusion and Exclusion criteria), ii) formulating research objectives, iii) identifying search strategy, and iv) data extraction Harris, Quatman, Manring, Siston and Flanigan (2014); Wright, Brand, Dunn and Spindler (2007).This study employed the systematic review methodology to address the research questions posited through a systematic and replicable process Gough, Oliver and Thomas (2017).Specifically, the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) Statement was used as a model for this review Page, McKenzie, Bossuyt, Boutron, Hoffmann, Mulrow et al. (2021).Based on the established eligibility criteria, the papers selected were analyzed and synthesized to address the research questions postulated in the following subsection.

Research Questions and Objectives
Developing an AR model for specific tasks will enhance the use of AI systems in automating human actions and autonomously detecting actions in live feeds.Once the inclusion selection process has been carried out, based on pre-established criteria, the main results of the selected works are codified and extracted in order to synthesize and guide this research, the following research questions will be addressed in this study: • RQ1: What are the main Action Recognition techniques/application in accident detection and autonomous transportation?
• RQ2: What are the main taxonomies and algorithms used in Action Recognition for accident detection and autonomous transportation?
• RQ3: What are the main datasets, features, and metrics used in Action Recognition for accident detection task?

Selecting Eligibility Criteria
This review includes the research articles related to Action Recognition.It includes topics in (Autonomous Transportation, Traffic Control, and Accident Detection using computer vision published in peer-reviewed journals and published between 2012 and 2022.Based on the continuous evolving advancement in the technical field, the articles selected were published ten years before this review.Only research articles published in the English language were used.The inclusion and exclusion criteria were detailed in Subsection 3.3 and Subsection 3.4, respectively.This systematic review is based primarily on computer vision tasks using the AR model in autonomous transportation and smart city accident detection for further clarifications.

Inclusion Criteria
The publications needed to meet the following characteristics in order to be included: 1. Articles should be in Action Recognition and Computer Vision domain.

Exclusion Criteria
The following exclusions were implemented: 1. Does not contain video/motion analysis.2. Published before 2012.
3.Not peer reviewed or does not provide clear findings and analysis of results.4. Written in other languages excluding English. 5. Duplicated studies.6. Non peer-reviewed paper.

Information Sources
The papers included in this review were identified by searching electronic databases published in English.The databases in Table 1 were used as the primary source of articles for this review.
These databases provide impactful articles from full-text journals and conferences relevant to Action Recognition tasks in smart city automation, autonomous transportation, and accident detection.The first phase includes searching the databases in Table 1 with advanced search and filtering techniques to limit search results to only relevant studies.Two teaching assistants did a manual review of the search results in the second phase to ensure their validity of the search results.The number of articles retrieved from each database and the final number of papers selected is showcased in Figure 1.Only accessible articles are included in the search result.More details of the search term and strategy for validating and selecting relevant materials are discussed in Subsection 3.6.

Search Strategy
Combining the following keywords with conjunctions "AND" and disjunctions "OR" resulted in a total of 2,030 papers in an automated search, as shown in Table 1.The most common terms used for our search were: 1. Action Recognition.2. Transportation.3. Traffic control.4. Accident Detection.
The results of our search and the corresponding query that has been used are as follows: • IEEE Xplore: We received 299 papers from IEEE using the search string:[(("All Metadata":Action Recognition ) AND ("All Metadata":Transportation) OR ("All Metadata":Action Recognition ) AND ("All Metadata":Traffic) OR ("All Metadata":Action Recognition ) AND ("All Metadata":Accident Detection))] between 2013-2022 • ACM: We received 181 papers from ACM using the search string: [AllField:("Action Recognition") AND AllField:("Transportation") OR AllField:("Action Recognition") AND AllField:("Traffic") OR AllField:("Action Recognition") AND AllField:("Accident Detection")] • Web of Science: We received 445 papers from Web of Science using the search string:[((ALL=(Action Recognition) AND ALL=(Transportation OR Traffic OR Accident Detection))) AND (PY==("2022" OR "2021" OR "2020" OR "2019" OR "2018" OR "2017" OR "2016" OR "2015" OR "2014" OR "2013" ))] • Springer Link: We received 572 papers from Springer Link using the search string: [("Action Recognition") AND (("Transportation") OR ("Traffic") OR ("Accident Detection"))] between 2013-2022 The articles were evaluated and selected according to the mentioned criteria in Section (3).After the preliminary database search using the approved search strategy conducted by the student researchers and eliminating duplicates, a total of 1830 articles were screened by two faculty researchers and one student researcher independently who are domain experts.The abstracts, titles, and keywords from selected articles were reviewed for relevance based on the inclusion and exclusion criteria.Articles that did not meet the eligibility criteria or were not relevant to address the research question were removed.The independent researchers rated each article based on the inclusion criteria and eligibility criteria.The painstaking protocol observed in the selection process ensures that all articles included are relevant to this study.A total of 1650 papers were excluded because they do not contain video analysis or employ AR techniques in detecting accidents.Thirty-three papers were excluded because they lacked validation techniques for the proposed methodology, 108 papers identified as review papers were excluded, and 17 papers contained only abstracts.Finally, only 22 papers were selected for analysis, as shown in Figure 2.

Coding, Data Extraction and Analysis
For the data extraction phase, the full text of chosen paper was shared among the authors for review and tagging the key contributions.Microsoft spreadsheet Niglas (2007), Airtable Dirk and Maddox (2018) and Mendeley Zaugg, West, Tateishi and Randall citation manager were used to coordinate workflow and analyze the papers.This research aims to retrieve action recognition research articles relevant to accident detection and autonomous transportation.In addition, duplicate studies that cover the same issues are excluded from the study.Figure 1 shows the proportion of initial articles and final articles selected from each of the five online data sources listed in Table 1.

Results
Following PRISMA guidelines, 2030 publications were identified through the five databases included, and the results of the 22 papers selected for review are presented in this section.Figure 3 showcases the publication year for the selected papers.It is noteworthy that the majority of the selected papers were published between 2019 and 2021.Taking advantage of the advancement in technology and smart city automation, more research is now being conducted on which deep learning algorithms are developed to model traffic-related activities in a smart city utilizing computers equipped with high-performance GPU processors.

RQ1: Main Action Recognition Techniques in Smart City Transportation
The first research question of our study is to examine the main AR techniques and applications within smart cities and autonomous transportation, as shown in Table 3.1.Many researchers have proposed other methods to model traffic management and traffic prediction, including the use of Vector Auto-Regression, Support Vector Regression, Auto-Regressive Integrated Moving Average (ARIMA), Kalman Filter, and, most recently, LSTM, and RNN Smola and Schölkopf (2004); Wang, Ma, Wang, Jin, Wang, Tang, Jia and Yu (2020b).In time series data, such as traffic control data, the approaches have not been able to capture both spatial and temporal information concisely.Recent efforts have improved the accuracy performance of GNN and GaAN Zhang, Shi, Xie, Ma, King and Yeung (2018); Zhao, Song, Zhang, Liu, Wang, Lin, Deng and Li (2019).Ijjina, Chand, Gupta and Goutham (2019) proposed a supervised deep learning framework to detect and identify road-side vehicular accidents by extracting feature points based on local features such as trajectory intersection and velocity by detecting anomalies in real-time accident conditions such as daylight variations.Fernández-Llorca, Biparva, Izquierdo-Gonzalo and Tsotsos (2020) study utilized a visual cue derived from a camera to detect lane change or vehicle maneuvers by utilizing a disjoint two-stream convolutional network and a spatiotemporal multiplier network.You and Han (2020) discovered that time segmentation methods such as SS-TCN and MS-TCN were more successful at higher IoU thresholds.Their experiment also suggests that the R-C3D algorithm has a comparable result when compared to segmentation-based approaches.Although newer methods like R(2+1)D and SlowFast have improved accuracy, most techniques fail to capture traffic anomalies accurately on DoTA datasets, suggesting the problem of traffic anomaly classification is challenging.Yao, Wang, Xu, Pu, Wang, Atkins and Crandall (2022) suggest that distant anomalies and occluded objects are difficult to classify because of their low visibility.Collisions with moving vehicles present a similar problem since, at times, the vehicle ahead is substantially obscured by the vehicle it impacts.There may be instances when a vehicle hits obstacles that are not detected, such as bumpers or traffic cones.Most often, anomalous vehicles are responsible for occluding the obstacles.It is hard to detect horizontal vehicle collisions due to their slow vertical trajectory, making the anomaly subtle and thus hard to detect.The JSM-based method extracts motion trajectory to evaluate traffic scenes but ignores events that occurred in an unusual manner min Xia, jie Hu and Wang (2018); Yao et al. (2022).Srinivasan, Srikanth, Indrajit and Narasimhan (2020) developed a scalable algorithm for high-speed object detection (DETR), with a less complex architecture and a higher level of accuracy compared to other object detection algorithms that are based on correlations between all objects in the video data.Table 2 addresses the research question on main Action Recognition techniques and application in autonomous transportation.The notation "-" indicates that the corresponding research paper did not address our research question.

RQ2:Algorithms and Taxonomies in Autonomous Transportation
In order to answer our second research question, we have identified the most critical taxonomies and algorithms used in the AR systems for autonomous transportation and accident detection, respectively.Table 3 shows the models, architecture, features used by other researchers, and the evaluation metrics for evaluating the performance of proposed models.It is noteworthy that most researchers employed different metrics to evaluate their algorithm's performance, especially research work that develops a new novel algorithm or benchmark.More than 60% of the reviewed paper evaluated their algorithms using Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Mean Average Precision (MAP), Intersection over Union (IOU) Nowozin (2014), and Detection Rate (DR).Yao et al. (2022) proposed a novel FOL-based method for unsupervised video anomaly detection (VAD).A metric for computing anomaly scores using the spatial-temporal area under the curve (STAUC) was introduced.Reddy et al. ( 2021) developed a Spatio-Temporal Graph Neural Network for managing and predicting traffic flow, while RNN, LSTM, and other architectures were unable to fully capture it.Their study combined GNN, RNN, and a transformer layer to model complex topological and temporal relationships among traffic data, including adjacent traffic flows.Yu et al. proposed a new graph-based Spatio-temporal model to predict future traffic accidents.The integration of spatial, temporal, and external features in predicting accidents achieved a performance improvement of around 5% over the SAE.Ali et al. (2022) developed a Graph Convolutional Network coupled with DHST-Net, called GCN-DHSTNet, which is an enhanced GCN model for learning the spatial dependence of dynamic traffic flow by applying LSTM to capture dynamic temporal correlations with other external features.In terms of RMSE and MAPE, the proposed model is 27.2% and 11.2% better than AAtt-DHSTNet, which is the current state of the art.Wang et al. (2020b) study focused on accident prediction that takes into account Spatio-temporal dependence and other external factors in anticipating accident occurrence.Research work done in Reddy et al. (2021) proposed a hybrid method for detecting and recognizing stationary and moving vehicles, traffic lights, and road signs using Deep Q-Learning and YOLOv3.Bortnikov et al. (2020) study developed an HRNN for detecting accidents from CCTV surveillance by exploiting temporal and spatial features in the classification of the video footage.Yang et al. (2021) proposed a feature-fused SSD and a new tracking-based object detection technique TDO with greatly improved detection results over state-of-the-art and also established a vehicle dataset for highway scene analysis.Huang et al. (2019) developed a supervised learning algorithm to detect crash patterns from historical traffic data.They examined different prediction methods to estimate crash risk or occurrence.You and Han (2020) also created a benchmark of traffic accident data based on cause and effect events with temporal intervals in each accident event.The dataset provides atomic cues for reasoning in a complex environment and planning future actions, including mitigating legal ambiguity among agents.The framework developed by Tang, Huang, Sun, Dong, Zhang, Gao and Liu (2017) can classify traffic data into different categories, such as detecting a vehicle turning directions, bicycle lanes, and pedestrians within the two

✓ ✓ ✓
Traffic accidents can be caused by many factors, including driver behavior, weather conditions, traffic flow, and road structures.
The authors investigated spatial-temporal relationships on heterogeneous data to develop a road-level accident prediction system.

✓ ✓
The goal of this project is to develop a framework for analyzing stationary time series traffic data.In addition, it is able to predict traffic information with a 14.1% improvement in MAPE compared to other baselines.

✓ ✓ ✓
In this study, the authors discovered that time segmentation methods such as SS-TCN and MS-TCN were more successful on the dataset at higher IoU thresholds.In addition, the R-C3D algorithm has a comparable result when compared to segmentation-based approaches.

Srinivasan et al. (2020) Vehicle Detection Deep Learning Accident Detection
Accident Classification

✓ ✓ ✓
The authors developed a scalable algorithm for high-speed object detection (DETR), with a less complex architecture and a higher level of accuracy compared to other object detection algorithms that are based on correlations between all objects in the video data.

✓ ✓ ✓
The authors applied the SIFT flow method to improve dense trajectories and generate visual words that can be utilized in detecting traffic flow.
The data from the experiments demonstrate that the SIFT method is effective.Vatti, Vatti, Vatti and Garde (2018) Car Collision

Lane Maneuver
Statistical Model Traffic Flow Pattern -

✓ -
The authors developed an electronic notification system that can alert relatives when a vehicle accident is detected based on the vehicle's trajectory, position, and acceleration.
seconds of traffic footage.In order to correctly predict accidents and classify external factors leading to accident occurrence.Wang et al. (2020b) take into account Spatio-temporal dependence in their proposed methodology.

RQ3: Main Dataset, Features and Metrics Used in Action Recognition for Accident Detection Task
Our third research question focused on exploring the dataset used for accident detection.Table 4 showcase the dataset features, type of sensors/video data, and the link to publicly available datasets for accident detection in a smart city.Yao et al. (2022) developed a benchmark dataset to assess the quality of traffic accident detection and anomaly detection for nine action classes.Based on the scarcity of annotated real-life accident datasets, Bortnikov et al. (2020) utilized simulated game video data with varied weather and scene conditions and yielded comparable results to real-life traffic videos on YouTube, as shown in Table 4.The majority of the dataset used in accident detection and autonomous vehicles are collected from dashcams, traffic surveillance cameras, drones such as HighD, InD, or Interaction datasets Krajewski, Bock, Kloeker and Eckstein (2018); Zhan, Sun, Wang, Shi, Clausse, Naumann, Kummerle, Konigshof, Stiller, de La Fortelle et al. (2019) and cameras installed on buildings.For example, NGSIM HW101 and NGSIM I-80 datasets Colyar and Halkias (2007); Halkias and Colyar (2006) contain 45 minutes of images recorded from a building for eight synchronized cameras at 10 Hz.Fernández-Llorca et al. (2020) suggest that this dataset (NGSIM HW101) is not fully applicable for onboard detection applications even though it is beneficial for understanding and assessing the motion and behavior of vehicles and drivers under different traffic conditions.PKU dataset includes more than 5700 environmental trajectories collected using multiple horizontal 2-D lidars covering 360 • , including vehicles' trajectory data over 64 km and 19 hours of footage Zhao, Wang, Lin, Guillemard, Geronimi and Aioun (2017).The Prevention dataset includes data from three radars, two cameras, and one light detection and ranging (LiDAR), covering a range of 80 meters around an ego-vehicle, to support the development of intelligent systems for vehicle detection and tracking Izquierdo, Quintanar, Parra, Fernández-Llorca and Sotelo (2019).In a similar fashion, the apolloscape dataset was developed to support automatic driving and navigation in smart cities.The dataset contains about 100K image frames and 1000km trajectories collected using four cameras and two laser scanners utilizing 3D perception LiDAR object detection, and tracking Wang, Huang, Cheng, Zhou, Geng and Yang (2019).Ijjina et al. (2019) compiled surveillance videos at 30 frames per second (FPS) trimmed down to 20 seconds video chunks collected from CCTV videos recorded at road intersections from different parts of the world with diversified ambient conditions such as harsh sunlight, daylight hours, snow and night hours.

Limitation
Our research focused on research papers relevant to action recognition, accident detection, and autonomous transportation published within the last ten years that developed a novel framework or benchmark dataset.According to our inclusion and exclusion criteria, we excluded research papers in languages other than English and those that did not include video/motion analysis.Consequently, only a fraction of the articles surveyed in the study were considered.The vast majority of AR techniques developed in another domain can also easily be applied in a new domain (such as Accident Detection).In light of this, it is recommended to conduct a research project that combines Action Recognition Techniques for objects and human action classification since both have been developed using similar model architectures.

Conclusion
This systematic literature review aims to determine state-of-the-art Action Recognition for accident detection and autonomous transportation in smart cities.In order to achieve this, we used the PRISMA guideline for selecting seminary articles related to our topic domain.This guideline was based on the Inclusion and Exclusion criteria discussed in Section 3. We selected 22 papers from an initial list of 2030 publications, and we categorized and analyzed the relevant literature based on the three pillars of our research question.This paper discussed the leading techniques and applications of action recognition in autonomous transportation.The study also explored the main taxonomies and algorithms used in AR for autonomous transportation.Finally, we presented an overview of datasets used in AR for autonomous transportation, features of the datasets, and download links to the datasets.
In the quest for a smart city, automating city traffic by capturing spatial and temporal information from DNN is a significant step in smart city automation.Bao et al. (2020) developed a model to handle the challenges of relational feature learning and uncertainty anticipation from traffic video to predict accident occurrence within 3.53 seconds with an average precision of 72.22% using Graph Convolution Network (GCN) and Bayesian Neural Networks (BNN).Many factors are involved in traffic accidents, including driver behavior, weather conditions, traffic flow, and road structure.Yu et al. examined spatial-temporal relationships on heterogeneous data to develop a road-level accident prediction system.Besides sequential patterns in the temporal dimension, traffic flows on the road are strongly affected by other road networks in the spatial dimension.Studies have been conducted on traffic flow prediction; however, many of them lack the ability to account for spatial and temporal dependencie Wang et al. (2020b).Reddy et al. (2021) aimed to extract roadway characteristics that are relevant to the trajectory of an autonomous vehicle from real-world road conditions using Deep Q-Learning.Analyzing and forecasting dynamic traffic patterns within smart cities are necessary for planning and managing transportation.Forecasting traffic flow has become more difficult because of the volatility of vehicle flow in the temporal dimension and the uncertainty related to accident occurrence and traffic movements.Ali et al. (2022) proposed a hybrid model composed of GCN and DHSTNet, which can forecast short-term traffic patterns in urban areas for improved traffic management.Similarly, Alkandari and Aljandal (2015) developed a methodology for determining how long a vehicle stays in traffic based on traffic flow and congestion.
Automation of accident detection using AI systems based on security cameras will be a step towards the security of more lives.It will also support the transformation of traffic cameras to support smart city automation and provide first responders and law enforcement agencies with information about road accidents.We recommend future research focus on scaling up accident detection systems that can be integrated into smart city automation for alerting first responders about road accidents and providing a quick response to victims thereby reducing human error and response time by adopting a spontaneous model for reporting accidents.

Figure 1 :
Figure 1: Proportion of selected studies

Figure 2 :
Figure 2: Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) flow chart of the systematic review.

Figure 3 :
Figure 3: Number of papers published per year surveyed.

Bao-Fernández----✓✓
BNN, the developed model can handle the challenges of relational feature learning and uncertainty anticipation from video data to anticipate an accident in 3.53 seconds with an average precision of 72.22%.Reddy, Chella, Ravi Teja, Rose Baby and Kodali (2021) The research investigates how to extract road features relevant to the trajectory of an autonomous vehicle from real-world road conditions using Deep Q-Learning in a real-world environment setting.This study utilized a visual cue derived from a camera to detect lane change / vehicle maneuvers by utilizing a disjoint two-stream convolutional network and a spatiotemporal multiplier network.The authors propose a hybrid model combining GCN and DHSTNet that is effective in forecasting short-term traffic patterns in urban areas in order to improve traffic management.Wang, Chen and Gong (2020a) The researchers developed a new dataset and proposed a method for safety prediction.The main aim of this study is to develop a methodology for controlling the length of time that a vehicle stays in traffic based on the flow of traffic and congestion.Riaz, Chenqiang, Azeem, Saifullah, Bux and Ullah (2022This study implemented the FWPredNet framework for accident and anomaly prediction, which outperformed the previous state-of-the-art framework.Wang et al. (2020b)Researchers developed a framework for conceptually describing components of surveillance video, separating them into smaller components, and detecting activities from some short clips of two seconds.The authors present a comparative analysis of different statistical and deep learning models for solving traffic safety problems through the detection of collisions and estimating crash risk in urban Interstate highways.Through the use of video games under different weather conditions and scene conditions, the study generated traffic data that was then processed and trained with a 3D CNN.This model yielded comparable results to real-life traffic videos from YouTube.Using time-dependent frames in a video, the developed model was able to evaluate the effectiveness of the model on trimmed unlabelled video.Yang, Song, Sun, Zhang, Chen, Rakal and Fang (In this paper, researchers propose developing a feature-fused SSD in order to improve detection accuracy of vehicles from the ImageNet video database.The proposed supervised deep learning framework detects and identifies road-side vehicular accidents by extracting feature points based on local features such as trajectory intersection and velocity, and by detecting anomalies in real-time accident conditions such as daylight variations.

Hui
using a Gaussian Mixture Model to extract foreground and background information from video streams in order to create a vision-based accident detection model.min Xia et al.

Table 1
Article Data source

Table 2
Studies were used to address the research question on main Action Recognition techniques and application in autonomous transportation.The notation "-" means the research paper did not address our research question.The researchers developed a benchmark dataset to assess the quality of traffic accident detection and anomaly detection for nine action classes

Table 3
Identifying the main taxonomies and algorithms used in AR for autonomous transportation based on relevant studies to our second research question.The notation "-" means the metric is not applicable.

Table 4
Overview of datasets used in AR for autonomous transportation, features of the datasets and download links to the datasets.