 Review
 Open access
 Published:
Artificial intelligencebased traffic flow prediction: a comprehensive review
Journal of Electrical Systems and Information Technology volume 10, Article number: 13 (2023)
Abstract
The expansion of the Internet of Things has resulted in new creative solutions, such as smart cities, that have made our lives more productive, convenient, and intelligent. The core of smart cities is the Intelligent Transportation System (ITS) which has been integrated into several smart city applications that improve transportation and mobility. ITS aims to resolve many traffic issues, such as traffic congestion issues. Recently, new traffic flow prediction models and frameworks have been rapidly developed in tandem with the introduction of artificial intelligence approaches to improve the accuracy of traffic flow prediction. Traffic forecasting is a crucial duty in the transportation industry. It can significantly affect the design of road constructions and projects in addition to its importance for route planning and traffic rules. Furthermore, traffic congestion is a critical issue in urban areas and overcrowded cities. Therefore, it must be accurately evaluated and forecasted. Hence, a reliable and efficient method for predicting traffic is essential. The main objectives of this study are: First, present a comprehensive review of the most popular machine learning and deep learning techniques applied in traffic prediction. Second, identifying inherent obstacles to applying machine learning and deep learning in the domain of traffic prediction.
Introduction
In recent decades, the demand for the development of ITSbased solutions for precise traffic prediction and mobility management has increased as cities have gotten increasingly crowded and congested [1]. ITS is an advanced technology for delivering transportation by utilizing advanced data communication technologies through the integration of communications, computers, information, and other technologies and applying them to the transportation industry. This process aims to create an integrated system of people, roads, and vehicles [2]. ITS can construct a comprehensive, realtime, accurate, and effective transportation management system [3]. Furthermore, it has the potential to significantly reduce hazards, high accident rates, traffic congestion, carbon emissions, and air pollution, while also improving safety and dependability, travel speeds, traffic flow, and passenger satisfaction [4].
Precise traffic flow prediction is essential to the ITS as it can help traffic stakeholders (Individual passengers, traffic administrators, policymakers, and road users), shown in Fig. 1, utilize transport networks more safely and intelligently [5, 6]. The efficacy of these systems depends on the quality of traffic data; only then, an ITS will be successful. According to the World Health Organization's (WHO) 2018 report on the universal status of road safety, road traffic deaths continue to rise, with 1.35 million deaths recorded in 2016, making the study of traffic forecasting a valuable method for reducing congestion and ensuring safer, more costeffective travel [7, 8]. The benefits of traffic flow forecasting are illustrated in Fig. 2.
Historically, traffic flow forecasting was dependent on parametric models such as time series analysis derived from historical data. In time series, a collection of observed readings x is recorded at a specific time t. The objective is to recognize temporal patterns in past traffic data and use these results for forecasting. Another model for mobile stochastic problems capable of resolving regression concerns and minimizing variance to achieve optimal results was the Kalman Filtering method for timeseries analysis [9]. Also, the AutoRegressive Integrated Moving Average (ARIMA) model is a wellknown and standard framework for predicting shortterm traffic flow [10]. Numerous modifications to the ARIMA model were implemented, and the results ensured an enhanced performance [11,12,13,14].
Because traffic flow is stochastic and nonlinear, nonparametric models such as Random Forest (RF) Algorithm, Bayesian Algorithm (BA) approach, KNearest Neighbor (KNN), Principal Component Analysis (PCA), and Support Vector Algorithms [9] have recently been employed in traffic flow prediction. In addition, neural networks became popularly employed for predicting traffic flow [15]. In the era of big data, a shallow BackPropagation Neural Network (BPNN) [16] showed promising results. Thus, deep learning emerged, employing several layers to extract more complex properties from raw input. Convolutional Neural Networks (CNN) [17], Recurrent Neural Networks (RNN) [18], Long ShortTerm Memory (LSTM) [19], Restricted Boltzmann Machines (RBM) [20], Deep Belief Networks (DBN) [21], and Stacked AutoEncoder (SAE) [22] are some examples of deep learning architectures.
The primary goals of this research are to conduct a comprehensive survey of the key machine learning and deep learning techniques used in forecasting traffic flow in addition to identifying the obstacles and future directions for machine learning and deep learning in this field.
The rest of the paper is organized as follows: Section "Background" gives a theoretical background about traffic prediction problems, machine learning, and deep learning. Section "Survey methodology" outlines the survey methodology and presents a literature review of machine learning and deep learning approaches employed in traffic flow prediction. Section "Challenges" covers the existing challenges in the topic of this survey. Finally, Section "Conclusion" concludes the paper.
Background
ITS provides a bunch of highresolution traffic data to be used in datadrivenbased traffic flow prediction techniques [23]. From this perspective, traffic flow prediction can be considered as a time series problem in which the flow count at a future time is estimated based on data received from one or more observation points during prior periods. Traffic flow forecasting is a major component of traffic modeling, operation, and management. Accurately predicting traffic flows in realtime can give information and recommendations for road travelers to enhance their travel choices and decrease expenses, in addition to supplying authorities with enhanced traffic control tactics to alleviate congestion. Machine learning and deep learning as depicted in Fig. 3 are considered as subsets of artificial intelligence (AI) that have witnessed exponential expansion over years [24]. These approaches have been deemed successful in predicting traffic flow.
Machine learning
Machine Learning (ML) techniques are considered statistical models that are utilized to make classifications and predictions based on the data provided [24]. ML is an area of AI that focuses on the development of prediction algorithms depending on the fair discovery of patterns within huge datasets and without being designed specifically for a particular job [25]. ML models are classified into three categories according to the learning techniques they employ: supervised learning, unsupervised learning, and reinforced learning (RL). In addition, ML algorithms might be further subdivided into several subgroups depending on distinct learning approaches, as shown in Fig. 4 [26].
Supervised learning
In the tasks that depend on supervised learning, a labeled dataset known as feature vectors and their corresponding predicted output labels are supplied to the model. The objective of these models is to create an inference function that maps feature vectors into output labels. When the ML model training is complete, it can make predictions based on new data. Continuous or discrete predictions can be generated using supervised learning algorithms [24]. Support Vector Machine (SMV), KNN, Logistic Regression, Linear Regression, Decision Trees (DT), Random Forests (RF), and Naive Bayes are examples of supervised learning approaches [25].
A. Support vector machine
SVM is a supervised learning methodology based on the classification approach. It can be considered a nonprobabilistic linear classifier. SVM is regarded as the stateoftheart machine learning algorithm. Margin calculation is the core concept underlying SVM. In such an approach, each item of data is represented in ndimensional space as a point, where n is the features count and each feature represents the value of the coordinate. As depicted in Fig. 5, the objective of this strategy is to examine the vectorized data as well as create a hyperplane that distinguishes between the two classes [27]. Various margins are then drawn between several classes, and a hyperplane is built that minimizes the meansquared error and maximizes the margintoclass distance [28].
Once an optimal separating hyperplane is identified in the case of linearly separable data, points of data that sit on its boundary are called support vector points, and the solution is introduced as a linear combination of these points alone, as depicted in Fig. 6. The other data values are disregarded [29]. Therefore, the SVM model's complexity is independent of the feature count found in the training data. So, SVMs are ideally suitable for learning missions involving many features relative to the number of training cases.
Despite the greatest margin that enables the SVM to choose through numerous nominee hyperplanes, SVM may be unable to locate any hyperplane that can separate hyperplanes at all due to the misclassified instances contained in the data. One proposed solution to this problem is to utilize a soft margin that allows certain training cases misclassifications [30]. SVMs are binary classifiers, so in the case of multiclass problems, the problem needs to be reduced to a series of several binary classification problems. Categorical data represent another challenge; however, with adequate rescaling, decent results can be obtained [29].
B. Knearest neighbors
KNN is considered a nonparametric classification technique that makes no assumptions about the basic dataset and is known for its efficiency and simplicity. In KNN, a labeled training dataset is used to predict the class of unlabeled data [31]. KNN is typically employed as a classifier to classify data based on the nearest or most nearby training samples in a specific location. KNN is utilized in datasets where data may be divided into distinct clusters to determine the new input’s class. KNN is more significant in case there is no prior knowledge of the data used in the study [31].
KNN typically employs K variable values between 0 and 1 to calculate the number of training data points with the closest distance. KNN employs numerous distance functions, including Manhattan distance, Euclidean distance, Minkowski distance, and Hamming distance. The Euclidean distance is employed to calculate its nearest neighbors in the case of continuous data, but for categorical data, the Hamming distance function is utilized [32].
The most challenging aspect of the KNN algorithm is choosing the K value, as it affects the algorithm's performance and precision. Small K values generate noise in class label prediction, while large K values may lead to excess fitting likelihood. In addition, it increases the computation time and affects the execution speed. The K value is calculated according to (1):
where n is the size of the dataset.
Crossvalidation will be applied to training data with varied K values to maximize the test results. The optimal value for test results will be decided based on the optimal precision [32].
The KNN technique has the following benefits: it is a straightforward technique that is simple to apply. It is a very adaptable classification technique that is ideal for multimodal classes.
On the other hand, using the KNN algorithm to classify unknown data is quite costly. It needs to calculate the distance between the knearest neighbors. As the size of the training set increases, algorithm computations get increasingly intensive. Noisy or irrelevant characteristics will decrease accuracy. Moreover, KNN does no generalization on the training data and retains them all. Consequently, greater dimensional data will reduce the precision of areas. It computes the distance between k neighbors, so KNN is a lazy learner [33].
C. Logistic regression
Logistic regression is a supervised learning approach used to differentiate between two or more groups [27]. It provides, in terms of 0 and 1, the likelihood that an event will occur based on the values of the input variables (i.e., it gives the binomial outcome). For instance, predicting whether or not an email is categorized as spam is a binomial result of Logistic Regression. In addition, Logistic Regression can produce multinomial outcomes, such as predicting the preferred cuisine (Chinese, Italian, Mexican, etc.). In addition, Logistic Regression can produce ordinal results, such as rating a product from 1 to 5. Therefore, Logistic Regression is concerned with categorical target variable prediction [33]. Logistic Regression provides several benefits, including ease of implementation, computational efficiency, training efficiency, and regularization simplicity. In Logistic Regression, input features do not require scaling. In addition, Logistic Regression is immune to data noise and multicollinearity. Logistic Regression, on the other hand, is unsuitable for nonlinear problems since its decision surface is linear, and sensitive to overfitting, and all independent variables must be recognized for it to work successfully [33].
D. Linear Regression.
Regression is an example of a supervised learning technique in which the value of the output variable is decided by the values of the input variable and the utilized labeled datasets. Regression can be used to model and predict continuous variables. In linear regression, an attempt is made to fit a straight hyperplane to the data set if the relationship between the variables of a dataset is linear [33]. Linear Regression is calculated according to (2) [32]:
where x is the independent variable, F(x) is the dependent variable, m is the slope of the line, b is the yintercept, and e is the error term.
The best prediction accuracy may be achieved using the Linear Regression algorithm if the following steps are followed to prepare the training data [32]:

Assume that the dependent and independent variables are linear, i.e., apply any of the available data transformation techniques to make the data linear.

Remove noisy data and outliers using a technique for cleaning data.

To minimize overfitting, do pairwise correlation and exclude the most linked variables.

Apply Gaussian distribution to the training data to generate more accurate predictions.

Rescale inputs to improve the reliability of the prediction.
From the above discussion, it is clear that the Linear Regression algorithm is straightforward to comprehend. In addition, the ideal linear relationship between dependent and independent variables is demonstrated. In contrast, Linear Regression can only predict the numeric output. It is inappropriate for nonlinear data and highly sensitive to outliers. Also, data must be independent [32].
E. Decision trees
Classifiergenerating systems are one of the most popular strategies in data mining [34]. In data mining, classification algorithms are capable of processing vast quantities of data. It can be used to create assumptions about categorical class names, categorize information based on training sets and class labels, and classify newly accessible data [35].
DTs are one of the powerful approaches utilized in numerous domains, including ML, image processing, and pattern recognition [36]. DT is a model that sequentially as well as cohesively combines a set of basic tests in which a numerical characteristic is compared with a threshold value [37]. In addition, DT is a common classification model in Data Mining [38]. Every tree is composed of nodes and branches. Each node represents an attribute inside a group to be categorized, and each branch provides a possible value for the node [39]. Figure 7 illustrates the structure of DT.
DT algorithm is a supervised learning algorithm. It tries to build a training model that may be used to predict the class or value of target variables by employing learning decision rules learned from the training data [41].
The advantages and disadvantages of using the DT algorithm to solve regression and classification problems [42,43,44] are outlined in Table 1.
F. Random forest
RF is an ensemble classifier since it employs many DTs to compensate for the shortcomings of a single DT [45,46,47,48,49]. The 'vote' of all trees is utilized to determine the final class for each unknown. This eliminates the possibility that a single tree may not be ideal. Therefore, adding numerous trees should result in a global optimum [50]. For the formation of each tree in the "forest", the bootstrap approach is used for resampling. In addition, on each node split, a subset of features is randomly selected, and the split variable selection occurs over this subset. The projected value for classification is the majority vote, and the average, for regressions [51,52,53,54]. On RF models, there are two parameters for tuning: mtry, which is the number of features that are randomly picked to consider in each split; and ntree, which is the trees count in the model. The mtry parameter has a tradeoff: large values increase the correlation among trees but improve the accuracy of each tree [51]. The unused elements are called the Out of Bag (OOB) samples, which can be employed for validation in this case, each tree predicts over its OOB samples, and the final result is an average over the outcomes of the trees [55].
There are two options for estimating the relevance of each variable and ranking them accordingly. The initial choice is to utilize the OOB samples. In this option, the accuracy is calculated over the set of each tree and its corresponding OOB samples, a variable is randomly permuted among samples, and the accuracy is recalculated on the new set. Applying this to the set of all trees and average for each variable yields a metric for comparing relevance. This metric for comparison is known as the Permutation Importance Index (PIM) or Variable Importance Measure (VIM). The alternative is to calculate the split improvement for each tree and node using a measure (e.g., the Gini Index) and use these values to compare the significance of the variables [55].
RFs offer high flexibility and prediction rates. It also does not overfit the data when the number of trees is considered. Alternatively, a graphical representation is not feasible as in DTs [55].
G. Naïve Bayes
The Naïve Bayes technique, also known as the Bayes of Idiots, Bayes of Freedom, or basic Bayes, is a fundamental probabilityband classifier. Provided the class variable, it is supposed that the existence or absence of a particular class feature has no significance on the existence or absence of any other class feature [56].
The Naïve Bayes technique is straightforward to implement since it does not require complex recursive parameter estimation systems. Consequently, a naive Bayes classifier can be useful for enormous datasets. Also, it requires minimal training data to assess the restrictions. As independent variables rather than the whole matrix of covariance are assumed, only the variances of the variables within each class must be estimated [56].
Unsupervised learning
In unsupervised learning, there is no output label information contained in the dataset. The purpose of these models is to infer the link between data and/or to uncover hidden variables [25]. These strategies are mostly used to reduce the size of a dataset by extracting key features. Reducing the number of features helps prevent problems such as high computational cost and multicollinearity [57]. Figure 8 depicts unsupervised learning, in which the machine guesses the result according to past experiences and learns from information previously provided to anticipate the realvalued outcome. Examples of unsupervised learningbased methods are KMeans Clustering, Principal Component Analysis (PCA), and Latent Dirichlet Allocation (LDA) [25].
A. Kmeans clustering
Kmeans clustering is one of the unsupervised learning methods that automatically produces groups or clusters. Data with comparable properties are put into the same cluster. Kmeans is the name of the method as it forms K different groups [28]. The purpose of the Kmeans clustering is twofold: (1) to provide Kcentroids, one for each cluster, and (2) to minimize the square error function. The mean value is placed in the middle of the cluster [27].
The kmeans clustering technique has many advantages. First, it is computationally more effective than hierarchical clustering for enormous variables. Second, it yields tighter clusters than hierarchical clustering with global clusters and small k. Finally, ease in implementation and comprehension of the clustering results. The order of complexity of the algorithm is O(K*n*d), so it is computationally efficient [33].
On the other hand, the K value is not known and its prediction is complex. Degrades in performance occur when clusters are global and when different beginning partitions result in distinct final clusters. Also, when there is a difference in the size and density of the clusters in the input data, the performance decreases. In addition, the joint distribution of characteristics inside each cluster is spherical (spherical assumption) and cannot be achieved as the correlation between features break it and put extra weights on connected features. KMeans clustering can be susceptible to outliers. Also, it is sensitive to local ideal and initial points, and a unique solution for a specific K value does not exist—so K means needs to be run for a K value lots of times (20100times) and then, pick the results with the lowest J [33].
B. Principal component analysis
PCA is an unsupervised ML approach that reduces the dimension of the data. Therefore, the computations are more efficient and quicker [27]. The twodimensional data in PCA are turned into onedimensional data by transforming the collection of variables into new ones called principal components (PC) which are orthogonal. The data set of the PCA algorithm must be scaled because the results are sensitive to the relative scaling [28].
To explain the PCA mechanism, let us use an example of 2D data. When the 2D data are plotted on a graph, it takes up two axes. Applying PCA to this data will turn it into 1D [27], as illustrated in Fig. 9.
C. Latent Dirichlet allocation
LDA is a statisticsbased data mining technique that differentiates between classes of objects in Ndimensional feature space by computing a sequence of k ≤ N − 1 linear discriminant whose values can be used to describe the classes [59]. LDA and PCA are similar [60] in that they describe the "most important" variations in the data and select directions that maximize feature variance. LDA differs from PCA in that LDA makes use of the class labels: it selects directions that can best differentiate the class means relative to the sum of the class variances along that direction. It maximizes the ratio of betweenclass scatters to withinclass scatters. Intuitively, it detects lowerdimensional descriptions of the data which push the class members together and pull members of different classes out [61]. The k linear discriminants that correspond to the eigenvectors are arranged by eigenvalue. The discriminants can be used to group new objects or for dimension reduction [61].
To ensure the discriminant's optimality, the LDA's design makes the following two assumptions: (1) the linear combination of any characteristics is normally distributed, and (2) the classes have equal covariance matrices. Despite the danger of inferior outcomes, LDA has been utilized routinely for dimension reduction and classification when these assumptions are broken [61].
Reinforced learning
Unlike supervised and unsupervised learning, RL is a goaloriented learning approach. Learning occurs via reacting to the surrounding environments and detecting status changes. RL is strongly tied to an agent (controller) responsible for the learning process to attain a goal. In particular, the agent takes actions (control signals) and consequently, the status of the environment is changed and rewards, which are special numerical values, are returned either positive or negative. The agent aims to maximize the rewards obtained over time. A task is a full specification of an environment, which determines how the reward is generated [62]. Examples of RLbased techniques are QLearning Algorithm and Monte Carlo Tree Search (MCTS).
A. Qlearning
Qlearning [63] is a straightforward way that enables agents to learn how to act optimally in controlled Markovian domains. It represents an incremental approach to dynamic programming which imposes low processing demands. It works by boosting successively its ratings of the quality of specific acts at certain states. It can also be considered an asynchronous Dynamic Programming (DP) approach. It provides agents with the possibility of learning to act optimally in Markovian domains by experiencing the consequences of actions, without requiring them to generate maps of the domains [64].
Qlearning is applied in information theory, and related investigations are underway. Recently, Qlearning and information theory have been applied to various disciplines such as natural language processing, anomaly detection, pattern recognition, and image classification [65,66,67,68]. In addition, a framework has been established to provide a satisfying response based on the user’s speech using RL in a voice interaction system [69], and a highresolution prediction system for local rainfall based on DL has been developed [70].
The advantage of the ant Qlearning approach is that it can identify the value of the reward for a specific activity in a multiagent environment successfully due to the corporation between agents. The drawback of ant Qlearning is that its result can be stuck at a local minimum when agents take just the shortest path [71].
B. Monte Carlo tree search
MCTS is a powerful technique for handling sequential decision problems. The plan relies on a smart tree search that balances exploration and exploitation. Random sampling is employed in MCTS in the form of simulations to save statistics of activities and make more knowledgeable selections in each future iteration [72]. MCTS is a decisionmaking technique that is utilized in scanning huge combinatorial spaces represented by trees. In such trees, nodes represent states, also referred to as configurations of the problem, whereas edges denote transitions (actions) between states [72].
Formally, MCTS is directly applied to issues that can be described by a Markov Decision Process (MDP). Certain modifications of MCTS make it possible to be applied to Partially Observable Markov Decision Processes (POMDP) [73]. More recently, MCTS paired with deep RL are considered the backbone of AlphaGo developed by Google DeepMind which is documented in [74].
The basic MCTS procedure is conceptually so simple [75], as depicted in Fig. 10. A tree is created in an incremental and unbalanced method. In each iteration, a tree policy is utilized to get the most urgent node of the current tree.
The tree policy aims to balance the considerations of exploration and exploitation. A simulation is then run from the specified node, and the search tree result is accordingly updated. This involves the insertion of a child node that matches the action taken from the selected node and an update of the statistics of its ancestors. Based on some default policy, moves are being conducted during this simulation which in the simplest scenario aims to make uniform random moves. A notable advantage of MCTS is there is no need for the values of the intermediate states to be evaluated, which extremely minimizes the amount of domain knowledge required [75].
Deep learning
About a decade ago, Deep Learning (DL) emerged as an effective ML technique and achieved good performance in several application fields. The core idea of DL approaches is to learn complicated characteristics extracted from data with low external contribution using Deep Neural Networks (DNN) [77]. These algorithms do not require to be manually provided created features; they automatically learn additional complicated features [78].
DL is an AI paradigm that has gained major interest from the academic community and demonstrated higher potential over conventional methods [79]. DL is a more efficient, monitored, timeconsuming, and costeffective technique than the ML technique. Not only it is a specific approach to knowledge, but also it adapts to various methodologies and topographies that could be beneficial to a wide range of complicated problems. The approach learns the illustrative and differential properties in a relatively varied method [80, 81]. Figure 11 demonstrates the procedure of ML and DL.
To generate highlevel abstractions with many nonlinear transformations, DL is based on a collection of ML techniques used to model data. The artificial neural network (ANN) system runs on a DL technology [82, 83]. These networks include many layers for collecting highlevel characteristics and for eliminating problematic data, so the performance of DL algorithms is higher than ML algorithms [84].
ML approaches have brought a huge impact on our daily life such as efficient web search, selfdriving vehicles, computer vision, and optical character recognition. Also, by implementing ML approaches, the humanlevel AI has been improved as well [85,86,87]. Nevertheless, the performance of classic ML algorithms is far from ideal when it comes to human information processing mechanisms (e.g., voice and vision). The DL algorithms concept was formed in the late twentieth century inspired by deep hierarchical structures of human speech perception and production systems. Figure 12 displays a timeline showing the evolution of deep models along with the classic model [26]. DL has many architectures. Examples of such architectures are CNN, RNN, LSTM, and Recurrent CNN (RCNN).
A. Convolutional neural network
CNNs are a subtype of ANNs and are frequently utilized in face recognition, text analysis, human organ localization, and biological image recognition [88]. CNN structure was first introduced in 1988 by Fukushima [89]. It was not widely employed, however, due to restrictions of computation gear for training the network. In the 1990s, LeCun et al. [90] adapted a gradientbased learning algorithm to CNNs and provided successful results for the handwritten digit classification problem. After that, researchers progressively enhanced CNNs and reported stateoftheart results in different recognition tasks.
A CNN architecture includes three components: the input layer, hidden layer, and output layer. The intermediate levels of any feedforward network are known as hidden layers, and their number varies based on the network architecture type. Convolutions are executed in the hidden layers, which include dot products of the convolution kernel with the input matrix. Each convolutional layer generates feature maps to be used as input by the subsequent layers [91], as shown in Fig. 13.
In general, CNNs consist of two major components: Feature extractors and a classifier, as shown in Fig. 14. In the feature extraction layers, each layer of the network takes as its input the output of its immediate previous layer and transmits its output to be the input to the next layer. The CNN design involves a combination of three types of layers: Convolution, maxpooling, and classification. In the low and middle level of the network, there are two types of layers: Convolutional layers and maxpooling layers. Convolutions are the evennumbered layers, whereas the oddnumbered layers are for maxpooling operations. The output nodes of the convolution and maxpooling layers are then arranged into a 2D plane named feature mapping. Usually, the plane of each layer is produced by the combination of one or more planes of the previous levels. The nodes of a plane are connected to a small section of each connected plane of the previous layer. Each node of the convolution layer extracts the features from the inputs by convolution operations on the input nodes. As the features propagate to the highest level, the dimensions of the features are lowered based on the kernel size of the convolutional and maxpooling processes correspondingly.
For ensuring classification accuracy, the number of feature maps is increased for expressing better features of the input. The output of the last CNN layer is used as the input to a fully connected network called the categorization layer. In the classification layer, the extracted features are used as inputs concerning the size of the weight matrix of the final neural network. At the topmost classification layer, and using a softmax layer, the score of the respective class is calculated. According to the highest score, the classifier produces output for the corresponding classifications [92].
CNNs have various advantages including being more like the human visual processing system, having a highly optimized structure for processing 2D and 3D images, and being effective in learning and extracting abstractions of 2D information. The maxpooling layer of CNNs is successful, particularly at absorbing shape variations. Furthermore, CNNs contain much fewer parameters than a fully connected network of the same size as it is constructed of sparse connections with coupled weights. In addition, CNNs are trained with the gradientbased learning technique that suffers less from the diminishing gradient problem. Given that the gradientbased technique trains the full network to reduce an error criterion directly, CNNs can generate highly optimized weights [92].
B. Recurrent neural network
Developed in the 1980s, RNN is one of the most widely used DL models [93]. These kinds of networks have a memory that stores the information they have seen so far and have various types. Moreover, RNNs are powerful models for time series analysis, and they use the prior output to predict the next output. In this situation, the networks themselves contain repeating loops in the hidden layers, which allow the storing of previous input information for a while, so that the system can predict future outputs. The output of the hidden layer is retransmitted t times to the hidden layer. The output of a recursive layer is only sent to the next layer when the number of iterations is completed. In such a circumstance, the output is more global, and the preceding knowledge is maintained for longer. Finally, the errors are returned backward to update the weights [94]. RNN is employed mostly in the fields of speech processing and Nature Language Processing (NLP) settings [95, 96].
Unlike CNN, RNN employs sequential data in the network. As the embedded structure in the data sequence gives useful information, this property is fundamental to a range of various applications such as NLP. Thus, RNN can be considered as a unit of shortterm memory, where x is the input layer, y is the output layer, and s represents the state (hidden) layer [97]. For a specific sequence of input, a typical unfolded RNN diagram is presented in Fig. 15. In addition, a deep RNN was introduced to minimize the learning difficulty in the deep network and brings the benefits of a deeper RNN depending on three different deep RNN techniques, namely "HiddentoHidden", "HiddentoOutput", and "InputtoHidden" introduced by Pascanu et al. [98].
One of the main challenges with RNN is its sensitivity to the expanding gradient and vanishing problems [99]. More specifically, the reduplications of many large or small derivatives during the training phase may cause exponentially explode or decay of the gradients. With the introduction of new inputs, the network stops thinking about the original ones; hence, its sensitivity decays over time [97].
C. Long shortterm memory
LSTM is a special case of RNN as it has internal memory and multiplicative gates. The diversity of LSTM cell layouts has been described in 1997 when the first LSTM was launched [100]. LSTM contributed to the development of wellknown services like Siri, Cortana, Alexa, Google Translate, and Google voice assistant [101]. LSTM is a module in an RNN network that addresses missing gradient problems. Generally, RNN employs the LSTM network to avoid propagation errors. This allows the RNN to learn across multiple time steps. LSTM includes cells that keep information outside of a recurrent network. Like the memory in a computer, the cell is deciding when the data have to be stored, written, read, or erased using the gate [102]. A simple RNN cell depicted in Fig. 16(a) was enhanced by adding a memory block which is controlled by input and output multiplicative gates. Figure 16b shows the LSTM architecture of the jth cell c_{j}. The main component of a memory block is the selfconnected linear unit s_{c} termed constant error carousel (CEC) which protects LSTM from the drawbacks of regular RNN. An input gate and output gates consist of corresponding weight matrices and activation functions [101].
Generally, it can be concluded that the LSTM cell comprises one input layer, one output layer, and one selfconnected hidden layer. The hidden layer may contain 'conventional' units that can be fed into the next LSTM cells. However, a conventional LSTM cell also met some limits due to a linear form of s_{c}. It was specified that its steady expansion may induce saturation of the function hand converted into an ordinary unit. Therefore, an additional forget gate layer was inserted [103], as illustrated in Fig. 16(b), which permits undesirable information to be wiped and forgotten.
Bidirectional LSTM, Hierarchical LSTM, Convolutional LSTM, Grid LSTM, LSTM Autoencoder, and Crossmodal LSTM are the most advanced network topologies that use the LSTM gating mechanism [104].
Bidirectional LSTM type networks send and receive the state vector in both directions. As a result, bidirectional time dependencies are taken into account. As a result of reverse state propagation, future expected correlations can be included in the network's generated outputs. Hence, more time dependencies can be detected, extracted, and resolved using bidirectional LSTM networks more precisely than unidirectional LSTM networks. LSTM networks can encapsulate geographically and temporally dispersed information and harmonize partial data using a flexible connection mechanism for the propagation of the cell state vector [105]. Based on the data gaps discovered, this filter method redefines the connections between cells. Figure 17 depicts the architecture of Bidirectional LSTM.
Hierarchical LSTM networks resolve multidimensional problems by splitting the overall problem into subproblems and hierarchically structuring them. This is achieved by adjusting weights inside the network which obtains the power to produce a specific degree of attention.
Using a weightingbased attention mechanism that handles and filters input sequences, hierarchical LSTM networks could be utilized to predict longterm dependencies [106]. Convolution LSTM can be used to filter and reduce input information obtained over a longer time period using convolution operations built into LSTM networks or directly into the LSTM cell structure. Convolution methods that are directly incorporated into the cell can also be used to extend the usual LSTM cell. Correlations are extracted by convolving current input sequences, recurrent output sequences, and weight matrices. The newly created features are received as new inputs by the network gates [107]. Figure 18 depicts this strategy.
Moreover, convolutional LSTM networks are considered ideal for expressing a wide range of quantities, including spatially and temporally distributed relations. Nevertheless, as a reduced feature representation, various values can be collectively forecasted alone. Layers' deconvolving must predict different output quantities based on their original units rather than as features [104]. An autoencoder structure is commonly used to realize information deconvolution and convolving. A layered LSTM autoencoder handles the challenge of high dimensional input data and the forecasting of high dimensional parameter spaces in [108]. In [109], a method for directly integrating an autoencoder into the LSTM cell structure was proposed. This multimodal prediction approach was proposed by extending LSTM. To compress input data as well as cell states, encoders and decoders were integrated directly into the LSTM cell structure. This optimization maximizes information flow in the cell and leads to an enhanced cell state update mechanism for both shortterm and longterm dependencies.
Grid LSTM is an LSTM cell with a matrix structure [110]. The Grid LSTM has connections for the input sequences' spatial and temporal dimensions. As a result, connections in various dimensions within cells extend the normal information flow. As a result, the Grid LSTM is appropriate for the parallel prediction of a wide range of output quantities that can be either linearly independent or nonlinearly dependent. Figure 19 compares a twodimensional Grid LSTM network to a standard stacked LSTM network [110].
Crossmodal LSTM is a modern method for predicting various quantities collaboratively. It combines a number of regular LSTMs that were previously used to separately simulate the individual quantities. The LSTM flows interact via recurrent connections to handle the quantity dependencies. In other streams, the outputs of defined layers are used as extra inputs for previous and subsequent layers. As a result, a crossmodal prediction can be identified. Figure 20 depicts crossmodal LSTM [111].
D. Recurrent convolution neural network
In recent years, a new class of CNNs, RCNN, inspired by rich recurrent connections in the visual systems of animals, was introduced. The main component of RCNN is the recurrent convolutional layer (RCL), which integrates recurrent connections across neurons in the normal convolutional layer. With the increasing number of recurrent computations, the receptive fields (RFs) of neurons in RCL expand unboundedly, which is incongruous with biological realities [112]. The traditional RCNN model was proposed in [113, 114]. The RCNN architecture is presented in Fig. 21, in which both feedforward and recurrent connections have local connectivity and shared weights across distinct locations. This architecture is quite close to the recurrent multilayer perceptron (RMLP) which is generally used for dynamic control [115, 116] (Fig. 21, middle). The main difference is that the full connections in RMLP are replaced by shared local connections, similar to the difference between MLP [117] and CNN.
RCNN integrates a stack of RCLs, optionally interleaved with maxpooling layers, as seen in Fig. 22. Here, layer 1 is the traditional feedforward convolutional layer without recurrent connections, followed by max pooling. Furthermore, four RCLs are employed with a maxpooling layer in the middle. There are only feedforward connections among nearby RCLs. Both pooling operations have stride 2 and size 3. The output of the fourth RCL follows a global maxpooling layer, which yields the maximum across every feature map, providing a feature vector describing the image. Finally, a softmax layer is utilized to categorize the feature vectors into C categories. [113].
RCNN has various advantages from the computational perspective. First, the recurrent connections in RCNN allow every unit to include context information in an arbitrarily broad region in the current layer. Second, the recurrent connections improve the depth of the network and at the same time keep the number of changeable parameters constant by sharing weight. This is compatible with the tendency of the current CNN architecture. Third, unfolded RCNN is a CNN with numerous paths from the input layer to the output layer, which facilitate learning. On one hand, the existence of longer paths makes the model capable of learning more complicated features. On the other hand, the existence of shorter paths may improve gradient backpropagation during training [113].
Survey methodology
The articles reviewed in this paper have been published in highquality conferences and journals of IEEE, Elsevier, Springer, and IOP publishing. Machine learning, deep learning, traffic flow prediction, traffic flow forecasting, traffic speed prediction, shortterm traffic prediction, shortterm traffic forecasting, and ITS are some of the search terms used to find these articles. The articles examined in this survey are directly relevant to the application of ML and DL approaches in traffic flow prediction. Both empirical and literature reviews on the abovementioned subjects were considered for this work.
Survey organization
This survey compares various forecasting techniques for traffic flow. It follows a dual structure with ML techniques used for traffic flow prediction and DL techniques utilized for traffic flow prediction. This study provides a detailed discussion of the approaches and algorithms which are utilized for predictions, performance measurements, and tools used for these procedures.
The prediction of traffic flow has become one of the primary tasks in the ITS field [118]. Statistical methods, AI, and data mining techniques have been widely employed recently to evaluate road traffic data and anticipate future traffic indicators [119]. Previous findings demonstrated that no single technology could evaluate enormous datasets only by itself. Therefore, according to the data structure and its volume, the proper technology must be applied to extract the best insight from the collected data [120].
ML techniques for traffic flow prediction
In [121], the authors developed an MLbased traffic flow prediction paradigm employing a regression model implemented by several libraries including Pandas, Numpy, OS, Matplotlib, Keras, Sklearn, and Tensorflow. Traffic prediction in this study involves the prediction of next year’s traffic data based on previous years' traffic data which eventually offers the accuracy and mean square error. The traffic information was predicated on a basis of 1hour time gap. Data in this study were acquired from the Kaggle dataset. Two datasets were obtained, in which one is the 2015’s traffic data which contains the date, time, number of cars, and number of junctions. The other one is the 2017’s traffic data with identical specifications to compare easily without any confusion. This study needs to investigate more aspects that affect traffic flow prediction and employ other prediction approaches like deep learning and big data.
In [122], the authors aimed to address the traffic control problem with the assistance of an ML algorithm to deal with traffic challenges. The authors employed the Qlearning RL technique for managing traffic lights and developed an artificial environment named Simulation of Urban Mobility (SUMO) for simulation purposes. In SUMO, the cars in motion can be watched, the vehicle's delay time can be controlled, and the delay time can be adjusted.
In [123], the aim of this paper was to set the foundation for adaptive traffic control, either by controlling traffic lights remotely or by applying an algorithm that adjusts the timing according to the predicted flow based on the integration of ML (RF, Linear Regression, and Stochastic Gradient Regression) and DL (MLPNN, RNN) algorithms. The collected findings showed that the proposed ML algorithms had the worst performance.
In [124], the authors concentrated on a critical component of ITSs known as the ability to predict lane changes in vehicular traffic flow. The predictive accuracy to detect changes in lanes was measured using highfidelity data on vehicular traffic flow gathered by the US Federal Highway Administration (FHWA) for Peachstreet, Atlanta, GA, based on four ML models, namely SVM, NB, RF, and DT. The accuracy and performance measurements revealed that SVM outperforms the other three ML models in terms of precise and accurate prediction of vehicle lane shifts.
In [125], a prediction approach that is based on type2 fuzzy logic was introduced using the conceptual framework of fuzzy logic and an urban traffic flow time series. The interval type2 fuzzy system prediction approach was developed, and the Back Propagation (BP) technique was utilized to update the antecedent's coefficients and fuzzy rules' consequent. The effectiveness of the technique proposed in this study was validated using measured data from road networks and compared to other fuzzy approaches. The BP technique and SVM with that type2 fuzzy logic system have a higher prediction accuracy, according to the testing results.
In [126], the authors investigated the problem of predicting the traffic flow of a road based on historical data. The methodology depended on the decomposition of the canonical polygonal tensor (CP) of the traffic data. This move extracts the normal features of a traffic light on daily and weekly bases in addition to the typical spatial allocation of traffic, while greatly minimizing the amount of data required to represent it. Then, the key elements are extended into the future, and the traffic data are regenerated from the decomposition. The data used here are from the M62 motorway in northern England, from October 1, 2019, to October 28, 2019, at 15min intervals. These data are reported as the number of passing cars per hour. Using 4 parameters, the prediction captures 90 percent of the signal's power, which exceeds the current rolling average prediction algorithms. The authors indicated that they evaluated 4 variables in traffic flow forecasts but did not mention them.
In [127], the authors developed an intelligent traffic monitoring system based on ML (MLITMS) to estimate traffic jams in roadside units to improve ITS performance. A shortterm traffic flow MLbased model was developed, and SVM parameters were optimized to enhance traffic flow prediction. In the proposed MLITMS, SVM and RF were specifically designed for longrange wide area networks (LoRa) in a single query. The proposed MLITMS improved the accuracy estimate for traffic flow and nonparametric processes by using mathematical models. As feedback for the proposed MLITMS, a data processing method has been used. The platform was then passed through MLITMS services, including public safety and security for cities, medical facility provision, traffic prediction by light and range detection (LIDAR), and parking control. Thus, as the experimental results revealed, the proposed MLITMS can improve traffic monitoring to 98.6% and can enhance traffic flow prediction systems better than other existing methods.
In [128], the authors proposed a Gravitational Search Algorithm optimized Extreme Learning Machine, called GSAELM. It has been suggested to unleash the performance of shortterm traffic flow forecasts. ELM avoids the cumbersome process of BP by defining the best solution analytically. The proposed search technique generally investigates the optimal settings for ELM. The proposed search technique's prediction performance has been measured on four standard data sets by comparing several recent models. The four standard datasets were realworld traffic flow data from the A1, A2, A4, and A8 motorways along the Amsterdam Ring Road. The Mean Absolute Percentage Errors (MAPEs) for the GSAELM model on the used data sets are 11.69%, 10.25%, 11.72%, and 12.05%, respectively, while the Root Mean Square Errors (RMSEs) were 287.89, 203.04, 221.39, and 163.24, respectively.
In [120], supervised ML, as a method of Big Data analytics, to forecast various indicators of the traffic volume were examined and conducted through two case studies. In both experiments, for training and testing prediction models, traffic data provided by chosen automatic traffic counters on the roadways in the Republic of Serbia, in the period from 2011 to 2018, were employed.
In [129], the authors proposed reconstructing traffic flows from the expected travel time using an ML method. They examined the capabilities of the Gaussian Process Regressor (GPR) to handle this issue. After obtaining the expected travel time on a specific route, a clustering method shows that travel time profiles in each day can be associated with "different types of the day". Then, various regression factors were trained to estimate traffic flows from the duration of travel. In this study, two situations were studied. In the 'multimodel' variance, the regression factor was trained for each day profile. In the 'Single Model' variation, only one Regressor was trained (the day profile was not considered). The proposed method is a unique method to predict and reconstruct traffic flow in route networks using an ML method from aggregated floating vehicle data (FCD). Two main problems can be identified from this work. The first relates to using nondispersed algorithms on the input data which can be problematic with longer evaluation sequences, producing a more complex trained model. The other problem is a traditional issue of every ML solution, and it has to do with the dependence on the quality of the input data.
In [130], a hybrid model incorporating ELM and ensemblebased technologies was developed to predict the future hourly traffic on a road section in Tangiers, a city in northern Morocco. The suggested model was built based on a highspeed ML technology that uses a kind of SingleLayer Feedforward Neural Network (SLFN). The data set in this study was a set of traffic flow recorded over 5 years from 2013 to 2017 from the Moroccan Center for Road Studies and Research. This study needs to consider additional relevant information related to traffic, such as special events, weather conditions, and traffic characteristics on adjacent roads that may affect a particular road.
In [131], the power of various ML techniques was investigated to predict traffic conditions. Preliminary data were collected over two weeks of monitoring in Bandung, Indonesia, to be capable of determining future traffic conditions. The collected features used in the dataset are days, hours, origins, destinations, route view, traffic conditions, weather, and weather locations. The study investigated neural networks, NB, DT, SVM, DNN, and DL. There are two main issues in this work. First, the size of the training data was very small. Second, the change in the training data means that the training process must be reapplied to reflect the newer data set, which takes additional time.
In [132], the prediction accuracy of four ML models was examined using probe data gathered from the road network of Thessaloniki, Greece. The utilized ML models were RF, Support Vector Regression (SVR), Multilayer Perceptron (MLP), and Multiple Linear Regression (MLR). There are two key concerns in this work. First, it has low accuracy in realtime speed prediction. Second, it needs to be tested on different datasets.
In [7], the authors suggested a preliminary method for assessing a realistic data set of road traffic accidents utilizing graphical representations and dimension reduction methods. The data set was subjected to PCA analysis and linear discrimination, and the resulting performance measures provided some comprehensive insights into the patterns of road traffic accidents. The authors developed the preliminary framework by utilizing dimensionality reduction techniques on realistic road traffic accident data from Gauteng Province, South Africa (SA). Furthermore, classification was carried out using the NB, Logistic regression, and KNN methods. The processed data were postprocessed, and model performance measures, precision, and RMSE were used to evaluate each classifier.
In [133], the authors introduced a novel framework for stepwise regression in an ideadrift environment, with ensemble learning as the primary solution for modernizing distribution representation. The regression problem for predicting traffic volume was first converted into a binary classification problem. Second, the Regression to Classification (R2C) method was used to create a more precise classificationtype loss function for ensemble learning. Finally, the regression function's incremental learning was modeled as an incremental update to the hyperresolution level. The proposed R2C architecture for motion volume prediction has the disadvantage of not accounting for motion volume spatial dependencies.
To summarize all previous related works, Table 2 compares among them in terms of methodology, data set, approaches, and their main findings.
DL techniques for traffic flow prediction
In [134], it was proposed to construct a traffic prediction system using four DL approaches namely: Deep Autoencoder (DAN), DBN, RF, and LSTM. This technique is mostly used to estimate the traffic flow in more populated locations. The essential parameters used in this study were zone type, weather condition, day, road capacity, and vehicle types. There is no mention of the used dataset in this work.
In [135], the major objective was to predict trip duration from point A to point B on a route using neural networks. Several DL and neural network algorithms were utilized such as the color clustering algorithm (KMeans algorithm) combined with several parameters to compute and estimate travel duration. The dataset utilized in this study was obtained using Waze Live Map APIs. The authors need to examine other factors such as weather conditions to boost the efficiency and reliability of their job.
In [136], a shortterm strategy for traffic flow forecasting based on a recurrent mixture density network, which is a mix of RNN and mixture density network (MDN), was proposed. Traffic flow data generated by sensors placed on road networks in Shenzhen, China, were used as the data set used in this study. It was divided into two periods: from January 1, 2019, to March 31, 2019, and from October 1, 2019, to December 21, 2019. The modest size of the data set used is a critical issue in this study.
In [137], the authors aimed to enhance the DBN, a DL approach, performance for accurate traffic forecasting under bad weather conditions. First, bad weather and traffic data were gathered from the IoV, rather than from the inductance coils in the usual methods. Subsequently, the SVR technique was utilized to improve the traditional DBN. The optimized DBN consists of two layers: the primary structure is the traditional DBN that unsupervised learning the basic aspects of traffic data, and the topmost layer is an SVR that implements supervised traffic forecasting. Two types of data sets were used in this study. First, traffic data from a highway control center, and second, weather data from local monitoring stations. The main issue in this study was that the computing time of the upgraded DBN requires optimization.
In [138], the authors proposed an urban traffic light control system that combines optimized traffic light scheduling techniques with traffic flow forecasting techniques. The goal was to reduce the number of vehicles that were stopped at all signal intersections on the road network. First, a framework was proposed for an urban traffic control system, which included traffic flow predictions and signal control optimization. Second, to alleviate traffic congestion, an interactive traffic light approach was used. Experiments were carried out on realworld traffic data provided by the Aliyun Tianchi platform to validate the proposed system. The comparison results showed that both the proposed system and the signal control optimization technique work well.
In [139], the authors developed a technique for constructing a traffic congestion index by extracting freestream speed and flow. The author proposed the Traffic Congestion Index (TCI), which can synthesize changes in traffic flow and speed data to assess traffic congestion, and discussed how it is generated. Considering the correlation properties of road links in the road network, the authors introduced the technique of grouping road links based on the subgraph to pretrain the DL model and realize information sharing across road links. A traffic congestion prediction model called SGCNN was proposed by integrating the characteristics of the traffic data and the CNN model, and the training process was improved by the road segment aggregation method. To make the TCI more accurate, the authors must consider more information (such as weather, pedestrians, road conditions, etc.) that affects traffic congestion. Furthermore, designing a more efficient algorithm while accounting for the time complexity of the segment aggregation algorithm is an intriguing topic.
In [140], based on DL, the authors proposed a realtime datadriven queue length prediction technique. They considered a connecting corridor on which information would be transmitted from car detectors (placed at the intersection) to successive intersections. The length of the queue for crossing points in the next cycle was expected to be determined by the length of the queue for the target intersection and two upstream intersections in the current cycle. Data from the adaptive traffic control system InSync were used to train an LSTM neural network model that extracts timedependent patterns of a signal queue. To reduce overfitting and to select the optimal hyperparameter combinations, the authors used a Sequential ModelBased Optimization (SMBO) technique to determine the appropriate dropout in different stacked layers. For this investigation, they obtained adaptive traffic light data from InSync between December 18, 2017, and February 14, 2018. The Alafaya Trail (SR434) data for East Orlando, FL, were collected from Lake Waterford. McCulloch Road intersection includes 11 intersections. The InSync database provides two types of data: (1) Turning Movement Counts (TMC); the number of vehicles per stage and lane per 15 min; (2) Historical data with details of each movement with time, duration, queue, and waiting time for each stage. Due to the lack of data sources, it was not possible to obtain information about the movements of the vehicles in different directions with high accuracy (30–60 s). If this information is available, the performance of the model may improve further.
In [141], the authors presented an AttentionBased MultiTask Learning (ASTMTL) model for predicting multihorizon traffic flow and velocity at the road network scale. To learn related tasks while improving generalization performance, this approach integrates a fully connected neural network (FNN) with a multiheaded attention mechanism. To extract the Spatiotemporal aspects of traffic states, the model incorporates graph convolutional networks (GCNs) and GRUs. FNN begins by collecting and analyzing several related functions to derive a common representation. To extract relevant information and empower the model's predictive performance, the attention mechanism also considers taskspecific and shared representations. The experiments used new sets of GPS data, called On Board Unit (OBU) data, to make traffic forecasting in highway and urban contexts. This study struggles with finding the right strategy for explicitly maximizing task learning.
In [142], the authors proposed featureinjected RNNs (FIRNNs), which incorporate temporalsequential data with contextual elements to extract the potential correlation between traffic context and state. In this model, the stacked RNN was utilized to learn aspects of the traffic data sequence. Meanwhile, a sparse automatic encoder has been trained to increase contextual features, which are highlevel abstract representations and coding of contextual elements. Subsequently, a fusion technique was developed that injects contextual information into sequence features to produce fusion features. Finally, new builtin features have been sent to the forecaster to learn traffic patterns and estimate future speed. In this study, the accuracy and performance of the proposed model should be improved by investigating more feature extraction and merging techniques. Also, the examination of other influencing elements is needed.
In [143], a traffic situational awareness array technology was developed, which takes advantage of various core models. In that approach, a graph convolution was implemented on a network of traffic detectors to extract the spatial patterns encoded in the traffic flow. After that, the retrieved features were utilized to build a weight matrix to aggregate the predictions of the base models according to their performance under a given condition. Traffic flow data obtained by Caltrans PeMS were used as a data set for this study. The main observation in this study was the need to improve the network structure and parameter options.
In [144], a traffic congestion model was proposed to predict the traffic of neighborhoods within an area using a DL model. The model was depending on the LSTM and GraphCNN architectures. It predicts the degree of crowding, defined as the ratio of vehicle accumulation within a neighborhood to the trip completion rate. An abbreviated version of the San Francisco Bay Area Highway Network was used as a data set for this study.
In [145], a strengthened Bayesian Combination Model (BCM) with DL (IBCMDL) for traffic flow prediction was presented to tackle the error amplification phenomenon of classical summation methods and to improve prediction performance. The revised model was built up on the BCM framework proposed by Wang [146]. Realworld traffic data were obtained by microwave sensors placed on highways in Beijing, China, provided the dataset for this study. Additional information, such as weather conditions, traffic accidents, speed, and occupancy, should be included to enhance the model's reliability.
In [147], the authors addressed the complexity of predicting urban traffic when an FCD is available. Four DL methods have been compared to highlight the ability of a neural network approach (recursive and/or convolutional) in handling the problem of traffic prediction in an urban context. In particular, the authors investigated two RNN approaches (LSTM and GRU), as well as the spatiotemporal RCN (SRCN) model and the HighOrder Graph Convolutional LSTM Neural Network (HGCLSTM) methods. To generate basic FCD inputs, the proposed solutions use a traffic simulation approach. The original FCD was created with Aimsun (2018), a microscopic traffic simulator tool for simulating each vehicle's interactions as well as collecting data from them individually. At each preset period, a record (vehicle ID, speed, section, and lane) is collected from the simulation for each associated vehicle. The assembly period was 10 s. In this study, the authors evaluated the performance of prediction models using two distinct urban traffic networks in Spain: Camp Nou, a small area of Barcelona with 4 nodes and 22 divisions, and Amara, a district of San Sebastian with 105 nodes and 192 sections. The results of the experiments conducted revealed that these methods can estimate traffic speeds with good performance. Specifically, recursive algorithms (LSTM and GRU) present fewer errors than convolutional ones (SRCN and HGCLSTM). On the other hand, FCD can sometimes be insufficient to cover all sections of the network, and ML prediction of a variable without any historical data is meaningless.
In [148], the authors proposed deep artificial neural network (Deep ANN) and CNN traffic speed prediction models for upstream highway segments, including those on connected highways, under work area conditions. The proposed models can recognize congestion on the associated links as well as the upstream mainline segments. The suggested models predict traffic velocity under work zone conditions based on the volume of traffic approaching the work area, speed during normal conditions, work area capacity, distance from the work area, the vertical gradient of the road, downstream traffic volume, and type of highway section. The proposed models utilized a dropout regulation to address the ANN overfitting problems. The generated CNN model to predict traffic velocity under working zone conditions should be improved in the following aspects. Discovering additional sources to update the traffic volume to reflect the real traffic volume would enhance the accuracy of the CNN model. Furthermore, the use of a simulation model to predict the capacity of the working area can advance the generated CNN model. Automating databases via warehouses would facilitate the analysis of data for new goods and developments. Additionally, provided the availability of highresolution data, the model can be modified to anticipate traffic congestion in the opposite direction of traffic.
In [149], the authors proposed (1) an efficient and citywide data acquisition scheme by taking a snapshot of the Seoul Transport and Information Service (TOPIS), an opensource webbased traffic congestion map service, and (2) by integrating CNN, LSTM, and TransposeCNN, a hybrid neural network architecture was created to retrieve Spatiotemporal information from the input image and predict network congestion. In the proposed design, an LSTM network was inserted between the convolutional encoder and the convolutional decoder. The convolutional encoder initially converts the input image sequence into lowresolution latent state sequences. The LSTM network then learns to represent time series from sequences, and the convolutional decoder finally converts the latent state to the original precision. To further enhance forecast accuracy, external factors such as weather information (rain, snow, and fog) must be addressed. Moreover, the performance of the proposed model should be enhanced. Also, more information from many data sources must be added to get more accurate forecasts.
In [150], the authors suggested an LSTMbased traffic jam prediction technique based on correcting missing temporal and spatial information. Before making predictions, the proposed technique performs a preprocessing consisting of extrinsic removal using the average absolute deviation of traffic data and correction of Spatiotemporal values using temporal and geographic trends and pattern data. While data with timeseries features are not effectively learned, the suggested prediction technique utilized the LSTM model to learn timeseries data to tackle this problem. The precision of forecasting traffic congestion in lowspeed areas and urban areas using the proposed technique should be enhanced. Moreover, the authors need to build a model with improved user performance.
In [151], the authors suggested a deep and embedding learning (DELA) technique that could help explicitly learn accurate traffic information, road structure, and weather conditions. The original highway traffic data set contained traffic flow information for approximately 3 months (from July 19, 2016, to October 17, 2016) which was formally provided by Knowledge Discovery and Data Mining Tools Competition (KDD CUP 2017). The proposed model has poor explanatory power for the selected DL models. Also, it has a limited learning ability of the embed component.
In [119], an innovative and comprehensive technique for largescale, faster, and realtime traffic forecasting has been suggested. It has integrated four complementary advanced technologies: big data, DL, inmemory computing, and graphics processing units (GPUs). Deep networks were trained by employing more than 11 years of data provided by the California Department of Transportation (Caltrans) [152]. The suggested approach has poor prediction accuracy, in addition to the use of a small size data set.
In [153], the authors created a distinctive traffic prediction approach with the least prediction error based on DL and introduced the LSTM model. Realworld traffic big data of performance measurement system (PeMS) were used as the dataset of this research. The count of optimized parameters employed in this study needs to be expanded. Also, the model training time needs to be regulated.
In [154], a pathwaybased DL framework was presented. It can provide superior traffic velocity forecasts on a citywide scale. Furthermore, the model was reasonable and interpretable in the urban transportation context. The study area was a road network consisting of 112 road sections. The dataset used was obtained from Automated Vehicle Identification (AVI) detectors in the core area of Xuancheng, China. More essential path selection criteria were investigated. Also, raising the interpretability of a DL model for a transport application is an open topic.
In [155], using refined GPS trajectory data, the level of traffic congestion was forecasted. The Hidden Markov model has been utilized to match GPS trajectory data to the road network. The actual speed of road segments can even be calculated using GPS trajectory data from nearby locations. To predict congestion levels, four DL approaches namely CNN, RNN, LSTM, and GRU in addition to three classical ML models (ARIMA model, SVR, and ridge regression) were used. This study had some limitations that were highlighted. First, the GPS trajectory data collected were insufficient. Also, more GPS data must be considered. In addition, the structure of the CNN network can be altered to improve model performance.
In [156], the authors proposed a spatiotemporal model for the shortterm prediction of the level of crowding at each part of the route (CPMConvLSTM). The suggested model was developed on a geographical matrix that includes both the congestion propagation pattern and the spatial correlation between road segments. The traffic data set was obtained from Helsinki, Finland. Considering the historical spatiotemporal matrices' time series, the authors applied the newly popular ConvLSTM DL model by using the time series of historical spatiotemporal matrices as input and predicting the future shortrange spatiotemporal matrix. To enhance forecasting performance, the authors need to incorporate external parameters, such as points of interest, weather, and the surrounding environment.
In [157], the authors created a DLbased methodology for directly forecasting traffic status based on a time–space diagram using CNN. The time–space diagram is directly fed into the traffic forecasting model, which employs a CNN. This technique has three significant benefits: (1) It allowed the time–space diagram to be used as the input with no need for abstraction or aggregation; (2) This methodology was created through a learning mechanism that focuses on learning the key features of the time–space diagram required for effective forecasting. These features seriously affect the dynamic behavior of traffic flow and vehicle interactions, which may have an impact on future traffic conditions; and (3) This approach addressed the problem of nonparametric models' transferability by introducing locationspecific solutions that needed to be recalibrated for another location. Compared with the existing nonparametric models, that is, SVR, MLP, and ARIMA, the suggested CNN model provided a higher generalization in traffic state prediction in different regions of the main diagram. The suggested CNN model was trained using simulated data and a realworld dataset (NGSIM US101). However, this study did not look into the effects of lane changes on traffic flow dynamic behavior and prediction accuracy.
In [158], a new method based on fuzzy CNN (FCNN) was proposed to predict traffic flow more accurately. When uncertain information about traffic accidents is entered into CNN for the first time, a fuzzy approach is used to represent traffic accident features in this method. First, to extract the Spatiotemporal features of the traffic flow data, this study divided the whole region into 32 × 32 small blocks and created three direction sequences with inward and outward flow types. Second, by applying a fuzzy inference mechanism, the uncertain traffic accident information was derived from the real traffic flow data. Then, the information about the trend sequence, the information of unconfirmed traffic accidents, and the external information can be trained by implementing the FCNN model. Moreover, pretraining and tuning procedures were designed efficiently to learn FCNN parameters. Finally, the Beijing taxi real route and meteorological data sets were applied to ensure that the proposed method has superior performance compared to the latest methods. The authors need to explore additional influential aspects in traffic flow forecasting and use more efficient DL models.
In [159], a model for shortterm traffic forecasting was proposed. This model incorporates Spatiotemporal analysis and the GRU. Before applying an algorithm for spatiotemporal feature selection to determine the ideal input time window and spatial data size, the proposed model applied temporal and spatial correlation analyses to aggregated traffic flow data. Simultaneously, the desired traffic flow information is extracted from the actual traffic flow data and converted into a twodimensional matrix containing Spatiotemporal traffic flow information. Finally, the GRU was employed to analyze the Spatiotemporal features of the internal traffic flow matrix to achieve the prediction goal. There are some issues with this work, such as other factors (for example, weather conditions) that are not included in the traffic flow, and only the traffic flow is expected for a specific section of the road.
To summarize all previous related works, Table 3 compares them in terms of methodology, data set, approaches, and their main findings.
Challenges
Traffic flows must be carefully anticipated and predicted due to the risk impact of traffic congestion, particularly in populated areas. As a result, realistic and efficient road traffic prediction techniques are required.
The publication gap in traffic flow forecasting addressed in this survey includes a lack of computationally effective methodologies and algorithms. Furthermore, there is a limitation of highquality training data. Because of using matched city traffic flow statistics, nonexhaustive data contents were used to train network models. These characteristics were discovered to constrain the development of traffic flow prediction using ML and DL approaches.
Because of the complicated link features between road sections and traffic congestion patterns or congested areas, the gap is created by the underutilization of dynamically acquired Spatiotemporal correlations in the DL. Furthermore, a lack of computing power and distributed storage constraints traffic forecasts. A future study should investigate this issue.
The current study has several limitations, including being limited to the approaches and algorithms included in the list of articles investigated. Other strategies that were not addressed in this study could exist. Future research should focus on popularly used DL techniques (CNN and LSTM), which are thoroughly covered in the literature review. This is possible by using traffic data collected in various local urban areas to provide broader data patterns for model training. As a result, traffic forecasting in small cities will improve, as will the accuracy of the ML and DL algorithms used to predict traffic flow. The researchers' biggest challenge will be collaborating with the local urban authority to contribute the volume of vital big data. The rules and regulations for sharing traffic data with local municipal governments will be another impediment.
The installation of sensors to collect traffic data for training ML and DL may result in connected IoT settings that increase cybersecurity risks. A framework should be developed to address the cybersecurity issues of ITS in smart cities. This leaves plenty of room for future investigation.
Conclusion
The present study is aimed to present a comprehensive review of the most significant ML and DL techniques used in traffic forecasting, as well as the problems associated with using ML and DL in traffic forecasting. A total of 40 articles were chosen and thoroughly reviewed after a rigorous selection process. According to the preceding discussion, traffic forecasting is an important task in the transportation industry due to its significant influence on road construction, route planning, and traffic rules. This work advances research in the field of traffic flow forecasting using ML and DL approaches. Contributes to the literature and future studies by serving as a resource for other academics and researchers.
Availability of data and materials
Not applicable.
References
Nellore K, Hancke G (2016) A survey on urban traffic management system using wireless sensor networks. Sensors 16:157
Patel P, Narmawala Z, Thakkar A (2019) A survey on intelligent transportation system using internet of things. In: Emerging research in computing, information, communication and applications, pp 231–240
An S, Lee BH, Shin DR (2011) A survey of intelligent transportation systems. In: 2011 third international conference on computational intelligence, communication systems and networks
Qureshi KN, Abdullah AH (2013) A survey on intelligent transportation systems. MiddleEast J Sci Res 15:629–642
Chen C, Li K, Teo SG, Zou X, Li K, Zeng Z (2020) Citywide traffic flow prediction based on multiple gated Spatiotemporal convolutional neural networks. ACM Trans Knowl Discov Data (TKDD) 14(4):1–23
Sun P, Boukerche A, Tao Y (2020) SSGRU: a novel hybrid stacked GRU based traffic volume prediction approach in a road network. Comput Commun 160:502–511
Makaba T, Doorsamy W, Paul BS (2020) Exploratory framework for analyzing road traffic accident data with validation on Gauteng province data. Cogent Engineering 7(1):1834659
World Health Organization (2018) Global status report on road safety 2018 summary. https://apps.who.int/iris/bitstream/handle/10665/277370/WHONMHNVI18.20eng.pdf?ua=1
Bengio Y (2009) Learning deep architectures for AI. Now Publishers Inc.
Van Der Voort M, Dougherty M, Watson S (1996) Combining Kohonen maps with ARIMA time series models to forecast traffic flow. Transp Res Part C Emerg Technol 4(5):307–318
Lee S, Fambro DB (1999) Application of subset autoregressive integrated moving average model for shortterm freeway traffic volume forecasting. Transp Res Rec J Transp Res Board 1678(1):179–188
Williams BM (2001) Multivariate vehicular traffic flow prediction: evaluation of ARIMAX modeling. Transp Res Rec J Transp Res Board 1776(1):194–200
Williams BM, Hoel LA (2003) Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: theoretical basis and empirical results. J Transp Eng 129(6):664–672
Chen K, Chen F, Lai B, Jin Z, Liu Y, Li K, Wei L, Wang P, Tang Y, Huang J, Hua X (2020) Dynamic Spatiotemporal graphbased CNNs for traffic flow prediction. IEEE Access 8:185136–185145
Kashyap AA, Raviraj S, Devarakonda A, Nayak KSR, Santhosh KV, Bhat SJ (2022) Traffic flow prediction models—a review of deep learning techniques. Cogent Eng 9(1):2010510
Smith BL, Demetsky MJ (1994) Shortterm traffic flow prediction: neural network approach. Transp Res Rec 98–104
Simonyan K, Zisserman A (2015) Very deep convolutional networks for largescale image recognition. In: International conference on learning representations, May 7–9, 2015, San Diego, USA. https://arxiv.org/abs/1409.1556
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: International conference on acoustics, speech and signal processing, 26–31 May 2013, Vancouver, Canada
Sainath TN, Vinyals O, Senior A, Sak H (2015) Convolutional, long shortterm memory, fully connected deep neural networks. In: International conference on acoustics, speech and signal processing, 19–24 April 2015, South Brisbane, Australia
Good Fellow IJ, Mirza M, Courville A, Bengio Y (2013) Multiprediction deep Boltzmann machines. In: Proceedings of the 26th international conference on neural information processing systems, Lake Tahoe, USA
Sarikaya R, Hinton GE, Deoras A (2014) Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(4):778–784
Gehring J, Miao Y, Metze F, Waibel A (2013) Extracting deep bottleneck features using stacked autoencoders. In: International conference on acoustics, speech and signal processing, 26–31 May 2013. IEEE, Vancouver, Canada
Zhang J, Wang FY, Wang K, Lin WH, Xu X, Chen C (2011) Datadriven intelligent transportation systems: a survey. IEEE Trans Intell Transp Syst 12(4):1624–1639
Chowdary GJ (2021) Machine learning and deep learning methods for building intelligent systems in medicine and drug discovery: a comprehensive survey. arXiv preprint arXiv:2107.14037.
Singh G, Al’Aref SJ, Van Assen M, Kim TS, van Rosendael A, Kolli KK, Dwivedi A, Maliakal G, Pandey M, Wang J, Do V, Gummalla M, De Cecco CN, Min JK (2018) Machine learning in cardiac CT: basic concepts and contemporary data. J Cardiovasc Comput Tomogr 12(3):192–201
Ahsan MM, Luna SA, Siddique Z (2022) Machinelearningbased disease diagnosis: a comprehensive review. Healthcare 10(3):541
Dey A (2016) Machine learning algorithms: a review. Int J Comput Sci Inf Technol 7(3):1174–1179
Dhall D, Kaur R, Juneja M (2020) Machine learning: a review of the algorithms and its applications. Proceedings of ICRIC 2019:47–63
Kotsiantis SB, Zaharakis I, Pintelas P (2007) Supervised machine learning: a review of classification techniques. Emerg Artif Intell Appl Comput Eng 160(1):3–24
Veropoulos K, Campbell C, Cristianini N (1999) Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on artificial intelligence (IJCAI99)
Taunk K, De S, Verma S, Swetapadma A (2019) A brief review of nearest neighbor algorithm for learning and classification. In: 2019 international conference on intelligent computing and control systems (ICCS)
Obulesu O, Mahendra M, ThrilokReddy M (2018) Machine learning techniques and tools: a survey. In: 2018 international conference on inventive research in computing applications (ICIRCA). IEEE, pp 605–611
Ray S (2019) A quick review of machine learning algorithms. In: 2019 International conference on machine learning, big data, cloud and parallel computing (COMITCon). IEEE, pp 35–39
Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol 1(2):7–14
Nikam SS (2015) A comparative study of classification techniques in data mining algorithms. Orient J Comput Sci Technol 8(1):13–19
Stein G, Chen B, Wu AS, Hua KA (2005) Decision tree classifier for network intrusion detection with GAbased feature selection. In: Proceedings of the 43rd annual Southeast regional conference, vol 2, pp 136–141
Damanik IS, Windarto AP, Wanto A, Andani SR, Saputra W (2019) Decision tree optimization in C4. 5 Algorithm using genetic algorithm. J Phys Conf Ser 1255(1):012012
Gavankar SS, Sawarkar SD (2017) Eager decision tree. In: 2017 2nd international conference for convergence in technology (I2CT), Mumbai, April 2017, pp 837–840
Mahesh B (2020) Machine learning algorithms—a review. Int J Sci Res 9:381–386
Janikow CZ (1998) Fuzzy decision trees: issues and methods. IEEE Trans Syst Man Cybern Part B 28(1):1–14
Charbuty B, Abdulazeez A (2021) Classification based on decision tree algorithm for machine learning. J Appl Sci Technol Trends 2(01):20–28
Zhao Y, Zhang Y (2008) Comparison of decision tree methods for finding active objects. Adv Space Res 41(12):1955–1959
Mittal K, Khanduja D, Tewari PC (2017) An insight into ‘decision tree analysis.’ World Wide J Multidiscip Res Dev 3(12):111–115
Priyanka, Kumar D (2020) Decision tree classifier: a detailed survey. Int J Inf Decis Sci 12(3):246–269
Breiman L (2001) Random forests. Mach Learn 54(1):5–32
Pal M (2005) Random forest classifier for remote sensing classification. Int J Remote Sens 26(1):217–222
Cutler DR, Edwards TC Jr, Beard KH, Cutler A, Hess KT, Gibson J, Lawler JJ (2007) Random forests for classification in ecology. Ecology 88(11):2783–2792
Belgiu M, Drăguţ L (2016) Random forest in remote sensing: a review of applications and future directions. ISPRS J Photogramm Remote Sens 114:24–31
He Y, Lee E, Warner TA (2017) A time series of annual land use and land cover maps of China from 1982 to 2013 generated using AVHRR GIMMS NDVI3g data. Remote Sens Environ 199:201–217
Maxwell AE, Warner TA, Fang F (2018) Implementation of machinelearning classification in remote sensing: an applied review. Int J Remote Sens 39(9):2784–2817
Breiman L (2001) Random forests. Mach Learn 45:5–32
Amit Y, Geman D (1997) Shape quantization and recognition with randomized trees. Neural Comput 9(7):1545–1588
Ho TK (1995) Random decision forests. In: 3rd international conference on document analysis and recognition—volume 1 (ICDAR’95). IEEE Computer Society, pp 278–282
Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8):832–844
Resende PAA, Drummond AC (2018) A survey of random forestbased methods for intrusion detection systems. ACM Comput Surv 51(3):1–36
Oladipo ID, AbdulRaheem M, Awotunde JB, Bhoi AK, Adeniyi EA, Abiodun MK (2022) Machine learning and deep learning algorithms for smart cities: a startoftheart review. In: IoT and IoE driven smart cities, pp 143–162
Dormann CF, Elith J, Bacher S, Buchmann C, Carl G, Carré G, Marquéz JRG, Gruber B, Lafourcade B, Leitão PJ, Münkemüller T, McClean C, Osborne PE, Reineking B, Schröder B, Skidmore AK, Zurell D, Lautenbach S (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36(1):27–46
Harrington P (2012) Machine Learning in action. Manning Publications Co., Shelter Island
McLachlan GJ (2005) Discriminant analysis and statistical pattern recognition. Wiley
Jolliffe IT (1986) Principal component analysis. SpringerVerlag, New York
Gow J, Baumgarten R, Cairns P, Colton S, Miller P (2012) Unsupervised modeling of player style with LDA. IEEE Trans Comput Intell AI Games 4(3):152–166
Coronato A, Naeem M, De Pietro G, Paragliola G (2020) Reinforcement learning for intelligent healthcare applications: a survey. Artif Intell Med 109:101964
Watkin CJCH, Dayan P (1992) Technical note Qlearning. Mach Learn 8(3):279–292
Watkins CJCH (1989) Learning from delayed rewards. Ph.D. Thesis, University of Cambridge, England
Achille A, Soatto S (2018) Information dropout: Learning optimal representations through noisy computation. IEEE Trans Pattern Anal Mach Intell 40:2897–2905
Williams G, Wagener N, Goldfain B, Drews P, Rehg JM, Boots B, Theodorou EA (2017) Informationtheoretic mpc for modelbased reinforcement learning. In: IEEE international conference on robotics and automation (ICRA), pp 1714–1721
Wilkes JT, Gallistel CR (2017) Information theory, memory, prediction, and timing in associative learning. In: Computational models of brain and behavior, pp 481–492
Jang B, Kim M, Harerimana G, Kim JW (2019) Qlearning algorithms: a comprehensive classification and applications. IEEE Access 7:133653–133667
An Y, Wang Y, Meng H (2017) Multitask deep learning for user intention understanding in speech interaction systems
Shi X, Gao Z, Lausen L, Wang H, Yeung DY, Wong WK, Woo WC (2017) Deep learning for precipitation nowcasting: a benchmark and a new model. In: Advances in neural information processing systems, pp 5622–5632
Juang CF, Lu CM (2009) Ant colony optimization incorporated with fuzzy Qlearning for reinforcement fuzzy control. IEEE Trans Syst Man Cybern Part A Syst Hum 39(3):597–608
Świechowski M, Godlewski K, Sawicki B, Mańdziuk J (2021) Monte Carlo tree search: a review of recent modifications and applications. arXiv preprint arXiv:2103.04931
Lizotte DJ, Laber EB (2016) Multiobjective Markov decision processes for datadriven decision support. J Mach Learn Res 17:211:1211:28
Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M et al (2016) Mastering the game of go with deep neural networks and tree search. Nature 529(7587):484–489
Browne CB, Powley E, Whitehouse D, Lucas SM, Cowling PI, Rohlfshagen P, Tavener S, Perez D, Samothrakis S, Colton S (2012) A survey of monte carlo tree search methods. IEEE Trans Comput Intell AI Games 4(1):1–43
Baier H, Drake PD (2010) The power of forgetting: improving the last goodreply policy in Monte Carlo Go. IEEE Trans Comput Intell AI Games 2(4):303–309
Alpaydin E (2020) Introduction to machine learning. MIT Press
Mikolov T et al (2013) Efficient estimation of word representations in vector space
Nguyen G, Dlugolinsky S, Bobák M, Tran V, García ÁL, Heredia I et al (2019) Machine learning and deep learning frameworks and libraries for largescale data mining: a survey. Artif Intell Rev 52(1):77–124
Aggour KS, Gupta VK, Ruscitto D, Ajdelsztajn L, Bian X, Brosnan KH et al (2019) Artificial intelligence/machine learning in manufacturing and inspection: a GE perspective. MRS Bull 44(7):545–558
Khan FN, Fan Q, Lu C, Lau APT (2020) Machine learning methods for optical communication systems and networks. Optical fiber telecommunications VII. Academic Press, New York, pp 921–978
Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP et al (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR) 51(5):1–36
Dargan S, Kumar M, Ayyagari MR, Kumar G (2019) A survey of deep learning and its applications: a new paradigm to machine learning. Arch Comput Methods Eng 27(4):1–22
Lauzon FQ (2012) An introduction to deep learning. In: 2012 11th international conference on information science, signal processing and their applications (ISSPA). IEEE, pp 1438–1439
Ling ZH, Kang SY, Zen H, Senior A, Schuster M, Qian XJ, Meng HM, Deng L (2015) Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends. IEEE Signal Process Mag 32(3):35–52
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Yu D, Deng L (2010) Deep learning and its applications to signal and information processing [exploratory dsp]. IEEE Signal Process Mag 28(1):145–154
Yap MH, Pons G, Marti J, Ganau S, Sentis M, Zwiggelaar R, Davison AK, Marti R (2017) Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J Biomed Health Inform 22(4):1218–1226
Fukushima K (1988) Neocognitron: a hierarchical neural network capable of visual pattern recognition. Neural Netw 1:119–130
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradientbased learning applied to document recognition. Proc IEEE 86:2278–2324
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge
Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AA, Asari VK (2019) A stateoftheart survey on deep learning theory and architectures. Electronics 8(3):292
Apaydin H, Feizi H, Sattari MT, Colak MS, Shamshirband S, Chau KW (2020) Comparative analysis of recurrent neural network architectures for reservoir inflow forecasting. Water 12(5):1500
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing, pp 6645–6649
Baturdinler Ö, Aydin N (2020) An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Appl Sci 10(4):1273
Jagannatha AN, Yu H (2016) Structured prediction models for RNNbased sequence labeling in clinical text. In: Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing. vol 2016. NIH Public Access, p 856
Alzubaidi L, Zhang J, Humaidi AJ, AlDujaili A, Duan Y, AlShamma O, Santamaría J, Fadhel MA, AlAmidie M, Farhan L (2021) Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8(1):1–74
Pascanu R, Gulcehre C, Cho K, Bengio Y (2013) How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp 249–256
Hochreiter S, Schmidhuber J (1997) Long shortterm memory. Neural Comput 9(8):1735–1780
Smagulova K, James AP (2019) A survey on LSTM memristive neural network architectures and applications. Eur Phys J Spec Top 228(10):2313–2324
Setyanto A, Laksito A, Alarfaj F, Alreshoodi M, Oyong I, Hayaty M, Alomair A, Almusallam N, Kurniasari L (2022) Arabic language opinion mining based on long shortterm memory (LSTM). Appl Sci 12(9):4140
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: Continual prediction with LSTM. Neural Comput 12(10):2451–2471
Lindemann B, Müller T, Vietz H, Jazdi N, Weyrich M (2021) A survey on long shortterm memory networks for time series prediction. Procedia CIRP 99:650–655
Cui Z, Ke R, Pu Z, Wang Y (2018) Deep bidirectional and unidirectional LSTM recurrent neural network for networkwide traffic speed prediction. arXiv preprint arXiv:1801.02143
Villegas R, Yang J, Zou Y, Sohn S, Lin X, Lee H (2017) Learning to generate longterm future via hierarchical predictin. In: International conference on machine learning. PMLR, pp 3560–3569
Chu KF, Lam AY, Li VO (2019) Deep multiscale convolutional LSTM network for travel demand and origindestination predictions. IEEE Trans Intell Transp Syst 21(8):3219–3232
Gensler A, Henze J, Sick B, Raabe N (2016) Deep Learning for solar power forecasting—an approach using AutoEncoder and LSTM Neural Networks. In: 2016 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, pp 002858–002865
Hsu D (2017) Multiperiod time series modeling with sparsity via Bayesian variational inference. arXiv preprint arXiv:1707.00666
Kalchbrenner N, Danihelka I, Graves A (2015) Grid long shortterm memory. arXiv preprint arXiv:1507.01526
Veličković P, Karazija L, Lane ND, Bhattacharya S, Liberis E, Liò P, Chieh A, Bellahsen O, Vegreville M (2018) Crossmodal recurrent models for weight objective prediction from multimodal timeseries data. In: Proceedings of the 12th EAI international conference on pervasive computing technologies for healthcare, pp 178–186
Wang J, Hu X (2021) Convolutional neural networks with gated recurrent connections. IEEE Trans Pattern Anal Mach Intell 44:3421–3435
Liang M, Hu X (2015) Recurrent convolutional neural network for object recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3367–3375
Liang M, Hu X, Zhang B (2015) Convolutional neural networks with intralayer recurrent connections for scene labeling. In: Advances in neural information processing systems, 28
Fernandez B, Parlos AG, Tsai W (1990) Nonlinear dynamic system identification using artificial neural networks (ANNs). In: International joint conference on neural networks (IJCNN), pp 133–141
Puskorius GV, Feldkamp LA (1994) Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Trans Neural Netw 5(2):279–297
Rumelhart DE, Hinton GE, Williams RJ (1986). Chapter: learning internal representations by error propagation. In: Parallel distributed processing: explorations in the microstructure of cognition, vol 1. MIT Press, pp 318–362
Lippi M, Bertini M, Frasconi P (2013) Shortterm traffic flow forecasting: An experimental comparison of timeseries analysis and supervised learning. IEEE Trans Intell Transp Syst 14(2):871–882
Aqib M, Mehmood R, Alzahrani A, Katib I, Albeshri A, Altowaijri SM (2019) Smarter traffic prediction using big data, inmemory computing, deep learning and GPUs. Sensors 19:2206
Janković S, Uzelac A, Zdravković S, Mladenović D, Mladenović S, Andrijanić I (2021) Traffic volumes prediction using big data analytics methods. Int J Traffic Transp Eng 11(2):184–198
Deekshetha HR, Shreyas Madhav AV, Tyagi AK (2022) Traffic prediction using machine learning. In: Evolutionary computing and mobile sustainable networks. Springer, Singapore, pp 969–983
Kuamr S (2022) Traffic flow prediction using machine learning algorithms. Int Res J Eng Technol 9(4):2995–3004
NavarroEspinoza A, LópezBonilla OR, GarcíaGuerrero EE, TleloCuautle E, LópezMancilla D, HernándezMejía C, InzunzaGonzález E (2022) Traffic flow prediction for smart traffic lights using machine learning algorithms. Technologies 10(1):5
Upadhyaya S, Mehrotra D (2022) The facets of machine learning in lane change prediction of vehicular traffic flow. In: Proceedings of international conference on intelligent cyberphysical systems. Springer, Singapore, pp 353–365
Qu Z, Li J (2022) Shortterm traffic flow forecast on basis of PCAinterval type2 fuzzy system. J Phys Conf Ser 2171(1):012051
Steffen T, Lichtenberg G (2022). A machine learning approach to traffic flow prediction using CP data tensor decompositions. In: IFAC world congress 2020. Loughborough Research Repository
Wang J, Pradhan MR, Gunasekaran N (2022) Machine learningbased humanrobot interaction in ITS. Inf Process Manag 59(1):102750
Cui Z, Huang B, Dou H, Tan G, Zheng S, Zhou T (2022) GSAELM: a hybrid learning model for shortterm traffic flow forecasting. IET Intel Transport Syst 16(1):41–52
Li J, Boonaert J, Doniec A, Lozenguez G (2021) Multimodels machine learning methods for traffic flow estimation from Floating Car Data. Transp Res Part C Emerg Technol 132:103389
Jiber M, Mbarek A, Yahyaouy A, Sabri MA, Boumhidi J (2020) Road traffic prediction model using extreme learning machine: the case study of Tangier. Morocco Inf 11(12):542
Husni E, Nasution SM, Yusuf R (2020) Predicting traffic conditions using knowledgegrowing Bayes classifier. IEEE Access 8:191510–191518
Bratsas C, Koupidis K, Salanova JM, Giannakopoulos K, Kaloudis A, Aifadopoulou G (2020) A comparison of machine learning methods for the prediction of traffic speed in urban places. Sustainability 12(1):142
Xiao J, Xiao Z, Wang D, Bai J, Havyarimana V, Zeng F (2019) Shortterm traffic volume prediction by ensemble learning in concept drifting environments. KnowlBased Syst 164:213–225
Ramchandra NR, Rajabhushanam C (2022) Machine learning algorithms performance evaluation in traffic flow prediction. Mater Today Proc 51:1046–1050
Pangesta J, Dharmadinata OJ, Bagaskoro ASC, Hendrikson N, Budiharto W (2021) Travel duration prediction based on traffic speed and driving pattern using deep learning. ICIC Express Lett Part B Appl 12(1):83–90
Chen M, Chen R, Cai F, Li W, Guo N, Li G (2021) Shortterm traffic flow prediction with recurrent mixture density network. Math Problems Eng 2021:6393951
Bao X, Jiang D, Yang X, Wang H (2021) An improved deep belief network for traffic prediction considering weather factors. Alex Eng J 60(1):413–420
Jiang CY, Hu XM, Chen WN (2021) An urban traffic signal control system based on traffic flow prediction. In: 2021 13th international conference on advanced computational intelligence (ICACI). IEEE, pp 259–265
Tu Y, Lin S, Qiao J, Liu B (2021) Deep traffic congestion prediction model based on road segment grouping. Appl Intell 51(11):8519–8541
Rahman R, Hasan S (2021) Realtime signal queue length prediction using long shortterm memory neural network. Neural Comput Appl 33(8):3311–3324
Buroni G, Lebichot B, Bontempi G (2021) ASTMTL: an attentionbased multitask learning strategy for traffic forecasting. IEEE Access 9:77359–77370
Qu L, Lyu J, Li W, Ma D, Fan H (2021) Features injected recurrent neural networks for shortterm traffic speed prediction. Neurocomputing 451:290–304
Chen Y, Lv Y, Ye P, Zhu F (2020) Trafficconditionawareness ensemble learning for traffic flow prediction. IFACPapersOnLine 53(5):582–587
Mohanty S, Pozdnukhov A, Cassidy M (2020) Regionwide congestion prediction and control using deep learning. Transp Res Part C Emerg Technol 116:102624
Gu Y, Lu W, Xu X, Qin L, Shao Z, Zhang H (2020) An improved Bayesian combination model for shortterm traffic prediction with deep learning. IEEE Trans Intell Transp Syst 21(3):1332–1342
Wang J, Deng W, Guo Y (2014) New Bayesian combination method for shortterm traffic flow forecasting. Transp Res C Emerg Technol 43:79–94
Vázquez JJ, Arjona J, Linares M, CasanovasGarcia J (2020) A comparison of deep learning methods for urban traffic forecasting using floating car data. Transportation Research Procedia 47:195–202
Shabarek A (2020) A deep machine learning approach for predicting freeway work zone delay using big data. Doctoral dissertation, New Jersey Institute of Technology
Ranjan N, Bhandari S, Zhao HP, Kim H, Khan P (2020) Citywide traffic congestion prediction based on CNN, LSTM, and transpose CNN. IEEE Access 8:81606–81620
Shin DH, Chung K, Park RC (2020) Prediction of traffic congestion based on LSTM through correction of missing temporal and spatial data. IEEE Access 8:150784–150796
Zheng Z, Yang Y, Liu J, Dai HN, Zhang Y (2019) Deep and embedded learning approach for traffic flow prediction in urban informatics. IEEE Trans Intell Transp Syst 20(10):3927–3939
California Department of Transportation (Caltrans). Caltrans Performance Measurement System (PeMS) Available online: http://pems.dot.ca.gov/. Accessed 13 May 2019
Kong F, Li J, Jiang B, Zhang T, Song H (2019) Big datadriven machine learningenabled traffic flow prediction. Trans Emerg Telecommun Technol 30(9):e3482
Wang J, Chen R, He Z (2019) Traffic speed prediction for urban transportation network: a path based deep learning approach. Transp Res Part C Emerg Technol 100:372–385
Sun S, Chen J, Sun J (2019) Traffic congestion prediction based on GPS trajectory data. Int J Distrib Sens Netw 15(5):1550147719847440
Di X, Xiao Y, Zhu C, Deng Y, Zhao Q, Rao W (2019) Traffic congestion prediction by spatiotemporal propagation patterns. In: 2019 20th IEEE international conference on mobile data management (MDM). IEEE, pp 298–303
Khajeh Hosseini M, Talebpour A (2019) Traffic prediction using timespace diagram: a convolutional neural network approach. Transp Res Rec 2673(7):425–435
An J, Fu L, Hu M, Chen W, Zhan J (2019) A novel fuzzybased convolutional neural network method to traffic flow prediction with uncertain traffic accident information. IEEE Access 7:20708–20722
Dai G, Ma C, Xu X (2019) Shortterm traffic flow prediction method for urban road sections based on spacetime analysis and GRU. IEEE Access 7:143025–143035
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
SAS wrote the main text of the manuscript; YAH and HAH revised the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sayed, S.A., AbdelHamid, Y. & Hefny, H.A. Artificial intelligencebased traffic flow prediction: a comprehensive review. Journal of Electrical Systems and Inf Technol 10, 13 (2023). https://doi.org/10.1186/s43067023000816
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s43067023000816