Improved‑RSSI‑based indoor localization by using pseudo‑linear solution with machine learning algorithms

required


Introduction
ML/AI-based IoT application development is considered one of the hot topics among developers as well as academia.Among these IoT applications, location-based applications are critical.A few examples of location-based IoT services are locating people in a shopping complex, locating mobile robots on factory floors, attendance management in smart campuses, etc.In indoor environments, finding the location of a moving object is quite challenging due to Non-Line of Sight (NLOS) environments and multipath fading [1][2][3].In indoor wireless localization, additional hardware is not required to get the location information.By employing the broadcasting signals from the sensor node can assess its position.Further, the already implemented Wireless Sensor Network (WSN) for sensing purposes could be upgraded to know the location without any additional cost.Radio signals from mobile sensor nodes are used as input for an algorithm to estimate the location.Generally, indoor positioning systems are based on wireless technologies such as Bluetooth Low Energy (BLE), Wi-Fi, LoRaWAN, UWB, Zigbee, etc.Each wireless technology has its pros and cons.For instance, BLE has less power consumption and a very short communication range, and LoraWAN has high power consumption and a long sensing range [4,5].
Numerous of the prominent algorithms available in the study for indoor localization are mainly focus on statistical, deterministic, or filter-based [6][7][8].Such algorithms are highly complex and impractical to deploy on real hardware setups.Further, various hardware devices are used in Indoor Positioning Systems (IPS) based on classical algorithms, increasing the cost and significantly limiting the location accuracy.
ML algorithms are mostly employed in localization to extract the signals' essential properties.Based on these derived features, clustering is carried out using the fingerprint method.For NLOS identification and mitigation, feature extraction is also crucial.Current research endeavors focus on advancing machine learning-based indoor localization techniques tailored for IoT systems, enabling their diverse application in innovative scenarios [9][10][11][12].Some works are based on regressor types of algorithms, classifier-type algorithms, or deep learning-based algorithms.Yet, proposed ML models have limitations.Often, proposed methods for ML-based localization are limited to a single ML algorithm, and no comparison of performances with other algorithms is available.Also, few works are based on simulated datasets, and no experimental testbed is implemented and evaluated.Further, there is less or no consideration of hyper-parameter tuning in algorithms.
The main contribution of this study is as follows: • The RSSI measurement values are gathered using a Wi-Fi-based testbed featuring anchor nodes and target nodes designed using Espressif(ESP) 12 devices, operating on the IEEE 802.11 b/g/n protocol within an indoor environment.• We introduce a pseudo-linear solution (PLS) as an innovative approach, offering a closed-form solution that approximates the original system of nonlinear RSSI measurement equations with a set of linear equations.• To effectively manage measurement errors, our PLS method employs a weighted least-squares approach, with the weights carefully determined by considering the statistical properties of errors in both RSSI measurements and reference node locations.• Finally, the received RSSI data is subjected to training with a selection of ML models: linear regression, polynomial regression, support vector regression, random forest regression, and decision tree regression, followed by a comparative evaluation of their respective performances.
This paper is organized as follows.Section "Related works" explains the recent works available; Section "Experimental testbed design" presents the details of designing the experimental testbed, Section "System model" expresses the details of ML models used and how they were trained; and finally, the results and conclusions.

Related works
Several studies have been conducted to estimate the precise location of a sensor node in indoor environments with various localization techniques using numerous machine learning algorithms.This section briefly describes the recent studies and highlights the fundamental methodology used for Machine Learning-based indoor localization: In the article [13], the authors have investigated using an ML regressor for indoor localization.The authors of this paper used neural network technologies to carry out localization procedures based on the RSSI parameter.We compared the location estimate outcomes with two approaches (the ANN and the Decision tree) and the RSSI dataset.In order to evaluate the location for each triplet of RSSI, they initially used an artificial neural network with three inputs.We calculated the means error value for each location acquired for this ANN architecture.The same task is done for the ANN architecture with four inputs, where they estimate the location for each of the four inputs and determine the means error value for those estimates.
In [10], Ultra-Wide Band(UWB) has been used as the wireless technology for the Indoor Positioning Systems(IPS).For the UWB IPS system, an ML-based algorithm built on Naive Bayes(NB) principles has been developed.The suggested techniques exhibit a considerable improvement in localization precision.The outcome shows that as the distance between the anchors and tags grows, so does the error between the measured and actual distance.The area under the curve for the NB method is 87%, demonstrating that it has high classification properties.The suggested algorithm will also retain good placement accuracy in both Line of Sight (LoS) and NLoS environments.In work [14], authors analyzed contemporary resolution technologies to locate objects inside buildings accurately.Then, they showed how positioning errors increased when training and testing fingerprinting techniques on various platforms and devices.Received Signal Strength (RSS) computations produce varied results when multiple platform types and devices are used for the precise location and time.The model was trained using Support Vector Machine (SVM) combined with Error-Correcting Output Codes (ECOC) One-Versus-One and Long Short-Term Memory (LSTM) models.To determine the accuracy of the model, Root Mean Square Error (RMSE) was performed to show an error in meters between the true position and the predicted position.
In work [15], detailed comparison of LR, PR, DTR, SVR, and RFR performances for a Wi-Fi-based IPS.According to their findings, the DTR algorithm fared the best as compared with the other algorithms examined.The number of forests in DTR significantly minimizes error and improves location estimation accuracy.It was noted that the accuracy and error were greatly enhanced once the test-reference bed's nodes were increased.Our research predicts that supervised machine learning algorithms will produce better outcomes than deterministic localization.
On the contrary, proposed ML-based methods in related works can provide good accuracy in estimation over classical localization algorithms.However, it can be observed that RSSI is highly fluctuating and needs to apply string filtering techniques and linearization methods over the RSSI dataset before it trains using ML models.

Experimental testbed design
We designed and implemented the testbed using two sensor nodes: the target node and the reference node.The target node is required to evaluate the position and reference nodes positioned in a fixed position in the indoor location.The experimental setup is established in an electronics engineering laboratory, as shown in Fig. 1.The location is about 8.02 square meters, spanning an open area surrounded by walls, and also consists of some furniture.The IoT architecture used in the RSSI data collection systems is denoted in Fig. 2. Both the target node is implemented using ESP-12E and the anchor nodes are implemented using ESP-01 modules.ESP modules incorporate the IEEE 802.11 standard employed in completely indoor locations (Fig. 3).This system supports IPv4, TCP, MQTT protocol, UDP, and HTTP in communication between nodes.A self-regulating 3.3 V DC power source through an ADP7158 linear regulator was used to power up the nodes, as depicts in Fig. 4a, b.Also, ESP-12E employs a lithium polymer secondary battery source for the storage.
In the testbed arrangement, 34 known location is identified with their x and y axis.Before taking RSSI readings, all the Wi-Fi-enabled devices, such as Wi-Fi access points, were turned off in the environment.During the data collection, the references were fixed on the wall at 2 feet height from the ground level, and the mobile node was kept on marked the places.During the experiment, the mobile node was kept in all 34 locations for one minute, and recorded the RSSI values via an IoT cloud architecture.The actual image of the testbed is shown in Fig. 1.
The RSSI data collection and publication to a cloud storage server are done using the IoT cloud architecture, is shown in Fig. 3.The mobile node's private Wi-Fi network data collection for RSSI is made public on the internet, which is a public network.The hardware platform and the online RSSI data gathering are linked through the IoT cloud.The Internet of Things cloud is a widely dispersed mosquito MQTT broker that publishes the information collected to a distant server.Wi-Fi and internet technologies are used to send the acquired data between the hardware platform and the distant server, respectively.Figure 5 demonstrate the process of location estimating with reference nodes.

System model
The RSSI-based localization of the target node is estimated by using multiple reference nodes.Let the target node is denoted as x b , y b with the fixed reference node locations at x i , y i , i = 1, 2, . . ., M .i.e., M ≥ 3. The target node's RSSI measurement is included with noise due to signal fluctuation.The noisy reference location at the target node is represented as x i , y i and the subsequent RSSI estimation is represented as p i .An additive independent with zero-mean Gaussian noise affects the anchor node location information with a standard deviation indicated as σ a i [16].There is variation of σ a i values due to the multiples reference nodes.On the other hand, it considers the identical for both the x and y coordinates of a targeted node. (1) Similarly, the RSSI measurement by log-normal shadowing system model of radio signal path-loss is also employed [17].So that the target node of the transmitted signal from the ith reference nodes is represented as pi (dBm).The perturbation n σ p i in p i is denotes an additive noises with independent zero-mean Gaussian and standard deviation is denoted as σ p i (dB), such that.
Moreover, the shadowing path loss system model represents the correlation between the ith mean of the power and the distance among the target source and the ith reference nodes, i.e., as where d 0 defines the reference nodes distance, p 0 defines received source power value at the reference distances, and η is the pathloss exponent value, respectively.Assumed the perturbed value p i , the RSSI-caused measure of the distance amongst the target source and the ith reference nodes is represented by di , and it is computed as This study considers the challenges of computational efficiency and energy resource constraints for location estimation of the target node by using the reference nodes.In this manner, the RSSI location measurement from every reference node is accessible to the target node at any period for localization.To cope with the challenges mentioned as above, this study proposed a PLS to solve the autonomous-localization issue described below: The basic idea of the proposed algorithm is to find the near-optimal position of the target node that decreases the sum of the squared error values.As denoted earlier, the reference nodes position x i , y i and its subsequent distances d i , i = 1, 2, . . ., M , the tar- get node location is computed by intersecting the circles described as To cope with the system's nonlinearization nature of Eqs.(6), subtraction of the equation regarding from the i = 1 to the other outcomes in a system of linearization equations is defined as (2) i , and It is observed that Eq. ( 7) is an over-determined set of nonlinear equations, thus the objective is to find a solution s by decreasing the subsequent sum of the square-error function The solution of ( 8) is It is noted that, only noisy information xi , ỹi , and di are accessible rather than actual x i , y i , and d i .To factor in the change of the scale as well as numerical attribute val- ues that included with multiple reference node's location and distance estimations of Eq. ( 8), the minimization of the sum of square errors as where and W denoted as M × M weighted matrix.Then, the explanation s of ( 10) is To evaluate the weight matrix (W), it is noted that the error vector b − As in (10) contains two noise elements, one is in the reference node's location and another one is in distance measurement.The vector b comprises the squares of the noise elements, which basically lead the impact of noise in A to the error vector covariance.Thus, it is considered that the W represents the covariance matrix of b .Thus, b is simplified as where Hence, we have where Reflecting the assumptions mentioned above is independent features of the noises of the reference node's location and RSSI-induced distances, () is defined as It is notable that the k i represent the summation of the square with independent nor- mal distributed random variable x i , and y i as well as a non-zero mean.Thus, variance k i σ 2 a i is defined as

And consequently
Thus Var d 2 i is computed as [11] where The noisy values of x i , y i , and d i are used to compute Eqs. ( 13) and ( 14) because of the actual values x i , y i , and d i are not accessible.
Moreover, it is noted that Eq. ( 11) has multiple sources of bias.The matrix A contains noise, the errors in b are not additive as well as zero-mean, and there is a relationship among the errors in A and b .To evaluate the bias into the system model algorithm tak- ing an additive error, Eq. ( 9) is simplified as (12) By using Eqs.( 15) and (11), the E[ s] is written as In Eq. ( 16), the expansion of , to make the equa- tion simpler has been avoided.It is assumed that part I in Eq. ( 16) is the correspond to the target node location s and the remaining of the parts, II, III, and IV are the bias parts owing to estimation errors.
Part II provides the bias owing to the noise in A .Part III provides the statistical dependence among A and b i.e., E N T e = 0 .Moreover, part IV provides the non-addi- tive nature of perturbation in d i i.e., E[e] = 0 .To compensate of the bias parts II, III, and IV, the expectation for concerning noise covariance is then subtraction in Eq. ( 11) is written as To compute E N T W −1 N and E N T W −1 b , N can be written as where Thus.We have Representing (i, j)th is the element of W −1 by w ′ ij , and the entries of ( 19) are estimated as (15

And
The bias owing to the dependence of noises in the A and b can be written as where and To compensate of the bias provided by the non-additive feature of the perturbation in the d i [part IV in Eq. ( 16)], E b with its i-th entry can be computed as It can be considered that the noise is independent of the reference 's location and RSSI-induced distances; thus Eq. ( 21) is expressed as To compute E d 2 i , it is noted that the d 2 i employing in Eq. ( 5) is equal to where The term E k i in ( 22) corresponds to And mentioned assumption E k c develop into Employing ( 24) and ( 25) E b is expressed as here the ith entry for the t is It is noted that the d i is not available, thus the subsequent noise measurement val- ues are employed in the estimation of the t.
Computation estimation shows that evaluation of the bias owing to the included of the noise in the A and b employing ( 20) is approximate actual value only when low noise exists in the reference node's location.Thus, it is dependent on the bias on x i , y i and becomes the poor evaluation performance is provided with higher values of the σ a i .The target node estimated location, that is bias compensated in the pre- sented PLS algorithm, the bias-compensated solution s bc in (17), is computed as a closed form equation as: using Sci-kit-learn machine learning library on Intel(R) Core (TM) i5-10210U CPU @ 1.60 GHz 2.11 GHz.For visualizations, MATLAB 2020R is used.

Linear regression (LR)
Linear regression (LR) could consider the simplest ML algorithm available.In LR, it is the best-fit linear line between the independent and dependent variables.Defining the best-fit linear line and the ideal intercept and coefficient value so that the error is decreased is the major aim of a LR model.The first variable is the independent variable, whereas the second is regarded as a dependent variable.Moreover, this algorithm is easy to implement and requires less computational power to train the model [19,20].

Polynomial regression (PR)
Polynomial regression is the improved version of the LR.As a specific case of multiple LR, PR is a kind of linear regression that assess the connection as a nth-degree polynomial.PR is suitable for scenarios such as when the dataset consists of nonlinear data.In such scenario, LR fails to create a best-fit line.Consider the accompanying graphic, it depicts a nonlinear correlation, and the outcomes of LR, which accomplish poorly and are not at all realistic.To cope this challenges, PR is used, which identifies the curvilinear correlation between the independent and dependent variables.Moreover, this model is also less complex and easy to implement in even low-power hardware devices [21,22].

Support vector regressor (SVR)
SVR is a powerful ML algorithm used in indoor localization.It is more effective since SVM models linear and nonlinear relations with superior generalization performance and adopts the kernels technique to detect the difference among two points of the two distinct classes.However, when the number of SVs increases, SVM-based approaches become time-consuming and memory-intensive [23,24].

Decision tree regression (DTR)
A decision tree is a supervised machine learning method that could be employed to cope classification and regression challenges, although it is utmost frequently used when coping with classification challenges.It is a tree-structured classifier, in which internal nodes characterize the feature of a datasets, and branches shows the procedure of making decisions, and each leaf node is the classification result.There are basically two nodes such as decision node and leaf node.When it comes to indoor localization, compared to other categorization techniques like K-NN and Neural Network, Decision Tree-based indoor localization performs better in terms of increasing localization accuracy.When the Decision Tree categorizes continuous numerical data, there is a chance that some information will be missed [25,26].

Random forest regression (RFR)
A machine learning ensemble technique using many decision trees is called a random forest regression (RFR).A voting system is employed in RFR to raise the performance of numerous weak students (in this case, decision trees).The primary properties of random forests include random feature selection, bootstrap sampling, out-of-bag error estimates, and full-depth decision tree growth.Random forest improves the performance of regression trees by combining several regression trees.Using a random forest eliminates the need for cross-validation because the forest is constructed using native out-of-bag error estimates.In some tests, the out-of-bag error estimation is considered impartial [27].

Result and discussion
Algorithms, DTR, LR, PR, SVM, and RFR are used to train supervised machine learning algorithms to estimate the x and y geographical coordinates of the target node.For all the models, the coefficient of determination (R 2 ) and the Root Mean Squared Error (RMSE) were calculated.Firstly, the experiment taking place with three reference nodes, and step by step, the number of anchor nodes elevate to four and five, respectively, and new data sets were generated.Finally, RMSE and R 2 were calculated under different hyper-parameter conditions.

Root mean squared error
Figure 7a, b denotes the RMSE values changes in the x coordinate as we change the number of anchor nodes for the x coordinate and y coordinate, respectively.In the experimental setup, we changed the number of anchor nodes to 3, 4, and 5, respectively.
In each case, RSSI values were collected and trained using ML models.It observed that as the number of anchor nodes increases, there is a significant reduction in RMSE values for all the models.The LR and PR show the higher RMSE values and SVR, DTR, and RFR show relatively lower RMSE values.Where DTR outperformed in terms of RMSE.This trend is because the model trains very well when the number of trainable parameters increases.Figure 8a, b denotes the RMSE value variation against the sample size for the x coordinate and y coordinate, respectively.It is observed that RMSE decreases as the number of samples increases in all the models.LR and PR showed relatively high RMSE and SVR, while DTR and RFR showed the lowest RMSE values.Where DTR is outperformed for both coordinates, giving the lowest RMSE value.For all the models, the RMSE value decreases as the number of samples increases.In ML models, the standard deviation decreases as the number of samples increases.Figure 9a, b shows the change of coefficient value determination against the number of samples for the x coordinate and y coordinate, respectively.For machine learning models, the coefficient of determination, or R-squared value, ranges from 0.0 to 1.0 and reflects the correlation of the variance proportionate to the real and estimated node position.All dataset points perfectly lie at the estimated line of best fit when the R-squared values are closer to 1.0, indicating that the estimated position is entirely defined concerning the higher accuracy.For all the models, R 2 values rapidly increase till 1000 samples, and after 1000, it increases normally.DTR and RFR show better R 2 score, which is closer to 1. LR and PR show less than 0.5, meaning that models do not fit well with the data.

Hyper-parameter of the ML models
Figure 10a shows the impact of the hyper-parameter and the number of forests in RFR against the accuracy of the estimation.It can be observed that as the number of forests increases, RMSE is significantly decreasing.In RFR as the number of forests increases, the model is well trained with the data and gives better accuracy.However, the model required a higher computational power in hardware devices with a high number of forests.The number of tree hyper-parameters used in tree-based ensemble methods must be adjusted, directly affecting the computational cost.Sufficient trees must be chosen to find a trade-off between forecast accuracy and computational time.According to the foundations of tree-based algorithms, a model with more trees will be optimized and have the lowest possible prediction error.It shows that model performance depends on the maximum tree depth and that deeper trees perform better.Figure 10b illustrates the impact of the number of trees versus RMSE in the DTR algorithm.It can be observed that RMSE is significantly decreasing as the number of trees increases.

RMSE value with the epsilon for different kernel functions in SVR
Figure 11 illustrates the change of RMSE value against the epsilon for different kernel functions in SVR.Firstly, the input dataset forwarded into the kernels, which then transforms it into the desired form.Various SVM algorithms use different kernel functions.There are several forms of these functions.For instance, linear, nonlinear, polynomial, sigmoid, and radial basis functions (RBF).Describe the kernel functions for vectors, text, pictures, graphs, and sequence data.RBFs are the utmost prevalent types of kernel functions.since it responds locally and infinitely throughout the entire x-axis.The kernel functions return the inner product between two locations in an appropriate feature space.Thus, a notion of similarity is defined even in very high-dimensional areas with low computational expense.The experimental results show that all the kernel functions are giving decrement RMSE from 0.1 to 0.2 and after ε > 0.2, RMSE is rapidly increasing.Based on the observations, the RBF kernel is outperformed.

RMSE value with the C parameter in SVR
Figure 12 illustrates the RMSE value change against the c parameter in SVR.Where gamma set 0.1 for RBF kernel.It is observed that when C is increasing, RSME is significantly decreasing.For each erroneously classified data point, the C parameter provides a penalty value.In the event that c is low, selecting a decision boundary with a high margin comes at the expense of more misclassifications for the reason that the penalty for incorrectly classified points is low.SVM attempts to decrease the number of erroneously classified instances owing to a high penalty when C is large, which leads to a decision boundary with a narrower margin.Not all instances of misclassification get a similar penalty.It is contrarily relationship with the partition from the decision boundary.

Conclusions
This study presents an ML-based approach that could apply to robust indoor location scenarios.An experimental testbed was designed, including five reference nodes and one target node.The target node was placed at known geographic coordinates, and RSSI data were gathered using an IoT cloud architecture.The collected dataset was pre-processed using a PLS for a closed-form solution.It approximated the original system of nonlinear RSSI measurement equations with a system of linear equations.The dataset was trained using several ML algorithms.It is evident from the experiment with many supervised algorithms under various circumstances that the DTR outperformed the other algorithms that experimented the best.Hyper-parameters, number of trees in DTR, number of forests in RFR, penalty parameter, and explosion in SVR significantly affect localization accuracy.Moreover, accuracy and error were greatly improved once the reference nodes of the network are increased.Future research can delve into creating and refining ensemble-type machine-learning models designed to enhance indoor localization accuracy.These models can leverage the strengths of various algorithms and techniques, combining them synergistically to improve localization performance.Investigating novel ensemble strategies and assessing their effectiveness in real-world scenarios will Fig.12 RSME value with C parameter in SVR be crucial.Research efforts should focus on accommodating dynamic indoor environments, diverse IoT device types, and varying network conditions.This will help ascertain the adaptability of the models to a wide range of real-world settings.

Fig. 2 Fig. 3
Fig. 2 Arrangement of reference nodes and mobile nodes

Fig. 7
Fig. 7 RSME value with the number of anchor nodes a x coordinate, b y coordinate

Fig. 9
Fig. 9 Coefficient of determination with number of samples a x coordinate, b y coordinate

Fig. 10
Fig. 10 RSME value for x and y coordinated a number of forests, b number of trees