Skip to main content

Table 1 Related work summary

From: GMA: Gap Imputing Algorithm for time series missing values

 

References

Datasets

Description

Univariate methods

[10]

4 time series used: CO2 concentrations, Phu Lien air temperature, NNGC1 F1 V1 003 (NNGC), and Ba Tri temperature

This approach involves transforming data into a multivariate time series, using machine learning for forward and backward forecasting to estimate missing values, and imputing gaps with the average of both forecast sets. It adheres to academic standards of syntax and grammar

[11]

Scientific research data on factors causing crime for males and females

The "most frequent value" imputation method replaces missing data with the mode of the variable. It is typically used for categorical variables or numerical variables that have a limited range of values

[12]

Water-level data from telemetry stations across Thailand

This study compared three methods for imputing missing data: mean imputation, regression imputation, and multiple imputation. The results showed that multiple imputation was the most effective method and produced less biased estimates compared to the other two methods

Multivariate methods

Pattern matching

[13]

Air quality and meteorological data in Beijing, China

The proposed ST-MVL method fills missing readings in geo-sensory time series data by considering temporal and spatial correlations. It uses empirical statistic models and data-driven algorithms to handle different types of missing data cases

[14]

2 datasets: SBR meteorological time series in South Tyrol and Flights dataset

Clustering algorithm was employed to group similar time series, and the resulting groups were used to impute missing values. This approach is specifically designed for continuous streams of time series data

[15]

Data collected from in-situ monitoring station in Mulgrave-Russell catchment, Australia

The proposed method involves using a Seq2Seq model to impute missing values in time series data. This model utilizes a dual-head architecture that includes an encoder and two decoders, each corresponding to one direction of the time series data. Seq2Seq models are a type of recurrent neural network (RNN) that can be applied for sequence prediction and generation

Matrix completion principles

[16]

Face images under varying illuminations: 168 × 192 resolution, 55 frames

The proposed technique for robust low-rank matrix recovery is capable of handling data corruption and utilizes orthonormal subspace learning to estimate a low-rank matrix from incomplete or corrupted data. This method has shown promising results in experiments and outperformed existing methods, and can be applied in various applications such as image processing, signal processing, and recommendation systems

[17]

Netflix data: 17,770 movies rated by 480,189 customers

The proposed method utilizes spectral regularization to promote low-rank solutions and impose structural constraints on the estimated matrix. This approach has proved to be effective in dealing with ill-posed problems and improving the performance of matrix completion. The method also allows for incorporating additional side information, such as similarity between items or users, to further enhance the estimation

[18]

Hydrological time series with tuples of timestamp and observation value

SVD and CD are two widely used methods for imputing missing values in time series data. SVD decomposes the dataset into a subset of singular values, while CD calculates the distance between the missing value and its neighboring points based on correlation. CD has been found to be more accurate and computationally for time series datasets with low correlation

[19]

National water quality reference index data monitored by Haimen Bay station

A proposed approach for imputing missing values in data involves combining low-rank matrix completion and sparse representation. The approach first uses low-rank matrix completion to impute missing values based on a low-rank structure assumption. Then, sparse representation is employed to refine the imputed data by assuming it can be represented as a linear combination of a few basis elements

[20]

Rainfall data from 4 stations in Malaysia

The method combines PCA and Bayesian modeling to estimate missing values in a dataset

Machine learning imputing

[21]

Measured water levels in 7 monitoring wells in the USA

RF algorithm used for imputing missing values in a dataset with continuous variables

[22]

Three field-based time series were used, including traffic speed data, water flow rate data, and the Nottem dataset

A hybrid approach was used to impute missing data, where regression imputation predicted missing input variables, and data augmentation created synthetic data points for missing output variables. The approach was applied to a dataset with missing values in both input and output variables

[23]

Seven datasets were used from the UCI and KEEL repositories

A genetic algorithm is proposed to impute missing values in datasets with multiple missing observations and different data types. The algorithm minimizes a multi-objective fitness function based on Minkowski distance of statistical measures between available and completed data

[24]

Two untargeted metabolomics datasets from the COPDGene cohort were used

A two-step approach was used for imputing missing values, involving a random forest classifier to classify the missing mechanism and mechanism-specific algorithms for imputation. The approach improved imputations by reducing bias and producing values closer to the original data

[25]

Letter and SPAM datasets https://archive.ics.uci.edu/

A method that estimates missing values in datasets using a generative adversarial network (GAN) model