MISSING DATA: BRIEF REVIEW AND CASE STUDIES
Palavras-chave:
Missing Data, Missing Completely at Random (MCAR), Missing at Random (MAR)Resumo
A major difficulty faced in developing real applications that use data streams to solve pre-
diction and/or classification problems are the missing data. Although there are techniques to reduce
the impacts caused by this problem, most systems are not preventively modeled to allow the adequate
treatment of this type of occurrence. Unreliable or damaged sensors, partial occlusion, interference in
communication, restrictions in the data transmission band, are some of the reasons that can cause this
problem. Some authors classify this lack of data regarding their randomness and show that in some cases
this absence may also be related to the lack of other attributes. Basically there are two ways of dealing
with missing data: i) exclusion, where all or part of the sample is removed or ignored; and ii) imputation,
where the missing value is replaced by zero, by the average of the variable up to the sample with the
problem or by an estimated value, where the problematic variable has its estimated value by some model
that, in some cases, can lead to variables and/or previous values. In this context, this article presents a
review of the literature addressing the main methodologies used to address the problem of missing data.
In addition, case studies are presented for comparing results and defining situations where each type of
approach is best employed.