Missing Data (DRAFT)

Missing Data (DRAFT)#

In data-based fields of research, missing data are a ubiquitous and challenging problem. Since the occurrence of missing values is rarely random, not accounting for them may lead to biased estimates and large standard errors in statistical analyses. For the last decades, missing data issues have produced a broad field of research having created various approaches to deal with them. One class of these approaches are imputation methods which replace missing data by imputed values. In order to apply a specific imputation method appropriately, one needs to account for the underlying cause of missing data. The literature therefore distinguishes between different missing data mechanisms which are presented in this article. More detailed information can be found in [Enders, 2010, Little and Rubin, 2002, Molenberghs et al., 2015]. One of the main approaches to overcome the issue of missing data is data imputation. The article about data imputation collects different methods that impute values to obtain full data sets.

Missing Data Mechanisms#

Missing data problems are, in general, classified into three different categories. These so-called mechanisms describe how the missingness of values is related to the underlying data. In practice, it is hardly possible to verify a classification completely. For this reason, a missing data mechanism needs to be understood as an assumption on the data at hand. Since the performance of methods dealing with missing data heavily depends on the imposed mechanism, the selection of an adequate mechanism is crucial.

In order to characterize the three different mechanisms mathematically, we set up some notation. Suppose that we consider data from \(n\in\mathbb{N}\) subjects and we observe \(p\in\mathbb{N}\) variables for each subject. Then (in the complete data case), the data matrix is given by \(Y\in\mathbb{R}^{n\times p}\). In case of missing values, the missing-data indicator matrix \(M\in\{0,1\}^{n\times p}\) depicts whether a certain value is recorded or not. More precisely, if \(m_{ij}=0\), variable \(j\) of subject \(i\) is missing whereas \(m_{ij}=1\) means that the value is recorded. We denote by \(Y_i\) and \(M_i\) the \(i\)-th row of \(Y\) and \(M\), respectively, i.e. the data belonging to subject \(i\). Besides, we decompose \(Y_i\) into the components \(Y_i^{o}\) and \(Y_i^{m}\), the observed and unobserved components, respectively. The (set of) unknown parameter(s) is denoted by \(\phi\).

Each missing data mechanism can now be characterized by specifying the conditional distribution of \(M\) given \(Y^o\) and \(Y^m\).

Missing Completely At Random (MCAR) Data#

The MCAR mechanism requires that the occurrence of missing values is completely unrelated to the data generating process, i.e. to the components of \(Y\). In this case, the conditional distribution of \(M\) simplifies to

\[ \begin{align*} \mathbb{P}(M_i|Y_i,\phi)=\mathbb{P}(M_i|\phi),\quad i=1,\dots,n. \end{align*} \]

This mechanism is thus the most restrictive one and often regarded as unrealistic. Nevertheless, it is the only mechanism for which it is rather simple to check if data at hand satisfy this assumption. One method, for example, is to separate the missing and complete cases on a certain variable. Applying a t-test on group mean differences on other variables may help to verify the condition.

Missing At Random (MAR) Data#

At first glance, the term “missing at random” might be misleading. This mechanism does not mean that missing values occur randomly in a data set. Instead, the missingness depends on some other measured variable, but not to the actual unobserved variable itself. We therefore have for MAR data

\[ \begin{align*} \mathbb{P}(M_i|Y_i,\phi)=\mathbb{P}(M_i|Y_i^o,\phi),\quad i=1,\dots,n. \end{align*} \]

Missing Not At Random (MNAR) Data#

Missing data is said to be MNAR if the missingness is related to the underlying, but missing, value itself as well as to potential other variables. Hence, we obtain in this setting

\[ \begin{align*} \mathbb{P}(M_i|Y_i,\phi)=\mathbb{P}(M_i|Y_i^o,Y_i^m,\phi),\quad i=1,\dots,n. \end{align*} \]

Literature#

End10: Craig K. Enders. Applied Missing Data Analysis. Guilford, 2010.
LR02: Roderick J. A. Little and Donald B. Rubin. Statistical analysis with missing data. John Wiley & Sons, Inc, 2002.
MFK+15: Geert Molenberghs, Garrett Fitzmaurice, Michael G. Kenward, Anastasios Tsiatis, and Geert Verbeke. Handbook of Missing Data Methodology. CRC Press, 2015.

Authors#

Julian Wäsche

Contributors#

Jonas Bauer