Special Missing Values
Nicholas Tierney
2024-03-16
Source:vignettes/special-missing-values.Rmd
special-missing-values.Rmd
Data sometimes have special missing values to indicate specific reasons for missingness. For example, “9999” is sometimes used in weather data, say for for example, the Global Historical Climate Network (GHCN) data, to indicate specific types of missingness, such as instrument failure.
You might be interested in creating your own special missing values
so that you can mark specific, known reasons for missingness. For
example, an individual dropping out of a study, known instrument failure
in weather instruments, or for values being censored in analysis. In
these cases, the data is missing, but we have information about
why it is missing. Coding these cases as NA
would
cause us to lose this valuable information. Other stats programming
languages like STATA, SAS, and SPSS have this capacity, but currently
R
does not. So, we need a way to create these special
missing values.
We can use recode_shadow
to recode missingness by
recoding the special missing value as something like
NA_reason
. naniar
records these values in the
shadow
part of nabular
data, which is a
special dataframe that contains missingness information.
This vignette describes how to add special missing values using the
recode_shadow()
function. First we consider some
terminology to explain these ideas, if you are not familiar with the
workflows in naniar
.
Terminology
Missing data can be represented as a binary matrix of “missing” or
“not missing”, which in naniar
we call a “shadow matrix”, a
term borrowed from Swayne
and Buja, 1998.
library(naniar)
as_shadow(oceanbuoys)
#> # A tibble: 736 × 8
#> year_NA latitude_NA longitude_NA sea_temp_c_NA air_temp_c_NA humidity_NA
#> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 !NA !NA !NA !NA !NA !NA
#> 2 !NA !NA !NA !NA !NA !NA
#> 3 !NA !NA !NA !NA !NA !NA
#> 4 !NA !NA !NA !NA !NA !NA
#> 5 !NA !NA !NA !NA !NA !NA
#> 6 !NA !NA !NA !NA !NA !NA
#> 7 !NA !NA !NA !NA !NA !NA
#> 8 !NA !NA !NA !NA !NA !NA
#> 9 !NA !NA !NA !NA !NA !NA
#> 10 !NA !NA !NA !NA !NA !NA
#> # ℹ 726 more rows
#> # ℹ 2 more variables: wind_ew_NA <fct>, wind_ns_NA <fct>
The shadow matrix
has three key features to facilitate
analysis
Coordinated names: Variables in the shadow matrix gain the same name as in the data, with the suffix “_NA”.
Special missing values: Values in the shadow matrix can be “special” missing values, indicated as
NA_suffix
, where “suffix” is a very short message of the type of missings.Cohesiveness: Binding the shadow matrix column-wise to the original data creates a cohesive “nabular” data form, useful for visualization and summaries.
We create nabular
data by bind
ing the
shadow to the data:
bind_shadow(oceanbuoys)
#> # A tibble: 736 × 16
#> year latitude longitude sea_temp_c air_temp_c humidity wind_ew wind_ns
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1997 0 -110 27.6 27.1 79.6 -6.40 5.40
#> 2 1997 0 -110 27.5 27.0 75.8 -5.30 5.30
#> 3 1997 0 -110 27.6 27 76.5 -5.10 4.5
#> 4 1997 0 -110 27.6 26.9 76.2 -4.90 2.5
#> 5 1997 0 -110 27.6 26.8 76.4 -3.5 4.10
#> 6 1997 0 -110 27.8 26.9 76.7 -4.40 1.60
#> 7 1997 0 -110 28.0 27.0 76.5 -2 3.5
#> 8 1997 0 -110 28.0 27.1 78.3 -3.70 4.5
#> 9 1997 0 -110 28.0 27.2 78.6 -4.20 5
#> 10 1997 0 -110 28.0 27.2 76.9 -3.60 3.5
#> # ℹ 726 more rows
#> # ℹ 8 more variables: year_NA <fct>, latitude_NA <fct>, longitude_NA <fct>,
#> # sea_temp_c_NA <fct>, air_temp_c_NA <fct>, humidity_NA <fct>,
#> # wind_ew_NA <fct>, wind_ns_NA <fct>
This keeps the data values tied to their missingness, and has great benefits for exploring missing and imputed values in data. See the vignettes Getting Started with naniar and Exploring Imputations with naniar for more details.
Recoding missing values
To demonstrate recoding of missing values, we use a toy dataset,
dat
:
df <- tibble::tribble(
~wind, ~temp,
-99, 45,
68, NA,
72, 25
)
df
#> # A tibble: 3 × 2
#> wind temp
#> <dbl> <dbl>
#> 1 -99 45
#> 2 68 NA
#> 3 72 25
To recode the value -99 as a missing value “broken_machine”, we first
create nabular data with bind_shadow
:
dfs <- bind_shadow(df)
dfs
#> # A tibble: 3 × 4
#> wind temp wind_NA temp_NA
#> <dbl> <dbl> <fct> <fct>
#> 1 -99 45 !NA !NA
#> 2 68 NA !NA NA
#> 3 72 25 !NA !NA
Special types of missingness are encoded in the shadow part nabular
data, using the recode_shadow
function, we can recode the
missing values like so:
dfs_recode <- dfs %>%
recode_shadow(wind = .where(wind == -99 ~ "broken_machine"))
This reads as “recode shadow for wind where wind is equal to -99, and
give it the label”broken_machine”. The .where
function is
used to help make our intent clearer, and reads very much like the
dplyr::case_when()
function, but takes care of encoding
extra factor levels into the missing data.
The extra types of missingness are recoded in the shadow part of the nabular data as additional factor levels:
levels(dfs_recode$wind_NA)
#> [1] "!NA" "NA" "NA_broken_machine"
levels(dfs_recode$temp_NA)
#> [1] "!NA" "NA" "NA_broken_machine"
All additional types of missingness are recorded across all shadow variables, even if those variables don’t contain that special missing value. This ensures all flavours of missingness are known.
To summarise, to use recode_shadow
, the user provides
the following information:
- A variable that they want to effect
(
recode_shadow(var = ...)
) - A condition that they want to implement
(
.where(condition ~ ...)
) - A suffix for the new type of missing value
(
.where(condition ~ suffix)
)
Under the hood, this special missing value is recoded as a new factor level in the shadow matrix, so that every column is aware of all possible new values of missingness.
Some examples of using recode_shadow
in a workflow will
be discussed in more detail in the near future, for the moment, here is
a recommended workflow:
- Use
recode_shadow()
with actual data - Replacing the previous actual values using
replace_with_na()
(see the vignette on replacing values with NA) - Explore missings where special cases are considered
- Explore imputed values, looking at these special cases