Replacing values with NA
Nicholas Tierney
2024-03-16
Source:vignettes/replace-with-na.Rmd
replace-with-na.Rmd
When dealing with missing values, you might want to replace values
with a missing values (NA
). This is useful in cases when
you know the origin of the data and can be certain which values should
be missing. For example, you might know that all values of “N/A”, “N A”,
and “Not Available”, or -99, or -1 are supposed to be missing.
naniar
provides functions to specifically work on this
type of problem using the function replace_with_na()
. This
function is the compliment to tidyr::replace_na()
, which
replaces an NA value with a specified value, whereas
naniar::replace_with_na()
replaces a value with an NA:
-
tidyr::replace_na()
: Missing values turns into a value (NA –> -99) -
naniar::replace_with_na()
: Value becomes a missing value (-99 –> NA)
In this vignette, we describe some simple use cases for these functions and describe how they work.
Example data
First, we introduce a small fictional dataset, df
, which
contains some common features of a dataset with the sorts of missing
values we might encounter. This includes multiple specifications of
missing values, such as “N/A”, “N A”, and “Not Available”. And also some
common numeric codes, like -98, -99, and -1.
df <- tibble::tribble(
~name, ~x, ~y, ~z,
"N/A", 1, "N/A", -100,
"N A", 3, "NOt available", -99,
"N / A", NA, "29", -98,
"Not Available", -99, "25", -101,
"John Smith", -98, "28", -1)
Using replace_with_na
What if we want to replace the value -99 in the x
column
with a missing value?
First, let’s load naniar
:
Now, we specify the fact that we want to replace -99 with a missing
value. To do so we use the replace
argument, and specify a
named list, which contains the names of the variable and the value it
would take to replace with NA
.
df %>% replace_with_na(replace = list(x = -99))
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A -100
#> 2 N A 3 NOt available -99
#> 3 N / A NA 29 -98
#> 4 Not Available NA 25 -101
#> 5 John Smith -98 28 -1
And say we want to replace -98 as well?
df %>%
replace_with_na(replace = list(x = c(-99, -98)))
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A -100
#> 2 N A 3 NOt available -99
#> 3 N / A NA 29 -98
#> 4 Not Available NA 25 -101
#> 5 John Smith NA 28 -1
And then what if we want to replace -99 and -98 in all the numeric columns, x and z?
df %>%
replace_with_na(replace = list(x = c(-99,-98),
z = c(-99, -98)))
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A -100
#> 2 N A 3 NOt available NA
#> 3 N / A NA 29 NA
#> 4 Not Available NA 25 -101
#> 5 John Smith NA 28 -1
Using replace_with_na()
works well when we know the
exact value to be replaced, and for which variables we want to replace,
providing there are not many variables. But what do you do when you’ve
got many variables you want to observe?
Extending replace_with_na
Sometimes you have many of the same value that you want to replace. For example, -99 and -98 above, and also the variants of “NA”, such as “N/A”, and “N / A”, and “Not Available”. You might also have certain variables that you want to be affected by these rules, or you might have more complex rules, like, “only affect variables that are numeric, or character, with this rule”.
To account for these cases we have borrowed from dplyr
’s
scoped variants and created the functions:
-
replace_with_na_all()
Replaces NA for all variables. -
replace_with_na_at()
Replaces NA on a subset of variables specified with character quotes (e.g., c(“var1”, “var2”)). -
replace_with_na_if()
Replaces NA based on applying an operation on the subset of variables for which a predicate function (is.numeric, is.character, etc) returns TRUE.
Below we will now consider some very simple examples of the use of these functions, so that you can better understand how to use them.
Using replace_with_na_all
Use replace_with_na_all()
when you want to replace ALL
values that meet a condition across an entire dataset. The syntax here
is a little different, and follows the rules for rlang’s expression of
simple functions. This means that the function starts with
~
, and when referencing a variable, you use
.x
.
For example, if we want to replace all cases of -99 in our dataset, we write:
df %>% replace_with_na_all(condition = ~.x == -99)
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A -100
#> 2 N A 3 NOt available NA
#> 3 N / A NA 29 -98
#> 4 Not Available NA 25 -101
#> 5 John Smith -98 28 -1
Likewise, if you have a set of (annoying) repeating strings like various spellings of “NA”, then I suggest you first lay out all the offending cases:
# write out all the offending strings
na_strings <- c("NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")
Then you write ~.x %in% na_strings
- which reads as
“does this value occur in the list of NA strings”.
df %>%
replace_with_na_all(condition = ~.x %in% na_strings)
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 NA 1 NA -100
#> 2 NA 3 NA -99
#> 3 NA NA 29 -98
#> 4 NA -99 25 -101
#> 5 John Smith -98 28 -1
You can also use the built-in strings and numbers in naniar:
common_na_numbers
common_na_strings
common_na_numbers
#> [1] -9 -99 -999 -9999 9999 66 77 88
common_na_strings
#> [1] "missing" "NA" "N A" "N/A" "#N/A" "NA " " NA"
#> [8] "N /A" "N / A" " N / A" "N / A " "na" "n a" "n/a"
#> [15] "na " " na" "n /a" "n / a" " a / a" "n / a " "NULL"
#> [22] "null" "" "\\?" "\\*" "\\."
And you can replace values matching those strings or numbers like so:
df %>%
replace_with_na_all(condition = ~.x %in% common_na_strings)
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 NA 1 NA -100
#> 2 NA 3 NOt available -99
#> 3 NA NA 29 -98
#> 4 Not Available -99 25 -101
#> 5 John Smith -98 28 -1
replace_with_na_at
This is similar to _all
, but instead in this case you
can specify the variables that you want affected by the rule that you
state. This is useful in cases where you want to specify a rule that
only affects a selected number of variables.
df %>%
replace_with_na_at(.vars = c("x","z"),
condition = ~.x == -99)
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A -100
#> 2 N A 3 NOt available NA
#> 3 N / A NA 29 -98
#> 4 Not Available NA 25 -101
#> 5 John Smith -98 28 -1
Although you can achieve this with regular
replace_with_na()
, it is more concise to use,
replace_with_na_at()
. Additionally, you can specify rules
as function, for example, make a value NA if the exponent of that number
is less than 1:
df %>%
replace_with_na_at(.vars = c("x","z"),
condition = ~ exp(.x) < 1)
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A NA
#> 2 N A 3 NOt available NA
#> 3 N / A NA 29 NA
#> 4 Not Available NA 25 NA
#> 5 John Smith NA 28 NA
replace_with_na_if
There may be some cases where you can identify variables based on
some test - is.character()
- are they character variables?
is.numeric()
- Are they numeric or double? and a given
value inside that type of data. For example,
df %>%
replace_with_na_if(.predicate = is.character,
condition = ~.x %in% ("N/A"))
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 NA 1 NA -100
#> 2 N A 3 NOt available -99
#> 3 N / A NA 29 -98
#> 4 Not Available -99 25 -101
#> 5 John Smith -98 28 -1
# or
df %>%
replace_with_na_if(.predicate = is.character,
condition = ~.x %in% (na_strings))
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 NA 1 NA -100
#> 2 NA 3 NA -99
#> 3 NA NA 29 -98
#> 4 NA -99 25 -101
#> 5 John Smith -98 28 -1
This means that you are able to apply a rule to many variables that meet a pre-specified condition. This can be of particular use if you have many variables and don’t want to list them all, and also if you know that there is a particular problem for variables of a particular class.
Notes on alternative ways to handle replacing with NAs
There are some alternative ways to handle replacing values with NA in
the tidyverse, na_if
and using readr
. These
are ultimately not as expressive as the replace_with_na()
functions, but they are very useful if you only have one kind of value
to replace with a missing, and if you know what the missing values are
upon reading in the data.
This function allows you to replace exact values - similar to
replace_with_na()
, but only for one single column in a data
frame. Here is how you would use it in our examples.
# instead of:
df_1 <- df %>% replace_with_na_all(condition = ~.x == -99)
df_1
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <chr> <dbl>
#> 1 N/A 1 N/A -100
#> 2 N A 3 NOt available NA
#> 3 N / A NA 29 -98
#> 4 Not Available NA 25 -101
#> 5 John Smith -98 28 -1
df_2 <- df %>% dplyr::mutate(
x = dplyr::na_if(x, -99),
y = dplyr::na_if(z, -99)
)
df_2
#> # A tibble: 5 × 4
#> name x y z
#> <chr> <dbl> <dbl> <dbl>
#> 1 N/A 1 -100 -100
#> 2 N A 3 NA -99
#> 3 N / A NA -98 -98
#> 4 Not Available NA -101 -101
#> 5 John Smith -98 -1 -1
# are they the same?
all.equal(df_1, df_2)
#> [1] "Component \"y\": Modes: character, numeric"
#> [2] "Component \"y\": target is character, current is numeric"
#> [3] "Component \"z\": 'is.NA' value mismatch: 0 in current 1 in target"
Note, however, that na_if()
can only take arguments of
length one. This means that it cannot capture other statements like
na_strings <- c("NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")
df_3 <- df %>% replace_with_na_all(condition = ~.x %in% na_strings)
# Not run:
df_4 <- df %>% dplyr::na_if(x = ., y = na_strings)
# Error in check_length(y, x, fmt_args("y"), glue("same as {fmt_args(~x)}")) :
# argument "y" is missing, with no default
It also cannot handle more complex equations, where you want to refer to values in other columns, or values less than or greater than another value.
catch NAs with readr
When reading in your data, you can use the na
argument
inside readr
to replace certain values with NA. For
example:
# not run
dat_raw <- readr::read_csv("original.csv", na = na_strings)
This would convert all of the values in na_strings
into
missing values.
This is useful to use if you happen to know the NA types upon reading in the data. However, this is not always practical in a data analysis pipeline.