R: filtering with across in dplyr
By Bas Machielsen
May 17, 2020
Introduction
The newest versions of the dplyr
package introduced a new function, across()
, to be used within summarise, mutate and filter functions, and I am trying to learn it. I am, however, having a hard time understanding the use and the mechanics of the function in combination with filter. While
this vignette has ample examples on mutate and summarise, the examples on filter are few and not very insightful. The same is true for several non-official guides that I have read.
As mentioned, the dplyr
documentation provides examples with the summarise
and mutate
functions, but it does not deal extensively with the filter
function, which is not (in my opinion) straightforward to use at all. The objective of this pamphlet is to introduce the reader to the (proper and transparent) usage of across()
in conjunction with filter
, but also to remind the writer of the functionality of across()
. Finally, it aims to help new and experienced programmers wrap their head around this new function.
After reading this pamphlet, you should be familiar enough with across()
to be able to replace the _all
, _at
, and _if
function family intuitively. I will illustrate how this works with the help of some examples.
Filter with all_vars
First, let us look at the most basic usage of across
in filter
: filtering a dataset based on all rows meeting a certain condition or requirement.
Suppose we want to select rows in a dataset such that for every character variable, the length of the string should be larger than a particular number, in this example, 5. This would be the equivalent of filter_all
in a data.frame consisting purely of character vectors, or otherwise, of filter_at
.
First, load the relevant packages..
library(tidyverse)
library(stringr)
As mentioned in the vignette, the across function without any additions works automatically as all_vars
, as evidenced by the following behavior:
numbers <- data.frame(a = c(10000, 90000, 60000, 3000),
b = c(10000, 20000, 4000, 30000))
numbers
## a b
## 1 10000 10000
## 2 90000 20000
## 3 60000 4000
## 4 3000 30000
numbers %>%
filter(across(everything(),~.x > 9999))
## a b
## 1 10000 10000
## 2 90000 20000
We can also specify which variables we want to meet the condition. Let’s take the dataset from before and apply the filter only on the first column:
numbers %>%
filter(across(a, ~ . > 9999))
## a b
## 1 10000 10000
## 2 90000 20000
## 3 60000 4000
As as second example, let’s implement a condition requiring that the string length of every character variable be greater than 2 (or 3) in the dplyr::starwars dataset:
starwars %>%
filter(across(is.character, ~ str_length(.) > 2))
## # A tibble: 69 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 Darth V… 202 136 none white yellow 41.9 male mascu…
## 3 Leia Or… 150 49 brown light brown 19 fema… femin…
## 4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 5 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 6 Biggs D… 183 84 black light brown 24 male mascu…
## 7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## 8 Anakin … 188 84 blond fair blue 41.9 male mascu…
## 9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
## 10 Chewbac… 228 112 brown unknown blue 200 male mascu…
## # … with 59 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
starwars %>%
filter(across(is.character, ~ str_length(.) > 3))
## # A tibble: 63 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 Darth V… 202 136 none white yellow 41.9 male mascu…
## 3 Leia Or… 150 49 brown light brown 19 fema… femin…
## 4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 5 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 6 Biggs D… 183 84 black light brown 24 male mascu…
## 7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## 8 Anakin … 188 84 blond fair blue 41.9 male mascu…
## 9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
## 10 Chewbac… 228 112 brown unknown blue 200 male mascu…
## # … with 53 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
We select all character variables with as.character
, and condition the filter on the string length of all variables being larger than 5.
Similarly, consider the following modified dataframe from the colwise vignette I mentioned above:
df <- tibble(x = c("a", "b"),
y = c(1, 1),
z = c(-1, 1),
w = c("Hanz", "Genghis Khan"),
u = c("Werner", "Monsieur Eugene Duchene"))
df
## # A tibble: 2 × 5
## x y z w u
## <chr> <dbl> <dbl> <chr> <chr>
## 1 a 1 -1 Hanz Werner
## 2 b 1 1 Genghis Khan Monsieur Eugene Duchene
Suppose we want to select, in parallel to the example given in the vignette, the rows for which all numeric variables are greater than -1:
df %>%
filter(across(is.numeric, ~ . > -1))
## # A tibble: 1 × 5
## x y z w u
## <chr> <dbl> <dbl> <chr> <chr>
## 1 b 1 1 Genghis Khan Monsieur Eugene Duchene
So it is very straightforward to use the filter()
function if you want to filter such that all variables meet a criterion. It is also very straightforward to filter using multiple criteria, as long as you want all variables, and NOT some variables, to which you apply the criterion have to meet it:
df %>%
filter(across(is.numeric, ~ . > -1), across(c(w,u), ~ str_length(.) > 5))
## # A tibble: 1 × 5
## x y z w u
## <chr> <dbl> <dbl> <chr> <chr>
## 1 b 1 1 Genghis Khan Monsieur Eugene Duchene
.. or perhaps better illustrated by:
# All numeric variables must be larger than 10
starwars %>%
filter(across(is.numeric, ~ . > 10))
## # A tibble: 35 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth V… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Or… 150 49 brown light brown 19 fema… femin…
## 6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Wh… 165 75 brown light blue 47 fema… femin…
## 8 Biggs D… 183 84 black light brown 24 male mascu…
## 9 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
## 10 Anakin … 188 84 blond fair blue 41.9 male mascu…
## # … with 25 more rows, and 5 more variables: homeworld <chr>, species <chr>,
## # films <list>, vehicles <list>, starships <list>
# All numeric variables must be larger than 10
# and both name AND hair_color should contain an a.
starwars %>%
filter(across(is.numeric, ~ . > 10),
across(c(name, hair_color), ~ grepl("a", .)))
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Biggs Da… 183 84 black light brown 24 male mascu…
## 2 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male mascu…
## 3 Boba Fett 183 78.2 black fair brown 31.5 male mascu…
## 4 Lando Ca… 177 79 black dark brown 31 male mascu…
## 5 Luminara… 170 56.2 black yellow blue 58 fema… femin…
## 6 Barriss … 166 50 black yellow blue 40 fema… femin…
## 7 Jango Fe… 183 79 black tan brown 66 male mascu…
## # … with 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Filter with any_vars
Problems arise when we attempt to do the following: Suppose we want to laxen the condition, and instead, we want to require that at least one condition be met.
For example, in the data.frame df
below, we want at least one character string to be larger than 7. How do we go about this? Instead of using the any_vars
and all_vars
helpers, we must use an auxiliary function, as documented
here. In my opinion, this way of filtering is far from intuitive and transparent, and it took me a while to figure it out what I should do in each specific case: it is the reason why I wrote this pamphlet.
df <- tibble(x = c("a", "b"),
y = c(1, 1),
z = c(-1, 1),
w = c("Harry", "Potterisverycool")
)
df
## # A tibble: 2 × 4
## x y z w
## <chr> <dbl> <dbl> <chr>
## 1 a 1 -1 Harry
## 2 b 1 1 Potterisverycool
Now, suppose we want to filter the data.frame df
to include only the rows in which at least one character vector has a string length greater than 6. The vignette advises us to use a helper function, which is defined to be:
rowAny <- function(x) rowSums(x) > 0
In my view, this seems to be pretty much the only ‘helper’ function I can think of and that works, so I don’t see why it is not included as a function in the package. In any case, the job consists of apply the across()
function as an argument to rowAny
:
library(stringr)
df %>%
filter(rowAny(across(is.character, ~ str_length(.) > 6)))
## # A tibble: 1 × 4
## x y z w
## <chr> <dbl> <dbl> <chr>
## 1 b 1 1 Potterisverycool
df %>%
filter(rowAny(across(is.character, ~ str_length(.) > 4)))
## # A tibble: 2 × 4
## x y z w
## <chr> <dbl> <dbl> <chr>
## 1 a 1 -1 Harry
## 2 b 1 1 Potterisverycool
In other words, we are asking R to filter to any rows to which the following condition aplies:
-
For any row, check all character vectors
-
Compute the string length (
str_length
) of the cells -
Ask whether it is larger than 6 (or 4, respectively, in the examples)
-
We get back a logical in every cell
-
If the rowwise-sum of all these conditions is larger than 0, then keep the row
Let’s go over each of these individual steps quickly and try to emulate R’s behavior over this function:
#First, this is what we get when we evaluate the condition in each cell
df %>%
summarise(across(everything(), ~ str_length(.x) > 5))
## # A tibble: 2 × 4
## x y z w
## <lgl> <lgl> <lgl> <lgl>
## 1 FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE TRUE
#Then, we sum the value of all the logicals per row:
df %>%
summarise(across(everything(), ~ str_length(.x) > 5)) %>%
rowwise() %>%
rowSums() > 0
## [1] FALSE TRUE
#which gives us back the rows eligible for inclusion
df[df %>%
summarise(across(everything(), ~ str_length(.x) > 5)) %>%
rowwise() %>%
rowSums() > 0,
]
## # A tibble: 1 × 4
## x y z w
## <chr> <dbl> <dbl> <chr>
## 1 b 1 1 Potterisverycool
Similarly, going back to the df example, we could have also changed the zero to go from “at least one” to “at least two”, for example. So this would mean “select all Rows in which at least 2 variables meet a certain condition”. More concretely:
rowAny <- function(x) rowSums(x) > 1
df <- df %>%
mutate(u = c("Hermione", "Granger"))
df
## # A tibble: 2 × 5
## x y z w u
## <chr> <dbl> <dbl> <chr> <chr>
## 1 a 1 -1 Harry Hermione
## 2 b 1 1 Potterisverycool Granger
Let’s see if filter now works as expected, that is to say, it should only include the second row, as I’ve specified that the amount of occurrences (the count of the logical vectors) should be larger than one. (In case of the first row, it is 1.)
df %>%
filter(
rowAny(
across(
is.character, ~ str_length(.) > 5
)
)
)
## # A tibble: 1 × 5
## x y z w u
## <chr> <dbl> <dbl> <chr> <chr>
## 1 b 1 1 Potterisverycool Granger
df %>%
filter(
rowAny(
across(
is.character, ~ str_length(.) > 7
)
)
)
## # A tibble: 0 × 5
## # … with 5 variables: x <chr>, y <dbl>, z <dbl>, w <chr>, u <chr>
Conclusion
Whereas across()
has many other benefits clearly demonstrated in various vignettes, and has enormous potential, in the specific case of filtering, it might have become more awkward to use: the functions of filter_all
or filter_at
in combination with any_vars
and all_vars
were (i) at least as, or more intuitive, and (ii) did not require the user to specify a helper function. As for me, I hope this rowAny
will get a place inside dplyr
, preferably under an intuitive monniker. In any case, I hope you’ve found the small demonstration useful. Feel free to contact me at any time via
e-mail or
Github.
- Posted on:
- May 17, 2020
- Length:
- 11 minute read, 2250 words
- See Also: