library(tidyverse)
library(nycflights13)
# Your code here
3 Exercises (Chapter 3)
In a single pipeline for each condition, find all flights that meet the condition:
Had an arrival delay of two or more hours
Flew to Houston (
IAH
orHOU
)Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late.
Were delayed by at least an hour, but made up over 30 minutes in flight
Sort
flights
to find the flights with longest departure delays. Find the flights that left earliest in the morning.Sort
flights
to find the fastest flights. (Hint: Try including a math calculation inside of your function.)Was there a flight on every day of 2013?
Which flights traveled the farthest distance? Which traveled the least distance?
Does it matter what order you used
filter()
andarrange()
if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.Compare
dep_time
,sched_dep_time
, anddep_delay
. How would you expect those three numbers to be related?Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
.What happens if you specify the name of the same variable multiple times in a
select()
call?What does the
any_of()
function do? Why might it be helpful in conjunction with this vector?<- c("year", "month", "day", "dep_delay", "arr_delay") variables # Try below first |> flights select(variables)
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0. ℹ Please use `all_of()` or `any_of()` instead. # Was: data %>% select(variables) # Now: data %>% select(all_of(variables)) See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# A tibble: 336,776 × 5 year month day dep_delay arr_delay <int> <int> <int> <dbl> <dbl> 1 2013 1 1 2 11 2 2013 1 1 4 20 3 2013 1 1 2 33 4 2013 1 1 -1 -18 5 2013 1 1 -6 -25 6 2013 1 1 -4 12 7 2013 1 1 -5 19 8 2013 1 1 -3 -14 9 2013 1 1 -3 -8 10 2013 1 1 -2 8 # ℹ 336,766 more rows
# Or |> flights select(any_of(variables))
# A tibble: 336,776 × 5 year month day dep_delay arr_delay <int> <int> <int> <dbl> <dbl> 1 2013 1 1 2 11 2 2013 1 1 4 20 3 2013 1 1 2 33 4 2013 1 1 -1 -18 5 2013 1 1 -6 -25 6 2013 1 1 -4 12 7 2013 1 1 -5 19 8 2013 1 1 -3 -14 9 2013 1 1 -3 -8 10 2013 1 1 -2 8 # ℹ 336,766 more rows
Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
|> flights select(contains("TIME"))
Rename
air_time
toair_time_min
to indicate units of measurement and move it to the beginning of the data frame.Why doesn’t the following work, and what does the error mean?
|> flights select(tailnum) |> arrange(arr_delay)
Error in `arrange()`: ℹ In argument: `..1 = arr_delay`. Caused by error: ! object 'arr_delay' not found
|> flights select(tailnum)
# A tibble: 336,776 × 1 tailnum <chr> 1 N14228 2 N24211 3 N619AA 4 N804JB 5 N668DN 6 N39463 7 N516JB 8 N829AS 9 N593JB 10 N3ALAA # ℹ 336,766 more rows
Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights |> group_by(carrier, dest) |> summarize(n())
)Find the flights that are most delayed upon departure from each destination.
How do delays vary over the course of the day. Illustrate your answer with a plot.
What happens if you supply a negative
n
toslice_min()
and friends?Explain what
count()
does in terms of the dplyr verbs you just learned. What does thesort
argument tocount()
do?Suppose we have the following tiny data frame:
<- tibble( df x = 1:5, y = c("a", "b", "a", "a", "b"), z = c("K", "K", "L", "L", "K") )
Write down what you think the output will look like, then check if you were correct, and describe what
group_by()
does.|> df group_by(y)
Write down what you think the output will look like, then check if you were correct, and describe what
arrange()
does. Also comment on how it’s different from thegroup_by()
in part (a).|> df arrange(y)
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
|> df group_by(y) |> summarize(mean_x = mean(x))
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
|> df group_by(y, z) |> summarize(mean_x = mean(x))
Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?
|> df group_by(y, z) |> summarize(mean_x = mean(x), .groups = "drop")
Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
|> df group_by(y, z) |> summarize(mean_x = mean(x)) |> df group_by(y, z) |> mutate(mean_x = mean(x))