library(tidyverse)
library(nycflights13)
# Your code here3 Exercises (Chapter 3)
In a single pipeline for each condition, find all flights that meet the condition:
Had an arrival delay of two or more hours
Flew to Houston (
IAHorHOU)Were operated by United, American, or Delta
Departed in summer (July, August, and September)
Arrived more than two hours late, but didn’t leave late.
Were delayed by at least an hour, but made up over 30 minutes in flight
Sort
flightsto find the flights with longest departure delays. Find the flights that left earliest in the morning.Sort
flightsto find the fastest flights. (Hint: Try including a math calculation inside of your function.)Was there a flight on every day of 2013?
Which flights traveled the farthest distance? Which traveled the least distance?
Does it matter what order you used
filter()andarrange()if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.Compare
dep_time,sched_dep_time, anddep_delay. How would you expect those three numbers to be related?Brainstorm as many ways as possible to select
dep_time,dep_delay,arr_time, andarr_delayfromflights.What happens if you specify the name of the same variable multiple times in a
select()call?What does the
any_of()function do? Why might it be helpful in conjunction with this vector?variables <- c("year", "month", "day", "dep_delay", "arr_delay") # Try below first flights |> select(variables)Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0. ℹ Please use `all_of()` or `any_of()` instead. # Was: data %>% select(variables) # Now: data %>% select(all_of(variables)) See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.# A tibble: 336,776 × 5 year month day dep_delay arr_delay <int> <int> <int> <dbl> <dbl> 1 2013 1 1 2 11 2 2013 1 1 4 20 3 2013 1 1 2 33 4 2013 1 1 -1 -18 5 2013 1 1 -6 -25 6 2013 1 1 -4 12 7 2013 1 1 -5 19 8 2013 1 1 -3 -14 9 2013 1 1 -3 -8 10 2013 1 1 -2 8 # ℹ 336,766 more rows# Or flights |> select(any_of(variables))# A tibble: 336,776 × 5 year month day dep_delay arr_delay <int> <int> <int> <dbl> <dbl> 1 2013 1 1 2 11 2 2013 1 1 4 20 3 2013 1 1 2 33 4 2013 1 1 -1 -18 5 2013 1 1 -6 -25 6 2013 1 1 -4 12 7 2013 1 1 -5 19 8 2013 1 1 -3 -14 9 2013 1 1 -3 -8 10 2013 1 1 -2 8 # ℹ 336,766 more rowsDoes the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?
flights |> select(contains("TIME"))Rename
air_timetoair_time_minto indicate units of measurement and move it to the beginning of the data frame.Why doesn’t the following work, and what does the error mean?
flights |> select(tailnum) |> arrange(arr_delay)Error in `arrange()`: ℹ In argument: `..1 = arr_delay`. Caused by error: ! object 'arr_delay' not foundflights |> select(tailnum)# A tibble: 336,776 × 1 tailnum <chr> 1 N14228 2 N24211 3 N619AA 4 N804JB 5 N668DN 6 N39463 7 N516JB 8 N829AS 9 N593JB 10 N3ALAA # ℹ 336,766 more rowsWhich carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights |> group_by(carrier, dest) |> summarize(n()))Find the flights that are most delayed upon departure from each destination.
How do delays vary over the course of the day. Illustrate your answer with a plot.
What happens if you supply a negative
ntoslice_min()and friends?Explain what
count()does in terms of the dplyr verbs you just learned. What does thesortargument tocount()do?Suppose we have the following tiny data frame:
df <- tibble( x = 1:5, y = c("a", "b", "a", "a", "b"), z = c("K", "K", "L", "L", "K") )
Write down what you think the output will look like, then check if you were correct, and describe what
group_by()does.df |> group_by(y)Write down what you think the output will look like, then check if you were correct, and describe what
arrange()does. Also comment on how it’s different from thegroup_by()in part (a).df |> arrange(y)Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.
df |> group_by(y) |> summarize(mean_x = mean(x))Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.
df |> group_by(y, z) |> summarize(mean_x = mean(x))Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?
df |> group_by(y, z) |> summarize(mean_x = mean(x), .groups = "drop")Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?
df |> group_by(y, z) |> summarize(mean_x = mean(x)) df |> group_by(y, z) |> mutate(mean_x = mean(x))