3  Exercises (Chapter 3)

  1. In a single pipeline for each condition, find all flights that meet the condition:

    • Had an arrival delay of two or more hours

      Answer
      library(tidyverse)
      library(nycflights13)
      # Your code here
    • Flew to Houston (IAH or HOU)

      Answer
      # Your code here
    • Were operated by United, American, or Delta

      Answer
      # Your code here
    • Departed in summer (July, August, and September)

      Answer
      # Your code here
    • Arrived more than two hours late, but didn’t leave late.

      Answer
      # Your code here
    • Were delayed by at least an hour, but made up over 30 minutes in flight

      Answer
      # Your code here
  2. Sort flights to find the flights with longest departure delays. Find the flights that left earliest in the morning.

    Answer
    # Your code here
  3. Sort flights to find the fastest flights. (Hint: Try including a math calculation inside of your function.)

    Answer
    # Your code here
  4. Was there a flight on every day of 2013?

    Answer
    # Your code here

    Your text answer here.

  5. Which flights traveled the farthest distance? Which traveled the least distance?

    Answer
    # Your code here
  6. Does it matter what order you used filter() and arrange() if you’re using both? Why/why not? Think about the results and how much work the functions would have to do.

    Answer

    Your text answer here.

  7. Compare dep_time, sched_dep_time, and dep_delay. How would you expect those three numbers to be related?

    Answer
    # Your code here

    Your text answer here.

  8. Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

    Answer
    # Your code here
  9. What happens if you specify the name of the same variable multiple times in a select() call?

    Answer
    # Your code here

    Your text answer here.

  10. What does the any_of() function do? Why might it be helpful in conjunction with this vector?

    variables <- c("year", "month", "day", "dep_delay", "arr_delay")
    # Try below first
    flights |> 
      select(variables)
    Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
    ℹ Please use `all_of()` or `any_of()` instead.
      # Was:
      data %>% select(variables)
    
      # Now:
      data %>% select(all_of(variables))
    
    See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
    # A tibble: 336,776 × 5
        year month   day dep_delay arr_delay
       <int> <int> <int>     <dbl>     <dbl>
     1  2013     1     1         2        11
     2  2013     1     1         4        20
     3  2013     1     1         2        33
     4  2013     1     1        -1       -18
     5  2013     1     1        -6       -25
     6  2013     1     1        -4        12
     7  2013     1     1        -5        19
     8  2013     1     1        -3       -14
     9  2013     1     1        -3        -8
    10  2013     1     1        -2         8
    # ℹ 336,766 more rows
    # Or
    flights |> 
      select(any_of(variables))
    # A tibble: 336,776 × 5
        year month   day dep_delay arr_delay
       <int> <int> <int>     <dbl>     <dbl>
     1  2013     1     1         2        11
     2  2013     1     1         4        20
     3  2013     1     1         2        33
     4  2013     1     1        -1       -18
     5  2013     1     1        -6       -25
     6  2013     1     1        -4        12
     7  2013     1     1        -5        19
     8  2013     1     1        -3       -14
     9  2013     1     1        -3        -8
    10  2013     1     1        -2         8
    # ℹ 336,766 more rows
    Answer

    Your text answer here.

  11. Does the result of running the following code surprise you? How do the select helpers deal with upper and lower case by default? How can you change that default?

    flights |> 
      select(contains("TIME"))
    Answer
    flights |> 
      select(contains("TIME"))
    # A tibble: 336,776 × 6
       dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
          <int>          <int>    <int>          <int>    <dbl> <dttm>             
     1      517            515      830            819      227 2013-01-01 05:00:00
     2      533            529      850            830      227 2013-01-01 05:00:00
     3      542            540      923            850      160 2013-01-01 05:00:00
     4      544            545     1004           1022      183 2013-01-01 05:00:00
     5      554            600      812            837      116 2013-01-01 06:00:00
     6      554            558      740            728      150 2013-01-01 05:00:00
     7      555            600      913            854      158 2013-01-01 06:00:00
     8      557            600      709            723       53 2013-01-01 06:00:00
     9      557            600      838            846      140 2013-01-01 06:00:00
    10      558            600      753            745      138 2013-01-01 06:00:00
    # ℹ 336,766 more rows

    Your text answer here.

  12. Rename air_time to air_time_min to indicate units of measurement and move it to the beginning of the data frame.

    Answer
    # Your code here
  13. Why doesn’t the following work, and what does the error mean?

    flights |> 
      select(tailnum) |> 
      arrange(arr_delay)
    Error in `arrange()`:
    ℹ In argument: `..1 = arr_delay`.
    Caused by error:
    ! object 'arr_delay' not found
    flights |> 
      select(tailnum)
    # A tibble: 336,776 × 1
       tailnum
       <chr>  
     1 N14228 
     2 N24211 
     3 N619AA 
     4 N804JB 
     5 N668DN 
     6 N39463 
     7 N516JB 
     8 N829AS 
     9 N593JB 
    10 N3ALAA 
    # ℹ 336,766 more rows
    Answer

    Your text answer here.

  14. Which carrier has the worst average delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about flights |> group_by(carrier, dest) |> summarize(n()))

    Answer
    # Your code here

    Your text answer here.

  15. Find the flights that are most delayed upon departure from each destination.

    Answer
    # Your code here

    Your text answer here.

  16. How do delays vary over the course of the day. Illustrate your answer with a plot.

    Answer
    # Your code here

    Your text answer here.

  17. What happens if you supply a negative n to slice_min() and friends?

    Answer
    flights |> 
      slice_min(dep_delay, n = -5) |>
      relocate(dep_delay)
    # A tibble: 336,776 × 19
       dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
           <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
     1       -43  2013    12     7     2040           2123       40           2352
     2       -33  2013     2     3     2022           2055     2240           2338
     3       -32  2013    11    10     1408           1440     1549           1559
     4       -30  2013     1    11     1900           1930     2233           2243
     5       -27  2013     1    29     1703           1730     1947           1957
     6       -26  2013     8     9      729            755     1002            955
     7       -25  2013    10    23     1907           1932     2143           2143
     8       -25  2013     3    30     2030           2055     2213           2250
     9       -24  2013     3     2     1431           1455     1601           1631
    10       -24  2013     5     5      934            958     1225           1309
    # ℹ 336,766 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    flights |> 
      slice_min(dep_delay, n = 5) |>
      relocate(dep_delay)
    # A tibble: 5 × 19
      dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
          <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
    1       -43  2013    12     7     2040           2123       40           2352
    2       -33  2013     2     3     2022           2055     2240           2338
    3       -32  2013    11    10     1408           1440     1549           1559
    4       -30  2013     1    11     1900           1930     2233           2243
    5       -27  2013     1    29     1703           1730     1947           1957
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    flights |> 
      slice_max(dep_delay, n = -5) |>
      relocate(dep_delay)
    # A tibble: 336,776 × 19
       dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
           <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
     1      1301  2013     1     9      641            900     1242           1530
     2      1137  2013     6    15     1432           1935     1607           2120
     3      1126  2013     1    10     1121           1635     1239           1810
     4      1014  2013     9    20     1139           1845     1457           2210
     5      1005  2013     7    22      845           1600     1044           1815
     6       960  2013     4    10     1100           1900     1342           2211
     7       911  2013     3    17     2321            810      135           1020
     8       899  2013     6    27      959           1900     1236           2226
     9       898  2013     7    22     2257            759      121           1026
    10       896  2013    12     5      756           1700     1058           2020
    # ℹ 336,766 more rows
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>
    flights |> 
      slice_max(dep_delay, n = 5) |>
      relocate(dep_delay)
    # A tibble: 5 × 19
      dep_delay  year month   day dep_time sched_dep_time arr_time sched_arr_time
          <dbl> <int> <int> <int>    <int>          <int>    <int>          <int>
    1      1301  2013     1     9      641            900     1242           1530
    2      1137  2013     6    15     1432           1935     1607           2120
    3      1126  2013     1    10     1121           1635     1239           1810
    4      1014  2013     9    20     1139           1845     1457           2210
    5      1005  2013     7    22      845           1600     1044           1815
    # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #   hour <dbl>, minute <dbl>, time_hour <dttm>

    Your text answer here.

  18. Explain what count() does in terms of the dplyr verbs you just learned. What does the sort argument to count() do?

    Answer
    flights |> 
      count(origin, dest, sort = FALSE) # sort = FALSE by   default
    # A tibble: 224 × 3
       origin dest      n
       <chr>  <chr> <int>
     1 EWR    ALB     439
     2 EWR    ANC       8
     3 EWR    ATL    5022
     4 EWR    AUS     968
     5 EWR    AVL     265
     6 EWR    BDL     443
     7 EWR    BNA    2336
     8 EWR    BOS    5327
     9 EWR    BQN     297
    10 EWR    BTV     931
    # ℹ 214 more rows
    flights |> 
      count(origin, dest, sort = TRUE)
    # A tibble: 224 × 3
       origin dest      n
       <chr>  <chr> <int>
     1 JFK    LAX   11262
     2 LGA    ATL   10263
     3 LGA    ORD    8857
     4 JFK    SFO    8204
     5 LGA    CLT    6168
     6 EWR    ORD    6100
     7 JFK    BOS    5898
     8 LGA    MIA    5781
     9 JFK    MCO    5464
    10 EWR    BOS    5327
    # ℹ 214 more rows

    Your text answer here.

  19. Suppose we have the following tiny data frame:

    df <- tibble(
      x = 1:5,
      y = c("a", "b", "a", "a", "b"),
      z = c("K", "K", "L", "L", "K")
    )
  1. Write down what you think the output will look like, then check if you were correct, and describe what group_by() does.

    df |>
      group_by(y)
    Answer
    df |>
      group_by(y)
    # A tibble: 5 × 3
    # Groups:   y [2]
          x y     z    
      <int> <chr> <chr>
    1     1 a     K    
    2     2 b     K    
    3     3 a     L    
    4     4 a     L    
    5     5 b     K    

    Your text answer here.

  2. Write down what you think the output will look like, then check if you were correct, and describe what arrange() does. Also comment on how it’s different from the group_by() in part (a).

    df |>
      arrange(y)
    Answer
    df |>
      arrange(y)
    # A tibble: 5 × 3
          x y     z    
      <int> <chr> <chr>
    1     1 a     K    
    2     3 a     L    
    3     4 a     L    
    4     2 b     K    
    5     5 b     K    

    Your text answer here.

  3. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does.

    df |>
      group_by(y) |>
      summarize(mean_x = mean(x))
    Answer
    df |>
      group_by(y) |>
      summarize(mean_x = mean(x))
    # A tibble: 2 × 2
      y     mean_x
      <chr>  <dbl>
    1 a       2.67
    2 b       3.5 

    Your text answer here.

  4. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. Then, comment on what the message says.

    df |>
      group_by(y, z) |>
      summarize(mean_x = mean(x))
    Answer
    df |>
      group_by(y, z) |>
      summarize(mean_x = mean(x))
    `summarise()` has grouped output by 'y'. You can override using the `.groups`
    argument.
    # A tibble: 3 × 3
    # Groups:   y [2]
      y     z     mean_x
      <chr> <chr>  <dbl>
    1 a     K        1  
    2 a     L        3.5
    3 b     K        3.5

    Your text answer here.

  5. Write down what you think the output will look like, then check if you were correct, and describe what the pipeline does. How is the output different from the one in part (d)?

    df |>
      group_by(y, z) |>
      summarize(mean_x = mean(x), .groups = "drop")
    Answer
    df |>
      group_by(y, z) |>
      summarize(mean_x = mean(x), .groups = "drop")
    # A tibble: 3 × 3
      y     z     mean_x
      <chr> <chr>  <dbl>
    1 a     K        1  
    2 a     L        3.5
    3 b     K        3.5

    Your text answer here.

  6. Write down what you think the outputs will look like, then check if you were correct, and describe what each pipeline does. How are the outputs of the two pipelines different?

    df |>
      group_by(y, z) |>
      summarize(mean_x = mean(x))
    
    df |>
      group_by(y, z) |>
      mutate(mean_x = mean(x))
    Answer
    df |>
      group_by(y, z) |>
      summarize(mean_x = mean(x))
    `summarise()` has grouped output by 'y'. You can override using the `.groups`
    argument.
    # A tibble: 3 × 3
    # Groups:   y [2]
      y     z     mean_x
      <chr> <chr>  <dbl>
    1 a     K        1  
    2 a     L        3.5
    3 b     K        3.5
    df |>
      group_by(y, z) |>
      mutate(mean_x = mean(x))
    # A tibble: 5 × 4
    # Groups:   y, z [3]
          x y     z     mean_x
      <int> <chr> <chr>  <dbl>
    1     1 a     K        1  
    2     2 b     K        3.5
    3     3 a     L        3.5
    4     4 a     L        3.5
    5     5 b     K        3.5

    Your text answer here.