R_for_Data_Science_Workflow&transform

這是英文版的4、5章節

4 Workflow: basics

  • 括號好東西,能幫你在賦值的時候同時打印變量,這點在Rmarkdown的時候也很有幫助

    y <- seq(1, 10, length.out = 5)
    y #
    > [1] 1.00 3.25 5.50 7.75 10.00
    
    (y <- seq(1, 10, length.out = 5))
    #> [1] 1.00 3.25 5.50 7.75 10.00
    
  • 快捷鍵 Alt + Shift + K. 會幫你把所有R studio的快捷鍵呈現出來

    有一點要注意的是,R studio的一些快捷鍵在輸入法狀態下是不能用的

5.1 Introduction

  • 這章講了5個函數

    • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).

    • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).

    • Collapse many values down to a single summary (summarise()).
  • All verbs work similarly:

    1. The first argument is a data frame.

    2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

      # tab可以自動打出變量名(比如這里的year)的,這點很方便,但似乎只有在這種情況下才可以
      
      flights %>% 
        select(., year)
      
      # 這種似乎就不行
      select(flights, year)
      
      # 還有就是記住變量名是沒有引號的
      filter(flights, month == 1, day == 1)
      
    3. The result is a new data frame.

  • 5個函數可以和group_by 函數聯用,which changes the scope of each function from operating on the entire dataset to operating on it group-by-group

但有時候得記得ungroup

Is ungroup() recommended after every group_by()?

5.2 Filter rows with filter()

  • 浮點數的問題

    sqrt(2) ^ 2 == 2
    #> [1] FALSE
    1 / 49 * 49 == 1
    #> [1] FALSE
    

這是因為計算機是用有限精度來處理這些問題的。碰到這些問題的時候,考慮使用near函數

near(sqrt(2) ^ 2, 2)
#> [1] TRUE
near(1 / 49 * 49, 1)
#> [1] TRUE
  • comparison operators(比較運算符) 以及 Logical operators(邏輯運算符) 再次記住這兩個英文,方便你有問題查詢google

  • 這個圖片挺好的

    image
  • 在執行 | 的時候有一個有意思的事情

    # 我們會看到下面的輸出結果是month==1的時候,而不是我們預期的是11或者12
    > head(filter(flights, month == (11 | 12)), n = 3)
    # A tibble: 3 x 19
       year month   day dep_time sched_dep_time dep_delay arr_time
      <int> <int> <int>    <int>          <int>     <dbl>    <int>
    1  2013     1     1      517            515         2      830
    2  2013     1     1      533            529         4      850
    3  2013     1     1      542            540         2      923
    # ... with 12 more variables: sched_arr_time <int>,
    #   arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
    #   distance <dbl>, hour <dbl>, minute <dbl>,
    #   time_hour <dttm>
    
    # 是因為首先
    > 11 | 12
    [1] TRUE
    
    # 然后TRUE在數字環境的時候,就會變成1
    # 那么結果就變成了 month == 1
    > TRUE == 1
    [1] TRUE
    
    
    > head(filter(flights, month == 1), n = 3)
    # A tibble: 3 x 19
       year month   day dep_time sched_dep_time dep_delay arr_time
      <int> <int> <int>    <int>          <int>     <dbl>    <int>
    1  2013     1     1      517            515         2      830
    2  2013     1     1      533            529         4      850
    3  2013     1     1      542            540         2      923
    # ... with 12 more variables: sched_arr_time <int>,
    #   arr_delay <dbl>, carrier <chr>, flight <int>,
    #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
    #   distance <dbl>, hour <dbl>, minute <dbl>,
    #   time_hour <dttm>
    
  • x %in% y 是挺好的函數。This will select every row where x is one of the values in y .

    nov_dec <- filter(flights, month %in% c(11, 12))
    

    這會讓你的函數會更加的壓縮,比如這里有4,11,12的話,相比于 == 而言會更好寫一點

    filter(flights, month %in% c(4, 11, 12))
    filter(flights, month == 4 |  month == 11 | month == 12)
    
  • remembering De Morgan’s law: !(x & y) is the same as !x | !y, and !(x | y) is the same as !x & !y.

    # 這里原書有錯誤:weren’t delayed (on arrival or departure) by more than two hours
    # 原書是或,但這里表述的是且
    filter(flights, !(arr_delay > 120 | dep_delay > 120))
    filter(flights, arr_delay <= 120, dep_delay <= 120)
    
  • && 和 || 我們在會后面遇到,記住這里不要在這里用!我們應該用的是 | 和 &

  • Missing Value,即 NA (“not availables”) 。我能想到的NA場景就是在RNA-Seq這類的p-value里面的結果遇到。

    # NA是具有傳染性的
    # 因為NA代表著有這個數據,但你不知道是什么,所以對于NA的一切操作結果,你都只能是不知道,即NA
    # 當然,也有例子,見下面
    NA > 5
    #> [1] NA
    10 == NA
    #> [1] NA
    NA + 10
    #> [1] NA
    NA / 2
    #> [1] NA
    
    ------------------------------------------------
    NA == NA
    #> [1] NA
    
    # It’s easiest to understand why this is true with a bit more context:
    # Let x be Mary's age. We don't know how old she is.
    x <- NA
    
    # Let y be John's age. We don't know how old he is.
    y <- NA
    
    # Are John and Mary the same age?
    x == y
    #> [1] NA
    # We don't know!
    --------------------------------------------------
    
    
    is.na(x)
    #> [1] TRUE
    
    # filter只會包含condition是TRUE的,并不會保留 NA 的
    df <- tibble(x = c(1, NA, 3))
    filter(df, x > 1)
    #> # A tibble: 1 x 1
    #> x
    #> <dbl>
    #> 1 3
    
    # 想要NA結果的話得可以自己搞一個
    filter(df, is.na(x) | x > 1)
    #> # A tibble: 2 x 1
    #> x
    #> <dbl>
    #> 1 NA
    #> 2 3
    

  • Exercise 5.2.2

    Another useful dplyr filtering helper is between(). What does it do? Can you use it to simplify the code needed to answer the previous challenges?

    關于between的

    Description
    This is a shortcut for x >= left & x <= right, implemented efficiently in C++ for local values, and translated to the appropriate SQL for remote tables.

    Usage
    between(x, left, right)

    # 之前我們寫
    filter(flights, month %in% 7:9)
    
    # 現在我們還可以用between
    filter(flights, between(month, 7,9))
    
    # 其實我感覺filter就是把下面步驟整合了
    flights[between(flights$month, 7, 9),]
    
    # 對于between,我們還可以驗證標準正態分布的-1到1占據了68.3%的面積
    > x <- rnorm(1e7)
    > length(x[between(x, -1, 1)]) / 1e7
    [1] 0.6827031
    
  • Exercise 5.2.3

    How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

    filter(flights, is.na(dep_time))
    #> # A tibble: 8,255 x 19
    #>    year month   day dep_time sched_dep_time dep_delay arr_time
    #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
    #> 1  2013     1     1       NA           1630        NA       NA
    #> 2  2013     1     1       NA           1935        NA       NA
    #> 3  2013     1     1       NA           1500        NA       NA
    #> 4  2013     1     1       NA            600        NA       NA
    #> 5  2013     1     2       NA           1540        NA       NA
    #> 6  2013     1     2       NA           1620        NA       NA
    #> # … with 8,249 more rows, and 12 more variables: sched_arr_time <int>,
    #> #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
    #> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
    #> #   minute <dbl>, time_hour <dttm>
    
  • Exercise 5.2.4

    Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counter example!)

# 任何數的0次方都是1
# 這個特性對于scale的時候很有用,因為有時候你對于所有數都是均等進行scale的話,會由于方差是0(因為方差是分母)而無法返回值,這時候你就可以 x - mean(x) / (sd(x) ^ 0) 來解決這個問題了
> NA ^ 0
  [1] 1

# 因為
# anything and FALSE is always FALSE.
# anything or TRUE is always TRUE
> NA | TRUE
[1] TRUE
> FALSE & NA
[1] FALSE

# 而這里就不是了,我們不確定NA是什么,而結果又是會根據NA值的不同而變化的,所以返回值就是NA了
> NA | FALSE
[1] NA
> NA & TRUE
[1] NA


# 關于這個問題,我更喜歡Quaro的回答
# https://www.quora.com/In-R-why-is-NA*0-not-equal-to-0

# 即NA可以代表任何值,可以代表0也可以代表NaN。
# 而 x * 0 == 0 這一特性只在值是有限的時候,
# 而在無限的時候結果則是NaN,即無意義數
# 所以這里等于用NA代表了兩種結果
> NA * 0
[1] NA

> Inf * 0
[1] NaN

NaN 和 Null 要區分, NaN代表無意義,即 not a number。而 Null 代表的是一種特殊的對象,表示函數沒有被賦予任何內容。

R中NA,NaN,Inf什么意思

R語言初級教程(12): NA、Inf、NaN、NULL 特殊值


5.3 Arrange rows with arrange()

  • arrange排序的時候,缺失值在最后

    df <- tibble(x = c(5, 2, NA))
    arrange(df, x)
    #> # A tibble: 3 x 1
    #> x
    #> <dbl>
    #> 1 2
    #> 2 5
    #> 3 NA
    arrange(df, desc(x))
    #> # A tibble: 3 x 1
    #> x
    #> <dbl>
    #> 1 5
    #> 2 2
    #> 3 NA
    
    

  • Exercise 5.3.1

    How could you use arrange() to sort all missing values to the start? (Hint: use is.na()).

    > (df <- tibble(x = c(5, 2, 8, NA, NA, NA),
    +               y = c(4, 3, 1, 4, 3, 8)))
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1     5     4
    2     2     3
    3     8     1
    4    NA     4
    5    NA     3
    6    NA     8
    
    # 默認排序總是會把NA排在最后
    # 然后我們還應該注意的一點是NA對應的4、3、8其實是不會排序的
    #(雖然理論上來說如果x列相同,y列會再一次排序)
    # 我們看起來NA和NA是“一樣的”,但前面的知識告訴我們,NA == NA的結果是NA,即不知道,所以這里并沒有二次排序
    > arrange(df, x)
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1     2     3
    2     5     4
    3     8     1
    4    NA     4
    5    NA     3
    6    NA     8
    
    # 我看了solution之后,明白了
    # is.na返回的是TRUE或者FALSE,而TRUE在與FALSE比較的時候,TRUE > FALSE
    # 然后再按desc(降序排列,從大到小排序)的話,就是 NA 的在前面了
    # 在is.na排完之后,我們就不再排序了,還是按照原來的順序 5 -> 2 -> 8
    > arrange(df, desc(is.na(x)))
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1    NA     4
    2    NA     3
    3    NA     8
    4     5     4
    5     2     3
    6     8     1
    
    # 如果你想在TRUE和FALSE排完之后,再做一波排序的話,可以在加上x
    # 那就是先排is.na(x)的結果,is.na(x)排完,再排一次x
    > arrange(df, desc(is.na(x)),x)
    # A tibble: 6 x 2
          x     y
      <dbl> <dbl>
    1    NA     4
    2    NA     3
    3    NA     8
    4     2     3
    5     5     4
    6     8     1
    
  • Exercise 5.3.3

    Sort flights to find the fastest flights.

    # 還可以加上數學表達式的結果排序
    flights %>% 
      arrange(., desc(distance / air_time))
    

5.4 Select columns with select()

  • select挑選

    # 你既可以挑選某幾列
    # 也可以去掉某幾列
    select(flights, -(year:day))
    #> # A tibble: 336,776 x 16
    #>   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
    #>      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
    #> 1      517            515         2      830            819        11 UA     
    #> 2      533            529         4      850            830        20 UA     
    #> 3      542            540         2      923            850        33 AA     
    #> 4      544            545        -1     1004           1022       -18 B6     
    #> 5      554            600        -6      812            837       -25 DL     
    #> 6      554            558        -4      740            728        12 UA     
    #> # … with 3.368e+05 more rows, and 9 more variables: flight <int>,
    #> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
    #> #   hour <dbl>, minute <dbl>, time_hour <dttm>
    
  • There are a number of helper functions you can use within select():

    • starts_with("abc"): matches names that begin with “abc”.

      > flights %>% 
      +   select(., - starts_with("ye"))
      # A tibble: 336,776 x 18
         month   day dep_time sched_dep_time dep_delay arr_time
         <int> <int>    <int>          <int>     <dbl>    <int>
       1     1     1      517            515         2      830
       2     1     1      533            529         4      850
       3     1     1      542            540         2      923
       4     1     1      544            545        -1     1004
       5     1     1      554            600        -6      812
       6     1     1      554            558        -4      740
       7     1     1      555            600        -5      913
       8     1     1      557            600        -3      709
       9     1     1      557            600        -3      838
      10     1     1      558            600        -2      753
      # ... with 336,766 more rows, and 12 more variables:
      #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
      #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
      #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
      #   time_hour <dttm>
      
    • ends_with("xyz"): matches names that end with “xyz”.

    • contains("ijk"): matches names that contain “ijk”.

    • matches("(.)\\1"): selects variables that match a regular expression. This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.

    • num_range("x", 1:3): matches x1, x2 and x3.

    See ?select for more details.

  • rename 和 select的區別

    # 可以看到rename好用多了
    > rename(as_tibble(iris), petal_length = Petal.Length)
    # A tibble: 150 x 5
       Sepal.Length Sepal.Width petal_length Petal.Width Species
              <dbl>       <dbl>        <dbl>       <dbl> <fct>  
     1          5.1         3.5          1.4         0.2 setosa 
     2          4.9         3            1.4         0.2 setosa 
     3          4.7         3.2          1.3         0.2 setosa 
     4          4.6         3.1          1.5         0.2 setosa 
     5          5           3.6          1.4         0.2 setosa 
     6          5.4         3.9          1.7         0.4 setosa 
     7          4.6         3.4          1.4         0.3 setosa 
     8          5           3.4          1.5         0.2 setosa 
     9          4.4         2.9          1.4         0.2 setosa 
    10          4.9         3.1          1.5         0.1 setosa 
    # ... with 140 more rows
    
    # select只會保留你選的
    > select(as_tibble(iris), petal_length = Petal.Length)
    # A tibble: 150 x 1
       petal_length
              <dbl>
     1          1.4
     2          1.4
     3          1.3
     4          1.5
     5          1.4
     6          1.7
     7          1.4
     8          1.5
     9          1.4
    10          1.5
    # ... with 140 more rows
    
  • everything也是很好用的

    # 可以把某幾列調到前面
    select(flights, time_hour, air_time, everything())
    #> # A tibble: 336,776 x 19
    #>   time_hour           air_time  year month   day dep_time sched_dep_time
    #>   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
    #> 1 2013-01-01 05:00:00      227  2013     1     1      517            515
    #> 2 2013-01-01 05:00:00      227  2013     1     1      533            529
    #> 3 2013-01-01 05:00:00      160  2013     1     1      542            540
    #> 4 2013-01-01 05:00:00      183  2013     1     1      544            545
    #> 5 2013-01-01 06:00:00      116  2013     1     1      554            600
    #> 6 2013-01-01 05:00:00      150  2013     1     1      554            558
    #> # … with 3.368e+05 more rows, and 12 more variables: dep_delay <dbl>,
    #> #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>,
    #> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, distance <dbl>,
    #> #   hour <dbl>, minute <dbl>
    

  • Exercise 5.4.1

    Brainstorm as many ways as possible to select dep_time, dep_delay, arr_time, and arr_delay from flights.

    來自solution

    # 關于變量名加不加引號這點,我一直沒怎么搞清楚
    # 不過后面我搞清楚了一點點
    select(flights, dep_time, dep_delay, arr_time, arr_delay)
    select(flights, "dep_time", "dep_delay", "arr_time", "arr_delay")
    
    select(flights, 4, 6, 7, 9)
    
    variables <- c("dep_time", "dep_delay", "arr_time", "arr_delay")
    select(flights, one_of(c("dep_time", "dep_delay", "arr_time", "arr_delay")))
    select(flights, one_of(variables))
    
    select(flights, starts_with("dep_"), starts_with("arr_"))
    
  • Exercise 5.4.2

    What happens if you include the name of a variable multiple times in a select() call?

    # 這點對于我們的everything特別有用
    # 即select變量名冗余會只保留一個
    select(flights, year, month, day, year, year)
    #> # A tibble: 336,776 x 3
    #>    year month   day
    #>   <int> <int> <int>
    #> 1  2013     1     1
    #> 2  2013     1     1
    #> 3  2013     1     1
    #> 4  2013     1     1
    #> 5  2013     1     1
    #> 6  2013     1     1
    #> # … with 3.368e+05 more rows
    
    # arr_delay這個變量是冗余的
    select(flights, arr_delay, everything())
    #> # A tibble: 336,776 x 19
    #>   arr_delay  year month   day dep_time sched_dep_time dep_delay arr_time
    #>       <dbl> <int> <int> <int>    <int>          <int>     <dbl>    <int>
    #> 1        11  2013     1     1      517            515         2      830
    #> 2        20  2013     1     1      533            529         4      850
    #> 3        33  2013     1     1      542            540         2      923
    #> 4       -18  2013     1     1      544            545        -1     1004
    #> 5       -25  2013     1     1      554            600        -6      812
    #> 6        12  2013     1     1      554            558        -4      740
    #> # … with 3.368e+05 more rows, and 11 more variables: sched_arr_time <int>,
    #> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
    #> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
    #> #   time_hour <dttm>
    
  • Exercise 5.4.3

    What does the one_of() function do? Why might it be helpful in conjunction with this vector?

    這個solution部分的解答非常值得好好看

    vars <- c("year", "month", "day", "dep_delay", "arr_delay")
    select(flights, one_of(vars))
    #> # A tibble: 336,776 x 5
    #>    year month   day dep_delay arr_delay
    #>   <int> <int> <int>     <dbl>     <dbl>
    #> 1  2013     1     1         2        11
    #> 2  2013     1     1         4        20
    #> 3  2013     1     1         2        33
    #> 4  2013     1     1        -1       -18
    #> 5  2013     1     1        -6       -25
    #> 6  2013     1     1        -4        12
    #> # … with 3.368e+05 more rows
    
    select(flights, vars)
    #> # A tibble: 336,776 x 5
    #>    year month   day dep_delay arr_delay
    #>   <int> <int> <int>     <dbl>     <dbl>
    #> 1  2013     1     1         2        11
    #> 2  2013     1     1         4        20
    #> 3  2013     1     1         2        33
    #> 4  2013     1     1        -1       -18
    #> 5  2013     1     1        -6       -25
    #> 6  2013     1     1        -4        12
    #> # … with 3.368e+05 more rows
    

    如果 vars 是flight里面的變量,那么就會返回名字是vars的變量列,如果不是flight里面的變量,就會查找 vars的值,對應的變量列

    # 舉個例子
    > year <- "month"
    > flights %>% 
    +   select(., year) %>% 
    +   head(., n = 2)
    # A tibble: 2 x 1
       year
      <int>
    1  2013
    2  2013
    
    > year_another <- "month"
    > flights %>% 
    +   select(., year_another) %>% 
    +   head(., n = 2)
    # A tibble: 2 x 1
      month
      <int>
    1     1
    2     1
    
    > year_second <- "month_another"
    > flights %>% 
    +   select(., year_second) %>% 
    +   head(., n = 2)
    Error: Unknown column `month_another` 
    Run `rlang::last_error()` to see where the error occurred.
    

    如果要消除這種“混淆”,利用 !!!

    > flights %>% 
    +   select(., !!!year) %>% 
    +   head(., n = 2)
    # A tibble: 2 x 1
      month
      <int>
    1     1
    2     1
    

    This behavior, which is used by many tidyverse functions, is an example of what is called non-standard evaluation (NSE) in R. See the dplyr vignette, Programming with dplyr, for more information on this topic.

  • Exercise 5.4.4

    Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?

    select(flights, contains("TIME"))
    #> # A tibble: 336,776 x 6
    #>   dep_time sched_dep_time arr_time sched_arr_time air_time
    #>      <int>          <int>    <int>          <int>    <dbl>
    #> 1      517            515      830            819      227
    #> 2      533            529      850            830      227
    #> 3      542            540      923            850      160
    #> 4      544            545     1004           1022      183
    #> 5      554            600      812            837      116
    #> 6      554            558      740            728      150
    #> # … with 3.368e+05 more rows, and 1 more variable: time_hour <dttm>
    

    這是因為大小寫不敏感的問題

    select(flights, contains("TIME", ignore.case = FALSE))
    #> # A tibble: 336,776 x 0
    

5.5 Add new variables with mutate()

  • 我喜歡Mutate這個特性

    Note that you can refer to columns that you’ve just created:

    flights_sml <- select(flights, 
      year:day, 
      ends_with("delay"), 
      distance, 
      air_time
    )
    mutate(flights_sml,
      gain = dep_delay - arr_delay,
      speed = distance / air_time * 60
    )
    #> # A tibble: 336,776 x 9
    #>    year month   day dep_delay arr_delay distance air_time  gain speed
    #>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
    #> 1  2013     1     1         2        11     1400      227    -9  370.
    #> 2  2013     1     1         4        20     1416      227   -16  374.
    #> 3  2013     1     1         2        33     1089      160   -31  408.
    #> 4  2013     1     1        -1       -18     1576      183    17  517.
    #> 5  2013     1     1        -6       -25      762      116    19  394.
    #> 6  2013     1     1        -4        12      719      150   -16  288.
    #> # … with 3.368e+05 more rows
    
    mutate(flights_sml,
      gain = dep_delay - arr_delay,
      hours = air_time / 60,
      gain_per_hour = gain / hours
    )
    #> # A tibble: 336,776 x 10
    #>    year month   day dep_delay arr_delay distance air_time  gain hours
    #>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
    #> 1  2013     1     1         2        11     1400      227    -9  3.78
    #> 2  2013     1     1         4        20     1416      227   -16  3.78
    #> 3  2013     1     1         2        33     1089      160   -31  2.67
    #> 4  2013     1     1        -1       -18     1576      183    17  3.05
    #> 5  2013     1     1        -6       -25      762      116    19  1.93
    #> 6  2013     1     1        -4        12      719      150   -16  2.5 
    #> # … with 3.368e+05 more rows, and 1 more variable: gain_per_hour <dbl>
    
  • 如果你只想保留新產生的變量

    transmute(flights,
      gain = dep_delay - arr_delay,
      hours = air_time / 60,
      gain_per_hour = gain / hours
    )
    #> # A tibble: 336,776 x 3
    #>    gain hours gain_per_hour
    #>   <dbl> <dbl>         <dbl>
    #> 1    -9  3.78         -2.38
    #> 2   -16  3.78         -4.23
    #> 3   -31  2.67        -11.6 
    #> 4    17  3.05          5.57
    #> 5    19  1.93          9.83
    #> 6   -16  2.5          -6.4 
    #> # … with 3.368e+05 more rows
    
  • 有許多函數你都可以和mutate聯用,從而創造新的變量。這些函數的關鍵特性就是向量化,即你輸入一個向量,輸出也是一個向量,輸入輸出向量里面包含的值是對應的。

    • Arithmetic operators: +, -, *, /, ^. These are all vectorised, using the so called “recycling rules”. If one parameter is shorter than the other, it will be automatically extended to be the same length. This is most useful when one of the arguments is a single number: air_time / 60, hours * 60 + minute, etc.

      Arithmetic operators are also useful in conjunction with the aggregate functions you’ll learn about later. For example, x / sum(x) calculates the proportion of a total, and y - mean(y) computes the difference from the mean.

    • Modular arithmetic(所謂的模運算): %/% (integer division) and %% (remainder), where x == y * (x %/% y) + (x %% y). Modular arithmetic is a handy tool because it allows you to break integers up into pieces. For example, in the flights dataset, you can compute hour and minute from dep_time with:

      transmute(flights,
        dep_time,
        hour = dep_time %/% 100,
        minute = dep_time %% 100
      )
      #> # A tibble: 336,776 x 3
      #>   dep_time  hour minute
      #>      <int> <dbl>  <dbl>
      #> 1      517     5     17
      #> 2      533     5     33
      #> 3      542     5     42
      #> 4      544     5     44
      #> 5      554     5     54
      #> 6      554     5     54
      #> # … with 3.368e+05 more rows
      
    • Logs: log(), log2(), log10(). Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert multiplicative relationships to additive(累乘關系變成累加關系,最大似然估計那邊會用到吧), a feature we’ll come back to in modelling.

      All else being equal, I recommend using log2() because it’s easy to interpret: a difference of 1 on the log scale corresponds to doubling on the original scale and a difference of -1 corresponds to halving.(對數標度下的數值增加 1 個單位,意味著初始數值加倍;減少 1 個單位,則意味著初始數值減半。常見的就是log2FoldChange了)

    • Offsets: lead() and lag() allow you to refer to leading or lagging values.(感覺是整體數據往前或者往后挪一格) This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)). They are most useful in conjunction with group_by(), which you’ll learn about shortly.

      (x <- 1:10)
      #>  [1]  1  2  3  4  5  6  7  8  9 10
      lag(x)
      #>  [1] NA  1  2  3  4  5  6  7  8  9
      lead(x)
      #>  [1]  2  3  4  5  6  7  8  9 10 NA
      
    • Cumulative and rolling aggregates: R provides functions for running sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means. If you need rolling aggregates (i.e. a sum computed over a rolling window), try the RcppRoll package.

      x
      #>  [1]  1  2  3  4  5  6  7  8  9 10
      
      # 累積求和
      cumsum(x)
      #>  [1]  1  3  6 10 15 21 28 36 45 55
      
      # 累積求平均
      cummean(x)
      #>  [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
      

      RcppRoll包的使用(我覺得RcppRoll包對于一些基因組的數據應該很有用)

      install.packages("RcppRoll")
      library(RcppRoll)
      
      # 看參數說明
      # n 滾動求和窗口的大小
      # by 表示每次窗口移動的距離
      
      # 也就是說如果 n = by的話,就是我們常見的bw壓縮或者分bin求值的基因組了吧
      
      (x <- 1:10)
      [1]  1  2  3  4  5  6  7  8  9 10
      
      > roll_sum(x, n = 3, by = 3)
      [1]  6 15 24
      
      > roll_sum(x, n = 3, by = 2)
      [1]  6 12 18 24
      
      > roll_sum(x, n = 3, by = 1)
      [1]  6  9 12 15 18 21 24 27
      
      

      又讓我想起了Sliding window 和 fixed windows。
      我估計是sliding windows會有bin(這里的bin經常和window是混用的)(bin就是RcppRoll的n)和step(Step就是RcppRoll的by)兩個參數。但fixed windows只有一個參數,即bin,應該就沒有step了,或者說step等于window。

      或者說 sliding windows 又會分overlap or be disjoint。overlap就是我們step < bin,而 disjoint 就是 step >= bin 了。

      參考

  • Logical comparisons, <, <=, >, >=, !=, and ==, which you learned about earlier. If you’re doing a complex sequence of logical operations it’s often a good idea to store the interim values in new variables so you can check that each step is working as expected.

  • Ranking: there are a number of ranking functions, but you should start with min_rank(). It does the most usual type of ranking (e.g. 1st, 2nd, 2nd, 4th). The default gives smallest values the small ranks; use desc(x) to give the largest values the smallest ranks.

    y <- c(1, 2, 2, NA, 3, 4)
    min_rank(y)
    #> [1]  1  2  2 NA  4  5
    min_rank(desc(y))
    #> [1]  5  3  3 NA  2  1
    

    If min_rank() doesn’t do what you need, look at the variants row_number(), dense_rank(), percent_rank(), cume_dist(), ntile(). See their help pages for more details.

    row_number(y) 
    # row_number(): equivalent to rank(ties.method = "first")
    #> [1]  1  2  3 NA  4  5
    
    dense_rank(y) # dense_rank(): like min_rank(), but with no gaps between ranks
    #> [1]  1  2  2 NA  3  4
     
    percent_rank(y) # 這是分位數(向量化版本)
    #> [1] 0.00 0.25 0.25   NA 0.75 1.00
    cume_dist(y) # 這是累積分布函數(向量化版本)
    #> [1] 0.2 0.6 0.6  NA 0.8 1.0
    

    Exercise 5.5.4的solution有一個解釋我覺得蠻好的

    rankme <- tibble(
      x = c(10, 5, 1, 5, 5)
    )
    
    rankme <- mutate(rankme,
      x_row_number = row_number(x),
      x_min_rank = min_rank(x),
      x_dense_rank = dense_rank(x)
    )
    arrange(rankme, x)
    #> # A tibble: 5 x 4
    #>       x x_row_number x_min_rank x_dense_rank
    #>   <dbl>        <int>      <int>        <int>
    #> 1     1            1          1            1
    #> 2     5            2          2            2
    #> 3     5            3          2            2
    #> 4     5            4          2            2
    #> 5    10            5          5            3
    

  • Exercise 5.5.1

    Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

    看了solution才理解,這題意思應該是,現在 dep_time 里面的時間,表示很方便(517代表了5:17), 但計算很麻煩(517-417是100,但代表的不是100min,而是60分鐘)。所以我們應該進行轉換。轉換成距離午夜12點的時間

    # 以1504(15:04)舉例
    # 這里產生的904,就是距離午夜12點(24:00)904min
    1504 %/% 100 * 60 + 1504 %% 100
    #> [1] 904
    
    # 然后有一個小問題就是,距離午夜1440分鐘,恰好是午夜12點,剛好一個輪回
    # 那么應該是距離午夜12點 0分鐘,而不是1440
    # 所以最后是變成了
    (1504 %/% 100 * 60 + 1504 %% 100) %% 1440
    
    
    # 轉換的話
    flights %>% 
      select(., dep_time, sched_dep_time) %>% 
      mutate(., dep_time_minus = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1440, 
             sched_dep_time_minus = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1440)
    # A tibble: 336,776 x 4
       dep_time sched_dep_time dep_time_minus sched_dep_time_minus
          <int>          <int>          <dbl>                <dbl>
     1      517            515            317                  315
     2      533            529            333                  329
     3      542            540            342                  340
     4      544            545            344                  345
     5      554            600            354                  360
     6      554            558            354                  358
     7      555            600            355                  360
     8      557            600            357                  360
     9      557            600            357                  360
    10      558            600            358                  360
    # ... with 336,766 more rows
    
    # 還可以寫成函數
    time2mins <- function(x) {
      (x %/% 100 * 60 + x %% 100) %% 1440
    }
    
  • Exercise 5.5.6

    What trigonometric functions does R provide?

    這部分solution寫的特別詳細,大家有需要的自己去看就行


5.6 Grouped summaries with summarise()

  • It collapses a data frame to a single row

    summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups. Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”. For example, if we applied exactly the same code to a data frame grouped by date, we get the average delay per date:

    summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
    #> # A tibble: 1 x 1
    #>   delay
    #>   <dbl>
    #> 1  12.6
    
    
    # 記得NA.rm,有傳染性
    flights %>% 
      group_by(year, month, day) %>% 
      summarise(mean = mean(dep_delay))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day  mean
    #>   <int> <int> <int> <dbl>
    #> 1  2013     1     1    NA
    #> 2  2013     1     2    NA
    #> 3  2013     1     3    NA
    #> 4  2013     1     4    NA
    #> 5  2013     1     5    NA
    #> 6  2013     1     6    NA
    #> # … with 359 more rows
    
    

    我有時候感覺group就相當于把觀測變得有層次了,比如上面就是先根據year劃一層,然后根據month劃一層,最后根據day劃一層。

  • Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))). That way you can check that you’re not drawing conclusions based on very small amounts of data.

  • Just using means, counts, and sum can get you a long way, but R provides many other useful summary functions:

    • Measures of location: we’ve used mean(x), but median(x) is also useful. The mean is the sum divided by the length; the median is a value where 50% of x is above it, and 50% is below it.

      It’s sometimes useful to combine aggregation with logical subsetting. We haven’t talked about this sort of subsetting yet, but you’ll learn more about it in subsetting.

      not_cancelled %>% 
        group_by(year, month, day) %>% 
        summarise(
          avg_delay1 = mean(arr_delay),
          avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
        )
      #> # A tibble: 365 x 5
      #> # Groups:   year, month [12]
      #>    year month   day avg_delay1 avg_delay2
      #>   <int> <int> <int>      <dbl>      <dbl>
      #> 1  2013     1     1      12.7        32.5
      #> 2  2013     1     2      12.7        32.0
      #> 3  2013     1     3       5.73       27.7
      #> 4  2013     1     4      -1.93       28.3
      #> 5  2013     1     5      -1.53       22.6
      #> 6  2013     1     6       4.24       24.4
      #> # … with 359 more rows
      
    • Measures of spread: sd(x), IQR(x)(IQR就是boxplot里面的那個上四分位點 - 下四分位點), mad(x). The root mean squared deviation, or standard deviation sd(x), is the standard measure of spread. The interquartile range IQR(x) and median absolute deviation mad(x) are robust equivalents that may be more useful if you have outliers.

      
      # Why is distance to some destinations more variable than to others?
      not_cancelled %>% 
      group_by(dest) %>% 
        summarise(distance_sd = sd(distance)) %>% 
        arrange(desc(distance_sd))
      #> # A tibble: 104 x 2
      #>   dest  distance_sd
      #>   <chr>       <dbl>
      #> 1 EGE         10.5 
      #> 2 SAN         10.4 
      #> 3 SFO         10.2 
      #> 4 HNL         10.0 
      #> 5 SEA          9.98
      #> 6 LAS          9.91
      #> # … with 98 more rows
      

      mad函數是做什么用的?

    • Measures of rank: min(x), quantile(x, 0.25), max(x). Quantiles are a generalisation of the median. For example, quantile(x, 0.25) will find a value of x that is greater than 25% of the values, and less than the remaining 75%.

      
      # When do the first and last flights leave each day?
      not_cancelled %>% 
      group_by(year, month, day) %>% 
        summarise(
        first = min(dep_time),
          last = max(dep_time)
      )
      #> # A tibble: 365 x 5
      #> # Groups:   year, month [12]
      #>    year month   day first  last
      #>   <int> <int> <int> <int> <int>
      #> 1  2013     1     1   517  2356
      #> 2  2013     1     2    42  2354
      #> 3  2013     1     3    32  2349
      #> 4  2013     1     4    25  2358
      #> 5  2013     1     5    14  2357
      #> 6  2013     1     6    16  2355
      #> # … with 359 more rows
      
    • Measures of position: first(x), nth(x, 2), last(x). These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements). For example, we can find the first and last departure for each day:

      not_cancelled %>% 
        group_by(year, month, day) %>% 
        summarise(
          first_dep = first(dep_time), 
        last_dep = last(dep_time)
        )
      #> # A tibble: 365 x 5
      #> # Groups:   year, month [12]
      #>    year month   day first_dep last_dep
      #>   <int> <int> <int>     <int>    <int>
      #> 1  2013     1     1       517     2356
      #> 2  2013     1     2        42     2354
      #> 3  2013     1     3        32     2349
      #> 4  2013     1     4        25     2358
      #> 5  2013     1     5        14     2357
      #> 6  2013     1     6        16     2355
      #> # … with 359 more rows
      

      These functions are complementary to filtering on ranks. Filtering gives you all variables, with each observation in a separate row:

      not_cancelled %>% 
        group_by(year, month, day) %>% 
        mutate(r = min_rank(desc(dep_time))) %>% 
        filter(r %in% range(r))
      #> # A tibble: 770 x 20
      #> # Groups:   year, month, day [365]
      #>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
      #>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
      #> 1  2013     1     1      517            515         2      830            819
      #> 2  2013     1     1     2356           2359        -3      425            437
      #> 3  2013     1     2       42           2359        43      518            442
      #> 4  2013     1     2     2354           2359        -5      413            437
      #> 5  2013     1     3       32           2359        33      504            442
      #> 6  2013     1     3     2349           2359       -10      434            445
      #> # … with 764 more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
      #> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
      #> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, r <int>
      
      # 上面的不太清楚
      # 其實這就等于是把最大的dep_time和最小的dep_time同時輸出了
      > not_cancelled %>%
      +   group_by(year, month, day) %>% 
      +   select(., year:day, dep_time) %>% 
      +   mutate(r = min_rank(desc(dep_time))) %>%
      +   filter(r %in% range(r))
      # A tibble: 770 x 5
      # Groups:   year, month, day [365]
          year month   day dep_time     r
         <int> <int> <int>    <int> <int>
       1  2013     1     1      517   831
       2  2013     1     1     2356     1
       3  2013     1     2       42   928
       4  2013     1     2     2354     1
       5  2013     1     3       32   900
       6  2013     1     3     2349     1
       7  2013     1     4       25   908
       8  2013     1     4     2358     1
       9  2013     1     4     2358     1
      10  2013     1     5       14   717
      # ... with 760 more rows
      
    • Counts: You’ve seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).(這函數可以唉,類似于Linux里面的uniq -c?)

      # Which destinations have the most carriers?
      not_cancelled %>% 
        group_by(dest) %>% 
        summarise(carriers = n_distinct(carrier)) %>% 
        arrange(desc(carriers))
      #> # A tibble: 104 x 2
      #>   dest  carriers
      #>   <chr>    <int>
      #> 1 ATL          7
      #> 2 BOS          7
      #> 3 CLT          7
      #> 4 ORD          7
      #> 5 TPA          7
      #> 6 AUS          6
      #> # … with 98 more rows
      

      Counts are so useful that dplyr provides a simple helper if all you want is a count:

      # 你可以看到not_cancelled是沒有group的
      > not_cancelled
      # A tibble: 327,346 x 19
      
      # count自動完成了group_by+n()
      not_cancelled %>% 
        count(dest)
      #> # A tibble: 104 x 2
      #>   dest      n
      #>   <chr> <int>
      #> 1 ABQ     254
      #> 2 ACK     264
      #> 3 ALB     418
      #> 4 ANC       8
      #> 5 ATL   16837
      #> 6 AUS    2411
      #> # … with 98 more rows
      
      # 我其實感覺 n() 就類似于 length() 吧
      # 來自solution的exercise 5.6.2
      > not_cancelled %>%
      +   group_by(dest) %>%
      +   summarise(n = length(dest)) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        dest      n
        <chr> <int>
      1 ABQ     254
      2 ACK     264
      3 ALB     418
      
      > not_cancelled %>%
      +   group_by(dest) %>%
      +   summarise(n = n()) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        dest      n
        <chr> <int>
      1 ABQ     254
      2 ACK     264
      3 ALB     418
      
      

      You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

      之前是數每個group下的個數,這里是對每個group下,你指定的varible進行求和

      
      # 數個數
      > not_cancelled %>% 
      +   count(tailnum) %>% 
      +   head(., n = 3)
      # A tibble: 3 x 2
        tailnum     n
        <chr>   <int>
      1 D942DN      4
      2 N0EGMQ    352
      3 N10156    145
      
      > not_cancelled %>% 
      +   group_by(tailnum) %>% 
      
      +   summarise(n = n()) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        tailnum     n
        <chr>   <int>
      1 D942DN      4
      2 N0EGMQ    352
      3 N10156    145
      
      
      # 求和
      > not_cancelled %>%
      +   count(tailnum, wt = distance) %>% 
      +   head(., n = 3)
      # A tibble: 3 x 2
        tailnum      n
        <chr>    <dbl>
      1 D942DN    3418
      2 N0EGMQ  239143
      3 N10156  109664
      
      > not_cancelled %>% 
      +   select(tailnum, distance) %>% 
      +   group_by(tailnum) %>%
      +   summarise(n = sum(distance)) %>% 
      +   head(n = 3)
      # A tibble: 3 x 2
        tailnum      n
        <chr>    <dbl>
      1 D942    3418
      2 N0EGMQ  239143
      3 N10156  109664
      
      
    • Counts and proportions of logical values: sum(x > 10), mean(y == 0). When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

    這個操作是很好的,舉個最簡單的例子

    # 下面這個例子可以用來說明
    # 正態分布的中位數是0
    > set.seed(19960203)
    > mean(rnorm(1E6) > 0)
    [1] 0.500266
    
    > set.seed(19960203)
    > sum(rnorm(1E6) > 0) / 1E6
    [1] 0.500266
    
    
    # How many flights left before 5am? (these usually indicate   delayed
    # flights from the previous day)
    not_cancelled %>% 
      group_by(year, month, day) %>% 
        summarise(n_early = sum(dep_time < 500))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day n_early
    #>   <int> <int> <int>   <int>
    #> 1  2013     1     1       0
    #> 2  2013     1     2       3
    #> 3  2013     1     3       4
    #> 4  2013     1     4       3
    #> 5  2013     1     5       3
    #> 6  2013     1     6       2
    #> # … with 359 more rows
    
    # What proportion of flights are delayed by more than an hour?
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(hour_prop = mean(arr_delay > 60))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day hour_prop
    #>   <int> <int> <int>     <dbl>
    #> 1  2013     1     1    0.0722
    #> 2  2013     1     2    0.0851
    #> 3  2013     1     3    0.0567
    #> 4  2013     1     4    0.0396
    #> 5  2013     1     5    0.0349
    #> 6  2013     1     6    0.0470
    #> # … with 359 more rows
    
  • 有意思的功能:當使用多個變量進行分組時,每次的摘要統計會用掉一個分組變量。這樣就可以輕松地對數據集進行循序漸進的分析:

    daily <- group_by(flights, year, month, day)
    (per_day   <- summarise(daily, flights = n()))
    #> # A tibble: 365 x 4
    #> # Groups:   year, month [12]
    #>    year month   day flights
    #>   <int> <int> <int>   <int>
    #> 1  2013     1     1     842
    #> 2  2013     1     2     943
    #> 3  2013     1     3     914
    #> 4  2013     1     4     915
    #> 5  2013     1     5     720
    #> 6  2013     1     6     832
    #> # … with 359 more rows
    (per_month <- summarise(per_day, flights = sum(flights)))
    #> # A tibble: 12 x 3
    #> # Groups:   year [1]
    #>    year month flights
    #>   <int> <int>   <int>
    #> 1  2013     1   27004
    #> 2  2013     2   24951
    #> 3  2013     3   28834
    #> 4  2013     4   28330
    #> 5  2013     5   28796
    #> 6  2013     6   28243
    #> # … with 6 more rows
    (per_year  <- summarise(per_month, flights = sum(flights)))
    #> # A tibble: 1 x 2
    #>    year flights
    #>   <int>   <int>
    #> 1  2013  336776
    

在循序漸進地進行摘要分析時,需要小心:使用求和與計數操作是沒問題的,但如果想要使用加權平均和方差的話,就要仔細考慮一下,在基于秩的統計數據(如中位數)上是無法進行這些操作的。換句話說,對分組求和的結果再求和就是對整體求和,但分組中位數的中位數可不是整體的中位數。(這話來自中文版)

我覺得意思就是

# 整體求和 和 把整體分割成組,然后分組求和的結果是一樣的
# 這個你可以通過自己列公式看出來
> sum(1:8)
[1] 36
> sum(sum(1:4),sum(5:6),sum(7:8))
[1] 36

# 但整體求中位數 和 把整體分割成組,然后分組求中位數,再求中位數的結果是不一樣的
> median(1:8)
[1] 4.5
> median(median(1:4),median(5:6),median(7:8))
[1] 2.5

# 均值同理
> mean(1:8)
[1] 4.5
> mean(mean(1:4),mean(5:6),mean(7:8))
[1] 2.5

  • Exercise 5.6.2

    Come up with another approach that will give you the same output as not_cancelled %>% count(dest) and not_cancelled %>% count(tailnum, wt = distance) (without using count()).

    not_cancelled <- flights %>%
      filter(!is.na(dep_delay), !is.na(arr_delay))
    
    > not_cancelled %>%
    +   count(dest) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
    
    > not_cancelled %>%
    +   group_by(dest) %>%
    +   summarise(n = length(dest)) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
    --------------------------------------------------------------------------------------------------
    
    # 其實group_by加summarise我覺得可以認為是下面這個操作的多次循環
    # 以ABQ為例
    > not_cancelled %>% 
    +   filter(., dest == "ABQ") %>% 
    +   nrow()
    [1] 254
    
    
    ------------------------------------------------------------------------------------------------------
    
    > not_cancelled %>%
    +   group_by(dest) %>%
    +   summarise(n = n()) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
    # Another alternative to count() is to use the combination of the group_by() and tally() verbs. 
    # In fact, count() is effectively a short-cut for group_by() followed by tally().
    > not_cancelled %>%
    +   group_by(dest) %>%
    +   tally() %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
    <chr> <int>
    1 ABQ     254
    2 ACK     264
    3 ALB     418
    
> not_cancelled %>%
+   count(tailnum, wt = distance) %>% 
+   head(n = 3)
# A tibble: 3 x 2
  tailnum      n
  <chr>    <dbl>
1 D942DN    3418
2 N0EGMQ  239143
3 N10156  109664

> not_cancelled %>%
+   group_by(tailnum) %>% 
+   summarise(n = sum(distance)) %>% 
+   head(n = 3)
# A tibble: 3 x 2
  tailnum      n
  <chr>    <dbl>
1 D942DN    3418
2 N0EGMQ  239143
3 N10156  109664

# 同樣舉個例子
> not_cancelled %>%
+   filter(., tailnum == "D942DN") %>% 
+   pull(distance) %>% 
+   sum()
[1] 3418

# Like the previous example, we can also use the combination group_by() and tally(). 
# Any arguments to tally() are summed.
> not_cancelled %>%
+   group_by(tailnum) %>%
+   tally(distance) %>% 
+   head(n = 3)
# A tibble: 3 x 2
  tailnum      n
  <chr>    <dbl>
1 D942DN    3418
2 N0EGMQ  239143
3 N10156  109664
  • Exercise 5.6.6

    What does the sort argument to count() do? When might you use it?

    > not_cancelled %>%
    +   count(dest, sort = T) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ATL   16837
    2 ORD   16566
    3 LAX   16026
    
    # 等價于
    
    > not_cancelled %>%
    +   count(dest) %>% 
    +   arrange(., desc(n)) %>% 
    +   head(n = 3)
    # A tibble: 3 x 2
      dest      n
      <chr> <int>
    1 ATL   16837
    2 ORD   16566
    3 LAX   16026
    

5.7 Grouped mutates (and filters)

  • 雖然與 summarize() 函數結合起來使用是最有效的,但分組也可以與 mutate() 和 filter()函數結合,以完成非常便捷的操作。(中文版這段話)

    flights_sml <- select(flights, 
      year:day, 
      ends_with("delay"), 
      distance, 
      air_time
    )
    
    # Find the worst members of each group:
    # 這里對arr_delay進行排序,按從大到小排。然后再得到rank,rank越靠前就是arr_delay越大的
    # 這里就是得到每組前10多延遲的
    flights_sml %>% 
      group_by(year, month, day) %>%
      filter(rank(desc(arr_delay)) < 10) %>% 
      head(n = 3)
    # A tibble: 3 x 7
    # Groups:   year, month, day [1]
       year month   day dep_delay arr_delay distance air_time
      <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
    1  2013     1     1       853       851      184       41
    2  2013     1     1       290       338     1134      213
    3  2013     1     1       260       263      266       46
    
    
    # Find all groups bigger than a threshold:
    popular_dests <- flights %>% 
      group_by(dest) %>% 
      filter(n() > 365)
    
    # Standardise to compute per group metrics:
    # 這里等于是計算了百分比
    # 始終記住:向量化,向量化
    popular_dests %>% 
      filter(arr_delay > 0) %>% 
      mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
      select(year:day, dest, arr_delay, prop_delay)
    #> # A tibble: 131,106 x 6
    #> # Groups:   dest [77]
    #>    year month   day dest  arr_delay prop_delay
    #>   <int> <int> <int> <chr>     <dbl>      <dbl>
    #> 1  2013     1     1 IAH          11  0.000111 
    #> 2  2013     1     1 IAH          20  0.000201 
    #> 3  2013     1     1 MIA          33  0.000235 
    #> 4  2013     1     1 ORD          12  0.0000424
    #> 5  2013     1     1 FLL          19  0.0000938
    #> 6  2013     1     1 ORD           8  0.0000283
    #> # … with 1.311e+05 more rows
    
    # 這樣可能直觀一點
    > popular_dests %>% 
    +   filter(arr_delay > 0) %>% 
    +   mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
    +   select(year:day, dest, arr_delay, prop_delay) %>% 
    +   arrange(dest)
    # A tibble: 131,106 x 6
    # Groups:   dest [77]
        year month   day dest  arr_delay prop_delay
       <int> <int> <int> <chr>     <dbl>      <dbl>
     1  2013     1     1 ALB          40   0.00418 
     2  2013     1     1 ALB          44   0.00459 
     3  2013     1     2 ALB          71   0.00741 
     4  2013     1     2 ALB          82   0.00856 
     5  2013     1     3 ALB          40   0.00418 
     6  2013     1     4 ALB          30   0.00313 
     7  2013     1     6 ALB          95   0.00992 
     8  2013     1     6 ALB           4   0.000418
     9  2013     1     7 ALB          41   0.00428 
    10  2013     1    10 ALB         120   0.0125  
    # ... with 131,096 more rows
    
最后編輯于
?著作權歸作者所有,轉載或內容合作請聯系作者
平臺聲明:文章內容(如有圖片或視頻亦包括在內)由作者上傳并發布,文章內容僅代表作者本人觀點,簡書系信息發布平臺,僅提供信息存儲服務。
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 229,963評論 6 542
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 99,348評論 3 429
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事。” “怎么了?”我有些...
    開封第一講書人閱讀 178,083評論 0 383
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 63,706評論 1 317
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 72,442評論 6 412
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 55,802評論 1 328
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,795評論 3 446
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,983評論 0 290
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 49,542評論 1 335
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 41,287評論 3 358
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 43,486評論 1 374
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 39,030評論 5 363
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,710評論 3 348
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 35,116評論 0 28
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 36,412評論 1 294
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 52,224評論 3 398
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 48,462評論 2 378