這是英文版的第9、10、11章
第9章是Introduction,沒啥好講的
第11章import部分我一點都不熟,也沒啥好講的……
10.2 Creating tibbles
- 把一個數據框轉as_tibble
越來越感覺tibble是個好東西
as_tibble(iris)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> # … with 144 more rows
但有個小問題,就是as_tibble轉換的時候會把列名給弄沒了。
其實這也是tidyverse一貫的思想吧?沒有列名。包括你像readr讀文件,left_join合并數據集等
> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
> mtcars %>%
+ as_tibble()
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
這時候你就可以把列名變成單獨的一列
> mtcars %>%
+ as_tibble(rownames = "myrowname")
# A tibble: 32 x 12
myrowname mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportab… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
還可以
> mtcars %>%
+ as_tibble(rownames = NA) %>%
+ rownames_to_column(var = "myrowname")
# A tibble: 32 x 12
myrowname mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 Drive 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Sportab… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# … with 22 more rows
- 用tibble函數自建tibble對象
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
#> # A tibble: 5 x 3
#> x y z
#> <int> <dbl> <dbl>
#> 1 1 1 2
#> 2 2 1 5
#> 3 3 1 10
#> 4 4 1 17
#> 5 5 1 26
If you’re already familiar with
data.frame()
, note thattibble()
does much less:
- it never changes the type of the inputs (e.g. it never converts strings to factors!), (R 4.0 終于不會默認把字符串變成因子了)
- it never changes the names of variables,(這應該指的是你如果列名是1的話,就會變成X1)
> data.frame(`1` = 1:5, `2` = 1:5) X1 X2 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 > tibble(`1` = 1:5, `2` = 1:5) # A tibble: 5 x 2 `1` `2` <int> <int> 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
- and it never creates row names.
It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. To refer to these variables, you need to surround them with backticks, ```:
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
tb
#> # A tibble: 1 x 3
#> `:)` ` ` `2000`
#> <chr> <chr> <chr>
#> 1 smile space number
- Another way to create a tibble is with
tribble()
Another way to create a tibble is with tribble()
, short for transposed tibble. tribble()
is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~
), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form.
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
#> # A tibble: 2 x 3
#> x y z
#> <chr> <dbl> <dbl>
#> 1 a 2 3.6
#> 2 b 1 8.5
I often add a comment (the line starting with #
), to make it really clear where the header is.
tribble還可以有下面的騷操作
# tribble will create a list column if the value in any cell is # not a scalar tribble( ~x, ~y, "a", 1:3, "b", 4:6 ) #> # A tibble: 2 x 2 #> x y #> <chr> <list> #> 1 a <int [3]> #> 2 b <int [3]>
參考
tibble做上面tribble的騷操作
> data.frame(x = c("a","b"), + y = I(list(1:3,4:6))) %>% + as_tibble() # A tibble: 2 x 2 x y <fct> <I<list>> 1 a <int [3]> 2 b <int [3]> > tibble(x = c("a","b"), + y = list(1:3,4:6)) # A tibble: 2 x 2 x y <chr> <list> 1 a <int [3]> 2 b <int [3]>
參考
想到一個有意思的包,usethis
library(usethis) # Want to print friendly output to a user in a package (or to yourself in your own code?) # The usethis ui_*() functions are perfect! # Use ui_done() when something is done, like a file saved ui_done("File saved at...") ## ? File saved at... # ui_todo() is useful when you need your user to pay attention and do something! ui_todo("Changes have been made, please review them!") ## ● Changes have been made, please review them! # ui_oops() when something went wrong ui_oops("That should not have happened") ## x That should not have happened
參考:
10.3 Tibbles vs. data.frame
There are two main differences in the usage of a tibble vs. a classic data.frame
: printing and subsetting.
- 打印
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str()
:
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
#> # A tibble: 1,000 x 5
#> a b c d e
#> <dttm> <date> <int> <dbl> <chr>
#> 1 2020-01-15 20:43:23 2020-01-22 1 0.368 n
#> 2 2020-01-16 14:48:32 2020-01-27 2 0.612 l
#> 3 2020-01-16 09:12:12 2020-02-06 3 0.415 p
#> 4 2020-01-15 22:33:29 2020-02-05 4 0.212 m
#> 5 2020-01-15 18:57:45 2020-02-02 5 0.733 i
#> 6 2020-01-16 05:58:42 2020-01-29 6 0.460 n
#> # … with 994 more rows
First, you can explicitly print()
the data frame and control the number of rows (n
) and the width
of the display. width = Inf
will display all columns:
nycflights13::flights %>%
print(n = 10, width = Inf)
You can also control the default print behaviour by setting options:
-
options(tibble.print_max = n, tibble.print_min = m)
: if more thann
rows, print onlym
rows. Useoptions(tibble.print_min = Inf)
to always show all rows. - Use
options(tibble.width = Inf)
to always print all columns, regardless of the width of the screen.
- 提取
So far all the tools you’ve learned have worked with complete data frames. If you want to pull out a single variable, you need some new tools, $
and [[
. [[
can extract by name or position; $
only extracts by name but is a little less typing.
df <- tibble(
x = runif(5),
y = rnorm(5)
)
# Extract by name
df$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
# Extract by position
df[[1]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
To use these in a pipe, you’ll need to use the special placeholder .
:
df %>% .$x
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
df %>% .[["x"]]
#> [1] 0.7330 0.2344 0.6604 0.0329 0.4605
data.frame也可以
df <- data.frame( x = runif(5), y = rnorm(5) ) > df %>% + .$x [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 > df %>% + .[["x"]] [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 > df %>% + .[[1]] [1] 0.03347872 0.27371447 0.96202331 0.78821730 0.32745451 # 2其實你還可以這樣子 df %>% "[["("x")
Compared to a data.frame
, tibbles are more strict: they never do partial matching, and they will generate a warning if the column you are trying to access does not exist.
關于部分匹配的例子
df1 <- data.frame(xyz = "a") df2 <- tibble(xyz = "a") str(df1$x) #> Factor w/ 1 level "a": 1 str(df2$x) #> Warning: Unknown or uninitialised column: 'x'. #> NULL
參考
部分匹配的另一個例子
$
is a shorthand operator:x$y
is roughly equivalent tox[["y"]]
. It’s often used to access variables in a data frame, as inmtcars$cyl
ordiamonds$carat
. One common mistake with$
is to use it when you have the name of a column stored in a variable:var <- "cyl" # Doesn't work - mtcars$var translated to mtcars[["var"]] mtcars$var #> NULL # Instead use [[ mtcars[[var]] #> [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
The one important difference between
$
and[[
is that$
does (left-to-right) partial matching:x <- list(abc = 1) x$a #> [1] 1 x[["a"]] #> NULL
To help avoid this behaviour I highly recommend setting the global option
warnPartialMatchDollar
toTRUE
:options(warnPartialMatchDollar = TRUE) x$a #> Warning in x$a: partial match of 'a' to 'abc' #> [1] 1
(For data frames, you can also avoid this problem by using tibbles, which never do partial matching.)
參考
10.4 Interacting with older code
Some older functions don’t work with tibbles. If you encounter one of these functions, use as.data.frame()
to turn a tibble back to a data.frame
:
class(as.data.frame(tb))
#> [1] "data.frame"
The main reason that some older functions don’t work with tibble is the [
function. We don’t use [
much in this book because dplyr::filter()
and dplyr::select()
allow you to solve the same problems with clearer code (but you will learn a little about it in vector subsetting). With base R data frames, [
sometimes returns a data frame, and sometimes returns a vector. With tibbles, [
always returns another tibble.
對于data.frame來說,如果你用[]選取了一列,那么其就會自動轉換成向量了
df <- data.frame( x = runif(5), y = rnorm(5) ) > df[, "x"] [1] 0.02206585 0.98926964 0.95333742 0.79946273 0.19327569 > df[, c("x","y")] x y 1 0.02206585 -1.32245311 2 0.98926964 0.59576966 3 0.95333742 0.03922984 4 0.79946273 1.09332833 5 0.19327569 0.88358188 # 如果你想阻止這種行為 # 加一個drop=F > df[, "x", drop = F] x 1 0.55692342 2 0.06739173 3 0.08648150 4 0.84341912 5 0.93941534
但對于tibble而言
df <- tibble( x = runif(5), y = rnorm(5) ) > df[, "x"] # A tibble: 5 x 1 x <dbl> 1 0.422 2 0.519 3 0.881 4 0.114 5 0.956 > df[, c("x","y")] # A tibble: 5 x 2 x y <dbl> <dbl> 1 0.422 -1.03 2 0.519 0.605 3 0.881 0.414 4 0.114 0.820 5 0.956 -0.391
-
Exercise 10.1
How can you tell if an object is a tibble? (Hint: try printing
mtcars
, which is a regular data frame).
You can use the function is_tibble()
to check whether a data frame is a tibble or not. The mtcars
data frame is not a tibble.
is_tibble(mtcars)
#> [1] FALSE
But the diamonds
and flights
data are tibbles.
is_tibble(ggplot2::diamonds)
#> [1] TRUE
is_tibble(nycflights13::flights)
#> [1] TRUE
is_tibble(as_tibble(mtcars))
#> [1] TRUE
More generally, you can use the class()
function to find out the class of an object. Tibbles has the classes c("tbl_df", "tbl", "data.frame")
, while old data frames will only have the class "data.frame"
.
class(mtcars)
#> [1] "data.frame"
class(ggplot2::diamonds)
#> [1] "tbl_df" "tbl" "data.frame"
class(nycflights13::flights)
#> [1] "tbl_df" "tbl" "data.frame"
If you are interested in reading more on R’s classes, read the chapters on object oriented programming in Advanced R.
Advanced R雖然看到 object oriented programming那里勸退了,但真的寫的極好
-
Exercise 10.2
Compare and contrast the following operations on a
data.frame
and equivalent tibble. What is different? Why might the default data frame behaviors cause you frustration?
df <- data.frame(abc = 1, xyz = "a")
df$x
#> [1] a
#> Levels: a
df[, "xyz"]
#> [1] a
#> Levels: a
df[, c("abc", "xyz")]
#> abc xyz
#> 1 1 a
tbl <- as_tibble(df)
tbl$x
#> Warning: Unknown or uninitialised column: 'x'.
#> NULL
tbl[, "xyz"]
#> # A tibble: 1 x 1
#> xyz
#> <fct>
#> 1 a
tbl[, c("abc", "xyz")]
#> # A tibble: 1 x 2
#> abc xyz
#> <dbl> <fct>
#> 1 1 a
The $
operator will match any column name that starts with the name following it. Since there is a column named xyz
, the expression df$x
will be expanded to df$xyz
. This behavior of the $
operator saves a few keystrokes, but it can result in accidentally using a different column than you thought you were using.
With data.frames, with [
the type of object that is returned differs on the number of columns. If it is one column, it won’t return a data.frame, but instead will return a vector. With more than one column, then it will return a data.frame. This is fine if you know what you are passing in, but suppose you did df[ , vars]
where vars
was a variable. Then what that code does depends on length(vars)
and you’d have to write code to account for those situations or risk bugs.
上面是solution的解答
其實綜合起來,上面所出現的df和tibble的操作結果差異就是因為df的$操作符的部分匹配特性、[]操作符在選取一列的時候會自動降維為一維向量的特性
-
Exercise 10.3
If you have the name of a variable stored in an object, e.g.
var <- "mpg"
, how can you extract the reference variable from a tibble?
You can use the double bracket, like df[[var]]
. You cannot use the dollar sign, because df$var
would look for a column named var
.
這種特性可能單個使用沒啥用,但在寫循環的時候應該會大有用處
試驗一下
df <- tibble( x = runif(5), y = rnorm(5) ) a <- "x" df[a] # A tibble: 5 x 1 x <dbl> 1 0.617 2 0.971 3 0.866 4 0.105 5 0.0429 > df[, a] # A tibble: 5 x 1 x <dbl> 1 0.617 2 0.971 3 0.866 4 0.105 5 0.0429 > df[[a]] [1] 0.61743318 0.97105570 0.86600921 0.10470175 0.04291076
data.frame似乎也是可以的
df <- data.frame( x = runif(5), y = rnorm(5) ) a <- "x" > df[a] x 1 0.5506573 2 0.2944493 3 0.7896432 4 0.6288798 5 0.6678818 > df[, a] [1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818 > df[[a]] [1] 0.5506573 0.2944493 0.7896432 0.6288798 0.6678818
- Exercise 10.4
Practice referring to non-syntactic names in the following data frame by:
- Extracting the variable called 1.
- Plotting a scatterplot of 1 vs 2.
- Creating a new column called 3 which is 2 divided by 1.
- Renaming the columns to one, two and three.
annoying <- tibble( `1` = 1:10, `2` = `1` * 2 + rnorm(length(`1`)) )
To extract the variable named 1
:
annoying[["1"]]
#> [1] 1 2 3 4 5 6 7 8 9 10
or
annoying$`1`
#> [1] 1 2 3 4 5 6 7 8 9 10
Plotting a scatterplot of 1 vs 2.
ggplot(annoying, aes(x = `1`, y = `2`)) +
geom_point()
To add a new column 3
which is 2
divided by 1
:
mutate(annoying, `3` = `2` / `1`)
#> # A tibble: 10 x 3
#> `1` `2` `3`
#> <int> <dbl> <dbl>
#> 1 1 0.600 0.600
#> 2 2 4.26 2.13
#> 3 3 3.56 1.19
#> 4 4 7.99 2.00
#> 5 5 10.6 2.12
#> 6 6 13.1 2.19
#> # … with 4 more rows
or
annoying[["3"]] <- annoying$`2` / annoying$`1`
or
annoying[["3"]] <- annoying[["2"]] / annoying[["1"]]
To rename the columns to one
, two
, and three
, run:
annoying <- rename(annoying, one = `1`, two = `2`, three = `3`)
glimpse(annoying)
#> Observations: 10
#> Variables: 3
#> $ one <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
#> $ two <dbl> 0.60, 4.26, 3.56, 7.99, 10.62, 13.15, 12.18, 15.75, 17.76,…
#> $ three <dbl> 0.60, 2.13, 1.19, 2.00, 2.12, 2.19, 1.74, 1.97, 1.97, 1.97
get到一個新函數glimpse(來自tibble包)
This is like a transposed version of print(): columns run down the page, and data runs across. This makes it possible to see every column in a data frame. It's a little like str() applied to a data frame but it tries to show you as much data as possible. (And it always shows the underlying data, even when applied to a remote data source.)
> glimpse(mtcars) Observations: 32 Variables: 11 $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.… $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, … $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, … $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 1… $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.9… $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, … $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, … $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, … $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, … $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, … $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, …
-
Exercise 10.5
What does
tibble::enframe()
do? When might you use it?
The function tibble::enframe()
converts named vectors to a data frame with names and values
enframe(c(a = 1, b = 2, c = 3))
#> # A tibble: 3 x 2
#> name value
#> <chr> <dbl>
#> 1 a 1
#> 2 b 2
#> 3 c 3
enframe還有個對應的函數deframe
來自Converting vectors to data frames, and vice versa
enframe(1:3)#> # A tibble: 3 x 2 #> name value #> <int> <int> #> 1 1 1 #> 2 2 2 #> 3 3 3 enframe(c(a = 5, b = 7))#> # A tibble: 2 x 2 #> name value #> <chr> <dbl> #> 1 a 5 #> 2 b 7 # 這個效果應該跟上面的tribble的很像 enframe(list(one = 1, two = 2:3, three = 4:6)) #> # A tibble: 3 x 2 #> name value #> <chr> <list> #> 1 one <dbl [1]> #> 2 two <int [2]> #> 3 three <int [3]> tribble( ~name, ~value, "one", 1, "two", 2:3, "three", 4:6 ) # A tibble: 3 x 2 name value <chr> <list> 1 one <dbl [1]> 2 two <int [2]> 3 three <int [3]>