logo.png

使用stringr處理字符串

往期文章

《R數據科學》學習筆記|Note1:緒論

《R數據科學》學習筆記|Note2:使用ggplot2進行數據可視化(上）

《R數據科學》學習筆記|Note3:使用ggplot2進行數據可視化(下）

《R數據科學》學習筆記|Note4:使用dplyr進行數據轉換(上）

《R數據科學》學習筆記|Note5:使用dplyr進行數據轉換(下）

《R數據科學》學習筆記|Note6:使用tibble實現簡單數據框

《R數據科學》學習筆記|Note7:使用readr進行數據導入

《R數據科學》學習筆記|Note8:使用dplyr處理關系數據

《R數據科學》學習筆記|Note9:使用stringr處理字符串(上）

本篇為使用stringr處理字符串的下半部分，同樣由于涉及正則表達式，可能略微有點難。下期給大家分享一篇正則表達式的內容，希望能有助于大家理解。最后希望大家看完能點個贊支持一下~

9.4 工具

我們已經掌握了正則表達式的基礎知識，現在是時候學習如何應用它們來解決實際問題。我們將在本節中學習多種 stringr 函數，它們可以：

確定與某種模式相匹配的字符串；
找出匹配的位置；
提取出匹配的內容；
使用新值替換匹配內容；
基于匹配拆分字符串。

9.4.1 匹配檢測

要想確定一個字符向量能否匹配一種模式，可以使用 str_detect() 函數。它返回一個與輸入向量具有同樣長度的邏輯向量：

# library(tidyverse) 別忘了
x <- c("apple", "banana", "pear")
str_detect(x, "e")
> [1] TRUE FALSE TRUE

從數學意義上來說，邏輯向量中的 FALSE 為 0，TRUE 為 1。這使得在匹配特別大的向量時，sum() 和 mean() 函數能夠發揮更大的作用：

# 有多少個以t開頭的常用單詞？
sum(str_detect(words, "^t"))
> [1] 65
# 以元音字母結尾的常用單詞的比例是多少？
mean(str_detect(words, "[aeiou]$"))
> [1] 0.277

當邏輯條件非常復雜時（例如，匹配 a 或 b，但不匹配 c，除非 d 成立），一般來說，相對于創建單個正則表達式，使用邏輯運算符將多個 str_detect() 調用組合起來會更容易。例如，以下兩種方法均可找出不包含元音字母的所有單詞：

# 找出至少包含一個元音字母的所有單詞，然后取反
no_vowels_1 <- !str_detect(words, "[aeiou]")
# 找出僅包含輔音字母（非元音字母）的所有單詞
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
> [1] TRUE

很明顯第一種方法更容易理解。如果正則表達式過于復雜，則應該將其分解為幾個更小的子表達式，將每個子表達式的匹配結果賦給一個變量，并使用邏輯運算組合起來。

str_detect() 函數的一種常見用法是選取出匹配某種模式的元素。你可以通過邏輯取子集方式來完成這種操作，也可以使用便捷的 str_subset() 包裝器函數：

words[str_detect(words, "x$")]
> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
> [1] "box" "sex" "six" "tax"

然而，字符串通常會是數據框的一列，此時我們可以使用 filter 操作：

df <- tibble(
 word = words,
 i = seq_along(word) #seq_along(along.with)：創建開始于1，步長為1，與向量長度相等的數字序列
)
df %>%
 filter(str_detect(words, "x$"))
> df %>% 
+   filter(str_detect(words,"x$"))
# A tibble: 4 x 2
  word      i
  <chr> <int>
1 box     108
2 sex     747
3 six     772
4 tax     841

str_detect() 函數的一種變體是 str_count()，后者不是簡單地返回是或否，而是返回字符串中匹配的數量：

x <- c("apple", "banana", "pear")
str_count(x, "a")
> [1] 1 3 1

# 平均來看，每個單詞中有多少個元音字母？
mean(str_count(words, "[aeiou]"))
 > [1] 1.99

str_count() 也完全可以同 mutate() 函數一同使用：

df %>%
 mutate(
 vowels = str_count(word, "[aeiou]"),
 consonants = str_count(word, "[^aeiou]")
 )
> df %>%
+   mutate(
+     vowels = str_count(word, "[aeiou]"),
+     consonants = str_count(word, "[^aeiou]")
+   )
# A tibble: 980 x 4
   word         i vowels consonants
   <chr>    <int>  <int>      <int>
 1 a            1      1          0
 2 able         2      2          2
 3 about        3      3          2
 4 absolute     4      4          4
 5 accept       5      2          4
 6 account      6      3          4
 7 achieve      7      4          3
 8 across       8      2          4
 9 act          9      1          2
10 active      10      3          3
# ... with 970 more rows

注意，匹配從來不會重疊。例如，在 "abababa" 中，模式 "aba" 會匹配多少次？正則表達式會告訴你是 2 次，而不是 3 次：

str_count("abababa", "aba")
> [1] 2
str_view_all("abababa", "aba")

9.4.2 提取匹配內容

要想提取匹配的實際文本，我們可以使用 str_extract() 函數。我們將使用維基百科上的 Harvard sentences，這個數據集是用來測試 VOIP 系統的，但也可以用來練習正則表達式。這個數據集的全名是 stringr::sentences：

length(sentences)
> [1] 720
head(sentences)
> head(sentences)
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[6] "The juice of lemons makes fine punch."

假設我們想要找出包含一種顏色的所有句子。首先，我們需要創建一個顏色名稱向量，然后將其轉換成一個正則表達式：

colors <- c(
 "red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|") #collapse 分隔符
color_match
> [1] "red|orange|yellow|green|blue|purple"

注意，str_extract() 只提取第一個匹配。我們可以先選取出具有多于一種匹配的所有句子，然后就可以很容易地看到更多匹配：

more <- sentences[str_count(sentences, color_match) > 1]
str_view_all(more, color_match)

9.1

str_extract(more, color_match) #只提取第一個匹配
> [1] "blue" "green" "orange"

這是 stringr 函數的一種通用模式，因為單個匹配可以使用更簡單的數據結構。要想得到所有匹配，可以使用 str_extract_all() 函數，它會返回一個列表：

str_extract_all(more, color_match)
> str_extract_all(more, color_match)
[[1]]
[1] "blue" "red" 

[[2]]
[1] "green" "red"  

[[3]]
[1] "orange" "red"

如果設置了 simplify = TRUE，那么 str_extract_all() 會返回一個矩陣，其中較短的匹配會擴展到與最長的匹配具有同樣的長度：

str_extract_all(more, color_match, simplify = TRUE)
> str_extract_all(more, color_match, simplify = TRUE)
     [,1]     [,2] 
[1,] "blue"   "red"
[2,] "green"  "red"
[3,] "orange" "red"

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
> str_extract_all(x, "[a-z]", simplify = TRUE)
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "b"  ""  
[3,] "a"  "b"  "c"

9.4.3 分組匹配

本章前面討論了括號在正則表達式中的用法，它可以闡明優先級，還能對正則表達式進行分組，分組可以在匹配時回溯引用。還可以使用括號來提取一個復雜匹配的各個部分。舉例來說，假設我們想從句子中提取出名詞。我們先進行一種啟發式實驗，找出跟在 a 或 the 后面的所有單詞。因為使用正則表達式定義“單詞”有一點難度，所以我們使用一種簡單的近似定義——至少有 1 個非空格字符的字符序列：

noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
 str_subset(noun) %>%
 head(10)
> has_noun
 [1] "The birch canoe slid on the smooth planks."       
 [2] "Glue the sheet to the dark blue background."      
 [3] "It's easy to tell the depth of a well."           
 [4] "These days a chicken leg is a rare dish."         
 [5] "The box was thrown beside the parked truck."      
 [6] "The boy was there when the sun rose."             
 [7] "The source of the huge river is the clear spring."
 [8] "Kick the ball straight and follow through."       
 [9] "Help the woman get back to her feet."             
[10] "A pot of tea helps to pass the evening." 
has_noun %>%
 str_extract(noun)
> has_noun %>%
+   str_extract(noun)
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
 [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract() 函數可以給出完整匹配；str_match() 函數則可以給出每個獨立分組。str_ match() 返回的不是字符向量，而是一個矩陣，其中一列是完整匹配，后面的列是每個分組的匹配：

has_noun %>%
 str_match(noun)
> has_noun %>%
+   str_match(noun)
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"

（不出所料，這種啟發式名詞檢測的效果并不好，它還找出了一些形容詞，比如 smooth 和 parked。）

如果數據是保存在 tibble 中的，那么使用 tidyr::extract() 會更容易。這個函數的工作方式與 str_match() 函數類似，只是要求為每個分組提供一個名稱，以作為新列放在 tibble 中：

tibble(sentence = sentences) %>%
 tidyr::extract(
 sentence, c("article", "noun"), "(a|the) ([^ ]+)",
 remove = FALSE
 )
> tibble(sentence = sentences) %>%
+   tidyr::extract(
+     sentence, c("article", "noun"), "(a|the) ([^ ]+)",
+     remove = FALSE
+   )
# A tibble: 720 x 3
   sentence                                    article noun   
   <chr>                                       <chr>   <chr>  
 1 The birch canoe slid on the smooth planks.  the     smooth 
 2 Glue the sheet to the dark blue background. the     sheet  
 3 It's easy to tell the depth of a well.      the     depth  
 4 These days a chicken leg is a rare dish.    a       chicken
 5 Rice is often served in round bowls.        NA      NA     
 6 The juice of lemons makes fine punch.       NA      NA     
 7 The box was thrown beside the parked truck. the     parked 
 8 The hogs were fed chopped corn and garbage. NA      NA     
 9 Four hours of steady work faced us.         NA      NA     
10 Large size in stockings is hard to sell.    NA      NA

與 str_extract() 函數一樣，如果想要找出每個字符串的所有匹配，你需要使用 str_match_ all() 函數。

9.4.4 替換匹配內容

str_replace() 和 str_replace_all() 函數可以使用新字符串替換匹配內容。最簡單的應用是使用固定字符串替換匹配內容：

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-") #將aeiou替換為-,只替換第一個
> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-") #全部替換
> [1] "-ppl-" "p--r" "b-n-n-"

通過提供一個命名向量，使用 str_replace_all() 函數可以同時執行多個替換：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))r
> [1] "one house" "two cars" "three people"

除了使用固定字符串替換匹配內容，你還可以使用回溯引用來插入匹配中的分組。在下面的代碼中，我們交換了第二個單詞和第三個單詞的順序：

> head(sentences,5)
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls." 

sentences %>%
 str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
 head(5)
> sentences %>%
+     str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
+     head(5)
[1] "The canoe birch slid on the smooth planks." 
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."     
[4] "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls.

9.4.5 拆分

str_split() 函數可以將字符串拆分為多個片段。例如，我們可以將句子拆分成單詞：

sentences %>%
 head(5) %>%
 str_split(" ")
> sentences %>%
+   head(5) %>%
+   str_split(" ")
[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
[8] "rare"    "dish."  

[[5]]
[1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

因為字符向量的每個分量會包含不同數量的片段，所以 str_split() 會返回一個列表。如果你拆分的是長度為 1 的向量，那么只要簡單地提取列表的第一個元素即可：

"a|b|c|d" %>%
 str_split("\\|") %>%
 .[[1]]
> "a|b|c|d" %>%
+   str_split("\\|") %>%
+   .[[1]]
[1] "a" "b" "c" "d"

否則，和返回列表的其他 stringr 函數一樣，你可以通過設置 simplify = TRUE 返回一個矩陣：

sentences %>%
 head(5) %>%
 str_split(" ", simplify = TRUE)
> sentences %>%
+   head(5) %>%
+   str_split(" ", simplify = TRUE)
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]    
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth"
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"  
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"    
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"     
[5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls."
     [,8]          [,9]   
[1,] "planks."     ""     
[2,] "background." ""     
[3,] "a"           "well."
[4,] "rare"        "dish."
[5,] ""            ""

你還可以設定拆分片段的最大數量：

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
> fields %>% str_split(": ", n = 2, simplify = TRUE)
     [,1]      [,2]    
[1,] "Name"    "Hadley"
[2,] "Country" "NZ"    
[3,] "Age"     "35"

除了模式，你還可以通過字母、行、句子和單詞邊界（boundary() 函數）來拆分字符串：

x <- "This is a sentence. This is another sentence."
str_view_all(x, boundary("word"))

9.2

str_split(x, " ")[[1]]
> str_split(x, " ")[[1]]
[1] "This"      "is"        "a"         "sentence." "This"      "is"       
[7] "another"   "sentence."

9.4.6 定位匹配內容

str_locate() 和 str_locate_all() 函數可以給出每個匹配的開始位置和結束位置。當沒有其他函數能夠精確地滿足需求時，這兩個函數特別有用。你可以使用 str_locate() 函數找出匹配的模式，然后使用 str_sub() 函數來提取或修改匹配的內容。

9.5 其他類型的模式

當使用一個字符串作為模式時，R 會自動調用 regex() 函數對其進行包裝：

# 正常調用：
str_view(fruit, "nana")
# 上面形式是以下形式的簡寫
str_view(fruit, regex("nana"))

你可以使用 regex() 函數的其他參數來控制具體的匹配方式。

ignore_case = TRUE 既可以匹配大寫字母，也可以匹配小寫字母，它總是使用當前的區域設置：

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")

9.3

str_view(bananas, regex("banana", ignore_case = TRUE))

9.4

multiline = TRUE 可以使得 ^ 和 $ 從每行的開頭和末尾開始匹配，而不是從完整字符串的開頭和末尾開始匹配：

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
> [1] "Line" "Line" "Line"

comments = TRUE 可以在復雜的正則表達式中加入注釋和空白字符，以便更易理解。匹配時會忽略空格和 # 后面的內容。如果想要匹配一個空格，你需要對其進行轉義："\\ "：

phone <- regex("
 \\(? # 可選的開括號
 (\\d{3}) # 地區編碼
 [)- ]? # 可選的閉括號、短劃線或空格
 (\\d{3}) # 另外3個數字
 [ -]? # 可選的空格或短劃線
 (\\d{3}) # 另外3個數字
 ", comments = TRUE)
str_match("514-791-8141", phone)
> str_match("514-791-8141", phone)
     [,1]          [,2]  [,3]  [,4] 
[1,] "514-791-814" "514" "791" "814"

dotall = TRUE 可以使得 . 匹配包括 \n 在內的所有字符。

除了 regex()，還可以使用其他 3 種函數。

fixed() 函數可以按照字符串的字節形式進行精確匹配，它會忽略正則表達式中的所有特殊字符，并在非常低的層次上進行操作。這樣可以讓你不用進行那些復雜的轉義操作，而且速度比普通正則表達式要快很多。

install.packages("microbenchmark")
microbenchmark::microbenchmark(
 fixed = str_detect(sentences, fixed("the")),
 regex = str_detect(sentences, "the"),
 times = 20
)
> microbenchmark::microbenchmark(
+   fixed = str_detect(sentences, fixed("the")),
+   regex = str_detect(sentences, "the"),
+   times = 20
+ )
Unit: microseconds
  expr   min     lq    mean median     uq   max neval
 fixed  93.3  95.75 129.090  98.25 105.45 616.4    20
 regex 274.5 277.35 302.605 280.10 302.35 475.1    20

在匹配非英語數據時，要慎用 fixed() 函數。它可能會出現問題，因為此時同一個字符經常有多種表達方式。例如，定義 á 的方式有兩種：一種是單個字母 a，另一種是 a 加上重音符號。

coll() 函數使用標準排序規則來比較字符串，這在進行不區分大小寫的匹配時是非常有效的。注意，可以在 coll() 函數中設置 locale 參數，以確定使用哪種規則來比較字符。
在介紹 str_split() 函數時，已經知道可以使用 boundary() 函數來匹配邊界。還可以在其他函數中使用這個函數：

x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
> str_extract_all(x, boundary("word"))
[[1]]
[1] "This"     "is"       "a"        "sentence"

9.6 正則表達式的其他應用

R 基礎包中有兩個常用函數也可以使用正則表達式。

apropos() 函數可以在全局環境空間中搜索所有可用對象。當不能確切想起函數名稱時，這個函數特別有用：

> apropos("replace")
 [1] "%+replace%"                     ".rs.registerReplaceHook"       
 [3] ".rs.replaceBinding"             ".rs.rpc.replace_comment_header"
 [5] "replace"                        "replace_na"                    
 [7] "setReplaceMethod"               "str_replace"                   
 [9] "str_replace_all"                "str_replace_na"                
[11] "theme_replace"

dir() 函數可以列出一個目錄下的所有文件。dir() 函數的 patten 參數可以是一個正則表達式，此時它只返回與這個模式相匹配的文件名。例如，你可以使用以下代碼返回當前目錄中的所有 R Markdown 文件：

head(dir(pattern = "\\.Rmd$"))

9.7 stringi

stringr 建立于 stringi 的基礎之上。stringr 非常容易學習，因為它只提供了非常少的函數，這些函數是精挑細選的，可以完成大部分常用字符串操作功能。與 stringr 不同，stringi 的設計思想是盡量全面，幾乎包含了我們可以用到的所有函數：stringi 中有 234 個函數，而 stringr 中只有 42 個。主要區別是前綴：str_ 與 stri_。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

《R數據科學》學習筆記|Note10:使用stringr處理字符串(下）

《R數據科學》學習筆記|Note10:使用stringr處理字符串(下）

使用stringr處理字符串

9.4 工具

9.4.1 匹配檢測

9.4.2 提取匹配內容

9.4.3 分組匹配

9.4.4 替換匹配內容

9.4.5 拆分

9.4.6 定位匹配內容

9.5 其他類型的模式

9.6 正則表達式的其他應用

9.7 stringi

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

《R數據科學》學習筆記|Note10:使用stringr處理字符串(下）

使用stringr處理字符串

9.4 工具

9.4.1 匹配檢測

9.4.2 提取匹配內容

9.4.3 分組匹配

9.4.4 替換匹配內容

9.4.5 拆分

9.4.6 定位匹配內容

9.5 其他類型的模式

9.6 正則表達式的其他應用

9.7 stringi

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频