卡通自拍亚洲另类,中文字幕无线码一区2020青青,百度输入法下载

1 簡介

接下來將學(xué)習(xí) R 中的字符串操作。本節(jié)中你將學(xué)習(xí)字符串如何工作以及如何手動創(chuàng)建字符串的基礎(chǔ)知識，重點(diǎn)部分是正則表達(dá)式。正則表達(dá)式很有用，因?yàn)樽址ǔ０墙Y(jié)構(gòu)化或半結(jié)構(gòu)化數(shù)據(jù)，而正則表達(dá)式是一種描述字符串模式的簡潔語言。

1.1 加載包

本節(jié)介紹的字符串操作使用的是stringr包，它是 tidyverse 核心部分之一。

library(tidyverse)

2 字符串基礎(chǔ)

可以使用單引號或雙引號創(chuàng)建字符串。如果你想創(chuàng)建包含雙引號的字符串，可以通過單引號來創(chuàng)建，如下所示：

string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

有的時候你可能會忘記關(guān)閉引號，則會出現(xiàn)+連續(xù)字符：

> "This is a string without a closing quote
+ 
+ 
+ HELP I'M STUCK

如果遇到這種情況，按 Ctrl + c 退出或者補(bǔ)全引號結(jié)束。

要在字符串中只包含單引號或雙引號，可以使用\它來“轉(zhuǎn)義”它：

double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"

這意味著如果你想包含一個文字反斜杠，則需要兩個反斜杠："\\"。

注意，打印字符串與字符串本身不同，因?yàn)榇蛴∫榭醋址脑純?nèi)容：

x <- c("\"", "\\")
x
#> [1] "\"" "\\"
writeLines(x)
#> "
#> \

還有一些其他特殊字符。最常見的是"\n"換行符和"\t"制表符，但您可以通過?'"'或?"'"來查看"與'的幫助文檔。有時還會看到類似"\u00b5"的字符串，這是一種在所有平臺上都適用的書寫非英文字符的方法：

x <- "\u00b5"
x
#> [1] "μ"

把多個字符串存儲在一個字符向量中，可以使用c()命令創(chuàng)建：

c("one", "two", "three")
#> [1] "one"   "two"   "three"

2.1 字符串長度

雖然R 中的Base包含許多處理字符串的函數(shù)，但我們將使用 stringr 中的函數(shù)。它們有更直觀的名稱，并且都以str_。例如，str_length()字符串中的字符數(shù)：

str_length(c("a", "R for data science", NA))
#> [1]  1 18 NA

如果你在RStudio使用str_，則這個str_前綴特別有用，當(dāng)你輸入str_時會觸發(fā)可以使用的 stringr 函數(shù)：

image.png

2.2 字符串合并

要合并兩個或多個字符串，使用str_c()：

str_c("x", "y")
#> [1] "xy"
str_c("x", "y", "z")
#> [1] "xyz"

使用sep參數(shù)來控制它們的分隔方式：

str_c("x", "y", sep = ", ")
#> [1] "x, y"

與 R 中的大多數(shù)其他函數(shù)一樣，缺失值具有傳染性。如果希望它們打印為"NA"，請使用str_replace_na()：

x <- c("abc", NA)
str_c("|-", x, "-|")
#> [1] "|-abc-|" NA
str_c("|-", str_replace_na(x), "-|")
#> [1] "|-abc-|" "|-NA-|"

如上所示，str_c()被向量化，它會自動將較短的向量回收到與最長的向量相同的長度：

str_c("prefix-", c("a", "b", "c"), "-suffix")
#> [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"

長度為 0 的對象被丟棄。這在if條件語句下結(jié)合使用時特別有用：

name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
#> [1] "Good morning Hadley."

要將字符串向量合并為單個字符串，使用collapse：

str_c(c("x", "y", "z"), collapse = ", ")
#> [1] "x, y, z"

2.3 字符串子集

提取字符串的一部分字符可以使用 str_sub()。str_sub()需要提供start和end參數(shù)獲取該子串的位置：

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
#> [1] "App" "Ban" "Pea"
# negative numbers count backwards from end
str_sub(x, -3, -1)
#> [1] "ple" "ana" "ear"

請注意，str_sub()如果字符串太短，則不會出錯：

str_sub("a", 1, 5)
#> [1] "a"

您還可以使用str_sub()的賦值形式來修改字符串：

str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple"  "banana" "pear"

2.4 語言環(huán)境

上面使用的str_to_lower()將文本更改為小寫。還可以使用str_to_upper()或str_to_title()。然而，改變大小寫比它最初看起來更復(fù)雜，因?yàn)椴煌恼Z言有不同的改變大小寫的規(guī)則。可以通過指定區(qū)域設(shè)置來選擇要使用的規(guī)則集：

# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "?"))
#> [1] "I" "I"
str_to_upper(c("i", "?"), locale = "tr")
#> [1] "?" "I"

語言環(huán)境指定為 ISO 639 語言代碼，它是兩個或三個字母的縮寫。如果您還不知道您的語言的代碼，維基百科可以查看。如果將區(qū)域設(shè)置留空，它將使用操作系統(tǒng)提供的當(dāng)前區(qū)域設(shè)置。

排序也受語言環(huán)境影響。Base R中order()和sort()函數(shù)使用當(dāng)前語言環(huán)境對字符串進(jìn)行排序。如果你想要在不同計算機(jī)進(jìn)行排序，可以使用str_sort()和 str_order()并添加locale參數(shù)：

x <- c("apple", "eggplant", "banana")

str_sort(x, locale = "en")  # English
#> [1] "apple"    "banana"   "eggplant"

str_sort(x, locale = "haw") # Hawaiian
#> [1] "apple"    "eggplant" "banana"

還有一些比較常用的基本函數(shù)：str_length()、str_wrap()、str_trim()等等。

3 正則表達(dá)式匹配模式

正則表達(dá)式是一種非常簡潔的語言，它允許在字符串匹配模式中設(shè)定相應(yīng)的匹配模式。剛開始學(xué)習(xí)時可能比較難理解，只要你們掌握了它，用處將非常大。

要學(xué)習(xí)正則表達(dá)式，我們將使用str_view()和str_view_all()。這些函數(shù)接受一個字符向量和一個正則表達(dá)式，并展示它們是如何匹配的。我們將從非常簡單的正則表達(dá)式開始，然后逐漸變得越來越復(fù)雜。一旦學(xué)會了模式匹配，你會了解如何應(yīng)用到各種 stringr 函數(shù)中。

3.1 基本匹配

簡單的匹配精確的字符串：

x <- c("apple", "banana", "pear")
str_view(x, "an")

image.png

接下來進(jìn)行稍復(fù)雜的匹配：.匹配任何字符（換行符除外）

str_view(x, ".a.")

image.png

如果要匹配.，理論上需要用正則表達(dá)式\.。然而這會產(chǎn)生一個問題。我們用字符串來表示正則表達(dá)式，\在字符串中也用作轉(zhuǎn)義符。所以要創(chuàng)建正則表達(dá)式，\.，則需要使用"\\."。

# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)
#> \.

# And this tells R to look for an explicit .
str_view(c("abc", "a.c", "bef"), "a\\.c")

image.png

\在正則表達(dá)式中用作轉(zhuǎn)義字符，那么如何匹配\呢？首先你需要轉(zhuǎn)義它，創(chuàng)建正則表達(dá)式\\。創(chuàng)建該正則表達(dá)式，您需要使用一個字符串，該字符串也需要轉(zhuǎn)義\。這意味著匹配\你需要寫成"\\\\"——即需要四個反斜杠來匹配一個！

x <- "a\\b"
writeLines(x)
#> a\b

str_view(x, "\\\\")

image.png

x = c("a\'b",'c\"d',"e\\f")
writeLines(x)
#> a'b
#> c"d
#> e\f
str_view(x,"\"")

image.png

3.2 指定錨點(diǎn)

默認(rèn)情況下，正則表達(dá)式將匹配字符串的任何部分。錨點(diǎn)在正則表達(dá)式中很常用，在字符串的開頭或結(jié)尾進(jìn)行匹配，可以使用：

^ 匹配字符串的開頭。
$ 匹配字符串的結(jié)尾。

x <- c("apple", "banana", "pear")
str_view(x, "^a")

image.png

str_view(x, "a$")

image.png

要強(qiáng)制使用正則表達(dá)式精確匹配一個完整的字符串，錨定位置用^和$：

x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")

image.png

str_view(x, "^apple$")

image.png

可以使用\b匹配單詞之間的邊界。例如，尋找\bsum\b避免匹配summarise，summary，rowsum等等。

如何匹配字符串"$^$"？

x <- "$100^2999$"
str_view(x,"^\\$.*\\$$")

image.png

3.3 字符類替代

有許多特殊模式可以匹配多個字符。前面已經(jīng)了解了.，它匹配除換行符之外的任何字符。還有其他四個有用的工具：

\d: 匹配任何數(shù)字。
\s: 匹配任何空格（例如空格、制表符、換行符）。
[abc]: 匹配 a、b 或 c。
[^abc]: 匹配除 a、b 或 c 之外的任何內(nèi)容。

請記住，要創(chuàng)建包含\d或\s的正則表達(dá)式，需要對\字符串進(jìn)行轉(zhuǎn)義，即"\\d"或 "\\s"。

如果想匹配單個字符，可以通過[ ]來代替\使用。

# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")

image.png

str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c")

image.png

str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]")

image.png

這適用于大多數(shù)（但不是全部）正則表達(dá)式元字符：$ . | ? * + ( ) [ {。然而這些字符：] \ ^ -，由于有特殊含義，必須使用反斜杠轉(zhuǎn)義來處理。

你可以在一種或多種替代模式之間進(jìn)行選擇。例如，abc|d..f將匹配 '“abc”' 或"deaf". 請注意， |的優(yōu)先級較低，因此abc|xyz匹配abc或xyz，不匹配abcyz或abxyz。如果匹配比較復(fù)雜，可以使用括號區(qū)分：

str_view(c("grey", "gray"), "gr(e|a)y")

image.png

3.4 重復(fù)模式

控制匹配模式的次數(shù)：

?: 0次或1次
+: 1次或多次
*: 0次或多次

x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")

image.png

str_view(x, "CC+")

image.png

str_view(x, 'C[LX]+')

image.png

指定具體匹配次數(shù)：

{n}: 只匹配n次
{n,}: 匹配n次或多于n次
{,m}: 最多不超過m次
{n,m}: 在 n 和 m 之間

str_view(x, "C{2}")

image.png

str_view(x, "C{2,}")

image.png

str_view(x, "C{2,3}")

image.png

默認(rèn)情況下都是采用貪婪模式進(jìn)行匹配的，貪婪模式將匹配盡可能長的字符串。如果想用非貪婪模式匹配，可以在匹配項(xiàng)后面通過添加?來實(shí)現(xiàn)。

str_view(x, 'C{2,3}?')

image.png

str_view(x, 'C[LX]+?')

image.png

3.5 分組和反向引用

前面，我們已經(jīng)知道圓括號可以解決匹配復(fù)雜表達(dá)式。其實(shí)括號還創(chuàng)建一個編號的捕獲組（編號 1、2 等）。捕獲組存儲括號內(nèi)的正則表達(dá)式匹配字符串部分。如果想反向引用（參考先前由捕獲組匹配的相同的文本），比如\1，\2等等。

str_view(fruit, "(..)\\1", match = TRUE)

image.png

4 匹配工具

前面我們已經(jīng)了解了正則表達(dá)式的基礎(chǔ)知識，那么在實(shí)際運(yùn)用中的stringr函數(shù)可以解決哪些問題呢？

確定哪些字符串匹配。
找到匹配的位置。
提取匹配內(nèi)容。
用新值替換匹配項(xiàng)。
根據(jù)匹配拆分字符串。

在我們繼續(xù)之前，請注意一點(diǎn)：因?yàn)檎齽t表達(dá)式非常強(qiáng)大，所以很容易使用單個正則表達(dá)式解決所有問題。

下面這個例子是檢查電子郵件地址是否有效的正則表達(dá)式：

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)

在實(shí)際使用的時候，這個電子郵件地址檢測就非常冗余了，關(guān)于這個問題的討論，可參考http://stackoverflow.com/a/201378的 stackoverflow 討論。

4.1 查看是否匹配

要查看字符向量是否與正則表達(dá)式匹配，請使用str_detect()。它返回一個與輸入長度相同的邏輯向量：

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1]  TRUE FALSE  TRUE

當(dāng)我們進(jìn)行數(shù)字計算（sum或mean）時：FALSE變?yōu)?，TRUE變?yōu)?

# How many common words start with t?
sum(str_detect(words, "^t"))
#> [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.2765306

當(dāng)您有復(fù)雜的邏輯條件（例如除了d之外匹配 a 或 b 但不匹配 c ）時，將調(diào)用多個str_detect()與邏輯運(yùn)算符組合，而不是依靠單個正則表達(dá)式。例如，有兩種方法可以查找所有不包含任何元音的單詞：

# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")
# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)    # 判斷對象是否完全相等
#> [1] TRUE

上面兩種方法的結(jié)果完全相同，但是第一種更容易理解。

str_detect()的一個常見用途是選擇與模式匹配的元素。可以使用邏輯子集或str_subset()來實(shí)現(xiàn)：

words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

在通常情況下，你所需的字符串是數(shù)據(jù)框的一列，可以通過filter過濾：

df <- tibble(
  word = words, 
  i = seq_along(word)
)
df %>% 
  filter(str_detect(word, "x$"))
#> # A tibble: 4 x 2
#>   word      i
#>   <chr> <int>
#> 1 box     108
#> 2 sex     747
#> 3 six     772
#> 4 tax     841

str_count()是str_detect()的一個變體，而不是簡單的 yes 或 no，它顯示字符串中匹配的次數(shù)：

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1

# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
#> [1] 1.991837

mutate()經(jīng)常使用str_count()：

df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]")
  )
#> # A tibble: 980 x 4
#>   word         i vowels consonants
#>   <chr>    <int>  <int>      <int>
#> 1 a            1      1          0
#> 2 able         2      2          2
#> 3 about        3      3          2
#> 4 absolute     4      4          4
#> 5 accept       5      2          4
#> 6 account      6      3          4
#> # … with 974 more rows

請注意，匹配過程不會重復(fù)匹配。例如，在"abababa" 中，將"aba"匹配多少次？正則表達(dá)式說的是兩個，而不是三個：

str_count("abababa", "aba")
#> [1] 2
str_view_all("abababa", "aba")

image.png

4.2 匹配提取

要提取匹配的文本，使用str_extract()。通過使用Harvard_sentences例子。

length(sentences)
#> [1] 720
head(sentences)
#> [1] "The birch canoe slid on the smooth planks." 
#> [2] "Glue the sheet to the dark blue background."
#> [3] "It's easy to tell the depth of a well."     
#> [4] "These days a chicken leg is a rare dish."   
#> [5] "Rice is often served in round bowls."       
#> [6] "The juice of lemons makes fine punch."

想象一下，我們想要找到所有包含顏色的句子。我們首先創(chuàng)建一個顏色名稱向量，然后將其轉(zhuǎn)換為單個正則表達(dá)式：

colours <- c("red", "orange", "yellow", "green", "blue", "purple")
colour_match <- str_c(colours, collapse = "|")
colour_match
#> [1] "red|orange|yellow|green|blue|purple"

現(xiàn)在我們可以選擇包含顏色的句子，然后提取顏色來找出它是哪個：

has_colour <- str_subset(sentences, colour_match)
matches <- str_extract(has_colour, colour_match)
head(matches)
#> [1] "blue" "blue" "red"  "red"  "red"  "blue"

請注意，str_extract()僅提取第一個匹配項(xiàng)。通過查看匹配項(xiàng)超過 1個的句子：

more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)

image.png


str_extract(more, colour_match)
#> [1] "blue"   "green"  "orange"

這是 stringr 函數(shù)的常見模式，因?yàn)槭褂脝蝹€匹配項(xiàng)使用更簡單的數(shù)據(jù)結(jié)構(gòu)。要獲取所有匹配項(xiàng)，請使用str_extract_all()。它返回一個列表：

str_extract_all(more, colour_match)
#> [[1]]
#> [1] "blue" "red" 
#> 
#> [[2]]
#> [1] "green" "red"  
#> 
#> [[3]]
#> [1] "orange" "red"

如果使用simplify = TRUE,str_extract_all()將返回一個矩陣，其中的短匹配自動擴(kuò)展到與最長匹配的長度相同：

str_extract_all(more, colour_match, simplify = TRUE)
#>      [,1]     [,2] 
#> [1,] "blue"   "red"
#> [2,] "green"  "red"
#> [3,] "orange" "red"

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
#>      [,1] [,2] [,3]
#> [1,] "a"  ""   ""  
#> [2,] "a"  "b"  ""  
#> [3,] "a"  "b"  "c"

4.3 分組匹配

我們討論了使用括號來澄清優(yōu)先級和匹配時的反向引用。你還可以使用括號提取復(fù)雜匹配的部分。例如，假設(shè)我們想從句子中提取名詞。我們將查找“a”或“the”之后的任何單詞。在正則表達(dá)式中定義“單詞”有點(diǎn)棘手，所以這里我使用一個簡單的近似值：至少一個不是空格的字符的序列。

noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
#>  [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
#>  [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"

str_extract()給我們完整的匹配；str_match()給出每個單獨(dú)的匹配。它返回一個矩陣，而不是字符向量，其中一列用于完整匹配，然后列對應(yīng)匹配的組：

has_noun %>% 
  str_match(noun)
#>       [,1]         [,2]  [,3]     
#>  [1,] "the smooth" "the" "smooth" 
#>  [2,] "the sheet"  "the" "sheet"  
#>  [3,] "the depth"  "the" "depth"  
#>  [4,] "a chicken"  "a"   "chicken"
#>  [5,] "the parked" "the" "parked" 
#>  [6,] "the sun"    "the" "sun"    
#>  [7,] "the huge"   "the" "huge"   
#>  [8,] "the ball"   "the" "ball"   
#>  [9,] "the woman"  "the" "woman"  
#> [10,] "a helps"    "a"   "helps"

如果您的數(shù)據(jù)在 tibble 中，則使用[tidyr::extract()](https://tidyr.tidyverse.org/reference/extract.html)。它的工作原理類似于str_match()但需要您命名匹配項(xiàng)，然后將其放置在新列中：

tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE
  )
#> # A tibble: 720 x 3
#>   sentence                                    article noun   
#>   <chr>                                       <chr>   <chr>  
#> 1 The birch canoe slid on the smooth planks.  the     smooth 
#> 2 Glue the sheet to the dark blue background. the     sheet  
#> 3 It's easy to tell the depth of a well.      the     depth  
#> 4 These days a chicken leg is a rare dish.    a       chicken
#> 5 Rice is often served in round bowls.        <NA>    <NA>   
#> 6 The juice of lemons makes fine punch.       <NA>    <NA>   
#> # … with 714 more rows

就像str_extract()，如果您想要每個字符串的所有匹配項(xiàng)，則需要str_match_all().

4.4 匹配替換

str_replace()和str_replace_all()允許您用新字符串替換匹配項(xiàng)。最簡單的用法是用固定字符串替換模式：

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-"  "p--r"   "b-n-n-"

str_replace_all()可以通過提供一個向量名執(zhí)行多個替換：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house"    "two cars"     "three people"

您可以使用反向引用來引用匹配的組，而不是用固定字符串替換。在下面的代碼中，我翻轉(zhuǎn)了第二個和第三個單詞的順序。

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
#> [1] "The canoe birch slid on the smooth planks." 
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."     
#> [4] "These a days chicken leg is a rare dish."   
#> [5] "Rice often is served in round bowls."

4.5 分割

使用str_split()將一個字符串分解成多個。例如，我們可以將句子拆分為單詞：

sentences %>%
  head(5) %>% 
  str_split(" ")
#> [[1]]
#> [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
#> [8] "planks."
#> 
#> [[2]]
#> [1] "Glue"        "the"         "sheet"       "to"          "the"        
#> [6] "dark"        "blue"        "background."
#> 
#> [[3]]
#> [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
#> 
#> [[4]]
#> [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
#> [8] "rare"    "dish."  
#> 
#> [[5]]
#> [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

因?yàn)槊總€組可能包含不同數(shù)量元素，所以這將返回一個列表。如果您使用的是長度為 1 的向量，最簡單的方法就是提取列表的第一個元素：

"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]]
#> [1] "a" "b" "c" "d"

您可以使用simplify = TRUE返回一個矩陣：

sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)
#>      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
#> [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
#> [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
#> [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
#> [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
#> [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
#>      [,9]   
#> [1,] ""     
#> [2,] ""     
#> [3,] "well."
#> [4,] "dish."
#> [5,] ""

您還可以請求最大件數(shù)：

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields %>% str_split(": ", n = 2, simplify = TRUE)
#>      [,1]      [,2]    
#> [1,] "Name"    "Hadley"
#> [2,] "Country" "NZ"    
#> [3,] "Age"     "35"

除了按模式拆分字符串，boundary()還可以按字符、行、句子和單詞拆分：

x <- "This is a sentence.  This is another sentence."
str_view_all(x, boundary("word"))

image.png


str_split(x, " ")[[1]]
#> [1] "This"      "is"        "a"         "sentence." ""          "This"     
#> [7] "is"        "another"   "sentence."
str_split(x, boundary("word"))[[1]]
#> [1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
#> [8] "sentence"

4.6 匹配項(xiàng)定位

str_locate()和str_locate_all()提供每次匹配的開始和結(jié)束位置。當(dāng)其他函數(shù)都不能滿足我們的需求時，可以使用str_locate()查找匹配的模式，str_sub()提取或修改它們。

5 其他類型的匹配

當(dāng)您使用字符串匹配時，調(diào)用中regex()：

# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))

您可以使用regex() 的其他參數(shù)來控制匹配的詳細(xì)信息：

ignore_case = TRUE允許字符匹配它們的大寫或小寫形式。這始終使用當(dāng)前語言環(huán)境。
```
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
```

image.png

str_view(bananas, regex("banana", ignore_case = TRUE))

image.png

multiline = TRUE允許^和$匹配每一行的開始和結(jié)束，而不是整個字符串的開始和結(jié)束。

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"

comments = TRUE允許您使用注釋和空格使復(fù)雜的正則表達(dá)式更易于理解。空格將被忽略， #之后的所有內(nèi)容也是如此。要匹配數(shù)字，您需要對其進(jìn)行轉(zhuǎn)義： "\\ "。

phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
#>      [,1]          [,2]  [,3]  [,4] 
#> [1,] "514-791-814" "514" "791" "814"

dotall = TRUE允許.匹配所有內(nèi)容，包括\n.
fixed(): 完全匹配指定的字節(jié)序列。它忽略所有特殊的正則表達(dá)式并在非常低的級別上運(yùn)行。這使您可以避免復(fù)雜的轉(zhuǎn)義，并且可以比正則表達(dá)式快得多。以下通過microbenchmark測試表明，對于一個簡單示例，它的速度大約快了 3 倍。
```
microbenchmark::microbenchmark(
  fixed = str_detect(sentences, fixed("the")),
  regex = str_detect(sentences, "the"),
  times = 20
)
#> Unit: microseconds
#>   expr     min       lq     mean   median       uq     max neval
#>  fixed 100.392 101.3465 118.7986 105.9055 108.8545 367.118    20
#>  regex 346.595 349.1145 353.7308 350.2785 351.4135 403.057    20
```
小心使用fixed()非英語數(shù)據(jù)。這是有問題的，因?yàn)橥ǔＳ卸喾N方式來表示同一個字符。例如，有兩種方法可以定義“á”：作為單個字符或作為“a”加一個重音符號：
```
a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
#> [1] "á" "a?"
a1 == a2
#> [1] FALSE
```
它們的渲染方式相同，但由于定義不同， fixed()因此找不到匹配項(xiàng)。相反，您可以使用coll(), 以符合人類字符比較規(guī)則：
```
str_detect(a1, fixed(a2))
#> [1] FALSE
str_detect(a1, coll(a2))
#> [1] TRUE
```
coll(): 使用標(biāo)準(zhǔn)排序規(guī)則比較字符串。這對于進(jìn)行不區(qū)分大小寫的匹配很有用。請注意，coll()采用一個 locale參數(shù)來控制用于比較字符的規(guī)則。然而世界不同地區(qū)使用不同的規(guī)則！
```
# That means you also need to be aware of the difference
# when doing case insensitive matches:
i <- c("I", "?", "i", "?")
i
#> [1] "I" "?" "i" "?"

str_subset(i, coll("i", ignore_case = TRUE))
#> [1] "I" "i"
str_subset(i, coll("i", ignore_case = TRUE, locale = "tr"))
#> [1] "?" "i"
```
fixed()和regex()都有ignore_case參數(shù)，但他們不能選擇的語言環(huán)境：他們總是使用默認(rèn)的語言環(huán)境。使用以下代碼查看
```
stringi::stri_locale_info()
#> $Language
#> [1] "zh"

#> $Country
#> [1] "CN"

#> $Variant
#> [1] ""

#> $Name
#> [1] "zh_CN"
```
coll()的缺點(diǎn)是速度；因?yàn)樽R別哪些字符相同的規(guī)則很復(fù)雜，coll()與regex()相比fixed()相對較慢。
str_split()可以使用boundary()來匹配邊界。還可以將其與其他函數(shù)一起使用：
```
x <- "This is a sentence."
str_view_all(x, boundary("word"))
```

image.png

```
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This"     "is"       "a"        "sentence"
```

6 正則表達(dá)式的其他應(yīng)用

Base R 中有兩個有用的函數(shù)也使用正則表達(dá)式：

apropos()搜索全局環(huán)境中所有可用的對象。在你記不清楚函數(shù)的名稱時很有用。

apropos("replace")
#> [1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
#> [5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"

dir()列出目錄中的所有文件。該pattern參數(shù)采用正則表達(dá)式，僅返回與模式匹配的文件名。例如，您可以使用以下命令查找當(dāng)前目錄中的所有 R Markdown 文件：
```
head(dir(pattern = "\\.Rmd$"))
#> [1] "communicate-plots.Rmd" "communicate.Rmd"       "datetimes.Rmd"        
#> [4] "EDA.Rmd"               "explore.Rmd"           "factors.Rmd"
```

7 stringi

stringr 建立在stringi包之上。stringr 在您學(xué)習(xí)時很有用，因?yàn)樗_了一組最少的函數(shù)，這些函數(shù)是經(jīng)過精心挑選來處理最常見的字符串操作函數(shù)的。stringi 的設(shè)計是全面的。它幾乎包含您可能需要的所有函數(shù)：stringi 有 250 個函數(shù)，stringr 有 49 個。

如果您發(fā)現(xiàn)自己很難在 stringr 中做某事，那么可以看一看 stringi。這兩個包的工作方式非常相似。主要區(qū)別在于前綴：str_vs stri_。

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

9.字符串

9.字符串

1 簡介

1.1 加載包

2 字符串基礎(chǔ)

2.1 字符串長度

2.2 字符串合并

2.3 字符串子集

2.4 語言環(huán)境

3 正則表達(dá)式匹配模式

3.1 基本匹配

3.2 指定錨點(diǎn)

3.3 字符類替代

3.4 重復(fù)模式

3.5 分組和反向引用

4 匹配工具

4.1 查看是否匹配

4.2 匹配提取

4.3 分組匹配

4.4 匹配替換

4.5 分割

4.6 匹配項(xiàng)定位

5 其他類型的匹配

6 正則表達(dá)式的其他應(yīng)用

7 stringi

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

9.字符串

1 簡介

1.1 加載包

2 字符串基礎(chǔ)

2.1 字符串長度

2.2 字符串合并

2.3 字符串子集

2.4 語言環(huán)境

3 正則表達(dá)式匹配模式

3.1 基本匹配

3.2 指定錨點(diǎn)

3.3 字符類替代

3.4 重復(fù)模式

3.5 分組和反向引用

4 匹配工具

4.1 查看是否匹配

4.2 匹配提取

4.3 分組匹配

4.4 匹配替換

4.5 分割

4.6 匹配項(xiàng)定位

5 其他類型的匹配

6 正則表達(dá)式的其他應(yīng)用

7 stringi

推薦閱讀更多精彩內(nèi)容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频