R正則表達式(grep,grepl,regexpr,sub,gsub)

[1] “Hello” “Adam!\nHello” “Ava!”

R語言的字符串事實上也是正則表達式，上面文本中的\n在圖形輸出中是被解釋為換行符的

strsplit(text, “\s”)

[1] “Hello” “Adam!” “Hello” “Ava!”

strsplit得到的結果是列表，后面要怎么處理就得看情況而定了：

class(strsplit(text, “\s”))

[1] “list”

有一種情況很特殊：如果split參數的字符長度為0，得到的結果就是一個個的字符：

strsplit(text, “”)

[1] “H” “e” “l” “l” “o” " " “A” “d” “a” “m” “!” “\n” “H” “e”

[15] “l” “l” “o” " " “A” “v” “a” “!”

從這里也可以看到R把 \n 是當成一個字符來處理的。

5 字符串查詢：

5.1 grep和grepl函數：

這兩個函數返回向量水平的匹配結果，不涉及匹配字符串的詳細位置信息。

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE,

useBytes = FALSE, invert = FALSE)

grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)

雖然參數看起差不多，但是返回的結果不一樣。下來例子列出C:\windows目錄下的所有文件，然后用grep和grepl查找exe文件：

files <- list.files(“c:/windows”)

grep("\.exe$", files)

[1] 8 28 30 35 36 58 69 99 100 102 111 112 115 117

grepl("\.exe$", files)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE

[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[23] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE

[34] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[56] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[67] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

[100] TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

[111] TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE

grep僅返回匹配項的下標，而grepl返回所有的查詢結果，并用邏輯向量表示有沒有找到匹配。兩者的結果用于提取數據子集的結果都一樣：

files[grep("\.exe$", files)]

[1] “bfsvc.exe” “explorer.exe” “fveupdate.exe” “HelpPane.exe”

[5] “hh.exe” “notepad.exe” “regedit.exe” “twunk_16.exe”

[9] “twunk_32.exe” “uninst.exe” “winhelp.exe” “winhlp32.exe”

[13] “write.exe” “xinstaller.exe”

files[grepl("\.exe$", files)]

[1] “bfsvc.exe” “explorer.exe” “fveupdate.exe” “HelpPane.exe”

[5] “hh.exe” “notepad.exe” “regedit.exe” “twunk_16.exe”

[9] “twunk_32.exe” “uninst.exe” “winhelp.exe” “winhlp32.exe”

[13] “write.exe” “xinstaller.exe”

5.2 regexpr、gregexpr和regexec

這三個函數返回的結果包含了匹配的具體位置和字符串長度信息，可以用于字符串的提取操作。

text <- c(“Hellow, Adam!”, “Hi, Adam!”, “How are you, Adam.”)

regexpr(“Adam”, text)

[1] 9 5 14

[1] 4 4 4

attr(,“useBytes”)

[1] TRUE

gregexpr(“Adam”, text)

[1] 9

attr(,“useBytes”)

[1] TRUE

regexec(“Adam”, text)

[1] 9

[[2]]

[1] 5

[[3]]

[1] 14