R語言基礎(chǔ)系列:
- R語言基礎(chǔ)1--R數(shù)據(jù)格式:.rds和.rda的區(qū)別
- R語言基礎(chǔ)2--數(shù)據(jù)排序與長寬型數(shù)據(jù)的轉(zhuǎn)換
- R語言基礎(chǔ)3--tidyverse包總結(jié)
- R語言基礎(chǔ)4--dplyr包的函數(shù)及用法
- R語言基礎(chǔ)5--tidyr包的函數(shù)及用法
- R語言基礎(chǔ)6--apply函數(shù)家族及其應(yīng)用
- R語言基礎(chǔ)7--R語言中的正則表達(dá)式
- R語言基礎(chǔ)8--R語言缺失值、異常值和重復(fù)值的識別與處理
字符串的處理與正則表達(dá)式關(guān)系密切,參考:R語言中的正則表達(dá)式
1. 字符串的初步處理
生成字符串
x <- c('huake','wuda')
1.1 nchar函數(shù):查看字符串有多少個(gè)字符
nchar(x)
# [1] 5 4
??注意nchar函數(shù)與length函數(shù)的區(qū)別,如果用length(x),返回的是2(有兩個(gè)字符串),但可以使用str_length()函數(shù)
length(x)
# [1] 2
str_length(x)
# [1] 5 4
1.2 大小寫的轉(zhuǎn)換
toupper函數(shù):小寫變大寫
tolower函數(shù):大寫變小寫
toupper('huake')
# [1] "HUAKE"
tolower('WUDA')
# [1] "wuda"
1.3 paste()
函數(shù)和paste0()
函數(shù):連接字符串
paste函數(shù)
stringa <- LETTERS[1:5]
STRINGB <- 1:5
paste(stringa,STRINGB)
# [1] "A 1" "B 2" "C 3" "D 4" "E 5"
# sep參數(shù)可以定義黏貼參數(shù)間的連接方法
paste(stringa,STRINGB,sep='-')
# [1] "A-1" "B-2" "C-3" "D-4" "E-5"
#collapse參數(shù),把所有參數(shù)粘貼在一起,并定義連接方法
paste(stringa,STRINGB,collapse ='-')
# [1] "A 1-B 2-C 3-D 4-E 5"
paste0函數(shù) (0代表粘貼在一起后沒有間隔)
paste0(stringa,STRINGB)
# [1] "A1" "B2" "C3" "D4" "E5"
#使用sep和collapse也無法插入到中間
paste0(stringa,STRINGB,sep='-')
# [1] "A1-" "B2-" "C3-" "D4-" "E5-"
paste0(stringa,STRINGB,collapse ='-')
# [1] "A1-B2-C3-D4-E5"
若對paste函數(shù)設(shè)置sep="",效果和paste0一樣
1.4 拆分函數(shù)strsplit()
拆分后生成列表
stringC <- paste(stringa,STRINGB,sep='/')
stringC
# [1] "A/1" "B/2" "C/3" "D/4" "E/5"
M <- strsplit(stringC,split = '/')
M
# [[1]]
# [1] "A" "1"
# [[2]]
# [1] "B" "2"
# [[3]]
# [1] "C" "3"
# [[4]]
# [1] "D" "4"
# [[5]]
# [1] "E" "5"
class(M)
# [1] "list"
1.5 字符串的截取函數(shù) substr
# 從2-4位截取
stringd <- c('python','java','ruby','php','huazhongda')
sub_str <- substr(stringd,start = 2,stop = 4)
sub_str
# [1] "yth" "ava" "uby" "hp" "uaz"
# 除了截取,還可以賦值 #將2-4位換成aaa
substr(stringd,start = 2,stop = 4) <- 'aaa'
stringd
# [1] "paaaon" "jaaa" "raaa" "paa" "haaahongda"
1.6 grep()
函數(shù)和grepl()
函數(shù)
處理比較復(fù)雜的字符串
# 語法
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
生成向量
seq_names <- c('EU_FRA02_C1_S2008','AF_COM12_B0_2004','AF_COM17_F0_S2008',
'AS_CHN11_C3_2004','EU-FRA-C3-S2007','NAUSA02E02005',
'AS_CHN12_N0_05','NA_USA03_C2_S2007','NA USA04 A3 2004',
'EU_UK01_A0_2009','eu_fra_a2_s98','SA/BRA08/B0/1996')
# 有大寫有小寫,有斜杠有下劃線,有確定年份有不確定年份。。。
# 如第一個(gè),EU是歐洲,F(xiàn)RA是法國,02是法國的第二個(gè)序列,C1是序列亞型,2008是樣本收集年份,S是2008年是一個(gè)推測的數(shù)值,并不確定。
grep()函數(shù)提取法國的元素
fra_seq <- grep(pattern = 'FRA|fra',x=seq_names)
fra_seq
# [1] 1 5 11
seq_names[fra_seq]
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
#也可通過設(shè)置value = TRUE來返回得到的元素
fra_seq <- grep(pattern = 'FRA|fra',x=seq_names,value = TRUE)
fra_seq
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
# 通過設(shè)置ignore.case = T來忽略大小寫
grep(pattern = 'FRA|fra',x=seq_names,value = TRUE,ignore.case = T)
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
#這里用到了正則表達(dá)式
grepl()函數(shù)返回的是TRUE或FALSE
grepl(pattern = 'FRA|fra',x=seq_names)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# 用[]提取
seq_names[grepl(pattern = 'FRA|fra',x=seq_names)]
# [1] "EU_FRA02_C1_S2008" "EU-FRA-C3-S2007" "eu_fra_a2_s98"
?練習(xí):提取如上向量中有明確收集年份的序列。
(思路:找出不明確年份的序列(含s或S的),然后取非。)
spe_seq <- seq_names[!grepl(pattern = '[s|S][0-9]{2,4}\\b',seq_names)]
spe_seq
# [1] "AF_COM12_B0_2004" "AS_CHN11_C3_2004" "NAUSA02E02005" "AS_CHN12_N0_05"
# [5] "NA USA04 A3 2004" "EU_UK01_A0_2009" "SA/BRA08/B0/1996"
# \\是轉(zhuǎn)義符,\\b是去匹配boundary,放在右邊說明是去匹配字符的結(jié)尾。
# 前面[s|S]的意思是在s或S中取值,[0-9]的意思是在0-9中取值,{2,4}緊跟在[0-9]后面的意思在0-9中取值取2-4次。
1.7 gsub()
函數(shù)和sub()
函數(shù)
# 語法
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
money <- c('$1888','$2888','$3888')
# 由于美元符的存在,不能直接使用as.numeric
as.numeric(money)
# [1] NA NA NA
gsub()函數(shù)
# $本身也有含義,不能直接使用,需要在前面加上轉(zhuǎn)義符\\,之后再用as.numeric轉(zhuǎn)換。
money1 <- gsub('\\$',replacement = '',money)
money1
[1] "1888" "2888" "3888"
as.numeric(money1)
# [1] 1888 2888 3888
gsub函數(shù)可以替換它找到的所有的字符
sub函數(shù)只能替換它找到的第一個(gè)字符
sub('\\$',replacement = '',money)
# [1] "1888" "2888" "3888"
money <- c('$1888 $2888 $3888')
sub('\\$',replacement = '',money)
# [1] "1888 $2888 $3888"
gsub('\\$',replacement = '',money)
# [1] "1888 2888 3888"
1.8 regexpr()
函數(shù)、gregexpr()
函數(shù)和regexec()
函數(shù)
功能非常類似
# 語法
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
regexec(pattern, text, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE)
以regexpr()為例:
# 尋找test_string里含有pp的字符串
test_string <- c('happy','apple','application','apolotoc')
regexpr('pp',test_string)
# [1] 3 2 2 -1
# attr(,"match.length")
# [1] 2 2 2 -1
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# 返回的3 2 2 -1的意思是,第一個(gè)字符串里的pp出現(xiàn)在第三位,第二個(gè)和第三個(gè)出現(xiàn)在第二位。最后一個(gè)沒有找到,返回-1。
1.9 agrep()函數(shù)和agrepl()函數(shù)
以agrep()為例
string1 <- c('I need a favour','my favorite sport','you made an error')
agrep('favor',string1)
# [1] 1 2
英式英語和美式英語的寫法可以自動被識別
2. stringr和stringi包
stringr和stringi功能類似,stringi功能更強(qiáng)大,但更依賴于正則表達(dá)式的使用。
# 查看這兩個(gè)包中的函數(shù)
library(stringr)
library(stringi)
ls('package:stringr')
ls('package:stringi')
# stringr中有52個(gè)函數(shù), stringi中有252個(gè)函數(shù)。
2.1 stringr包??
常用函數(shù) | 功能 |
---|---|
str_split() /str_c() | 字符串拆分與組合 |
str_length() | 檢測字符串長度 |
str_sub() | 按位置提取字符 |
str_dup | 識別重復(fù)的字符串 |
str_trim | 去除字符串首尾的空格 |
str_to_upper()/str_to_lower()/str_to_title() | 大小寫轉(zhuǎn)換 |
str_locate() | 字符串定位 |
str_detect(x,“h”) | 字符檢測 –返回邏輯值 |
str_extract()/ str_extract_all() | 字符提取 |
str_remove()/ str_remove_all() | 字符刪除 |
str_replace()/str_replace_all() | 字符串替換 |
- 2.1.1 str_c()和str_split()
str_c()函數(shù)與paste函數(shù)類似
library(stringr)
str_c('a','b')
# [1] "ab"
str_c('a','b',sep='-')
# [1] "a-b"
str_split()??
x <- "The birch canoe slid on the smooth planks."
x
# [1] "The birch canoe slid on the smooth planks."
str_split(x," ") #生成的是列表
# [[1]]
# [1] "The" "birch" "canoe" "slid" "on"
# [6] "the" "smooth" "planks."
x[[1]] #得到向量
[1] "The birch canoe slid on the smooth planks."
y = c("john 150","mike 140","lucy 152")
str_split(y," ")
# [[1]]
# [1] "john" "150"
# [[2]]
# [1] "mike" "140"
# [[3]]
# [1] "lucy" "152"
str_split(y," ",simplify = T) #‘simplify = T’生成矩陣??
# [,1] [,2]
# [1,] "john" "150"
# [2,] "mike" "140"
# [3,] "lucy" "152"
2.1.2 str_length()函數(shù)
對字符串進(jìn)行計(jì)數(shù),與nchar()類似2.1.3 str_sub()函數(shù):按位置提取字符
aaa <- 'huake tongji cardio'
str_sub(aaa,c(1,4,8),c(2,7,11))
# [1] "hu" "ke t" "ongj"
#第一個(gè)是1-2個(gè)字符,第二個(gè)是4-7個(gè)字符,第三個(gè)是8-11個(gè)字符
- 2.1.4 str_dup
fruit <- c('apple','pear','banana')
str_dup(fruit,2) #2表示把字符串重復(fù)兩次
# [1] "appleapple" "pearpear" "bananabanana"
str_dup(fruit,2:4)
# [1] "appleapple" "pearpearpear" "bananabananabananabanana"
- 2.1.5 str_trim 去除字符串首尾的空格
string <- c(' Huake is good ')
string
# [1] " Huake is good "
str_trim(string,side = 'both')
# [1] "Huake is good"
- 2.1.6 str_locate 字符串定位
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
# start end
# [1,] 6 5
# [2,] 7 6
# [3,] 5 4
# [4,] 10 9
str_locate(fruit, "a")
# start end
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 5 5
str_locate(fruit, c("a", "b", "p", "p"))
# start end
# [1,] 1 1
# [2,] 1 1
# [3,] 1 1
# [4,] 1 1
- 2.1.7 str_detect 字符檢測??
fruit <- c("apple", "banana", "pear", "pinapple")
str_detect(fruit, "a")
# [1] TRUE TRUE TRUE TRUE
str_detect(fruit, "^a")
# [1] TRUE FALSE FALSE FALSE
str_detect(fruit, "a$")
# [1] FALSE TRUE FALSE FALSE
str_detect(fruit, "b")
# [1] FALSE TRUE FALSE FALSE
str_detect(fruit, "[aeiou]")
# [1] TRUE TRUE TRUE TRUE
??:str_detect()和ifelse()聯(lián)合使用可以根據(jù)字符串中是否存在某字符將字符串分為兩類,常用于GEO等分析時(shí)根據(jù)樣本名判斷該樣本是正常樣本還是病例(如腫瘤)樣本。
用法:
ifelse(str_detect(colname(a), ''tumor), 'tumor', 'normal' )
# 如果在數(shù)據(jù)框a的列名中搜索到tumor,返回tumor,沒有搜索到返回normal。
- 2.1.8 str_extract和str_extract_all
shopping_list <- c("apples x4", "bag of flour", "bag of sugar", "milk x2")
str_extract(shopping_list, "\\d")
# [1] "4" NA NA "2"
str_extract(shopping_list, "[a-z]+")
# [1] "apples" "bag" "bag" "milk"
str_extract(shopping_list, "[a-z]{1,4}")
# [1] "appl" "bag" "bag" "milk"
str_extract(shopping_list, "\\b[a-z]{1,4}\\b")
# [1] NA "bag" "bag" "milk"
str_extract_all(shopping_list, "[a-z]+")
# [[1]]
# [1] "apples" "x"
# [[2]]
# [1] "bag" "of" "flour"
# [[3]]
# [1] "bag" "of" "sugar"
# [[4]]
# [1] "milk" "x"
- 2.1.9 str_remove()和str_remove_all()
fruits <- c("one apple", "two pears", "three bananas")
str_remove(fruits, "[aeiou]")
# [1] "ne apple" "tw pears" "thre bananas"
str_remove_all(fruits, "[aeiou]")
# [1] "n ppl" "tw prs" "thr bnns"
- 2.1.10 str_replace()和str_replace_all()
fruits <- c("one apple", "two pears", "three bananas")
str_replace(fruits, "[aeiou]", "-")
# [1] "-ne apple" "tw- pears" "thr-e bananas"
str_replace_all(fruits, "[aeiou]", "-")
# [1] "-n- -ppl-" "tw- p--rs" "thr-- b-n-n-s"
2.2 stringi包
- 2.2.1 stri_join 字符串的粘貼
stri_join(1:13, letters)
# [1] "1a" "2b" "3c" "4d" "5e" "6f" "7g" "8h" "9i" "10j" "11k"
# [12] "12l" "13m" "1n" "2o" "3p" "4q" "5r" "6s" "7t" "8u" "9v"
# [23] "10w" "11x" "12y" "13z"
stri_join(1:13, letters, sep=',')
# [1] "1,a" "2,b" "3,c" "4,d" "5,e" "6,f" "7,g" "8,h" "9,i" "10,j"
# [11] "11,k" "12,l" "13,m" "1,n" "2,o" "3,p" "4,q" "5,r" "6,s" "7,t"
# [21] "8,u" "9,v" "10,w" "11,x" "12,y" "13,z"
stri_join(1:13, letters, collapse='; ')
# [1] "1a; 2b; 3c; 4d; 5e; 6f; 7g; 8h; 9i; 10j; 11k; 12l; 13m; 1n; 2o; 3p; 4q; 5r; 6s; 7t; 8u; 9v; 10w; 11x; 12y; 13z"
2.2.2 stri_cmp_eq和stri_cmp_neq
stri_cmp_eq 判斷兩個(gè)字符串是否完全一樣
stri_cmp_neq 判斷兩個(gè)字符串是否不一樣
stri_cmp_eq('AB','AB')
# [1] TRUE
stri_cmp_eq('AB','aB')
# [1] FALSE
stri_cmp_neq('AB','aB')
# [1] TRUE
- 2.2.3 stri_cmp_lt和stri_cmp_gt
stri_cmp_lt 小于
stri_cmp_gt 大于
字符串之間的比較,針對數(shù)字時(shí)按數(shù)字大小,針對字母的時(shí)候按字母表的順序,后出現(xiàn)的大
stri_cmp_lt('121','221')
# [1] TRUE
stri_cmp_lt('a121','b221')
# [1] TRUE
- 2.2.4 stri_count
s <- 'Lorem ipsum dolor sit amet, consectetur adipisicing elit.'
stri_count(s, fixed='dolor')
# [1] 1
stri_count(s, regex='\\p{L}+')
# [1] 8
- 2.2.5 stri_dup
stri_dup('a', 1:5)
# [1] "a" "aa" "aaa" "aaaa" "aaaaa"
stri_dup(c('a', NA, 'ba'), 4)
# [1] "aaaa" NA "babababa"
# stri_dup(c('abc', 'pqrst'), c(4, 2))
[1] "abcabcabcabc" "pqrstpqrst"
- 2.2.6 stri_detect_fixed
stri_detect_fixed(c('stringi R', 'R STRINGI', '123'), c('i', 'R', '0'))
# [1] TRUE TRUE FALSE
向量化的從前面那個(gè)里面尋找后面那個(gè),找到了就返回TRUE,找不到就返回FALSE
- 2.2.7 stri_detect_regex
# 尋找以ab開頭的和以t結(jié)尾的
stri_detect_regex(c('above','abort','about','abnormal','abandon'),'^ab')
# [1] TRUE TRUE TRUE TRUE TRUE
stri_detect_regex(c('above','abort','about','abnormal','abandon'),'t\\b')
# [1] FALSE TRUE TRUE FALSE FALSE
# case_insensitive=TRUE是忽視大小寫
stri_detect_regex(c('ABove','abort','About','aBnormal','abandon'),'^ab',case_insensitive=TRUE)
# [1] TRUE TRUE TRUE TRUE TRUE
- 2.2.8 stri_startswith_fixed 判斷是不是以某個(gè)字符開始
stri_startswith_fixed(c('a1','a2','b3','a4','c5'),'a1')
# [1] TRUE FALSE FALSE FALSE FALSE
stri_startswith_fixed(c('abaDc','asdfh','abiude'),'ba',from=2)
# [1] TRUE FALSE FALSE
# from定義從第幾個(gè)字符開始匹配
- 2.2.9 stri_endswith_fixed 判斷是不是以某個(gè)字符結(jié)束
stri_endswith_fixed(c('abaDc','asdfh','abiudba'),'ba')
# [1] FALSE FALSE TRUE
stri_endswith_fixed(c('abaDc','asdfh','abiudba'),'ba',to=3)
# [1] TRUE FALSE FALSE
# to表示匹配到第幾位
- 2.2.10 stri_extract_all
stri_extract_all('XaaaaX', regex=c('\\p{Ll}', '\\p{Ll}+', '\\p{Ll}{2,3}', '\\p{Ll}{2,3}?'))
# [[1]]
# [1] "a" "a" "a" "a"
# [[2]]
# [1] "aaaa"
# [[3]]
# [1] "aaa"
# [[4]]
# [1] "aa" "aa"
stri_extract_all('Bartolini', coll='i')
# [[1]]
# [1] "i" "i"
stri_extract_all('stringi is so good!', charclass='\\p{Zs}') # all white-spaces
# [[1]]
# [1] " " " " " "
- 2.2.11 stri_extract_all_fixed
參數(shù)overlap=TRUE意思是可以重復(fù)的對字符串進(jìn)行匹配
stri_extract_all_fixed('abaBAba', 'Aba', case_insensitive=TRUE)
# [[1]]
# [1] "aba" "Aba"
stri_extract_all_fixed('abaBAba', 'Aba', case_insensitive=TRUE, overlap=TRUE)
# [[1]]
# [1] "aba" "aBA" "Aba"
- 2.2.12 stri_extract_all_boundaries 提取字符串的邊界
根據(jù)空格提取的。問題是提取出來的字符串也帶空格
stri_extract_all_boundaries('stringi: THE string processing package 123.48...')
# [[1]]
# [1] "stringi: " "THE " "string "
# [4] "processing " "package " "123.48..."
- 2.2.13 stri_extract_all_words 提取單詞
stri_extract_all_words('stringi: THE string processing package 123.48...')
# [[1]]
# [1] "stringi" "THE" "string" "processing"
# [5] "package" "123.48"
- 2.2.14 stri_isempty 判斷字符串中是否存在空字符
注意:空格不算空字符
stri_isempty(letters[1:3])
# [1] FALSE FALSE FALSE
stri_isempty(c(',', '', 'abc', '123', '\u0105\u0104'))
# [1] FALSE TRUE FALSE FALSE FALSE
stri_isempty(character(1))
[1] TRUE
- 2.2.15 stri_locate_all 定位函數(shù) 可以找到匹配字符在字符串中出現(xiàn)的位置
stri_locate_all('Bartolini', fixed='i')
# [[1]]
# start end
# [1,] 7 7
# [2,] 9 9