正則表達式中的量詞可以用來指明某個字符串匹配的次數。將在以下描述“貪心量詞”(Greedy)、“厭惡量詞”(reluctant)、“占有量詞”(possessive)這三種量詞。(真的不知道怎么翻譯)。乍一看量詞X?(貪心量詞)、X??(厭惡量詞) 和X?+(占有量詞)好像作用也差不多,因為它們的匹配規則都是匹配“X” 一次或者零次,即X出現一次或者一次都不出現。其實它們有著細微的差別,在本文中最后一部分會說明。
讓我們用貪心量詞來創建三種不同的正則表達式:a?、a*、a+、。看看如果用空字符串來測匹配會得到什么結果。
先給出以下測試代碼(直接使用終端編譯運行即可):
public class RegexTestHarness {
public static void main(String[] args){
Console console = System.console();
if (console == null) {
System.err.println("No console.");
System.exit(1);
}
while (true) {
Pattern pattern =
Pattern.compile(console.readLine("%nEnter your regex: "));
Matcher matcher =
pattern.matcher(console.readLine("Enter input string to search: "));
boolean found = false;
while (matcher.find()) {
console.format("I found the text" +
" \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(),
matcher.start(),
matcher.end());
found = true;
}
if(!found){
console.format("No match found.%n");
}
}
}
}
Enter your regex: a?
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: a*
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
Enter your regex: a+
Enter input string to search:
No match found.
零長度匹配
在上面的例子中,前兩個例子可以匹配成功是因為表達式a?和a*允許字符串中不出現‘a’字符。你會看到開始和結束的下標都是0。空字符串""沒有長度,因此這個正則在開始位置(即下標為0)即匹配成功。像這一類的匹配稱之為“零長度匹配”。零長度匹配會在以下三種情況出現:
1.一個空字符串匹配。
2.和字符串的開端匹配,即下標為0的地方匹配。(開端即是空字符串)
3.和字符串結束的位置匹配。(結束即是空字符串)
4.任意兩個字符之間,如"bc",b和c之間即存在一個空字符串""。
用“foo”這個字符串作為例子,下標的位置對應關系為
即index=0和index=3的地方會匹配。
零長度匹配是非常容易辨別出來,因為他們開始的位置和結束的位置是同一下標。
然我們再看幾個列子,輸入一個“a”字符。
Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
以上三個量詞都能找到字符“a”,但是前兩個例子在下標為1處匹配,也就是字符的結尾處。記住,匹配器查找到下標0和1之間的“a”,該程序會一直匹配到沒有匹配為止。
接下來輸入"ababaaaab",看下會得到什么輸出。輸出如下:
Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.
Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
讀者可以自己推敲為什么會得出以上結果。
如果要限制某個字符出現的次數,可以使用大括號"{}"。如:
匹配“aaa”
Enter your regex: a{3}
Enter input string to search: aa
No match found.
Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.
Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.
對于第三個實例,要注意的是,當匹配了前三個a,后面的匹配和前面3個a沒有任何關系,正則會繼續和“aaa”后面的內容繼續嘗試匹配。
被量詞修飾的子表達式 如:
Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.
Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.
對于第二個例子,正則表達式匹配的內容應該是"do",后面緊跟3個"g",因此第二個例子無法匹配。
再看多一個例子:
Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.
Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.
貪婪模式和厭惡模式和占有模式的區別
貪婪模式之所以被稱為貪婪模式,是因為貪婪模式會盡可能的去匹配更多的內容,如果匹配不成功,將會進行回溯,直至匹配成功或者不成功。
看看下面例子:
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Enter your regex: .+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.
第一個例子采用貪婪模式,.部分和整個字符串"xfooxxxxxxfoo"匹配,接著正則中foo部分和字符串"xfooxxxxxxfoo"的剩余部分匹配,即空字串"",發現匹配不成功。開始回溯, .*與"xfooxxxxxxfo"匹配,正則中的foo部分和"xfooxxxxxxfo"剩余部分進行匹配,即"o",發現不匹配,繼續回溯。重復上訴過程,直到匹配成功。由于是貪婪模式,一旦成功,將不會繼續匹配,匹配終止。
第二個例子采用的是厭惡模式(非貪婪模式),剛好和貪婪模式相反,一開始只會和字符串開始位置進行匹配,此例中,即和空字符串""匹配,匹配成功后,正則中的foo部分和字符串中的開頭三個字符"xfo"匹配,發現匹配不成功。.*?開始和第一個字符匹配,即"x",匹配成功,接著正則中的foo和字符串中的"foo"匹配。至此整個正則第一次匹配成功。接著繼續匹配,接下來的匹配內容為"xxxxxxfoo",采用相同的規則繼續匹配,第二次匹配成功的字符串為"xxxxxxfoo"。直至整個字符串被消耗完畢才終止匹配。
第三個例子是占有模式。該模式只進行一次匹配。不進行回溯嘗試,在次例中,.*+與"xfooxxxxxxfoo"匹配,正則中的foo和空字符串""匹配,匹配失敗。將不進行回溯嘗試。匹配結束。
以上內容大部分是翻譯The Java? Tutorials中關于正則的教程