Statistical Computing, 36-350
Wednesday September 14, 2016
To get the specific substring that matches a regex pattern (rather than a whole line of matched text), use regexpr()
and regmatches()
regexpr()
returns location of first match in the target string (plus some attributes)-1
means no match was found in the target stringregmatches()
takes the output of regexpr()
, and teturns the matching substringstr.vec.1 = c("O my gosh!", "Oh wow!", "Ohhhhh no!")
grep("Oh+", str.vec.1, value=TRUE)
## [1] "Oh wow!" "Ohhhhh no!"
regexpr.1 = regexpr("Oh+", c("O my gosh!", "Oh wow!", "Ohhhhh no!"))
class(regexpr.1) # Integer vector
## [1] "integer"
regexpr.1 # Position of match
## [1] -1 1 1
## attr(,"match.length")
## [1] -1 2 6
## attr(,"useBytes")
## [1] TRUE
attributes(regexpr.1)$match.length # Length of match
## [1] -1 2 6
regmatches(str.vec.1, regexpr.1)
## [1] "Oh" "Ohhhhh"
str.vec.2 = c("10 dollars", "100 dollars", "1000 dollars")
grep("10{1,2}", str.vec.2, value=TRUE)
## [1] "10 dollars" "100 dollars" "1000 dollars"
grep("10{1,}", str.vec.2, value=TRUE)
## [1] "10 dollars" "100 dollars" "1000 dollars"
regmatches(str.vec.2, regexpr("10{1,2}", str.vec.2))
## [1] "10" "100" "100"
regmatches(str.vec.2, regexpr("10{1,}", str.vec.2))
## [1] "10" "100" "1000"
To replace matching substrings, use regmatches()
(similar to usage of substring()
)
str.vec.1
## [1] "O my gosh!" "Oh wow!" "Ohhhhh no!"
regmatches(str.vec.1, regexpr("Oh*", str.vec.1))
## [1] "O" "Oh" "Ohhhhh"
regmatches(str.vec.1, regexpr("Oh*", str.vec.1)) = "Oh"
str.vec.1
## [1] "Oh my gosh!" "Oh wow!" "Oh no!"
str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
"I hate hate HATE HATE broccoli, it disgusts me, I hate it")
pattern.3 = "(hate|HATE)( (hate|HATE))*"
regmatches(str.vec.3, regexpr(pattern.3, str.vec.3)) = "do not like"
str.vec.3
## [1] "I do not like broccoli"
## [2] "I do not like BROCCOLI"
## [3] "I do not like broccoli, it disgusts me, I hate it"
Note: in the 3rd string, we didn’t replace the last “hate”, why?
To extract all occurrences of matching substrings (not just the first occurrence per line), use gregexpr()
and then regmatches()
str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
"I hate hate HATE HATE broccoli, it disgusts me, I hate it")
regexpr("hate|HATE", str.vec.3) # Integer vector
## [1] 3 3 3
## attr(,"match.length")
## [1] 4 4 4
## attr(,"useBytes")
## [1] TRUE
gregexpr("hate|HATE", str.vec.3) # List of integer vectors
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
##
## [[2]]
## [1] 3
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
##
## [[3]]
## [1] 3 8 13 18 51
## attr(,"match.length")
## [1] 4 4 4 4 4
## attr(,"useBytes")
## [1] TRUE
regmatches(str.vec.3, regexpr("hate|HATE", str.vec.3))
## [1] "hate" "HATE" "hate"
regmatches(str.vec.3, gregexpr("hate|HATE", str.vec.3))
## [[1]]
## [1] "hate"
##
## [[2]]
## [1] "HATE"
##
## [[3]]
## [1] "hate" "hate" "HATE" "HATE" "hate"
str.vec.3
## [1] "I hate broccoli"
## [2] "I HATE BROCCOLI"
## [3] "I hate hate HATE HATE broccoli, it disgusts me, I hate it"
pattern.3
## [1] "(hate|HATE)( (hate|HATE))*"
regmatches(str.vec.3, gregexpr(pattern.3, str.vec.3)) = "do not like"
str.vec.3
## [1] "I do not like broccoli"
## [2] "I do not like BROCCOLI"
## [3] "I do not like broccoli, it disgusts me, I do not like it"
For an alternative (easier) way to replace matching substrings, use sub()
or gsub()
sub()
replaces the first occurrence of a matching substringgsub()
replaces all occurrences of matching substringsstr.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
"I hate hate HATE HATE broccoli, it disgusts me, I hate it")
pattern.3
## [1] "(hate|HATE)( (hate|HATE))*"
gsub(pattern.3, "do not like", str.vec.3)
## [1] "I do not like broccoli"
## [2] "I do not like BROCCOLI"
## [3] "I do not like broccoli, it disgusts me, I do not like it"