Extracting matching substrings

To get the specific substring that matches a regex pattern (rather than a whole line of matched text), use regexpr() and regmatches()

Examples

str.vec.1 = c("O my gosh!", "Oh wow!", "Ohhhhh no!")
grep("Oh+", str.vec.1, value=TRUE)
## [1] "Oh wow!"    "Ohhhhh no!"
regexpr.1 = regexpr("Oh+", c("O my gosh!", "Oh wow!", "Ohhhhh no!"))
class(regexpr.1) # Integer vector
## [1] "integer"
regexpr.1 # Position of match
## [1] -1  1  1
## attr(,"match.length")
## [1] -1  2  6
## attr(,"useBytes")
## [1] TRUE
attributes(regexpr.1)$match.length # Length of match
## [1] -1  2  6
regmatches(str.vec.1, regexpr.1)
## [1] "Oh"     "Ohhhhh"

More examples

str.vec.2 = c("10 dollars", "100 dollars", "1000 dollars")
grep("10{1,2}", str.vec.2, value=TRUE)
## [1] "10 dollars"   "100 dollars"  "1000 dollars"
grep("10{1,}", str.vec.2, value=TRUE)
## [1] "10 dollars"   "100 dollars"  "1000 dollars"
regmatches(str.vec.2, regexpr("10{1,2}", str.vec.2))
## [1] "10"  "100" "100"
regmatches(str.vec.2, regexpr("10{1,}", str.vec.2))
## [1] "10"   "100"  "1000"

Replacing matching substrings

To replace matching substrings, use regmatches() (similar to usage of substring())

str.vec.1
## [1] "O my gosh!" "Oh wow!"    "Ohhhhh no!"
regmatches(str.vec.1, regexpr("Oh*", str.vec.1))
## [1] "O"      "Oh"     "Ohhhhh"
regmatches(str.vec.1, regexpr("Oh*", str.vec.1)) = "Oh"
str.vec.1
## [1] "Oh my gosh!" "Oh wow!"     "Oh no!"

More examples

str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
              "I hate hate HATE HATE broccoli, it disgusts me, I hate it")
pattern.3 = "(hate|HATE)( (hate|HATE))*"
regmatches(str.vec.3, regexpr(pattern.3, str.vec.3)) = "do not like"
str.vec.3
## [1] "I do not like broccoli"                           
## [2] "I do not like BROCCOLI"                           
## [3] "I do not like broccoli, it disgusts me, I hate it"

Note: in the 3rd string, we didn’t replace the last “hate”, why?

Extracting all occurrences

To extract all occurrences of matching substrings (not just the first occurrence per line), use gregexpr() and then regmatches()

str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
              "I hate hate HATE HATE broccoli, it disgusts me, I hate it")
regexpr("hate|HATE", str.vec.3) # Integer vector
## [1] 3 3 3
## attr(,"match.length")
## [1] 4 4 4
## attr(,"useBytes")
## [1] TRUE
gregexpr("hate|HATE", str.vec.3) # List of integer vectors
## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 3
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  3  8 13 18 51
## attr(,"match.length")
## [1] 4 4 4 4 4
## attr(,"useBytes")
## [1] TRUE
regmatches(str.vec.3, regexpr("hate|HATE", str.vec.3))
## [1] "hate" "HATE" "hate"
regmatches(str.vec.3, gregexpr("hate|HATE", str.vec.3))
## [[1]]
## [1] "hate"
## 
## [[2]]
## [1] "HATE"
## 
## [[3]]
## [1] "hate" "hate" "HATE" "HATE" "hate"

More examples

str.vec.3
## [1] "I hate broccoli"                                          
## [2] "I HATE BROCCOLI"                                          
## [3] "I hate hate HATE HATE broccoli, it disgusts me, I hate it"
pattern.3
## [1] "(hate|HATE)( (hate|HATE))*"
regmatches(str.vec.3, gregexpr(pattern.3, str.vec.3)) = "do not like"
str.vec.3
## [1] "I do not like broccoli"                                  
## [2] "I do not like BROCCOLI"                                  
## [3] "I do not like broccoli, it disgusts me, I do not like it"

Replacements, alternative way

For an alternative (easier) way to replace matching substrings, use sub() or gsub()

str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
              "I hate hate HATE HATE broccoli, it disgusts me, I hate it")
pattern.3
## [1] "(hate|HATE)( (hate|HATE))*"
gsub(pattern.3, "do not like", str.vec.3)
## [1] "I do not like broccoli"                                  
## [2] "I do not like BROCCOLI"                                  
## [3] "I do not like broccoli, it disgusts me, I do not like it"