Extracting matching substrings

To get the specific substring that matches a regex pattern (rather than a whole line of matched text), use regexpr() and regmatches()

regexpr() returns location of first match in the target string (plus some attributes)
A location of -1 means no match was found in the target string
regmatches() takes the output of regexpr(), and teturns the matching substring

Examples

str.vec.1 = c("O my gosh!", "Oh wow!", "Ohhhhh no!")
grep("Oh+", str.vec.1, value=TRUE)

## [1] "Oh wow!"    "Ohhhhh no!"

regexpr.1 = regexpr("Oh+", c("O my gosh!", "Oh wow!", "Ohhhhh no!"))
class(regexpr.1) # Integer vector

## [1] "integer"

regexpr.1 # Position of match

## [1] -1  1  1
## attr(,"match.length")
## [1] -1  2  6
## attr(,"useBytes")
## [1] TRUE

attributes(regexpr.1)$match.length # Length of match

## [1] -1  2  6

regmatches(str.vec.1, regexpr.1)

## [1] "Oh"     "Ohhhhh"

More examples

str.vec.2 = c("10 dollars", "100 dollars", "1000 dollars")
grep("10{1,2}", str.vec.2, value=TRUE)

## [1] "10 dollars"   "100 dollars"  "1000 dollars"

grep("10{1,}", str.vec.2, value=TRUE)

## [1] "10 dollars"   "100 dollars"  "1000 dollars"

regmatches(str.vec.2, regexpr("10{1,2}", str.vec.2))

## [1] "10"  "100" "100"

regmatches(str.vec.2, regexpr("10{1,}", str.vec.2))

## [1] "10"   "100"  "1000"

Replacing matching substrings

To replace matching substrings, use regmatches() (similar to usage of substring())

str.vec.1

## [1] "O my gosh!" "Oh wow!"    "Ohhhhh no!"

regmatches(str.vec.1, regexpr("Oh*", str.vec.1))

## [1] "O"      "Oh"     "Ohhhhh"

regmatches(str.vec.1, regexpr("Oh*", str.vec.1)) = "Oh"
str.vec.1

## [1] "Oh my gosh!" "Oh wow!"     "Oh no!"

More examples

str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
              "I hate hate HATE HATE broccoli, it disgusts me, I hate it")
pattern.3 = "(hate|HATE)( (hate|HATE))*"
regmatches(str.vec.3, regexpr(pattern.3, str.vec.3)) = "do not like"
str.vec.3

## [1] "I do not like broccoli"                           
## [2] "I do not like BROCCOLI"                           
## [3] "I do not like broccoli, it disgusts me, I hate it"

Note: in the 3rd string, we didn’t replace the last “hate”, why?

Extracting all occurrences

To extract all occurrences of matching substrings (not just the first occurrence per line), use gregexpr() and then regmatches()

str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
              "I hate hate HATE HATE broccoli, it disgusts me, I hate it")
regexpr("hate|HATE", str.vec.3) # Integer vector

## [1] 3 3 3
## attr(,"match.length")
## [1] 4 4 4
## attr(,"useBytes")
## [1] TRUE

gregexpr("hate|HATE", str.vec.3) # List of integer vectors

## [[1]]
## [1] 3
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[2]]
## [1] 3
## attr(,"match.length")
## [1] 4
## attr(,"useBytes")
## [1] TRUE
## 
## [[3]]
## [1]  3  8 13 18 51
## attr(,"match.length")
## [1] 4 4 4 4 4
## attr(,"useBytes")
## [1] TRUE

regmatches(str.vec.3, regexpr("hate|HATE", str.vec.3))

## [1] "hate" "HATE" "hate"

regmatches(str.vec.3, gregexpr("hate|HATE", str.vec.3))

## [[1]]
## [1] "hate"
## 
## [[2]]
## [1] "HATE"
## 
## [[3]]
## [1] "hate" "hate" "HATE" "HATE" "hate"

More examples

str.vec.3

## [1] "I hate broccoli"                                          
## [2] "I HATE BROCCOLI"                                          
## [3] "I hate hate HATE HATE broccoli, it disgusts me, I hate it"

pattern.3

## [1] "(hate|HATE)( (hate|HATE))*"

regmatches(str.vec.3, gregexpr(pattern.3, str.vec.3)) = "do not like"
str.vec.3

## [1] "I do not like broccoli"                                  
## [2] "I do not like BROCCOLI"                                  
## [3] "I do not like broccoli, it disgusts me, I do not like it"

Replacements, alternative way

For an alternative (easier) way to replace matching substrings, use sub() or gsub()

sub() replaces the first occurrence of a matching substring
gsub() replaces all occurrences of matching substrings

str.vec.3 = c("I hate broccoli", "I HATE BROCCOLI",
              "I hate hate HATE HATE broccoli, it disgusts me, I hate it")
pattern.3

## [1] "(hate|HATE)( (hate|HATE))*"

gsub(pattern.3, "do not like", str.vec.3)

## [1] "I do not like broccoli"                                  
## [2] "I do not like BROCCOLI"                                  
## [3] "I do not like broccoli, it disgusts me, I do not like it"

Extractions and Replacements

Statistical Computing, 36-350

Wednesday September 14, 2016

Extracting matching substrings

Examples

More examples

Replacing matching substrings

More examples

Extracting all occurrences

More examples

Replacements, alternative way