Last week: Dplyr, pipes, and more

The tidyverse is a collection of packages for common data science tasks
Tidyverse functionality is greatly enhanced using pipes (%>% operator)
Pipes allow you to string together commands to get a flow of results
dplyr is a package for data wrangling, with several key verbs (functions)
filter(): subset rows based on a condition
group_by(): define groups of rows according to a condition
summarize(): apply computations across groups of rows
arrange(): order rows by value of a column
select(): pick out given columns
mutate(): create new columns
mutate_at(): apply a function to given columns
tidyr is a package for manipulating the structure of data frames
pivot_longer(): make “wide” data longer
pivot_wider(): make “long” data wider

Part I

String basics

What are strings?

The simplest distinction:

Character: a symbol in a written language, like letters, numerals, punctuation, space, etc.
String: a sequence of characters bound together

class("r")

## [1] "character"

class("Ryan")

## [1] "character"

Why do we care about strings?

A lot of interesting data out there is in text format!
Webpages, emails, surveys, logs, search queries, etc.
Even if you just care about numbers eventually, you’ll need to understand how to get numbers from text

Whitespaces

Whitespaces count as characters and can be included in strings:

" " for space
"\n" for newline
"\t" for tab

str = "Dear Mr. Carnegie,\n\nThanks for the great school!\n\nSincerely, Ryan"
str

## [1] "Dear Mr. Carnegie,\n\nThanks for the great school!\n\nSincerely, Ryan"

Use cat() to print strings to the console, displaying whitespaces properly

cat(str)

## Dear Mr. Carnegie,
## 
## Thanks for the great school!
## 
## Sincerely, Ryan

Vectors/matrices of strings

The character is a basic data type in R (like numeric, or logical), so we can make vectors or matrices of out them. Just like we would with numbers

str.vec = c("Statistical", "Computing", "isn't that bad") # Collect 3 strings
str.vec # All elements of the vector

## [1] "Statistical"    "Computing"      "isn't that bad"

str.vec[3] # The 3rd element

## [1] "isn't that bad"

str.vec[-(1:2)] # All but the 1st and 2nd

## [1] "isn't that bad"

str.mat = matrix("", 2, 3) # Build an empty 2 x 3 matrix
str.mat[1,] = str.vec # Fill the 1st row with str.vec
str.mat[2,1:2] = str.vec[1:2] # Fill the 2nd row, only entries 1 and 2, with
                              # those of str.vec
str.mat[2,3] = "isn't a fad" # Fill the 2nd row, 3rd entry, with a new string
str.mat # All elements of the matrix

##      [,1]          [,2]        [,3]            
## [1,] "Statistical" "Computing" "isn't that bad"
## [2,] "Statistical" "Computing" "isn't a fad"

t(str.mat) # Transpose of the matrix

##      [,1]             [,2]         
## [1,] "Statistical"    "Statistical"
## [2,] "Computing"      "Computing"  
## [3,] "isn't that bad" "isn't a fad"

Converting other data types to strings

Easy! Make things into strings with as.character()

as.character(0.8)

## [1] "0.8"

as.character(0.8e+10)

## [1] "8e+09"

as.character(1:5)

## [1] "1" "2" "3" "4" "5"

as.character(TRUE)

## [1] "TRUE"

Converting strings to other data types

Not as easy! Depends on the given string, of course

as.numeric("0.5")

## [1] 0.5

as.numeric("0.5 ")

## [1] 0.5

as.numeric("0.5e-10")

## [1] 5e-11

as.numeric("Hi!")

## Warning: NAs introduced by coercion

## [1] NA

as.logical("True")

## [1] TRUE

as.logical("TRU")

## [1] NA

Number of characters

Use nchar() to count the number of characters in a string

nchar("coffee")

## [1] 6

nchar("code monkey")

## [1] 11

length("code monkey")

## [1] 1

length(c("coffee", "code monkey"))

## [1] 2

nchar(c("coffee", "code monkey")) # Vectorization!

## [1]  6 11

Part II

Substrings, splitting and combining strings

Getting a substring

Use substr() to grab a subsequence of characters from a string, called a substring

phrase = "Give me a break"
substr(phrase, 1, 4)

## [1] "Give"

substr(phrase, nchar(phrase)-4, nchar(phrase))

## [1] "break"

substr(phrase, nchar(phrase)+1, nchar(phrase)+10)

## [1] ""

`substr()` vectorizes

Just like nchar(), and many other string functions

presidents = c("Clinton", "Bush", "Reagan", "Carter", "Ford")
substr(presidents, 1, 2) # Grab the first 2 letters from each

## [1] "Cl" "Bu" "Re" "Ca" "Fo"

substr(presidents, 1:5, 1:5) # Grab the first, 2nd, 3rd, etc.

## [1] "C" "u" "a" "t" ""

substr(presidents, 1, 1:5) # Grab the first, first 2, first 3, etc.

## [1] "C"    "Bu"   "Rea"  "Cart" "Ford"

substr(presidents, nchar(presidents)-1, nchar(presidents)) # Grab the last 2

## [1] "on" "sh" "an" "er" "rd"

                                                           # letters from each

Replace a substring

Can also use substr() to replace a character, or a substring

phrase

## [1] "Give me a break"

substr(phrase, 1, 1) = "L"
phrase # "G" changed to "L"

## [1] "Live me a break"

substr(phrase, 1000, 1001) = "R"
phrase # Nothing happened

## [1] "Live me a break"

substr(phrase, 1, 4) = "Show"
phrase # "Live" changed to "Show"

## [1] "Show me a break"

Splitting a string

Use the strsplit() function to split based on a keyword

ingredients = "chickpeas, tahini, olive oil, garlic, salt"
split.obj = strsplit(ingredients, split=",")
split.obj

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"

class(split.obj)

## [1] "list"

length(split.obj)

## [1] 1

Note that the output is actually a list! (With just one element, which is a vector of strings)

`strsplit()` vectorizes

Just like nchar(), substr(), and the many others

great.profs = "Nugent, Genovese, Greenhouse, Seltman, Shalizi, Ventura"
favorite.cats = "tiger, leopard, jaguar, lion"
split.list = strsplit(c(ingredients, great.profs, favorite.cats), split=",")
split.list

## [[1]]
## [1] "chickpeas"  " tahini"    " olive oil" " garlic"    " salt"     
## 
## [[2]]
## [1] "Nugent"      " Genovese"   " Greenhouse" " Seltman"    " Shalizi"    " Ventura"   
## 
## [[3]]
## [1] "tiger"    " leopard" " jaguar"  " lion"

Returned object is a list with 3 elements
Each one a vector of strings, having lengths 5, 6, and 4
Do you see why strsplit() needs to return a list now?

Splitting character-by-character

Finest splitting you can do is character-by-character: use strsplit() with split=""

split.chars = strsplit(ingredients, split="")[[1]]
split.chars

##  [1] "c" "h" "i" "c" "k" "p" "e" "a" "s" "," " " "t" "a" "h" "i" "n" "i" "," " " "o" "l" "i"
## [23] "v" "e" " " "o" "i" "l" "," " " "g" "a" "r" "l" "i" "c" "," " " "s" "a" "l" "t"

length(split.chars)

## [1] 42

nchar(ingredients) # Matches the previous count

## [1] 42

Combining strings

Use the paste() function to join two (or more) strings into one, separated by a keyword

paste("Spider", "Man") # Default is to separate by " "

## [1] "Spider Man"

paste("Spider", "Man", sep="-")

## [1] "Spider-Man"

paste("Spider", "Man", "does whatever", sep=", ")

## [1] "Spider, Man, does whatever"

`paste()` vectorizes

Just like nchar(), substr(), strsplit(), etc. Seeing a theme yet?

presidents

## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

paste(presidents, c("D", "R", "R", "D", "R"))

## [1] "Clinton D" "Bush R"    "Reagan R"  "Carter D"  "Ford R"

paste(presidents, c("D", "R")) # Notice the recycling (not historically accurate!)

## [1] "Clinton D" "Bush R"    "Reagan D"  "Carter R"  "Ford D"

paste(presidents, " (", 42:38, ")", sep="")

## [1] "Clinton (42)" "Bush (41)"    "Reagan (40)"  "Carter (39)"  "Ford (38)"

Condensing a vector of strings

Can condense a vector of strings into one big string by using paste() with the collapse argument

presidents

## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

paste(presidents, collapse="; ")

## [1] "Clinton; Bush; Reagan; Carter; Ford"

paste(presidents, " (", 42:38, ")", sep="", collapse="; ")

## [1] "Clinton (42); Bush (41); Reagan (40); Carter (39); Ford (38)"

paste(presidents, " (", c("D", "R", "R", "D", "R"), 42:38, ")", sep="", collapse="; ")

## [1] "Clinton (D42); Bush (R41); Reagan (R40); Carter (D39); Ford (R38)"

paste(presidents, collapse=NULL) # No condensing, the default

## [1] "Clinton" "Bush"    "Reagan"  "Carter"  "Ford"

Part III

Reading in text, summarizing text

Text from the outside

How to get text, from an external source, into R? Use the readLines() function

king.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp/data/king.txt")
class(king.lines) # We have a character vector

## [1] "character"

length(king.lines) # Many lines (elements)!

## [1] 59

king.lines[1:3] # First 3 lines

## [1] "Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a great beacon light of hope to millions of Negro slaves who had been seared in the flames of withering injustice. It came as a joyous daybreak to end the long night of their captivity."  
## [2] ""                                                                                                                                                                                                                                                                                                                                                
## [3] "But 100 years later, the Negro still is not free. One hundred years later, the life of the Negro is still sadly crippled by the manacles of segregation and the chains of discrimination. One hundred years later, the Negro lives on a lonely island of poverty in the midst of a vast ocean of material prosperity. One hundred years later..."

(This was Martin Luther King Jr.’s famous “I Have a Dream” speech at the March on Washington for Jobs and Freedom on August 28, 1963)

Reading from a local file

We don’t need to use the web; readLines() can be used on a local file. The following code would read in a text file from Professor Tibs’ computer:

king.lines.2 = readLines("~/Dropbox/teaching/f21-350/lectures/text/king.txt")

This will cause an error for you, unless your folder is set up exactly like Professor Tibs’ laptop! So using web links is more robust

Reconstitution

Fancy word, but all it means: make one long string, then split the words

king.text = paste(king.lines, collapse=" ")
king.words = strsplit(king.text, split=" ")[[1]]

# Sanity check
substr(king.text, 1, 150)

## [1] "Five score years ago, a great American, in whose symbolic shadow we stand today, signed the Emancipation Proclamation. This momentous decree came as a"

king.words[1:20]

##  [1] "Five"          "score"         "years"         "ago,"          "a"            
##  [6] "great"         "American,"     "in"            "whose"         "symbolic"     
## [11] "shadow"        "we"            "stand"         "today,"        "signed"       
## [16] "the"           "Emancipation"  "Proclamation." "This"          "momentous"

Counting words

Our most basic tool for summarizing text: word counts, retrieved using table()

king.wordtab = table(king.words)
class(king.wordtab)

## [1] "table"

length(king.wordtab)

## [1] 622

king.wordtab[1:10]

## king.words
##             - ...the  ...to   'tis    100   1963      a   able  Again 
##     29      2      1      1      1      1      1     37      8      1

What did we get? Alphabetically sorted unique words, and their counts = number of appearances

The names are words, the entries are counts

Note: this is actually a vector of numbers, and the words are the names of the vector

king.wordtab[1:5]

## king.words
##             - ...the  ...to   'tis 
##     29      2      1      1      1

king.wordtab[2] == 2

##    - 
## TRUE

names(king.wordtab)[2] == "-"

## [1] TRUE

So with named indexing, we can now use this to look up whatever words we want

king.wordtab["dream"]

## dream 
##     9

king.wordtab["Negro"]

## Negro 
##    13

king.wordtab["freedom"]

## freedom 
##      18

king.wordtab["equality"] # NA means King never mentioned equality

## <NA> 
##   NA

Most frequent words

Let’s sort in decreasing order, to get the most frequent words

king.wordtab.sorted = sort(king.wordtab, decreasing=TRUE)
length(king.wordtab.sorted)

## [1] 622

head(king.wordtab.sorted, 20) # First 20

## king.words
##      of     the      to     and       a      be            will      is    that      as 
##      98      97      57      40      37      32      29      25      23      23      19 
## freedom      in      we    from    have     our       I   Negro     not 
##      18      18      18      17      17      16      14      13      13

tail(king.wordtab.sorted, 20) # Last 20

## king.words
##      walk,     wallow       warm    waters,       well       were       When whirlwinds 
##          1          1          1          1          1          1          1          1 
##     whites      whose      winds      with.  withering   wrongful      wrote       yes, 
##          1          1          1          1          1          1          1          1 
##       York      York.        You       your 
##          1          1          1          1

Notice that punctuation matters, e.g., “York” and “York.” are treated as separate words, not ideal—we’ll learn just a little bit about how to fix this on lab, using regular expressions

Visualizing frequencies

Let’s use a plot to visualize frequencies

nw = length(king.wordtab.sorted)
plot(1:nw, as.numeric(king.wordtab.sorted), type="l",
     xlab="Rank", ylab="Frequency")

A pretty drastic looking trend! It looks as if \(\mathrm{Frequency} \propto (1/\mathrm{Rank})^a\) for some \(a>0\)

Zipf’s law

This phenomenon, that frequency tends to be inversely proportional to a power of rank, is called Zipf’s law

For our data, Zipf’s law approximately holds, with \(\mathrm{Frequency} \approx C(1/\mathrm{Rank})^a\) for \(C=100\) and \(a=0.65\)

C = 100; a = 0.65
king.wordtab.zipf = C*(1/1:nw)^a
cbind(king.wordtab.sorted[1:8], king.wordtab.zipf[1:8])

##      [,1]      [,2]
## of     98 100.00000
## the    97  63.72803
## to     57  48.96336
## and    40  40.61262
## a      37  35.12930
## be     32  31.20338
##        29  28.22840
## will   25  25.88162

Not perfect, but not bad. We can also plot the original sorted word counts, and those estimated by our formula law on top

plot(1:nw, as.numeric(king.wordtab.sorted), type="l",
     xlab="Rank", ylab="Frequency")
curve(C*(1/x)^a, from=1, to=nw, col="red", add=TRUE)

We’ll learn about plotting tools in detail a bit later

Summary

Strings are, simply put, sequences of characters bound together
Text data occurs frequently “in the wild”, so you should learn how to deal with it!
nchar(), substr(): functions for substring extractions and replacements
strsplit(), paste(): functions for splitting and combining strings
Reconstitution: take lines of text, combine into one long string, then split to get the words
table(): function to get word counts, useful way of summarizing text data
Zipf’s law: word frequency tends to be inversely proportional to (a power of) rank

Text Manipulation

Statistical Computing, 36-350

Tuesday October 5, 2021

Last week: Dplyr, pipes, and more

Part I

What are strings?

Whitespaces

Vectors/matrices of strings

Converting other data types to strings

Converting strings to other data types

Number of characters

Part II

Getting a substring

`substr()` vectorizes

Replace a substring

Splitting a string

`strsplit()` vectorizes

Splitting character-by-character

Combining strings

`paste()` vectorizes

Condensing a vector of strings

Part III

Text from the outside

Reading from a local file

Reconstitution

Counting words

The names are words, the entries are counts

Most frequent words

Visualizing frequencies

Zipf’s law

Summary

Text Manipulation

Statistical Computing, 36-350

Tuesday October 5, 2021

Last week: Dplyr, pipes, and more

Part I

What are strings?

Whitespaces

Vectors/matrices of strings

Converting other data types to strings

Converting strings to other data types

Number of characters

Part II

Getting a substring

substr() vectorizes

Replace a substring

Splitting a string

strsplit() vectorizes

Splitting character-by-character

Combining strings

paste() vectorizes

Condensing a vector of strings

Part III

Text from the outside

Reading from a local file

Reconstitution

Counting words

The names are words, the entries are counts

Most frequent words

Visualizing frequencies

Zipf’s law

Summary

`substr()` vectorizes

`strsplit()` vectorizes

`paste()` vectorizes