Text from the outside

How to get text, from an external source, into R? Use the readLines() function

trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
class(trump.lines) # We have a character vector

## [1] "character"

length(trump.lines) # Many lines (elements)!

## [1] 113

trump.lines[1:3] # First 3 lines

## [1] "Friends, delegates and fellow Americans: I humbly and gratefully accept your nomination for the presidency of the United States."
## [2] "Story Continued Below"                                                                                                           
## [3] ""

Another example

That seemed to work really well! Let’s try again

wiki.lines = readLines("https://en.wikipedia.org/wiki/Donald_Trump")
length(wiki.lines) # Many, many lines

## [1] 4198

wiki.lines[1:6] # First 6 lines

## [1] "<!DOCTYPE html>"                                                                                                                                    
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"                                                                                               
## [3] "<head>"                                                                                                                                             
## [4] "<meta charset=\"UTF-8\"/>"                                                                                                                          
## [5] "<title>Donald Trump - Wikipedia, the free encyclopedia</title>"                                                                                     
## [6] "<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );</script>"

Scraping web text is not easy!

The difference between these two examples: the first was a text file put there by your professor, the second was a bonafide website (in HTML)

To get “usual” text from the website, we have to go to line 216, and even then it’s not that readable

wiki.lines[216]

## [1] "</div>"

We’ll learn more next week; for now, we’ll just look at a nice text files

Reading text from a local file

We don’t need to use the web; readLines() can be used on a local file. The following code would read in a text file from your professor’s computer:

trump.lines.2 = readLines("~/Dropbox/teaching/f16-350/lectures/text/trump.txt")

This will cause an error for you, unless your folder is set up exactly like the professor’s laptop! So using web links is more robust

Reconstitution

Fancy word, but all it means: make one long string, then split the words

trump.text = paste(trump.lines, collapse=" ")
trump.list = strsplit(trump.text, split=" ")

The big long string

Let’s investigate, starting with the big long string

class(trump.text)

## [1] "character"

length(trump.text)

## [1] 1

substr(trump.text, 1, 300)

## [1] "Friends, delegates and fellow Americans: I humbly and gratefully accept your nomination for the presidency of the United States. Story Continued Below  Together, we will lead our party back to the White House, and we will lead our country back to safety, prosperity, and peace. We will be a country o"

nchar(trump.text)

## [1] 25481

The split up words

Let’s investigate the result of splitting up the long string

class(trump.list) # Remember why it's a list?

## [1] "list"

length(trump.list)

## [1] 1

trump.words = trump.list[[1]]
length(trump.words)

## [1] 4437

trump.words[1:10]

##  [1] "Friends,"   "delegates"  "and"        "fellow"     "Americans:"
##  [6] "I"          "humbly"     "and"        "gratefully" "accept"

Sorting the words

We can sort strings (just like numbers), with sort()

trump.words.sorted = sort(trump.words) # Default is increasing order
length(trump.words.sorted)

## [1] 4437

head(trump.words.sorted) # Peak at the start

## [1] ""  "–" "–" "–" "–" "–"

tail(trump.words.sorted) # Peak at the end

## [1] "your"  "your"  "your"  "your"  "YOUR"  "youth"

Notice that punctuation marks are treated as words (not ideal)

Getting the unique words

We can get just the unique (sorted) words, using unique()

trump.words.sorted.unique = unique(trump.words.sorted)
length(trump.words.sorted.unique)

## [1] 1604

length(trump.words.sorted.unique) / length(trump.words.sorted) # About 36% unique words

## [1] 0.3615055

head(trump.words.sorted.unique)

## [1] ""           "–"          "'I"         "“extremely" "“I’m"      
## [6] "“I’M"

tail(trump.words.sorted.unique)

## [1] "YOU."     "young"    "youngest" "your"     "YOUR"     "youth"

Reading in Text

Statistical Computing, 36-350

Wednesday September 7, 2016