How to get text, from an external source, into R? Use the readLines()
function
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
class(trump.lines) # We have a character vector
## [1] "character"
length(trump.lines) # Many lines (elements)!
## [1] 113
trump.lines[1:3] # First 3 lines
## [1] "Friends, delegates and fellow Americans: I humbly and gratefully accept your nomination for the presidency of the United States."
## [2] "Story Continued Below"
## [3] ""
That seemed to work really well! Let’s try again
wiki.lines = readLines("https://en.wikipedia.org/wiki/Donald_Trump")
length(wiki.lines) # Many, many lines
## [1] 4198
wiki.lines[1:6] # First 6 lines
## [1] "<!DOCTYPE html>"
## [2] "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">"
## [3] "<head>"
## [4] "<meta charset=\"UTF-8\"/>"
## [5] "<title>Donald Trump - Wikipedia, the free encyclopedia</title>"
## [6] "<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );</script>"
The difference between these two examples: the first was a text file put there by your professor, the second was a bonafide website (in HTML)
To get “usual” text from the website, we have to go to line 216, and even then it’s not that readable
wiki.lines[216]
## [1] "</div>"
We’ll learn more next week; for now, we’ll just look at a nice text files
We don’t need to use the web; readLines()
can be used on a local file. The following code would read in a text file from your professor’s computer:
trump.lines.2 = readLines("~/Dropbox/teaching/f16-350/lectures/text/trump.txt")
This will cause an error for you, unless your folder is set up exactly like the professor’s laptop! So using web links is more robust
Fancy word, but all it means: make one long string, then split the words
trump.text = paste(trump.lines, collapse=" ")
trump.list = strsplit(trump.text, split=" ")
Let’s investigate, starting with the big long string
class(trump.text)
## [1] "character"
length(trump.text)
## [1] 1
substr(trump.text, 1, 300)
## [1] "Friends, delegates and fellow Americans: I humbly and gratefully accept your nomination for the presidency of the United States. Story Continued Below Together, we will lead our party back to the White House, and we will lead our country back to safety, prosperity, and peace. We will be a country o"
nchar(trump.text)
## [1] 25481
Let’s investigate the result of splitting up the long string
class(trump.list) # Remember why it's a list?
## [1] "list"
length(trump.list)
## [1] 1
trump.words = trump.list[[1]]
length(trump.words)
## [1] 4437
trump.words[1:10]
## [1] "Friends," "delegates" "and" "fellow" "Americans:"
## [6] "I" "humbly" "and" "gratefully" "accept"
We can sort strings (just like numbers), with sort()
trump.words.sorted = sort(trump.words) # Default is increasing order
length(trump.words.sorted)
## [1] 4437
head(trump.words.sorted) # Peak at the start
## [1] "" "–" "–" "–" "–" "–"
tail(trump.words.sorted) # Peak at the end
## [1] "your" "your" "your" "your" "YOUR" "youth"
Notice that punctuation marks are treated as words (not ideal)
We can get just the unique (sorted) words, using unique()
trump.words.sorted.unique = unique(trump.words.sorted)
length(trump.words.sorted.unique)
## [1] 1604
length(trump.words.sorted.unique) / length(trump.words.sorted) # About 36% unique words
## [1] 0.3615055
head(trump.words.sorted.unique)
## [1] "" "–" "'I" "“extremely" "“I’m"
## [6] "“I’M"
tail(trump.words.sorted.unique)
## [1] "YOU." "young" "youngest" "your" "YOUR" "youth"