Statistical Computing, 36-350
Monday September 26, 2016
From our lectures on text manipulation and regexes:
# Get Trump's word counts
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words = strsplit(trump.text, split="[[:space:]]|[[:punct:]]")[[1]]
trump.words = trump.words[trump.words != ""]
trump.wordtab = table(trump.words)
# Now do the same for Clinton, Pence, Kaine, etc...
Call function()
to create your own function. Optional (but highly recommended): document your function with comments
# get.wordtab: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab = function(str.url) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split="[[:space:]]|[[:punct:]]")[[1]]
words = words[words != ""]
wordtab = table(words)
return(wordtab)
}
The structure of a function has three basic parts:
R doesn’t let your function have multiple outputs, but you can return a list
Functions can call other functions:
Our created functions can be used just like the built-in ones
# Using our function
trump.wordtab.new = get.wordtab("http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
all(trump.wordtab.new == trump.wordtab)
## [1] TRUE
# Revealing our function's definition
get.wordtab
## function(str.url) {
## lines = readLines(str.url)
## text = paste(lines, collapse=" ")
## words = strsplit(text, split="[[:space:]]|[[:punct:]]")[[1]]
## words = words[words != ""]
## wordtab = table(words)
## return(wordtab)
## }
## <environment: 0x10d75f708>
With no explicit return()
statement, the default is just to return whatever is on the last line. So the following is exactly the same as before
get.wordtab = function(str.url) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split="[[:space:]]|[[:punct:]]")[[1]]
words = words[words != ""]
table(words)
}
Our function can take more than one input
# get.wordtab: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# - split: string, specifying what to split on
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab = function(str.url, split) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
table(words)
}
Our function can also specify default values for the inputs (so if the user doesn’t specify the input in the function call, then the default value is used)
# get.wordtab: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# - split: string, specifying what to split on. Default is the regex pattern
# "[[:space:]]|[[:punct:]]"
# - tolower: boolean, TRUE if words should be converted to lower case before
# the word table is computed. Default is TRUE
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab = function(str.url, split="[[:space:]]|[[:punct:]]",
tolower=TRUE) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
# Convert to lower case, if we're asked to
if (tolower) words = tolower(words)
table(words)
}
# Inputs can be called by name, or without names
trump.word.tab.1 = get.wordtab(
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
split="[[:space:]]|[[:punct:]]", tolower=TRUE)
trump.word.tab.2 = get.wordtab(
"http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
"[[:space:]]|[[:punct:]]", TRUE)
all(trump.word.tab.2 == trump.word.tab.1)
## [1] TRUE
# Inputs can be called by partial names (if uniquely identifying)
trump.word.tab.3 = get.wordtab(
str="http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
spl="[[:space:]]|[[:punct:]]", tolower=TRUE)
all(trump.word.tab.3 == trump.word.tab.1)
## [1] TRUE
# When inputs aren't specified, default values are used
trump.word.tab.4 = get.wordtab(
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
split="[[:space:]]|[[:punct:]]")
all(trump.word.tab.4 == trump.word.tab.1)
## [1] TRUE
# Named inputs can go in any order
trump.word.tab.5 = get.wordtab(tolower=TRUE, split="[[:space:]]|[[:punct:]]",
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt")
all(trump.word.tab.5 == trump.word.tab.1)
## [1] TRUE
While named inputs can go in any order, unnamed inputs must go in the proper order (as they are specified in the function’s definition)
So the following code would throw an error:
trump.word.tab.6 = get.wordtab("[[:space:]]|[[:punct:]]",
"http://www.stat.cmu.edu/~ryantibs/statcomp-F16/data/trump.txt",
tolower=FALSE)
## Warning in file(con, "r"): cannot open file '[[:space:]]|[[:punct:]]': No
## such file or directory
## Error in file(con, "r"): cannot open the connection
because our function would try to open up “[[:space:]]|[[:punct:]]” as the URL of a web page
When calling a function with multiple arguments, use input names for safety, unless you’re absolutely certain of the right order for (some) inputs