Statistical Computing, 36-350
Monday September 30, 2019
plot()
: generic plotting functionpoints()
: add points to an existing plotlines()
, abline()
: add lines to an existing plottext()
, legend()
: add text to an existing plotrect()
, polygon()
: add shapes to an existing plothist()
, image()
: histogram and heatmapheat.colors()
, topo.colors()
, etc: create a color vectordensity()
: estimate density, which can be plottedcontour()
: draw contours, or add to existing plotcurve()
: draw a curve, or add to existing plotFunction basics
From our lectures on text manipulation and regexes:
# Get Trump's word counts
trump.lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt")
trump.text = paste(trump.lines, collapse=" ")
trump.words = strsplit(trump.text, split="[[:space:]]|[[:punct:]]")[[1]]
trump.words = trump.words[trump.words != ""]
trump.wordtab = table(trump.words)
# Now retype for other politicians/speeches, etc.
Call function()
to create your own function. Document your function with comments
# get.wordtab.trump: get a word table from Trump's 2016 RNC speech on the web
# Input: none
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab.trump = function() {
lines = readLines("http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt")
text = paste(lines, collapse=" ")
words = strsplit(text, split="[[:space:]]|[[:punct:]]")[[1]]
words = words[words != ""]
wordtab = table(words)
return(wordtab)
}
Much better: create a word table function that takes a URL of web
# get.wordtab.from.url: get a word table from text on the web
# Input:
# - str.url: string, specifying URL of a web page
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab.from.url = function(str.url) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split="[[:space:]]|[[:punct:]]")[[1]]
words = words[words != ""]
wordtab = table(words)
return(wordtab)
}
The structure of a function has three basic parts:
R doesn’t let your function have multiple outputs, but you can return a list
Our created functions can be used just like the built-in ones
# Using our function
trump.wordtab.new = get.wordtab.from.url(
"http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt")
all(trump.wordtab.new == trump.wordtab)
## [1] TRUE
## function(str.url) {
## lines = readLines(str.url)
## text = paste(lines, collapse=" ")
## words = strsplit(text, split="[[:space:]]|[[:punct:]]")[[1]]
## words = words[words != ""]
## wordtab = table(words)
## return(wordtab)
## }
## <environment: 0x7fa74f470558>
With no explicit return()
statement, the default is just to return whatever is on the last line. So the following is equivalent to what we had before
Our function can take more than one input
# get.wordtab.from.url: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# - split: string, specifying what to split on
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab.from.url = function(str.url, split) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
table(words)
}
Our function can also specify default values for the inputs (if the user doesn’t specify an input in the function call, then the default value is used)
# get.wordtab.from.url: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# - split: string, specifying what to split on. Default is the regex pattern
# "[[:space:]]|[[:punct:]]"
# - tolower: Boolean, TRUE if words should be converted to lower case before
# the word table is computed. Default is TRUE
# Output: word table, i.e., vector with counts as entries and associated
# words as names
get.wordtab.from.url = function(str.url, split="[[:space:]]|[[:punct:]]",
tolower=TRUE) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
# Convert to lower case, if we're asked to
if (tolower) words = tolower(words)
table(words)
}
# Inputs can be called by name, or without names
trump.wordtab1 = get.wordtab.from.url(
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt",
split="[[:space:]]|[[:punct:]]", tolower=TRUE)
trump.wordtab2 = get.wordtab.from.url(
"http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt",
"[[:space:]]|[[:punct:]]", TRUE)
all(trump.wordtab2 == trump.wordtab1)
## [1] TRUE
# Inputs can be called by partial names (if uniquely identifying)
trump.wordtab3 = get.wordtab.from.url(
str="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt",
spl="[[:space:]]|[[:punct:]]", tolower=TRUE)
all(trump.wordtab3 == trump.wordtab1)
## [1] TRUE
# When inputs aren't specified, default values are used
trump.wordtab4 = get.wordtab.from.url(
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt",
split="[[:space:]]|[[:punct:]]")
all(trump.wordtab4 == trump.wordtab1)
## [1] TRUE
# Named inputs can go in any order
trump.wordtab5 = get.wordtab.from.url(
tolower=TRUE, split="[[:space:]]|[[:punct:]]",
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt")
all(trump.wordtab5 == trump.wordtab1)
## [1] TRUE
While named inputs can go in any order, unnamed inputs must go in the proper order (as they are specified in the function’s definition). E.g., the following code would throw an error:
trump.wordtab6 = get.wordtab.from.url("[[:space:]]|[[:punct:]]",
"http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt",
tolower=FALSE)
## Warning in file(con, "r"): cannot open file '[[:space:]]|[[:punct:]]': No
## such file or directory
## Error in file(con, "r"): cannot open the connection
because our function would try to open up “[[:space:]]|[[:punct:]]” as the URL of a web page
When calling a function with multiple arguments, use input names for safety, unless you’re absolutely certain of the right order for (some) inputs
Return values and side effects
When creating a function in R, though you cannot return more than one output, you can return a list. This (by definition) can contain an arbitrary number of arbitrary objects
# get.wordtab.from.url: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# - split: string, specifying what to split on. Default is the regex pattern
# "[[:space:]]|[[:punct:]]"
# - tolower: Boolean, TRUE if words should be converted to lower case before
# the word table is computed. Default is TRUE
# - keep.nums: Boolean, TRUE if words containing numbers should be kept in the
# word table. Default is FALSE
# Output: list, containing word table, and then some basic numeric summaries
get.wordtab.from.url = function(str.url, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.nums=FALSE) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
# Convert to lower case, if we're asked to
if (tolower) words = tolower(words)
# Get rid of words with numbers, if we're asked to
if (!keep.nums)
words = grep("[0-9]", words, inv=TRUE, val=TRUE)
# Compute the word table
wordtab = table(words)
return(list(wordtab=wordtab,
number.unique.words=length(wordtab),
number.total.words=sum(wordtab),
longest.word=words[which.max(nchar(words))]))
}
# Trump's Republican National Convention 2016 speech
trump.wordtab = get.wordtab.from.url(
"http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt")
lapply(trump.wordtab, head)
## $wordtab
## words
## a abandon abandoned able abolish about
## 54 1 1 2 1 4
##
## $number.unique.words
## [1] 1236
##
## $number.total.words
## [1] 4431
##
## $longest.word
## [1] "straightforward"
# Clinton's Democratic National Convention 2016 speech
clinton.wordtab = get.wordtab.from.url(
"http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/clinton.txt")
lapply(clinton.wordtab, head)
## $wordtab
## words
## a abandoned able about abroad accept
## 106 1 1 11 1 1
##
## $number.unique.words
## [1] 1306
##
## $number.total.words
## [1] 5555
##
## $longest.word
## [1] "representatives"
A side effect of a function is something that happens as a result of the function’s body, but is not returned. Examples:
# get.wordtab.from.url: get a word table from text on the web
# Inputs:
# - str.url: string, specifying URL of a web page
# - split: string, specifying what to split on. Default is the regex pattern
# "[[:space:]]|[[:punct:]]"
# - tolower: Boolean, TRUE if words should be converted to lower case before
# the word table is computed. Default is TRUE
# - keep.nums: Boolean, TRUE if words containing numbers should be kept in the
# word table. Default is FALSE
# - hist: Boolean, TRUE if a histogram of word lengths should be plotted as a
# side effect. Default is FALSE
# Output: list, containing word table, and then some basic numeric summaries
get.wordtab.from.url = function(str.url, split="[[:space:]]|[[:punct:]]",
tolower=TRUE, keep.nums=FALSE, hist=FALSE) {
lines = readLines(str.url)
text = paste(lines, collapse=" ")
words = strsplit(text, split=split)[[1]]
words = words[words != ""]
# Convert to lower case, if we're asked to
if (tolower) words = tolower(words)
# Get rid of words with numbers, if we're asked to
if (!keep.nums)
words = grep("[0-9]", words, inv=TRUE, val=TRUE)
# Plot the histogram of the word lengths, if we're asked to
if (hist)
hist(nchar(words), col="lightblue", breaks=0:max(nchar(words)),
xlab="Word length")
# Compute the word table
wordtab = table(words)
return(list(wordtab=wordtab,
number.unique.words=length(wordtab),
number.total.words=sum(wordtab),
longest.word=words[which.max(nchar(words))]))
}
# Trump's speech
trump.wordtab = get.wordtab.from.url(
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/trump.txt",
hist=TRUE)
## $wordtab
## words
## a abandon abandoned able abolish about
## 54 1 1 2 1 4
##
## $number.unique.words
## [1] 1236
##
## $number.total.words
## [1] 4431
##
## $longest.word
## [1] "straightforward"
# Clinton's speech
clinton.wordtab = get.wordtab.from.url(
str.url="http://www.stat.cmu.edu/~ryantibs/statcomp-F19/data/clinton.txt",
hist=TRUE)
## $wordtab
## words
## a abandoned able about abroad accept
## 106 1 1 11 1 1
##
## $number.unique.words
## [1] 1306
##
## $number.total.words
## [1] 5555
##
## $longest.word
## [1] "representatives"
Environments and design
## [1] 8
## [1] 7
## [1] "A" "C" "G" "T" "U"
## [1] 3.141593 12.566371 28.274334
## [1] 3 12 27
## [1] 3.141593 12.566371 28.274334
pi
, letters
, month.names
, etc.Not all side effects are desirable. One particularly bad side effect is if the function’s body changes the value of some variable outside of the function’s environment
You can write top-level code, right away, for your function’s design:
# Not actual code
big.job = function(lots.of.arguments) {
first.result = first.step(some.of.the.args)
second.result = second.step(first.result, more.of.the.args)
final.result = third.step(second.result, rest.of.the.args)
return(final.result)
}
After you write down your design, go ahead and write the sub-functions (here first.step()
, second.step()
, third.step()
). The process may be iterative, in that you may write these sub-functions, then go back and change the design a bit, etc.
With practice, this design strategy should become natural