Statistical Computing, 36-350
Monday August 26, 2019
All 3 are free, and all 3 will be used extensively in this course
It’s on the course website, please read it (actually read it)
Don’t do it, refer to syllabus if you’re unclear about anything
(and if you’re still unclear, come see me)
Please don’t call me “Professor”. Call me:
(demo)
Data types, operators, variables
Two basic types of things/objects: data and functions
log, + (takes two arguments), < (two), %% (two), and mean (one)A function is a machine which turns input objects, or arguments, into an output object, or a return value (possibly with side effects), according to a definite rule
The trick to good programming is to take a big transformation and break it down into smaller ones, and then break those down, until you come to tasks which are easy (using built-in functions)
At base level, all data can represented in binary format, by bits (i.e., TRUE/FALSE, YES/NO, 1/0). Basic data types:
TRUE or FALSE in RNA, NaN, etc.- for arithmetic negation, ! for Boolean negation+, -, *, and / (though this is only a partial operator). Also, %% (for mod), and ^ (again partial)-7## [1] -77 + 5## [1] 127 - 5## [1] 27 * 5## [1] 357 ^ 5## [1] 168077 / 5## [1] 1.47 %% 5## [1] 2These are also binary operators; they take two objects, and give back a Boolean
7 > 5## [1] TRUE7 < 5## [1] FALSE7 >= 7## [1] TRUE7 <= 5## [1] FALSE7 == 5## [1] FALSE7 != 5## [1] TRUEWarning: == is a comparison operator, = is not!
These basic ones are & (and) and | (or)
(5 > 7) & (6 * 7 == 42)## [1] FALSE(5 > 7) | (6 * 7 == 42)## [1] TRUE(5 > 7) | (6 * 7 == 42) & (0 != 0)## [1] FALSE(5 > 7) | (6 * 7 == 42) & (0 != 0) | (9 - 8 >= 0)## [1] TRUENote: The double forms && and || are different! We’ll see them later
typeof() function returns the data typeis.foo() functions return Booleans for whether the argument is of type fooas.foo() (tries to) “cast” its argument to type foo, to translate it sensibly into such a valuetypeof(7)## [1] "double"is.numeric(7)## [1] TRUEis.na(7)## [1] FALSEis.na(7/0)## [1] FALSEis.na(0/0)## [1] TRUEis.character(7)## [1] FALSEis.character("7")## [1] TRUEis.character("seven")## [1] TRUEis.na("seven")## [1] FALSEas.character(5/6)## [1] "0.833333333333333"as.numeric(as.character(5/6))## [1] 0.83333336 * as.numeric(as.character(5/6))## [1] 55/6 == as.numeric(as.character(5/6))## [1] FALSEWe can give names to data objects; these give us variables. Some variables are built-in:
pi## [1] 3.141593Variables can be arguments to functions or operators, just like constants:
pi * 10## [1] 31.41593cos(pi)## [1] -1We create variables with the assignment operator, <- or =
approx.pi = 22/7
approx.pi## [1] 3.142857diameter = 10
approx.pi * diameter## [1] 31.42857The assignment operator also changes values:
circumference = approx.pi * diameter
circumference## [1] 31.42857circumference = 30
circumference## [1] 30What variables have you defined?
ls()## [1] "approx.pi"     "circumference" "diameter"Getting rid of variables:
rm("circumference")
ls()## [1] "approx.pi" "diameter"rm(list=ls()) # Be warned! This erases everything
ls()## character(0)Data structures
x = c(7, 8, 10, 45)
x## [1]  7  8 10 45is.vector(x)## [1] TRUEc() function returns a vector containing all its arguments in specified order1:5 is shorthand for c(1,2,3,4,5), and so onx[1] would be the first element, x[4] the fourth element, and x[-4] is a vector containing all but the fourth elementvector(length=n) returns an empty vector of length n; helpful for filling things up later
weekly.hours = vector(length=5)
weekly.hours## [1] FALSE FALSE FALSE FALSE FALSEweekly.hours[5] = 8
weekly.hours## [1] 0 0 0 0 8Arithmetic operator apply to vectors in a “componentwise” fashion
y = c(-7, -8, -10, -45)
x + y## [1] 0 0 0 0x * y## [1]   -49   -64  -100 -2025Recycling repeat elements in shorter vector when combined with a longer one
x + c(-7,-8)## [1]  0  0  3 37x^c(1,0,-1,0.5)## [1] 7.000000 1.000000 0.100000 6.708204Single numbers are vectors of length 1 for purposes of recycling:
2 * x## [1] 14 16 20 90Can do componentwise comparisons with vectors:
x > 9## [1] FALSE FALSE  TRUE  TRUELogical operators also work elementwise:
(x > 9) & (x < 20)## [1] FALSE FALSE  TRUE FALSETo compare whole vectors, best to use identical() or all.equal():
x == -y## [1] TRUE TRUE TRUE TRUEidentical(x, -y)## [1] TRUEidentical(c(0.5-0.3,0.3-0.1), c(0.3-0.1,0.5-0.3))## [1] FALSEall.equal(c(0.5-0.3,0.3-0.1), c(0.3-0.1,0.5-0.3))## [1] TRUENote: these functions are slightly different; we’ll see more later
Many functions can take vectors as arguments:
mean(), median(), sd(), var(), max(), min(), length(), and sum() return single numberssort() returns a new vectorhist() takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plotecdf() similarly produces a cumulative-density-function objectsummary() gives a five-number summary of numerical vectorsany() and all() are useful on Boolean vectorsVector of indices:
x[c(2,4)]## [1]  8 45Vector of negative indices:
x[c(-1,-3)]## [1]  8 45Boolean vector:
x[x > 9]## [1] 10 45y[x > 9]## [1] -10 -45which() gives the elements of a Boolean vector that are TRUE:
places = which(x > 9)
places## [1] 3 4y[places]## [1] -10 -45We can give names to elements/components of vectors, and index vectors accordingly
names(x) = c("v1","v2","v3","fred")
names(x)## [1] "v1"   "v2"   "v3"   "fred"x[c("fred","v1")]## fred   v1 
##   45    7Note: here R is printing the labels, these are not additional components of x
names() returns another vector (of characters):
names(y) = names(x)
sort(names(x))## [1] "fred" "v1"   "v2"   "v3"which(names(x) == "fred")## [1] 4An array is a multi-dimensional generalization of vectors
x = c(7, 8, 10, 45)
x.arr = array(x, dim=c(2,2))
x.arr##      [,1] [,2]
## [1,]    7   10
## [2,]    8   45dim says how many rows and columns; filled by columnsdim is vector of arbitrary lengthSome properties of our array:
dim(x.arr)## [1] 2 2is.vector(x.arr)## [1] FALSEis.array(x.arr)## [1] TRUEtypeof(x.arr)## [1] "double"Can access a 2d array either by pairs of indices or by the underlying vector (column-major order):
x.arr[1,2]## [1] 10x.arr[3]## [1] 10Omitting an index means “all of it”:
x.arr[c(1,2),2]## [1] 10 45x.arr[,2]## [1] 10 45x.arr[,2,drop=FALSE]##      [,1]
## [1,]   10
## [2,]   45Note: the optional third argument drop=FALSE ensures that the result is still an array, not a vector
Many functions applied to an array will just boil things down to the underlying vector:
which(x.arr > 9)## [1] 3 4This happens unless the function is set up to handle arrays specifically
And there are several functions/operators that do preserve array structure:
y = -x
y.arr = array(y, dim=c(2,2))
y.arr + x.arr##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0A matrix is a specialization of a 2d array
z.mat = matrix(c(40,1,60,3), nrow=2)
z.mat##      [,1] [,2]
## [1,]   40   60
## [2,]    1    3is.array(z.mat)## [1] TRUEis.matrix(z.mat)## [1] TRUEncol for the number of columnsbyrow=TRUEz.mat/3)Matrices have its own special multiplication operator, written %*%:
six.sevens = matrix(rep(7,6), ncol=3)
six.sevens##      [,1] [,2] [,3]
## [1,]    7    7    7
## [2,]    7    7    7z.mat %*% six.sevens # [2x2] * [2x3]##      [,1] [,2] [,3]
## [1,]  700  700  700
## [2,]   28   28   28Can also multiply a matrix and a vector
Row/column sums, or row/column means:
rowSums(z.mat)## [1] 100   4colSums(z.mat)## [1] 41 63rowMeans(z.mat)## [1] 50  2colMeans(z.mat)## [1] 20.5 31.5The diag() function can be used to extract the diagonal entries of a matrix:
diag(z.mat)## [1] 40  3It can also be used to change the diagonal:
diag(z.mat) = c(35,4)
z.mat##      [,1] [,2]
## [1,]   35   60
## [2,]    1    4Finally, diag() can be used to create a diagonal matrix:
diag(c(3,4))##      [,1] [,2]
## [1,]    3    0
## [2,]    0    4diag(2)##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1Transpose:
t(z.mat)##      [,1] [,2]
## [1,]   35    1
## [2,]   60    4Determinant:
det(z.mat)## [1] 80Inverse:
solve(z.mat)##         [,1]    [,2]
## [1,]  0.0500 -0.7500
## [2,] -0.0125  0.4375z.mat %*% solve(z.mat)##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1rownames() and colnames()names() for vectorsA list is sequence of values, but not necessarily all of the same type
my.dist = list("exponential", 7, FALSE)
my.dist## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSEMost of what you can do with vectors you can also do with lists
[ ] as with vectors[[ ]], but only with a single index [[ ]] drops names and structures, [ ] does notmy.dist[2]## [[1]]
## [1] 7my.dist[[2]]## [1] 7my.dist[[2]]^2## [1] 49Add to lists with c() (also works with vectors):
my.dist = c(my.dist, 9)
my.dist## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSE
## 
## [[4]]
## [1] 9Chop off the end of a list by setting the length to something smaller (also works with vectors):
length(my.dist)## [1] 4length(my.dist) = 3
my.dist## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] 7
## 
## [[3]]
## [1] FALSEPluck out all but one piece of a list (also works with vectors):
my.dist[-2]## [[1]]
## [1] "exponential"
## 
## [[2]]
## [1] FALSEWe can name some or all of the elements of a list:
names(my.dist) = c("family","mean","is.symmetric")
my.dist## $family
## [1] "exponential"
## 
## $mean
## [1] 7
## 
## $is.symmetric
## [1] FALSEmy.dist[["family"]]## [1] "exponential"my.dist["family"]## $family
## [1] "exponential"Lists have a special shortcut way of using names, with $:
my.dist[["family"]]## [1] "exponential"my.dist$family## [1] "exponential"Creating a list with names:
another.dist = list(family="gaussian",
                            mean=7, sd=1, is.symmetric=TRUE)Adding named elements:
my.dist$was.estimated = FALSE
my.dist[["last.updated"]] = "2015-09-01"Removing a named list element, by assigning it the value NULL:
my.dist$was.estimated = NULLfamily, we can look that up by name, without caring where it is (in what position it lies) in the listrowSums(), summary(), apply())a.mat = matrix(c(35,8,10,4), nrow=2)
colnames(a.mat) = c("v1","v2")
a.mat##      v1 v2
## [1,] 35 10
## [2,]  8  4a.mat[,"v1"]  # Try a.mat$v1 and see what happens## [1] 35  8a.df = data.frame(a.mat,logicals=c(TRUE,FALSE))
a.df##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSEa.df$v1## [1] 35  8a.df[,"v1"]## [1] 35  8a.df[1,]##   v1 v2 logicals
## 1 35 10     TRUEcolMeans(a.df)##       v1       v2 logicals 
##     21.5      7.0      0.5We can add rows or columns to an array or data frame with rbind() and cbind(), but be careful about forced type conversions
rbind(a.df,list(v1=-3,v2=-5,logicals=TRUE))##   v1 v2 logicals
## 1 35 10     TRUE
## 2  8  4    FALSE
## 3 -3 -5     TRUErbind(a.df,c(3,4,6))##   v1 v2 logicals
## 1 35 10        1
## 2  8  4        0
## 3  3  4        6Much more on data frames a bit later in the course …
So far, every list element has been a single data value. List elements can be other data structures, e.g., vectors and matrices, even other lists:
my.list = list(z.mat=z.mat, my.lucky.num=13, my.dist=my.dist)
my.list## $z.mat
##      [,1] [,2]
## [1,]   35   60
## [2,]    1    4
## 
## $my.lucky.num
## [1] 13
## 
## $my.dist
## $my.dist$family
## [1] "exponential"
## 
## $my.dist$mean
## [1] 7
## 
## $my.dist$is.symmetric
## [1] FALSE
## 
## $my.dist$last.updated
## [1] "2015-09-01"