Statistical Computing, 36-350
Monday October 31, 2015
Class website: http://www.stat.cmu.edu/~ryantibs/statcomp-F15/.
This will host class info, schedule, lecture notes, homework assignments, etc.
For announcements and discussions, sign up for the Piazza group. Blackboard will be used to collect submissions, and keep track of grades
Statistical tools to analyze your darts game!
Two basic types of things, or objects: data and functions
log
, +
(takes two arguments), <
(two), %%
(two), and mean
(one)A function is a machine which turns input objects, or arguments, into an output object, or a return value (possibly with side effects), according to a definite rule
Programming is writing functions to transform inputs into outputs
Good programming ensures the transformation is done easily and correctly
Machines are made out of machines; functions are made out of functions, like \(f(a,b) = a^2 + b^2\)
The route to good programming is to take a big transformation and break it down into smaller ones, and then break those down, until you come to tasks which are easy (using built-in functions)
At base level, all data can represented in binary format, by bits (i.e., TRUE/FALSE, YES/NO, 1/0). Some basic data types:
TRUE
or FALSE
in RNA
, NaN
, etc.-
for arithmetic negation, !
for Boolean negation+
, -
, *
, and /
(though this is only a partial operator). Also, %%
(for mod), and ^
(again partial)Examples:
-7
## [1] -7
7+5
## [1] 12
7-5
## [1] 2
7*5
## [1] 35
7^5
## [1] 16807
7/5
## [1] 1.4
7 %% 5
## [1] 2
7 %/% 5
## [1] 1
Basic interaction with R is by typing in the console, i.e., terminal or command line
You type in commands, R gives back answers (or errors)
Menus and other graphical interfaces are extras built on top of the console
These are also binary operators; they take two objects, like numbers, and give back a Boolean. Examples:
7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE
7 <= 5
## [1] FALSE
7 == 5
## [1] FALSE
7 != 5
## [1] TRUE
(Note: ==
is a comparison operator, =
is not!)
These basic ones are &
(and) and |
(or). Examples:
(5 > 7) & (6*7 == 42)
## [1] FALSE
(5 > 7) | (6*7 == 42)
## [1] TRUE
(5 > 7) | (6*7 == 42) & (0 != 0)
## [1] FALSE
(5 > 7) | (6*7 == 42) & (0 != 0) | (9-8 >= 0)
## [1] TRUE
(The double forms &&
and ||
are different! We will see them later)
The typeof()
function returns the data type
is.
foo()
functions return Booleans for whether the argument is of type foo
as.
foo()
(tries to) “cast” its argument to type foo, to translate it sensibly into such a value
typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.na(7)
## [1] FALSE
is.na(7/0)
## [1] FALSE
is.na(0/0)
## [1] TRUE
(Why is 7/0
not NA
, but 0/0
is?)
is.character(7)
## [1] FALSE
is.character("7")
## [1] TRUE
is.character("seven")
## [1] TRUE
is.na("seven")
## [1] FALSE
as.character(5/6)
## [1] "0.833333333333333"
as.numeric(as.character(5/6))
## [1] 0.8333333
6*as.numeric(as.character(5/6))
## [1] 5
5/6 == as.numeric(as.character(5/6))
## [1] FALSE
(Why is the last evaluation FALSE
?)
We can give names to data objects; these give us variables
A few variables are built in:
pi
## [1] 3.141593
Variables can be arguments to functions or operators, just like constants:
pi*10
## [1] 31.41593
cos(pi)
## [1] -1
Create variables with the assignment operator, <-
or =
, as in:
approx.pi = 22/7
approx.pi
## [1] 3.142857
diameter = 10
approx.pi * diameter
## [1] 31.42857
The assignment operator also changes values:
circumference = approx.pi * diameter
circumference
## [1] 31.42857
circumference = 30
circumference
## [1] 30
The code you write will be made of variables, with descriptive names: easier to design, easier to debug, easier to improve, and easier for others to read
Avoid “magic constants”; use named variables (you will be graded on this!)
Named variables are a first step towards abstraction
What variables have you defined?
ls()
## [1] "approx.pi" "circumference" "diameter" "name"
## [5] "places" "weekly.hours" "x" "y"
Getting rid of variables:
rm("circumference")
ls()
## [1] "approx.pi" "diameter" "name" "places"
## [5] "weekly.hours" "x" "y"
rm(list=ls()) # Be warned! This erases everything
ls()
## character(0)
A data structure is a grouping of relaed data values into an object
A vector is a sequence of values, all of the same type, as in:
x = c(7, 8, 10, 45)
x
## [1] 7 8 10 45
is.vector(x)
## [1] TRUE
c()
function returns a vector containing all its arguments in specified order
Hence x[1]
would be the first element, x[4]
the 4th element, and x[-4]
a vector containing all but the fourth element
vector(length=6)
returns an empty vector of length 6; helpful for filling things up later
weekly.hours = vector(length=5)
weekly.hours
## [1] FALSE FALSE FALSE FALSE FALSE
weekly.hours[5] = 8
weekly.hours
## [1] 0 0 0 0 8
Arithmetic operator apply to vectors in a “componentwise” fashion:
y = c(-7, -8, -10, -45)
x+y
## [1] 0 0 0 0
x*y
## [1] -49 -64 -100 -2025
Recycling repeat elements in shorter vector when combined with a longer one. Example:
x + c(-7,-8)
## [1] 0 0 3 37
x^c(1,0,-1,0.5)
## [1] 7.000000 1.000000 0.100000 6.708204
Single numbers are vectors of length 1 for purposes of recycling:
2*x
## [1] 14 16 20 90
Can also do componentwise comparisons with vectors:
x > 9
## [1] FALSE FALSE TRUE TRUE
(Note: this returns a Boolean vector)
Logical operators work elementwise:
(x > 9) & (x < 20)
## [1] FALSE FALSE TRUE FALSE
To compare whole vectors, best to use identical()
or all.equal()
:
x == -y
## [1] TRUE TRUE TRUE TRUE
identical(x,-y)
## [1] TRUE
identical(c(0.5-0.3,0.3-0.1),c(0.3-0.1,0.5-0.3))
## [1] FALSE
all.equal(c(0.5-0.3,0.3-0.1),c(0.3-0.1,0.5-0.3))
## [1] TRUE
Lots of functions can take vectors as arguments:
mean()
, median()
, sd()
, var()
, max()
, min()
, length()
, and sum()
return single numberssort()
returns a new vectorhist()
takes a vector of numbers and produces a histogram, a highly structured object, with the side effect of making a plotecdf()
similarly produces a cumulative-density-function objectsummary()
gives a five-number summary of numerical vectorsany()
and all()
are useful on Boolean vectorsVector of indices:
x[c(2,4)]
## [1] 8 45
Vector of negative indices:
x[c(-1,-3)]
## [1] 8 45
(Why that, and not 8 10
?)
Boolean vector:
x[x>9]
## [1] 10 45
y[x>9]
## [1] -10 -45
which()
gives the elements of a Boolean vector that are TRUE
:
places = which(x > 9)
places
## [1] 3 4
y[places]
## [1] -10 -45
You can give names to elements or components of vectors, and index vectors accordingly
names(x) = c("v1","v2","v3","fred")
names(x)
## [1] "v1" "v2" "v3" "fred"
x[c("fred","v1")]
## fred v1
## 45 7
(Here R prints the labels, these are not actual components of x
)
names(x)
is just another vector (of characters):
names(y) = names(x)
sort(names(x))
## [1] "fred" "v1" "v2" "v3"
which(names(x)=="fred")
## [1] 4
The more bits in the fraction part, the more precision
The R floating-point data type is a double (what R calls numeric). Back when memory was expensive, the now-standard number of bits was twice the default
Finite precision \(\Rightarrow\) arithmetic on doubles \(\not=\) arithmetic on \(\mathbb{R}\).
0.45 == 3*0.15
## [1] FALSE
0.45 - 3*0.15
## [1] 5.551115e-17
Often ignorable, but not always
all.equal()
than exact comparison(0.5 - 0.3) == (0.3 - 0.1)
## [1] FALSE
all.equal(0.5-0.3, 0.3-0.1)
## [1] TRUE
Typing a whole number in the terminal doesn’t make it an integer; it makes it a double, whose fractional part is 0
is.integer(7)
## [1] FALSE
as.integer(7)
## [1] 7
To test for being an integer in the mathematical sense, use round()
:
round(7) == 7
## [1] TRUE