Lecture 1: Introduction and Basics

Statistical Computing, 36-350

Monday October 31, 2015

Outline

Why good statisticians learn to program

How this class will work

Class website: http://www.stat.cmu.edu/~ryantibs/statcomp-F15/.
This will host class info, schedule, lecture notes, homework assignments, etc.

For announcements and discussions, sign up for the Piazza group. Blackboard will be used to collect submissions, and keep track of grades

Cool example: statistics + darts

A screenshot from the darts R package.

Statistical tools to analyze your darts game!

This class in a nutshell: functional programming

Two basic types of things, or objects: data and functions

A function is a machine which turns input objects, or arguments, into an output object, or a return value (possibly with side effects), according to a definite rule

Programming is writing functions to transform inputs into outputs

Good programming ensures the transformation is done easily and correctly

Machines are made out of machines; functions are made out of functions, like \(f(a,b) = a^2 + b^2\)

The route to good programming is to take a big transformation and break it down into smaller ones, and then break those down, until you come to tasks which are easy (using built-in functions)

Before functions, data

At base level, all data can represented in binary format, by bits (i.e., TRUE/FALSE, YES/NO, 1/0). Some basic data types:

Operators

Examples:

-7
## [1] -7
7+5
## [1] 12
7-5
## [1] 2
7*5
## [1] 35
7^5
## [1] 16807
7/5
## [1] 1.4
7 %% 5
## [1] 2
7 %/% 5
## [1] 1

The R console

Basic interaction with R is by typing in the console, i.e., terminal or command line

You type in commands, R gives back answers (or errors)

Menus and other graphical interfaces are extras built on top of the console

Comparison operators

These are also binary operators; they take two objects, like numbers, and give back a Boolean. Examples:

7 > 5
## [1] TRUE
7 < 5
## [1] FALSE
7 >= 7
## [1] TRUE
7 <= 5
## [1] FALSE
7 == 5
## [1] FALSE
7 != 5
## [1] TRUE

(Note: == is a comparison operator, = is not!)

Logical operators

These basic ones are & (and) and | (or). Examples:

(5 > 7) & (6*7 == 42)
## [1] FALSE
(5 > 7) | (6*7 == 42)
## [1] TRUE
(5 > 7) | (6*7 == 42) & (0 != 0)
## [1] FALSE
(5 > 7) | (6*7 == 42) & (0 != 0) | (9-8 >= 0)
## [1] TRUE

(The double forms && and || are different! We will see them later)

More types

The typeof() function returns the data type

is.foo() functions return Booleans for whether the argument is of type foo

as.foo() (tries to) “cast” its argument to type foo, to translate it sensibly into such a value

typeof(7)
## [1] "double"
is.numeric(7)
## [1] TRUE
is.na(7)
## [1] FALSE
is.na(7/0)
## [1] FALSE
is.na(0/0)
## [1] TRUE

(Why is 7/0 not NA, but 0/0 is?)

is.character(7)
## [1] FALSE
is.character("7")
## [1] TRUE
is.character("seven")
## [1] TRUE
is.na("seven")
## [1] FALSE
as.character(5/6)
## [1] "0.833333333333333"
as.numeric(as.character(5/6))
## [1] 0.8333333
6*as.numeric(as.character(5/6))
## [1] 5
5/6 == as.numeric(as.character(5/6))
## [1] FALSE

(Why is the last evaluation FALSE?)

Data can have names

We can give names to data objects; these give us variables

A few variables are built in:

pi
## [1] 3.141593

Variables can be arguments to functions or operators, just like constants:

pi*10
## [1] 31.41593
cos(pi)
## [1] -1

Create variables with the assignment operator, <- or =, as in:

approx.pi = 22/7
approx.pi
## [1] 3.142857
diameter = 10
approx.pi * diameter
## [1] 31.42857

The assignment operator also changes values:

circumference = approx.pi * diameter
circumference
## [1] 31.42857
circumference = 30
circumference
## [1] 30

The code you write will be made of variables, with descriptive names: easier to design, easier to debug, easier to improve, and easier for others to read

Avoid “magic constants”; use named variables (you will be graded on this!)

Named variables are a first step towards abstraction

The R workspace

What variables have you defined?

ls()
## [1] "approx.pi"     "circumference" "diameter"      "name"         
## [5] "places"        "weekly.hours"  "x"             "y"

Getting rid of variables:

rm("circumference")
ls()
## [1] "approx.pi"    "diameter"     "name"         "places"      
## [5] "weekly.hours" "x"            "y"
rm(list=ls()) # Be warned! This erases everything
ls()
## character(0)

First data structure: vectors

A data structure is a grouping of relaed data values into an object

A vector is a sequence of values, all of the same type, as in:

x = c(7, 8, 10, 45)
x
## [1]  7  8 10 45
is.vector(x)
## [1] TRUE

c() function returns a vector containing all its arguments in specified order

Hence x[1] would be the first element, x[4] the 4th element, and x[-4] a vector containing all but the fourth element

vector(length=6) returns an empty vector of length 6; helpful for filling things up later

weekly.hours = vector(length=5)
weekly.hours
## [1] FALSE FALSE FALSE FALSE FALSE
weekly.hours[5] = 8
weekly.hours
## [1] 0 0 0 0 8

Vector arithmetic

Arithmetic operator apply to vectors in a “componentwise” fashion:

y = c(-7, -8, -10, -45)
x+y
## [1] 0 0 0 0
x*y
## [1]   -49   -64  -100 -2025

Recycling

Recycling repeat elements in shorter vector when combined with a longer one. Example:

x + c(-7,-8)
## [1]  0  0  3 37
x^c(1,0,-1,0.5)
## [1] 7.000000 1.000000 0.100000 6.708204

Single numbers are vectors of length 1 for purposes of recycling:

2*x
## [1] 14 16 20 90

Can also do componentwise comparisons with vectors:

x > 9
## [1] FALSE FALSE  TRUE  TRUE

(Note: this returns a Boolean vector)

Logical operators work elementwise:

(x > 9) & (x < 20)
## [1] FALSE FALSE  TRUE FALSE

To compare whole vectors, best to use identical() or all.equal():

x == -y
## [1] TRUE TRUE TRUE TRUE
identical(x,-y)
## [1] TRUE
identical(c(0.5-0.3,0.3-0.1),c(0.3-0.1,0.5-0.3))
## [1] FALSE
all.equal(c(0.5-0.3,0.3-0.1),c(0.3-0.1,0.5-0.3))
## [1] TRUE

Functions on vectors

Lots of functions can take vectors as arguments:

Indexing vectors

Vector of indices:

x[c(2,4)]
## [1]  8 45

Vector of negative indices:

x[c(-1,-3)]
## [1]  8 45

(Why that, and not 8 10?)

Boolean vector:

x[x>9]
## [1] 10 45
y[x>9]
## [1] -10 -45

which() gives the elements of a Boolean vector that are TRUE:

places = which(x > 9)
places
## [1] 3 4
y[places]
## [1] -10 -45

Named components

You can give names to elements or components of vectors, and index vectors accordingly

names(x) = c("v1","v2","v3","fred")
names(x)
## [1] "v1"   "v2"   "v3"   "fred"
x[c("fred","v1")]
## fred   v1 
##   45    7

(Here R prints the labels, these are not actual components of x)

names(x) is just another vector (of characters):

names(y) = names(x)
sort(names(x))
## [1] "fred" "v1"   "v2"   "v3"
which(names(x)=="fred")
## [1] 4

Summary

Peculiarites of floating-point numbers

The more bits in the fraction part, the more precision

The R floating-point data type is a double (what R calls numeric). Back when memory was expensive, the now-standard number of bits was twice the default

Finite precision \(\Rightarrow\) arithmetic on doubles \(\not=\) arithmetic on \(\mathbb{R}\).

0.45 == 3*0.15
## [1] FALSE
0.45 - 3*0.15
## [1] 5.551115e-17

Often ignorable, but not always

(0.5 - 0.3) == (0.3 - 0.1)
## [1] FALSE
all.equal(0.5-0.3, 0.3-0.1)
## [1] TRUE

Peculiarities of integers

Typing a whole number in the terminal doesn’t make it an integer; it makes it a double, whose fractional part is 0

is.integer(7)
## [1] FALSE
as.integer(7)
## [1] 7

To test for being an integer in the mathematical sense, use round():

round(7) == 7
## [1] TRUE