MATH11176 Statistical Programming
Table of Contents
- 1. General
- 2. R Basics
- 3. Tidyverse
- 4. Optimization
- 5. Computer intensive methods
1 General
1.1 Course Information
Lecturer | Ioannis Papastathopoulos | |
---|---|---|
Office hour | Mon 15.00 - 16.00 | JCMB 4602 (TBC) |
Lectures | Mon 16.10 - 18.00 | Joseph Black Building LTh 100 |
Mon 16.10 - 18.00 | Ashworth Lab LTh 1 (Weeks 5, 6) | |
Labs | Thu 16.10 - 18.00 | JCMB 3210 & 5205 |
1.2 Recommended reading
References (learning R) | source | Level |
---|---|---|
Wickham, H. (2014). Advanced R. CRC Press | book, companion | intermediate-advanced |
Grolemund, G. and Wickham, H. (2017). R for Data Science. O'Reilly. | book | beginner-intermediate |
Wood, N. S. (2015). Core Statistics. Cambridge Univ. Press | book, library | beginner-intermediate |
Crawley, M. (2012). The R book. Wiley | book, library | beginner |
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S-plus. Springer | library | intermediate-advanced |
References (Optimization) | source |
---|---|
Nocedal, J. and Wright, S. J. (2006). Numerical Optimization | library |
Givens, H. and Hoeting, J. A. (2012). Computational Statistics. Wiley | library |
2 R Basics
2.1 Getting started
Statistical analysis of data sets is conducted using computers. Various specialist computer programs are available to facilitate statistical work. For using general statistical theory directly with custom built models, R is probably the most useful of these, because of its flexibility. R is a programming language and environment designed for statistical analysis and is written and maintained by a community of statisticians.
A major design feature is extendibility. R makes it straightforward to code up statistical methods in a way that is easy to distribute and for others to use. The first place to look for information on getting started with R is
2.1.1 Installing R and using an editor to write code
Windows
- Download R from the CRAN website
- Run the
.exe
file that was just downloaded - Go to the RStudio download page
- Under Installers select RStudio x.yy.zzz - Windows XP/Vista/7/8 (where x, y, and z represent version numbers)
- Double click the file to install it
- Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
macOS
- Download R from the CRAN website
- Select the
.pkg
file for the latest R version - Double click on the downloaded file to install R
- It is also a good idea to install XQuartz (needed by some packages)
- Go to the RStudio download page
- Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers)
- Double click the file to install RStudio
- Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
GNU/Linux
- Follow the instructions for your distribution from CRAN,
they provide information to get the most recent version of R for common distributions.
For most distributions, you could use your package manager
(e.g., for Debian/Ubuntu run
sudo apt-get install r-base
, and for Fedorasudo yum install R
), but we don’t recommend this approach as the versions provided by this are usually out of date. See Darren Wilkinson's research blog post for installing the latest version of R. In any case, make sure you have at least R 3.3.1. - Go to the RStudio download page
- Under Installers select the version that matches your distribution,
and install it with your preferred method (e.g., with Debian/Ubuntu
sudo dpkg -i rstudio-x.yy.zzz-amd64.deb
at the terminal). - Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.
2.1.2 Adding Comments to Code
You should always write a brief commentary on what your code means, so
that when you come to look at it after a period of time it still makes
sense. You do this by preceding your comments with a '#
'.
For example, you could send all of the below code to R
##------------------------------------------- ## It is generally useful to have general ## headings outlining the code that follows ##------------------------------------------- ## This line will not be run by R 3+2 ## but the above line will! ## comments can be used to explain what your code does ## the line below adds together 5 and 6 5+6
but:
## the line below will give an error message because there is no '#' at the beginning ## so *R* tries to run it, but "doesn't understand"
You should save your script files often! They may be useful in the future, as you may wish to re-use your code. You can open as many separate scripts as you like.
2.1.3 Simple R Arithmetic
At its most basic level, R can be used as a calculator. For example:
1+2+3 # Type this [1] 6 # *R* gives you the answer, labelled with a [1]
Besides the basic arithmetic operators, R contains a number of arithmetic functions. Some of these are
R Function | Description |
---|---|
sqrt(x) |
Compute the square root of \(x\) |
abs(x) |
Compute the absolute value of \(x\) |
sin(x), cos(x), tan(x) |
Trigonometric functions |
asin(x), acos(x), atan(x) |
Inverse Trigonometric functions |
exp(x),log(x) |
Exponential and natural log of \(x\) |
gamma(x), lgamma(x) |
The gamma function and its log |
round(x) |
Round/truncate a number |
Exercise: Evaluate the following expressions:
\begin{eqnarray*} 1)&&11+22+33+44+55+66+77\\ 2)&&11\times22\times33\times44\times55\times66\\ 3)&&1-\exp(-1/2)\\ 4)&&\frac{1}{\sqrt{2\pi3}}\exp\left\{-\frac{1}{2}\frac{(1-2)^2}{3}\right\}\\ \end{eqnarray*}2.2 An Introduction to R Objects
When you start R two important things are created: a command prompt at which to type commands telling R what to do, and an environment, known interchangeably as the global environment or user workspace to hold the objects created by your commands. Unlike the command prompt, you don't see the global environment directly, but its there as an extendable chunk of computer memory for holding your data, commands and other objects. Generically, in R, an "environment" consists of two things. The first, known in R jargon as a frame, is a set of symbols used to refer to objects, along with the data defining those objects. The second is a pointer to an enclosing environment.
Everything in R is an object living in an environment, including R commands themselves. In R, objects can store data, and perform functions on data. In this section, we will introduce some different types of data objects, and show you how to store data in them. Besides the data it contains, one of the most important aspects of an object is its name. Always try to give your objects meaningful names, that way you won't forget what they contain!
Try typing the following to store the value 10 in
object x
x <- 10
You can retrieve the value by typing the object name
x
The <-
symbol can be read as the word assigns (assignment
symbol). So in English we can read the above command as "assign the
value 10 to the object x
" and when we type x
we are
essentially saying "what is \(x\)?".
Objects can also store the results of operations on other objects
y <- x <- 3 * 4 + 20
Objects in R carry around information about the basic type of thing that they are made up of. Actually somewhat confusingly, they carry around 3 different classifications of the sort of basic thing they are made up of: their type, mode and storage mode. The following illustrates this by creating the vector \((1, 2, 3, 4)\) and examining its type, mode and storage mode,
z <- 1:4 # create vector (1, 2, 3, 4) > typeof(b) [1] "integer" > mode(z) [1] "numeric" > storage.mode(z) [1] "integer"
For the most part it is not necessary to worry much about the modes and type of an object: for example the conversion between real and integer numbers is automatic and need seldom concern the programmer. The exception is when calling code written in other languages from R, when it is essential to know the storage mode of data being passed to the external code.
Objects can also carry a variety of extra information as
attributes. Attributes have a name (they can be thought of as a named list with unique names),
and can be an object of any type. They behave rather like a whole bunch of notes stuck onto the
object and carried around with it. The attributes
function lets you
access all the attributes of an object (as a list), while the attr
function allows individual attributes to be set and extracted.
As an example, let’s give the vector z
, above, an attribute
consisting of a 2 by 2 matrix (and then print it)
attr(z,"mat") <- matrix(1:4,2,2) > attr(z,"mat") [,1] [,2] [1,] 1 3 [2,] 2 4
You can add as many attributes as you like, and they are used by R itself in many ways, including in the implementation of matrices and higher dimensional arrays. They are also used in R 's basic object orientation mechanism to implement the notion of the class of an object.
Object names in R can only contain letters, numbers, underbars
_
or fullstops .
, with the condition that names
can only begin with an alphabetic character, e.g.
Temp
, and high.5
are all valid, whereas
7up
, sample-mean
, and results!
are not.
Note that R is case sensitive, i.e., myObject
and
myobject
are different objects to R. Try storing different
values in scalar objects of different names.
Objects can be overwritten, R will remember the last thing you stored. Take care with this when defining objects that you might want to use again later!
Make sure you have written any examples you have tried out in your script file — they will be useful to refer to later.
2.2.1 Vectors and vector arithmetic
Vectors in R are the same as vectors in mathematics and constitute the basic data structure in R. They come in two flavours: atomic vectors and lists. They have three common properties
- Type,
typeof()
, what it is. - Length,
length()
, how many elements it contains - Attributes,
attributes()
, additional arbitrary metadata
They differ in the types of their elements: all elements of an atomic vector must be of the same type, whereas the elements of a list can have different types. Try creating a short vector:
myVector <- c(1,2,3,4)
We use the concatenate c()
command to tell R
that we want our values to be represented as a vector. What happens if
you now type myVector
?
Vectors may be combined with scalars and/or with other vectors in arithmetic operations and may also be supplied as arguments to functions, e.g.,
c(1,2,4)+2 [1] 3 4 6 c(1,2,4)*c(3,3,2) [1] 3 6 8 sqrt(c(1,2,3,4,5)) [1] 1.000000 1.414214 1.732051 2.000000 2.236068
The recycling rule: When working with vectors we often want to perform operations involving a scalar and a vector (such as multiplying all the elements of a vector by 2). In R, scalars are simply vectors of length 1 – there is nothing special about them, but R has a recycling rule, which is a vector oriented generalization of what happens when a scalar is multiplied by a vector. The recycling rule states that when two vectors of different length appear in an operation then the shorter is replicated as many times as is necessary to produce a vector of the length of the longer vector, and this recycled vector is what is used in computation.
Exercise: What will be printed after running
c(1, 2, 3) + c(1, 2)
in the R console?
Exercise: The following table gives the frequencies of the number of matches per box produced by Burnam's match factory on 1st February 2017.
Number of matches | Frequency |
---|---|
45 | 5 |
46 | 4 |
47 | 8 |
48 | 5 |
49 | 12 |
50 | 16 |
51 | 19 |
52 | 18 |
53 | 14 |
54 | 10 |
55 | 7 |
Define a vector matchNum
of length 11 by typing in the number
of matches per box.
Also define a vector matchFreq
which identifies the
frequencies of matches per box. Make sure the entries of
matchFreq
correspond to the relevant entries of
matchNum
. Try typing length(matchNum)
and
length(matchFreq)
to ensure you didn't miss any!
Setting up these vectors is all very well, but what can we do with them? Suppose we want to know how many matchsticks were produced by Burnam's on 1st February?
R provides an easy way to multiply vectors pointwise:
matchFreq * matchNum
Does this return a scalar or a vector? Store the result of this command in another object.
You can also perform mathematical operations on vectors. Here we divide the matchFreq
by
its smallest value (try ?min
):
Data manipulation in R is vector based. That is wherever possible we
work with whole vectors, rather than individual elements of vectors,
since the former is much more efficient in an interpreted language.
So standard operators and mathematical functions in R are defined in a
vector oriented manner.
matchFreq/min(matchFreq)
Exercise: Using the R function sum()
, we
can sum the elements of a vector. Use the online help system to help
you find out how to use sum()
, and calculate and store the
number of matchsticks made on 1st February 2017 at Burnam's in an
object.
R allows us to create vectors of sequences easily. Experiment with the following:
mySeq <- 1:10 myReps <- rep(4,5) mySeq <- seq(0,10,by=0.66) mySeq <- seq(0,10,length=13)
Look up the online help for rep()
and seq()
to see their syntax and understand what role each value plays in the above examples.
Exercise:
- Construct the
matchNum
vector you created by using the method formySeq
above.
- Create a vector that repeats 10 times a sequence of integers between 1 and 5.
- Evaluate the square root of all the odd numbers between 1 and 25.
- The function
sum()
adds the numbers in the vector supplied. For example
sum(c(1,2,4,9)) [1] 16
Find the sum of the numbers from 1 to 50. Check your answer (again using R) against the formula \[ S = n\,(n + 1)/2. \] Now find the sum of the reciprocals of the first fifty integers. There is no simple formula to check this but your answer should be about \(\log(50.5) - \log (0.5)\) (why?).
2.2.2 Matrices arithmetic and efficiency
Matrices are 2 dimensional arrays and are of class matrix
.
They are treated as a separate class to facilitate numerical linear
algebra with matrices. Matrices in R are the same as matrices in
mathematics. Try creating a small matrix:
myMatrix <- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=FALSE) myMatrix class(myMatrix)
Notice that we use a vector to tell R what the elements are in a matrix.
The byrow
command tells R whether to fill the matrix row-wise or column-wise.
Exercise: Experiment with creating matrices. What
happens if you change byrow
to TRUE
? What happens if
you omit the byrow
argument? Try the following, and explain
the output you see:
myMatrix1 <- matrix(c(1,2,3), nrow=3, ncol=2, byrow=FALSE) myMatrix2 <- matrix(c(1,2,3,4,5,6,7), nrow=3, ncol=2, byrow=FALSE) myMatrix3 <- matrix(c(1,2,3,4,5,6,7,8), ncol=2, byrow=FALSE) myMatrix4 <- matrix(c(1,2,3,4,5,6),2,3, FALSE)
Experiment with other settings, and look at ?matrix
. It is important to understand how R deals with ambiguous situations.
This can save a lot of frustration!
Vectorized elementwise operations mean that A*B
does not
perform matrix multiplication and A/B
does not produce
\(AB^{-1}\). Instead these and other matrix operations have special
functions and operators. The operator for matrix multiplication is
%*%
.
Example:
a <- matrix(c(2,3,4,1,5,6), nrow=2, ncol=3) b <- matrix(1:12, nrow=3, ncol=4) c <- matrix(6:1, nrow=2, ncol=3, byrow=TRUE) a%*%b
Exercise: What happens if you try b%*%a
? Why?
Also try 2*a
, a*b
and a*c
, and explain the output in each case.
What happens if you try addition instead of multiplication in these examples?
The amazing combination of speed and accuracy apparent in many computer algorithms and numerical methods can lead to a false sense of security and the wrong impression that in translating mathematical and statistical ideas into computer code we can simply treat the numerical methods and algorithms we use as "infallible boxes". This is not the case, consider the following example.
The system.time
function in R measures how long a particular
operation takes to execute. First generate two \(2000 \times 2000\)
matrices, \(A\) and \(B\) and a \(2000\times 1\) vector \(y\).
n <- 2000 A <- matrix(runif(n*n),nrow=n,ncol=n) B <- matrix(runif(n*n),nrow=n,ncol=n) y <- runif(n)
Now form the product \(A B y\) in two slightly different ways
system.time(f0 <- (A%*%B)%*%y) user system elapsed 6.150 0.018 6.184 system.time(f1 <- A%*%(B%*%y)) user system elapsed 0.015 0.000 0.015
f0
and f1
are identical, to machine precision, so
why is one calculation so much quicker than the other? The answer is
all to do with how the multiplications were ordered in the two cases,
and the number of floating point operations (flop) required by
the two alternative orderings
- In the first case \(AB\) was formed first, and the resulting matrix was used to pre-multiply the vector \(y\).
- In the second case, the vector \(By\) was formed first, and was then pre-multiplied by \(A\).
Now we need to count the floating point operations.
Exercise:
- How many flops \((+,\, -,\, *,\, /)\) are involved in the multiplication
\[ \left(\begin{matrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{23} & b_{33} \\ \end{matrix}\right) \left(\begin{matrix} y_{1} \\ y_{2} \\ y_{3} \\ \end{matrix}\right) \]
- How many flops are involved in the multiplication
\[ \left(\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{23} & a_{33} \\ \end{matrix}\right) \left(\begin{matrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{23} & b_{33} \\ \end{matrix}\right) \]
- Now compute the number of operations required for the two cases above.
2.2.3 Characters and factors
R can store character data in the same way as we have been storing numeric data. Character data is enclosed in single quote marks ' ' or double quote marks " ". Use either, as long as they match. Examples:
c1 <- "Hello" # A scalar character string c1 [1] "Hello" c2 <- c("Yes","Maybe","No") # A vector of characters c2 [1] "Yes" "Maybe" "No" c3 <- c("Is","Could Be","Isn't") # Note single quote inside doubles
You can also use some of the functions we used to construct numeric vectors with character data:
c3 <- rep("Monkey",4) # Using rep() c3 [1] "Monkey" "Monkey" "Monkey" "Monkey"
Of course, don't expect to be able to use numeric functions such as sum()
on characters!
Factors are, conceptually, vectors of labels that serve to group other
data. They have a special place in statistical modelling, and as such
require special handling. In R, factors have class factor
and another attribute levels
, which is a vector of the
unique labels occurring in the factor object. If you print a factor
variable, then what is printed is the label given in each element of
the vector. However, what is actually stored is a set of integers,
indexing the "levels"
attribute, and it is the print
function that is actually doing the conversion from stored integer to
corresponding label.
Suppose you have data on some people and you want to store their
sex. You could use a numeric code — 0 for male, 1 for female, or you
could use a character code — 'M'
for male and 'F'
for female. But
you should really use a factor.
Factors are easy to construct from character vectors. Use the function as.factor()
with a character vector:
c5 <- c("M","F","F","F","M","M") f5 <- as.factor(c5) f5 [1] M F F F M M Levels: F M
Notice that a factor gets printed out like a character vector, but without the quote marks. The levels are also printed out.
The categories in a factor can also be seen by using the levels()
function:
levels(f5) [1] "F" "M" # The result is a character vector
And you can change these levels easily by assigning a new character vector to the levels:
levels(f5) <- c("Female","Male")} f5 [1] Male Female Female Female Male Male Levels: Female Male
One thing you can do with a factor is to tabulate the counts of each category with the table()
function:
table(f5) Female Male 3 3 # 3 of each
Exercise: Here is a data set of some people's response to the questions
what is your favourite hot drink? | what is your favourite biscuit? |
---|---|
coffee | bourbon |
tea | rich tea |
chocolate | bourbon |
tea | pink wafer |
tea | pink wafer |
coffee | rich tea |
coffee | bourbon |
chocolate | jammie dodger |
chocolate | bourbon |
tea | bourbon |
tea | digestive |
tea | jammie dodger |
chocolate | jammie dodger |
coffee | pink wafer |
tea | rich tea |
tea | rich tea |
tea | digestive |
coffee | bourbon |
tea | digestive |
coffee | pink wafer |
Do not type this data in in full. Instead, give each drink and each biscuit a number and make vectors of the numbers corresponding to the lists above (just type the numbers without quotes).
Now turn your vectors into factors with the right names and tabulate the levels.
Now use table(drinks,biscuits)
to cross-tabulate the drink and
biscuit preferences.
2.2.4 Subscripts
Suppose we want to extract a part of a vector. R allows us to do this using square brackets.
x <- 10:1 y <- x[2] y [1] 9
Here y
gets the value of the second element of x
.
Slicing more…
You can extract parts of a vector by subscripting with another vector.
y <- x[c(1,3,5)]
y
[1] 10 8 6
x[4:7]
[1] 7 6 5 4
Here the returned value is a vector.
Don't bite off more than you can chew…
If you try and subscript a vector with a number out of range…
x[12]
[1] NA
you get NA
. This is the symbol R uses for missing data.
Negative Subscripts
Negative subscripts? What's the -2nd element of a vector? R defines this as being all the vector except the 2nd element. This gives us an easy way of removing some elements from a vector.
x <- 10:1 x[-3] # omit the third element [1] 10 9 8 7 6 5 4 3 2 1 x[c(3,4,7)] [1] 8 7 4 x[-c(3,4,7)] # you can also exclude several elements simultaneously [1] 10 9 6 5 3 2 1 x[c(-3,-4,-7)] # this gives the same answer, # think of -c(3,4,7) = (-1) * c(3,4,7)
You can't mix positive and negative subscripts. Why would you want to?
Some more examples
x <- 1:10 x[-length(x)] # all but the last [1] 1 2 3 4 5 6 7 8 9 x[length(x):1] # reverse x [1] 10 9 8 7 6 5 4 3 2 1
There is a function to do this last one: rev()
- most of the simple things you may want to do tend to have their own functions!
Try using help.search()
to find more functions.
Exercise: There is a function in R that acts as the
modulo operator - it is used by typing %%
- for example 8
Modulo 2 is given by:
8 %% 2
[1] 0
Use this function, together with appropriate commands for sequences, subscripts and lengths of vectors to calculate the number of values between 1 and 1000 that have 13 as a factor.
Remember: for \(n\) to have \(x\) as a factor \(n\) modulo \(x\) must equal \(0\).
Exercise: Create a vector of all the years in the 21st century, then take out the leap years. (Let's assume the year 2000 was actually a leap year!)
Subscripting Matrices
Subscripting a matrix works in a very similar way to subscripting vectors, except there are now two dimensions to specify rather than one, so two subscripts are needed!
R always uses the convention myMat[row,column]
for subscripting.
myMat <- matrix(1:12,nrow=3,ncol=4) #create the matrix myMat[2,3] #return the entry in position [2,3] of the matrix #(2nd row, 3rd column) myMat[3,] #return the third row of the matrix myMat[,1:3] #return the first to third columns of the matrix myMat[-2,-1] #return the matrix with the 2nd row and 1st column removed
VERY IMPORTANT: when subscripting matrices, if you end up returning a submatrix with either one row or one column, R will (annoyingly) convert it into a vector! A vector is NOT a \(1 \times n\) matrix, R treats them differently!
You can avoid this by using the as.matrix()
command.
For example:
a.vect <- myMat[3,] a.mat <- as.matrix(myMat[3,])
These both needed to be subscripted in different ways — have a go! In general, everything you learnt about subscripting vectors also works for matrices, but you do need to be careful with the dimensions!
2.2.5 Logical values
As we saw in the previous section, R enables you to compute with
Boolean, or Logical, variables. These take on either the value
TRUE
or FALSE
. You can store them in objects just as
you can numeric or character data. You can use conditional tests to
generate these values:
x <- 1:10 gt5 <- x > 5 gt5 [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE x == 4 [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE gt3 <- x >= 3 gt3 [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Note: <=
means "less than or equal to', >=
means "greater than or equal to",
Logical operators
You can do "and" and "or" operations on vectors with &
and |
:
x > 2 & x != 7 [1] FALSE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE x < 2 | x > 7 [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Note: !=
means "not equal to"
True and False as Numbers…
If you try and do anything numeric with logical values, R converts true to 1 and false to 0. This can be usefully employed to count the number of true values in a vector.
Examples:
x <- 1:10 y <- x > 3 y [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE y*2 [1] 0 0 0 2 2 2 2 2 2 2 sum(y) [1] 7
Logical Subscripts You can use a vector of logical TRUE/FALSE values as a subscript. This is extremely useful.
Example:
x <- c(6,5,6,4,4,3,4,2,3,4) y <- c(5,3,4,2,6,5,4,5,4,3) # couple of random vectors xeq4 <- x == 4 # which of x equals 4? xeq4 [1] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE y[xeq4] # which of y correspond to x==4? [1] 2 6 4 3 y[x == 4] # or simply... [1] 2 6 4 3
Or you may just wish to return the elements of a vector that satisfy some rule, for example:
z <- y[y>3] #put into z only those elements of y that are larger than 3. length(y) length(z)
You can even apply this idea to a matrix:
xmat<-matrix(c(0,1,3,0,2,8,1,3,4,0,3,8,1,2,9),ncol=3,byrow=TRUE) ymat<-xmat[xmat[,1]==0,] #ymat contains only the rows of xmat #whose entry in the first column is zero
Exercise: You can generate random \(\mbox{Uniform}(0,1)\)
numbers using the function runif()
(r
stands for random), e.g.,
RandomUniforms <- runif(10)
Create a vector containing 200 random numbers generated from the Uniform distribution. Create logical vectors indicating:
- which are larger than \(0.5\);
- which are between 0.2 and 0.7.
Now use sum()
to find how many of your random numbers were in
each of these categories.
Are you keeping your script files up to date??
2.2.6 Reading data in from files
First download the file class96.dat
from Learn and save it in your
working directory. The class96 dataset contains the heights and
weights of the students enrolled in GSSE401 in 1996. Let's get this
data into R:
myTable <- read.table("class96.dat") myTable
Try the function class(myTable)
. What do you get?
Here, we have a data frame — a list of variables of the same length (each with a unique name).
They are arranged in a 2d structure like a matrix (and, indeed, they can be subscripted in exactly the same way).
2.2.7 Data frames and lists
A data frame is the most common way of storing data in R, and if used systematically
makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors.
This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list.
This means that a data frame has names()
, colnames()
, and rownames()
, although names()
and colnames()
are the same thing. The length()
of a data frame is the length of the underlying
list and so is the same as ncol()
; nrow()
gives the number of rows.
Creation
You create a data frame using data.frame()
, which takes named vectors as input
df <- data.frame(x = 1:3, y = c("a", "b", "c")) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
Beware data.frame()
default behaviour which turns strings into factors.
Use stringsAsFactors = FALSE
to suppress this behaviour:
df <- data.frame( x = 1:3, y = c("a", "b", "c"), stringsAsFactors = FALSE) str(df)
## 'data.frame': 3 obs. of 2 variables: ## $ x: int 1 2 3 ## $ y: chr "a" "b" "c"
Because a data.frame
is an S3 class, its type reflects the underlying vector used to build it:
the list. To check if an object is a data frame, use class()
or test explicitly with is.data.frame()
:
typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"
is.data.frame(df)
## [1] TRUE
You can coerce an object to a data frame with as.data.frame()
:
- A vector will create a one-column data frame.
- A list will create one column for each element; it’s an error if they’re not all the same length.
- A matrix will create a data frame with the same number of columns and rows as the matrix.
Combining data frames
You can combine data frames using cbind()
and rbind()
:
cbind(df, data.frame(z = 3:1))
## x y z ## 1 1 a 3 ## 2 2 b 2 ## 3 3 c 1
rbind(df, data.frame(x = 10, y = "z"))
## x y ## 1 1 a ## 2 2 b ## 3 3 c ## 4 10 z
When combining column-wise, the number of rows must match, but row names are ignored.
When combining row-wise, both the number and names of columns must match.
Use plyr::rbind.fill()
to combine data frames that don’t have the same columns.
It’s a common mistake to try and create a data frame by cbind()
-ing vectors together.
This doesn’t work because cbind()
will create a matrix unless one of the arguments is already a data frame.
Instead use data.frame()
directly:
bad <- data.frame(cbind(a = 1:2, b = c("a", "b"))) str(bad)
## 'data.frame': 2 obs. of 2 variables: ## $ a: Factor w/ 2 levels "1","2": 1 2 ## $ b: Factor w/ 2 levels "a","b": 1 2
good <- data.frame(a = 1:2, b = c("a", "b"), stringsAsFactors = FALSE) str(good)
## 'data.frame': 2 obs. of 2 variables: ## $ a: int 1 2 ## $ b: chr "a" "b"
The conversion rules for cbind()
are complicated and best avoided by ensuring all inputs are of the same type.
As we have seen, data frames can be thought of as matrices of data, where each column has a name, but not necessarily the same type, e.g., having a mixture of logical, numeric, factor and character columns is no problem. This format is a natural way of storing statistical datasets. Here is a small example
dat <- data.frame(y=5:7,lab=c("um","er","er")) dat
y lab 1 5 um 2 6 er 3 7 er
class(dat)
[1] "data.frame"
By default the character vector lab was converted to a factor variable. Data frames can be accessed like any other matrix, but also by variable name and as a list (see below).
Let's stick with the example we created above. We've already seen
that the class()
function tells us that R treats our data as
a data frame. This structure is used whenever we want to
group vectors of data which have corresponding rows. The
names()
function gives you the names of the vectors in the
data frame:
names(myTable) [1] "V1" "V2" "V3" "V4"
Not very informative, huh? Why is this? Let's give the columns some useful names:
names(myTable) <- c("index","height","weight","sex")} myTable
You can refer to a column in the data frame in one of two ways. The first uses the column name and a dollar sign:
myTable$index
The second uses subscripting in the same was as a matrix. Recall:
myTable[,1] #extract first column myTable[1,] #extract first row
Exercise:
- Sex is a factor with two levels, but R doesn't know this yet. Using what you learnt in Section Characters and factors and the above, turn the Sex column into a factor.
- Using the R functions
mean()
andsd()
, find the mean and standard deviation of both the height and weight columns.
- Using your knowledge of subscripts in R (Section Subscripts, make two new data frames for males and females separately. In the data 1 represents females and 2 represents males. Hint: Use logical subscripts!
Lists Lists are the basic building blocks for all sorts of complicated objects in R. Lists consist of any number of numbered, and optionally named, items, each of which can be an object of any type, e.g.,
mylist <- list(a="Cupcake", 1:13, M=matrix(1,2,2)) mylist
$a [1] "Cupcake" [[2]] [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 $M [,1] [,2] [1,] 1 1 [2,] 1 1
class(mylist)
[1] "list"
The components of a list are always numbered and may be referred to as
such. Elements can be accessed by number, starting from 1, using
double square brackets. e.g. mylist[[1]]
accesses
"Cupcake"
. If the item has a name then this provides an
alternative access method, using
$
, e.g. mylist$Cupcake
also accesses
"Cupcake"
.
Additionally to extracting elements from a list, sub-lists can be extracted too, using single square brackets. For example
mylist[1:2] mylist[c(1,3)] mylist[[1:2]] # this will give an error message class(mylist)
An excellent visual explanation of list subsetting can be found in
http://r4ds.had.co.nz/lists.html
(the "Lists of condiments" section!)
2.3 Further matrix operations
We're now going to perform a linear regression on the height, weight, sex data. For each sex, we will fit the following model:
\begin{eqnarray} y & = &a + bx \nonumber \end{eqnarray}Where the response data, \(y\), is height, and the covariate data, \(x\), is weight. If you've studied linear models (if not believe me!), you will know that if \(\beta = (a,b)\), then the parameter estimates are given by
\begin{equation} \hat{\beta} = (X^TX)^{-1}X^Ty \end{equation}where \(X\) is the model/design matrix. This gives us a straight-forward way to calculate our parameter estimate. However, first we need to rearrange our data so that we can perform the calculations. For each sex, do the following:
- Begin by creating a vector of response data by copying the
height
column of your data frame into a separate vector object. Call this \(y\). - Now we need to create a design matrix of covariate data.
This will be a two column matrix, with the first being a column of ones,
and the second containing the covariate data for weight.
Start by defining a 2 column by \(n\) row matrix, where \(n\) is the length of your covariate data
(you can use dim
on your data frames to find this). e.g. for a 2 column, 10 row matrix, I would type:
myMatrix <- matrix(ncol=2,nrow=10)
If you look at the matrix you have created, you will see that it is full of NA
.
This is fine, since we've not supplied any data yet.
- Now populate the first column of the matrix with 1's. Fill columns by referring to them with subscript notation, e.g.:
myMatrix[,1] <- 1 #Fills the 1st column with 1's
This example demonstrates the use of a feature we observed about how R handles vectors:
since the vector myMatrix[,1]
(recall: one column of a matrix is a vector!) is of length \(n\)
and 1
is of length 1, R repeats the \(1\) (recycling rule) as many times as is necessary to fill
myMatrix[,1]
.
- Now populate the second column with weight data, using a subscript on your data frame.
Exercise: Use the regression formula. To do this, you will need the following to multiply, transpose, and invert matrices:
Function/operator | Use |
---|---|
x %*% y |
Matrix-multiplies x and y |
solve(x) |
Returns the inverse of matrix x |
t(x) |
Returns the transpose of matrix x |
What are your parameter estimates for each sex? Store them in an appropriate form.
2.4 Graphics
In this section, we will use the height, weight, sex dataset you used in Section Further matrix operations to illustrate how to use some of the graphics functions in R. The functions we will be using are:
Function | Use |
---|---|
plot() |
Creates a new plot |
hist() |
Plots a histogram |
lines() |
Adds a line to an existing plot |
points() |
Adds points to an existing plot |
text() |
Adds text to a plot |
legend() |
Adds a legend to a plot |
Creating a plot
In this section we are going to create a plot of height versus weight for each sex in the class96 dataset.
The basic plot()
command can be used as follows:
plot(x, y, main="My Title", xlab="The x-axis", ylab="The y-axis", xlim=c(0,10), ylim=c(0,15)
where x
and y
are vectors of \(x\) and \(y\) coordinates respectively,
the other parameters setting the plot labelling, and axis limits
(\(x\)-axis between 0 and 10, and \(y\)-axis between 0 and 15).
Exercise: Using the data in your myTable
data frame, plot a scattergram of height versus weight,
with a suitable title and axis labels. Look at your data range, and set suitable limits for the axes.
Use the following steps to create your graph:
- First plot one of the sexes with
plot()
to establish the plot. By default this will create black points. - Now use the
points()
function to plot the other sex onto the existing graph.
In order to get a different colour, you will have to supply an argument to the col
parameter inside points()
's brackets.
In this example, we used the graphical parameter col
.
There are lots of these parameters that control the
look of a lot of different types of plot (scatter, line, histogram etc)
so the online help system lumps them all together under the par()
function.
Have a look at the online help, then try changing your plot()
expressions
to use a +
' character for the data points.
Also, have a look at the online help for plot()
.
You will see that it gives you information on the type
parameter for controlling the plot type.
We will be using this later.
Now we want to add a legend to our scatter plot.
The online help for legend()
tells us the syntax:
legend(x, y = NULL, legend, fill = NULL, col = par("col"), lty, lwd, pch, angle = 45, density = NULL, bty = "o", bg = par("bg"), box.lwd = par("lwd"), box.lty = par("lty"), pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd, xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1, adj = c(0, 0.5), text.width = NULL, text.col = par("col"), merge = do.lines && has.pch, trace = FALSE, plot = TRUE, ncol = 1, horiz = FALSE, title = NULL, inset = 0)
"Wow", you say! Notice that any parameter that has an =
sign has a default setting (shown) so you
don't have to be explicit about what you want, unless you want to change the default.
Legends are plotted in boxes, so as a bare minimum you need to specify:
x
andy
— the coordinates on the plot of the top left of the box;legend
— a vector of names of your data series on the plot;col
— a vector of colours that correspond to the data series, in the same order as you supplied forlegend
;pch
— a vector of point characters you used to plot your data series.
Any graph you have produced can be saved to file directly from the graph window by selecting File => Save As.
Plotting lines
In this section, we will add trendlines to the scatter plot of height
versus weight that we created previously, using the lines()
function. Look up the online help, and remember the graphical
parameters that you can set (see the help page for par()).
=lines()
can be used like this:
lines(c(0,4),c(0,2),col='blue',lty=2,lwd=3)
This command would draw a blue dashed line of width 3 between \((0,0)\) and \((4,2)\). What happens if you use this command without an existing plot being open? Why?
Exercise: Using the values for the regression parameters you estimated in Section Further matrix operations ,
use the lines()
function to add suitable trendlines to the scatter plot of height
versus weight, making sure the line colours match the point colours.
Histograms
R has a really nifty function for creating histograms really quickly
— hist()
. Let's try it out. First let's generate some
data to work on. R provides various functions for generating random
numbers from the common probability distributions. All functions
return a vector of \(n\) realisations from whichever distribution you
choose, parametrised by values of your choice. Here are some of the
distributions you can use (see also ?distributions
):
R function | Probability distribution |
---|---|
rnorm(n,mean,sd) |
Normal distribution |
rgamma(n,shape,rate) |
Gamma distribution |
runif(n,min,max) |
Uniform distribution |
rpois(n,lambda) |
Poisson distribution |
rexp(n,rate) |
Exponential distribution |
rweibull(n,shape,scale) |
Weibull distribution |
WARNING: R does not necessarily use the same parametrisation for these probability distributions as you do in your course. Until you get used to using these functions on a regular basis, always check the help pages - they give you the form of the distribution (in LaTeX format).
Note also, that for each distribution, there are functions to return density, distribution function, and quantiles
(look at ?rnorm
, for example).
Exercise:
- Choose your favourite distribution, parametrise it accordingly,
and use the appropriate function to draw 10000 random samples from it into a vector.
- Use the
hist()
function to plot a histogram. Add a title and axis labels in the same way you did forplot()
.
- Using the quantile function for your distribution, find the upper and lower quartiles.
Use the abline()
function to draw vertical lines on the histogram representing the quartiles.
Perhaps you should also make the lines a different style so that they stand out?
Challenge: In the exercise above, we used a vector of random numbers to approximate a probability distribution. However, we can use the density function to draw a curve. Try this out for your distribution. Hint: You will first need to create a vector of \(x\) values against which to plot the density. This vector will have to hold a sequence of values running between suitable limits, and have small enough intervals so that the curve appears smooth.
2.5 Loop structures
Loop structures are used when a task needs to be performed repetitively.
Loops enable routine manual tasks to be performed with great speed and,
together with conditional statements, are of fundamental importance in
creating your own functions or programmes. A scenario that seems to occur
very frequently when programming is when data are stored in a matrix (or vector)
and it is necessary to compute some function of the row or column elements
(or entries of the vector). The most basic way that this can be achieved is by the
use of for
or while
loops, the functionality of which is best illustrated via example.
The following code prints the square of the numbers 1 – 10.
Method 1:
for (i in 1:10){ # Start the loop with i = 1. print(i^2) # print the square of i } # goto start of the loop, increasing i by 1
Method 2:
i <- 1 # set i = 1 while (i <= 10){ # if i <= 10 then execute commands in the loop print(i^2) # print the square of i i <- i + 1 # increase i by 1 } # go to the start of the loop
Some comments are in order. Firstly, you don't need to use the letter i
, any valid object name will do.
With the first method, i
takes values in the vector object 1:10
(if you have not already done so,
try typing 1:10
into the R console). Any vector can be used here, so for example if
X <- runif(10)
is a random vector of uniformly distributed values then replacing the first line in Method 1
above with for (i in X){
will print out the squares of the 10 random numbers in the vector X
.
It is also possible to loop in reverse order by using 10:1
. With the second method, on the other hand,
the user dictates what values i
takes and how these values evolve within the loop structure itself.
The former for
method is probably more commonly used in practice, but it is best to be aware of the while
structure
as it is occasionally useful.
Matters become more complicated when the function in question
depends recursively on previous values stored in a vector. Suppose
it is required to set up a vector x
containing the first 10
factorial numbers ie \((1!,2!,3!,\ldots,10!)\); this is achieved as
follows.
x <- 1 # 1! = 1, so initialise by setting x = 1 for (i in 2:10){ # note the loop now starts from 2 x[i] <- i * x[i-1] # let the next entry of x be i times the previous entry }
If only the last number was required (i.e., \(10!\)) then the line within the loop could have
been replaced by x <- x * i
.
Exercise: Generate a \(5\times5\) matrix of random numbers using the command
X <- matrix(runif(25),5,5)
.
Using a for loop structure, compute a matrix whose entries on the
$i$th row are \(i\) times the entries of the row of X
. Your
knowledge of matrix subscripting may be useful here. How can this
operation be performed without using a loop? Your code may be slow to
run if it contains many loops, though there are not always short–cuts
available: sometimes it is necessary to use a loop.
Exercise: Using a while loop structure, plot the graph of \(y = kx^2\) for \(k=1,2,4,8,16,32\) for \(x\in[-10,10]\).
Hint: first use the plot
command for the case that \(k=1\), then the
lines
command for subsequent plots.
Below is a video from Computerphile explaining loop structures with lambda calculus!
2.6 Conditional statements
Conditional statements are often used within loops and enable
the programmer to specify when or if certain tasks are to be performed.
The following Example illustrates nested loops and conditional statements.
The loop structure starts with i,j
\(=1\). Run the code, then try to understand it.
Example 1:
mat <- matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=TRUE) print(mat) # print the matrix # loop through the elements of the matrix in turn. # Question: in what order does this proceed? for(i in 1:3){ # start of i loop for (j in 1:3){ # start of j loop (the 'nested' loop) if (mat[i,j] > 5){ # if the i,j th element of mat > 5 then ... mat[i,j] <- 0 # set the i,j th element to zero } # ends the 'if' statement } # end the j loop } # end the i loop print(mat) # print the altered matrix
Of course, the same effect could be achieved by simply typing mat[mat>5] <- 0
.
This leads onto an important point: R performs looping calculations relatively slowly,
so consider whether a loop is required.
Sometimes one may wish to perform an alternative task if the condition,
for example (mat[i,j] > 5)
, is not satisfied. In the following code,
which would replace the conditional statement in the loop in the example above,
the other elements of mat
are set to 1 (ie those elements of
mat
which are less than or equal to 5).
if (mat[i,j] > 5){ mat[i,j] <- 0 }else{ mat[i,j] <- 1 }
Note that indenting code as in the above examples makes it much easier to read -
the braces are "tabbed out" so it is clear what actions are performed in each loop.
This is good practice and makes it much easier to spot errors in your code.
Take care when using &
or |
in your conditional statements, if you are unsure what is happening,
then set up a simple example you can test your statements on
(this can occasionally save hours of frustration!).
Note: Conditional statements can be used to exit from a loop,
this is achieved by using the command break
; alternatively,
one can move onto the next iteration of the loop with the command next
.
Exercise: Using a loop and conditional statement and for the numbers 1 to 100,
set up a character vector containing the string div 4
if the number is divisible by 4 and not div 4
otherwise.
2.7 Programming
The situation is as follows: you have been trawling through the (somewhat unhelpful!)
online help files and searching on the internet for a command for which you think "surely this is built into R!".
You know mathematically how to compute the quantity you require, but you want to perform the same operation many times
over the course of your current project and don't want to have to repeat lines of code over and over again
(cutting and pasting lines of code can be a risky business when the integrity of your analysis is at stake).
If your repetitive task is definite integration of polynomials for example, you might be thinking: if only I could just type integrate("x^2+1",1,2)
to get
\[
\int_1^2(x^2+1)\mathrm{d}x
\]
Help is at hand! R can be used to define your own programs. Master the art of programming and you can open up a whole new world of R functionality and moreover, your code will both look neater and will be easier to de-bug.
Suppose you want to compute the mean of a vector \(x\) (obviously this task may be achieved with inbuilt R functions).
You want to be able to type m(x)
into the console and have R return the mean of the vector x
or mean.y <- m(y)
to store the mean of y
in a new object mean.y
. This can be achieved as follows.
m <- function(vector){ # 1. n <- length(vector) # 2. sum.vector <- sum(vector) # 3. answer <- (1/n) * sum.vector # 4. return(answer) # 5. }
The function may now be called in the usual manner: suppose vec
is a vector object in R,
then typing either m(vec)
or m(vector=vec)
will return the mean. Try it.
Let's go through the code line by line.
- The hardest line to understand. This line defines an object called
m
and tells R that it is a function object. The wordvector
in parenthesis is known as the argument of the function. The argument(s) are local names of objects the function is to operate on. To illustrate this, think of \(x\) as in \(f(x) = \sin{x}\) for example. If \(u\) is a similar mathematical object to \(x\) (e.g., if they are both real numbers) then it is natural to define \(f(u) = \sin{u}\); the same idea applies with R functions. In the call to the function,m(vector=vec)
, a local copy of the objectvec
is created and stored as a local objectvector
. If outside the function, I already had an object calledvector
, then this "external" object will not be changed by any operations made from within the program. Similarly, it is not possible to access the values of a local object from outside the program. The copy ofvector
is therefore called local to the programm()
. - Computes the length of
vector
and stores the answer in a local object calledn
. - Creates a local object,
sum.vector
and stores in it the sum of the entries ofvector
. - Creates a local object
answer
containing the required mean. - Returns the object
answer
. If an allocation has been made, for examplemean.vec <- m(vec)
then the required mean will be stored in an external object,mean.vec
, accessible outside the program. Otherwise if the call was made via the commandm(vec)
, then R will just display the answer.
Note: Use of the browser()
function in your
programs allows to interrupt the flow of a program and access local
objects. When R comes across the command browser()
, it enters a
special browsing mode. Type the name of any local or external object
to view it. To exit browsing mode, either hit return to continue
running the script or Q
to quit (note the capital). You can
look at your function code on the R console by typing the name of the
function – e.g., try typing sd
, this should display R's in-built
command for the standard deviation.
Exercise: Write a function called fact
to compute the factorial of any number.
Compute the factorial of a few (small!) numbers and compare your answer to those obtained via R's own
factorial function, factorial()
.
Exercise Set up a \(60\times30\) matrix, X
,
using the command, X <- matrix(runif(1800),60,30)
. Using an
optinal argument, cols
, which should take the default value
FALSE
, write a function to compute the variance of either the
columns, or rows of X
. Note that in the argument line of the
function, a default value for a local variable z
can be set by
using z=FALSE
for example; the function dim
may also be
of use in the program. Use your function to compute the variance of
the columns and rows of X
and compare these with answers you
obtain using the apply
function
(see the help file or ask a tutor about this).
Here is another Computerphile video, this time explaining lambda calculus.
2.8 Have the right style
Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. The following lines describe commonly used R style which make R code easier to read, share, and verify.
3 Tidyverse
3.1 Introduction
3.1.1 Base R and CRAN
So far we have been talking about base R, i.e., the R language itself with the set of functions that come with installation. Those include a lot of statistical packages and practically all simple statistical functions. However, there is an enormous ecosystem of add-on packages: if you have a statistical data wrangling problem and it has been done before, there is almost certainly an R package to do it. Luckily, almost all of them are in one place! The Comprehensive Archive R Network.
Within CRAN, there is a subset of packages which are becoming invaluable. This used to be called the Hadleyverse after the person who has written the most of most of them, i.e., Hadley Wickham. Hadley has now said he would like them to be called the tidyverse – tidy being the way he thinks about data being in the right format for doing appropriate statistical analysis.
To load tidyverse in R use
install.packages("tidyverse") library(tidyverse)
3.1.2 A look back on data frames
Tibbles are a modern reimaging of the data.frames
. They keep the features
that have stood the test of time, and drop the features
that used to be convenient but are now frustrating
(i.e., converting character vectors to factors).
Creating tibbles
tibble()
is a nice way of creating data frames and encapsulates
best practices for data frames. You can create
a new tibble from individual vectors
tibble( x = 1:5, y = 1, z = x ^ 2 + y )
# A tibble: 5 x 3 x y z <int> <dbl> <dbl> 1 1 1 2 2 2 1 5 3 3 1 10 4 4 1 17 5 5 1 26
Tibbles will recycle vectors of length 1 (e.g., y=1
above is recycled
to produce a vector of length 1) but will return an error otherwise:
tibble( x = 1:5, y = 1:2, z = x ^ 2 + y )
Error: Column `y` must be length 1 or 5, not 2 In addition: Warning message: In x^2 + y : longer object length is not a multiple of shorter object length
This is because recycling vectors of greater lengths is a frequent source of errors.
Additionally, tibbles evaluate their arguments lazily and sequentially.
For example, the object y
in
tibble( x = 1:5, y = x ^ 2 )
# A tibble: 5 x 2 x y <int> <dbl> 1 1 1 2 2 4 3 3 9 4 4 16 5 5 25
is constructed from the newly created object x = 1:5
whereas. Comparing this with base R data.frame
function
data.frame( x = 1:5, y = x ^ 2 )
Error in data.frame(x = 1:5, y = x^2) : object 'x' not found
we get an error telling us that object x
was not found!
Coercion
The mtcars
dataset from base R was extracted from the 1974 Motor Trend US magazine.
It comprises fuel consumption, 10 aspects of automobile design
and performance for 32 automobiles (1973-74 models).
head(mtcars) # head() returns the first parts of a data frame
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The variables in mtcars
are
[, 1] mpg: Miles/(US) gallon [, 2] cyl: Number of cylinders [, 3] disp: Displacement (cu.in.) [, 4] hp: Gross horsepower [, 5] drat: Rear axle ratio [, 6] wt: Weight (1000 lbs) [, 7] qsec: 1/4 mile time [, 8] vs: V/S [, 9] am: Transmission (0 = automatic, 1 = manual) [,10] gear: Number of forward gears [,11] carb: Number of carburators
Check mtcars
is a data.frame
is.data.frame(mtcars)
[1] TRUE
Now, to coerce any data frame to a tibble object use as_tibble()
tbl_mtcars <- as_tibble(mtcars)
tbl_mtcars
# A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
Printing tibbles
Tibbles have a refined print method that shows only the first 10 rows,
and all the columns that fit on screen. This makes it much easier to work with large data.
In addition to its name, each column reports its type, a nice feature borrowed from str()
(for mtcars
all objects are numeric
vectors of double precision <dbl>
).
3.2 Data manipulation using dplyr
3.2.1 Introduction
Undoubtedly, one of the most useful and important packages from tidyverse is dplyr. It provides a grammar of data manipulation and a set of consistent verbs that help solve the most common data wrangling problems:
select()
picks variables based on their names;filter()
picks cases based on their values;mutate()
adds new variables that are functions of existing variables;summarise()
reduces multiple values down to a single summary;arrange()
changes the ordering of the rows.
These all combine naturally with group_by()
which allows you to perform any
operation "by group".
3.2.2 The select()
verb
The select
function allows to
pick a set of columns by name.
Example
Pick miles per gallon, horsepower and number of carburators
df1 <- mtcars[,c("mpg","hp","carb")] ## base R df2 <- select(mtcars, mpg, hp, mpg) ## tidyverse::dplyr head (df2)
mpg hp carb Mazda RX4 21.0 110 4 Mazda RX4 Wag 21.0 110 4 Datsun 710 22.8 93 1 Hornet 4 Drive 21.4 110 1 Hornet Sportabout 18.7 175 2 Valiant 18.1 105 1
Of course, all dplyr
verbs work on tibbles
, for example
select(tbl_mtcars, mpg, hp, mpg)
will return the same data frame as before, now stored in tibble
format.
A colon can be used to select multiple contiguous columns, for example
df <- select(tbl_mtcars, mpg:hp)
head(df)
mpg cyl disp hp Mazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110 Datsun 710 22.8 4 108 93 Hornet 4 Drive 21.4 6 258 110 Hornet Sportabout 18.7 8 360 175 Valiant 18.1 6 225 105
Also, see ?select_helpers
for variable selection based on
names, regular expressions, string match, etc.
3.2.3 The filter()
verb
The filter
function allows to
pick a set of rows that meet specified conditions.
For example, suppose we want to keep all cars for which
miles per gallon is less than 20. This is easily done
by
filter(tbl_mtcars, mpg<20) ## tidyverse::dplyr tbl_mtcars[tbl_mtcars$mpg<20,] ## base R
More complicated expressions can be fed in filter
, e.g.,
suppose now that we want to keep all cars for which
miles per gallon is less than 20 AND horsepower is greater than 230:
## You can use comma or ambersand (&) for AND condition ## 1. comma syntax filter(tbl_mtcars, mpg<20, hp>230) ## 2. ambersand syntax filter(tbl_mtcars, mpg<20 & hp>230)
For OR condition use the |
, i.e.,
filter(tbl_mtcars, mpg<20 | hp>230)
will return a data frame that consists of all the cars for which either miles per gallon is less than 20 OR horsepower is greater than 230.
3.2.4 Pipelining: How you should write your dplyr code
Although not part of the dplyr package, the pipe operator %>%
(which is imported from package magrittr
) allows
to pipe an object forward into a function or call expression.
This is extremely useful as it allows to combine multiple
operations using clean and readable code.
The magrittr
package offers a set of operators which make your code more readable by:
- structuring sequences of data operations left-to-right (as opposed to from the inside and out),
- avoiding nested function calls
- minimizing the need for local variables and function definitions, and
- making it easy to add steps anywhere in the sequence of operations.
The operators pipe their left-hand side values forward into expressions that appear on the right-hand side,
i.e., one can replace f(x)
with x %>% f()
, where %>%
is the (main) pipe-operator.
When coupling several function calls with the pipe-operator, the benefit will become more apparent.
Consider this example: suppose we would like to obtain information
about mpg
, hp
and carb
for the cars that have mpg
less than
\(20\). One way of doing this is to first select
the variables mpg
, hp
and carb
from the tbl_mtcars
data frame and then
filter the rows that meet the condition mpg
less than 20,
i.e.,
filter( select(tbl_mtcars, mpg, hp, carb), mpg < 20 )
Here we see a nested structure in the code: the function call
select
is nested in the function call filter
.
With the pipe operator, this piece of code can be rewritten as
tbl_mtcars %>% select(mpg, hp, carb) %>% filter(mpg < 20)
which is cleaner and can be naturally read as:
tbl_mtcars
then select
variables mpg, hp, carb
then filter
rows that meet the condition mpg < 20
x %>% f
is equivalent tof(x)
x %>% f(y)
is equivalent tof(x,y)
x %>% f %>% g %>% h
is equivalent toh(g(f(x)))
3.2.5 The mutate()
verb
The mutate()
function is used to create new variables from existing ones.
Suppose we would like to convert the weight of all cars in metric system, i.e., from
pounds to kilograms. In order to do so in base R, we would need to multiply mtcars$wt
by
the appropriate conversion factor (\(0.4535924\)), store the new vector in an object
and append it to the data frame mtcars
. The function mutate
does all of this in one go!
conv_factor <- 0.4535924 mutate(tbl_mtcars, wt_metric = wt * conv_factor) ## without pipe tbl_mtcars %>% mutate(wt_metric = wt * conv_factor) ## pipe syntax
# A tibble: 32 x 12 mpg cyl disp hp drat wt qsec vs am gear carb wt_metric <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 1.188412 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 1.304078 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1.052334 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 1.458300 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 1.560358 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 1.569430 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 1.619325 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 1.446960 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 1.428816 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 1.560358
3.2.6 The group_by()
and summarise()
verbs
grouped_by_am <- group_by(tbl_mtcars, am)
grouped_by_am
# A tibble: 32 x 11 # Groups: am [2] mpg cyl disp hp drat wt qsec vs am gear carb * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows
We notice that nothing has changed, the data look exactly the same as if we
sent tbl_mtcars
in the console. However, the second line of the R output
shows additional information, i.e., # Groups: am [2]
. This indicates
that the new object grouped_by_am
has identified and stored the two different
groups of cars (automatic and manual).
This allows us to obtain useful summaries in each group with the summarise()
function.
For example, suppose we were interested in the average weight of cars in the two distinct
groups. This can be obtained by:
summarise(grouped_by_am, avg_wt = mean(wt))
# A tibble: 2 x 2 am avg_wt <dbl> <dbl> 1 0 3.768895 2 1 2.411000
Of course, this code can be combined in a single line as
summarise(group_by(tbl_mtcars, am), avg_wt = mean(wt))
but this becomes very hard to read. Comparing this with
tbl_mtcars %>% group_by(am) %>% summarise(avg_wt = mean(wt))
we can appreciate the power of writing code using the
pipe operator %>%
!!
4 Optimization
REFRESHER: If you feel your multivariable calculus is rusty, I strongly recommend Dennis Auroux Multivariable Calculus lectures that were recorded during Fall 2007 at MIT. In particular, Lecture 8:
- 15.40: Contour plot
- 34.48: Partial derivatives
- 41.57: Visualization
and Lecture 9:
4.1 Introduction
Maximum likelihood estimation is central to statistical inference. Long hours can be be invested in learning about the theoretical performance of MLEs and their analytic derivation. Faced with a complex likelihood lacking analytic solution, however, many people are unsure how to proceed.
Most functions cannot be optimized analytically. For example, maximizing \[ g(x)= (\log x)/(1+x) \qquad x>0 \] with respect to x by setting the derivative equal to zero and solving for \(x\) leads to an algebraic impasse because \[ 1 - \log x + 1/x = 0 \] has no analytic solution. Many realistic statistical models induce likelihoods that cannot be optimized analytically–indeed, we would argue that greater realism is strongly associated with reduced ability to find optima analytically.
Data scientists face other optimization tasks, too, aside from maximum likelihood. Minimizing risk in Bayesian decision problems, solving nonlinear least squares problems, finding highest posterior density intervals distributions, and a wide variety of other tasks all involve optimization.
More generally, many problems in statistics can be characterized as solving
\[\min_{\boldsymbol{x} \in \mathbb{R}^m} \, f(\boldsymbol{x})\]
where \(f: \mathbb{R}^m \mapsto \mathbb{R}\) is a real scalar valued function of vector \(x \in \mathbb{R}^m\). The function \(f\) can represent
- negative log-likelihood;
- negative posterior distribution;
- decision rule, cost, distance, energy, etc.
4.1.1 Local vs global optimization
Minimization methods generally operate by seeking a local minimum of \(f\). That is, they seek a point \(\boldsymbol{x}^*\), such that, \[ f(\boldsymbol{x}^* + \boldsymbol{\Delta}) \geq f(\boldsymbol{x}^*), \] for any arbitrary \(\boldsymbol{\Delta}\).
Unless \(f\) is a strictly convex function, it is not possible to guarantee that such a \(\boldsymbol{x}^*\) is a global minimum. That is, it is not possible to guarantee that \[f(\boldsymbol{x}^* + \boldsymbol{\Delta})\geq f(\boldsymbol{x}^*)\] for any arbitrary \(\boldsymbol{\Delta}\).
4.1.2 Synthetic and Analytic point of view for convex functions (\(m=1\))
Definition: A function \(f: G\mapsto \mathbb{R}\), where \(G\) is an open subinterval of \(\mathbb{R}\), is called convex on \(G\) if its graph lies below any of its chords: for \(x_1,x_2 \in G\) and \(t\in (0,1)\), \[ f(t\, x_1 + (1-t)\, x_2)\leq t \,f(x_1) + (1-t)\,f(x_2) \] Illustration:
From the definition, it automatically follows that \(f\) is continuous on \(G\) since the fact that \(f\) is convex may be rewritten as \[ \Delta_{u,v} \leq \Delta_{v,w} \quad \mbox{where }\Delta_{u,v} := \frac{f(v) - f(u)}{ v-u}\cdot \]
If \(f\) is twice differentiable on \(G\), then \(f\) is convex if and only if \[ f''(x)\geq 0 \qquad \mbox{for all }x \in G. \]
4.1.3 Optimization algorithms
- Beginning at \(\boldsymbol{x}^{(0)}\), optimization algorithms are designed to generate a sequence
of iterates \(\{\boldsymbol{x}^{(k)}\}_{k=0}^{\infty}\) that terminate when
- either no more progress can be made;
- or when it seems that a solution point has been approximated with sufficient accuracy.
- In deciding how to move from iterate \(\boldsymbol{x}^{(k)}\), the algorithms use information about the function \(f\) at \(\boldsymbol{x}^{(k)}\), and possibly also information from earlier iterates \(\boldsymbol{x}^{(0)}, \boldsymbol{x}^{(1)},\ldots, \boldsymbol{x}^{(k-1)}\).
- Optimization methods are short-sighted. At any stage of operation all that optimization methods "know" about a function are a few properties of the function at the current best \(\boldsymbol{x}\). It is on the basis of such information that a method tries to find an improved best \(\boldsymbol{x}\).
4.1.4 Search direction and Trust Region algorithms
Optimization algorithms start from the initial guesstimate \(\boldsymbol{x}^{(0)}\) of the optimum. Starting from \(k=0\), most methods then iterate the following steps
1 Evaluate \(f(\boldsymbol{x}^{(k)})\), and possibly the first and second derivatives of \(f\) with respect to the elements of \(\boldsymbol{x}\), at \(\boldsymbol{x}^{(k)}\)
2 Either
- use the information from step 1 (and possibly previous executions of step 1) to find a search direction, \(\boldsymbol{\Delta}\), such that for some \(\alpha>0\), \(f(\boldsymbol{x}^{(k)} + \alpha \boldsymbol{\Delta})\) will provide sufficient decrease relative to \(f(\boldsymbol{x}^{(k)})\). Search for such an \(\alpha\) and set \(\boldsymbol{x}^{(k+1)} = \boldsymbol{x}^{(k)} + \alpha \boldsymbol{\Delta}\).
- use the information from step 1 (and possibly previous executions of step 1) to construct a local model of \(f\), and find \(\boldsymbol{x}\) minimizing this model within a defined region around \(\boldsymbol{x}^{(k)}\). If this value of \(\boldsymbol{x}\) leads to a sufficient decrease in \(f\), accept it as \(\boldsymbol{x}^{(k+1)}\), otherwise shrink the search region until it does.
4.2 Multivariate Taylor's theorem
Suppose that \(f:\mathbb{R}^m\rightarrow \mathbb{R}\) is twice continuously differentiable function of \(\boldsymbol{x}\) and that \(\boldsymbol{\Delta}\) is of the same dimension as \(\boldsymbol{x}\). Then
\begin{equation} f(\boldsymbol{x} + \boldsymbol{\Delta}) = f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T \nabla^2f(\boldsymbol{x} + t \boldsymbol{\Delta}) \,\boldsymbol{\Delta} \end{equation}for some \(t\in(0,1)\), where
\begin{equation*} \nabla f = \left[\begin{matrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \end{matrix}\right] \qquad \text{and} \qquad \nabla^2 f = \left[ \begin{matrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \ldots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1}& \frac{\partial^2 f}{\partial x_2^2} & \ldots\\ \vdots & \vdots & \ddots \end{matrix} \right], \end{equation*}are the gradient vector and Hessian matrix, respectively.
Conditions for minimum: From Taylor's theorem, it can be established that the conditions for \(f(\boldsymbol{x}^*)\) to be a minimum are that
\begin{equation*} \nabla f(\boldsymbol{x}^*) = 0 \qquad \text{and}\qquad \nabla^2 f(\boldsymbol{x}^*)\text{ is positive semi-definite}. \end{equation*}Remember: A symmetric \(m\times m\) real matrix \(M\) is called positive semi-definite if \[ \boldsymbol{x}^T\, M \,\boldsymbol{x} > 0 \] for every \(\boldsymbol{x} \in \mathbb{R}^m \setminus \{\boldsymbol{0}\}\).
4.3 Steepest descent
Suppose we are located at an arbitrary point \(\boldsymbol{x}\) in \(\mathbb{R}^m\). We ask, which direction leads to the most rapid decrease in \(f\) for sufficiently small step, and use this as the search direction. If \(\boldsymbol{\Delta}\) is small enough, then \[ f(\boldsymbol{x}+\boldsymbol{\Delta}) \approx f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta}. \] This is the tangent plane approximation of \(f\) at \(\boldsymbol{x}\).
- For \(\boldsymbol{\Delta}\) of fixed length, the greatest decrease will be achieved by minimising the inner product
\[ \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} = \left\| \nabla f( \boldsymbol{x}) \right\| \left\| \boldsymbol{\Delta}\right\| \cos(\theta), \] that is, by choosing \(\boldsymbol{\Delta}\) parallel to \(-\nabla f(\boldsymbol{x})\) (recall \(\theta = \mbox{Angle}\left(f(\boldsymbol{x}),\boldsymbol{\Delta}\right)\)). Here \(\left\| \cdot \right\|\) denotes Euclidean distance.
- Set
\[ \boldsymbol{\Delta}= -\nabla f(\boldsymbol{x})/\left\|\nabla f(\boldsymbol{x})\right\|. \]
Given some direction, we need a method for choosing how far to move from \(\boldsymbol{x}\) along this direction.
Trade-off: Need to make reasonable progress, but do not want to spend too much time choosing the exactly optimal distance to move, since the point will only be abandoned at the next iteration.
Best to try and choose the step length \(\alpha > 0\) so that \(f(\boldsymbol{x}+\alpha \boldsymbol{\Delta})\) is sufficiently much lower than \(f(\boldsymbol{x})\), and also that the magnitude of the gradient of \(f\) in the \(\boldsymbol{\Delta}\) direction is sufficiently reduced by the step. At the moment we assume we have such a method (e.g., Wolfe conditions).
Given that we initially started at the point \(\boldsymbol{x}^{(0)}\), the gradient descent algorithm implements the following iterative procedure for finding the minimum of a function \(f\). \[ \boldsymbol{x}^{(k+1)} = \boldsymbol{x}^{(k)} - \alpha \,\frac{\nabla f(\boldsymbol{x}^{(k)})}{\left\| \nabla f(\boldsymbol{x}^{(k)}) \right\|} \]
4.4 Newton's method
REM: tangent plane approximation used to motivate steepest descent \[ f(\boldsymbol{x}+\boldsymbol{\Delta}) \approx f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} \] The quantity \(\nabla f(\boldsymbol{x})^T \boldsymbol{\Delta}\) is called the directional derivative of \(f\) along the vector \(\boldsymbol{\Delta}\) and is equal to
\begin{align*} \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} = \lim_{h\rightarrow 0} \frac{f(\boldsymbol{x}+h \boldsymbol{\Delta}) - f(\boldsymbol{x})}{h} \end{align*}Interpretation: \(f(\boldsymbol{x})^T \boldsymbol{\Delta}\) is the instantaneous rate of change of the function \(f\), when moving past \(\boldsymbol{x}\) at velocity \(\boldsymbol{\Delta}\).
Explanation of steepest descent's poor convergence
Taylor expansion: \(t\in (0,1)\)
\begin{align*} f(\boldsymbol{x} + \boldsymbol{\Delta}) &= f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T \nabla^2f(\boldsymbol{x} + t \boldsymbol{\Delta}) \,\boldsymbol{\Delta}% \\ % &\approx f(\boldsymbol{x}) + \nabla % f(\boldsymbol{x})^T\boldsymbol{\Delta} + O(\lVert\boldsymbol{\Delta}\lVert^2) \end{align*}- As steepest descent approaches the minimum, \(\boldsymbol{x}^*\), the directional derivative term retained in the Taylor expansion of \(f\) becomes negligible relative to the neglected second derivative term since \(\nabla f(\boldsymbol{x}^*) = 0\)
- This largely explains the method's poor performance.
Second order approximation Consider instead the quadratic form
\begin{align} f(\boldsymbol{x} + \boldsymbol{\Delta}) &= f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T \nabla^2f(\boldsymbol{x} + t \boldsymbol{\Delta}) \,\boldsymbol{\Delta} \nonumber\\ &= f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2}\boldsymbol{\Delta}^T \nabla^2 f(\boldsymbol{x}) \,\boldsymbol{\Delta} + O(\lVert\boldsymbol{\Delta}\lVert^3) \label{eq:quad_approx} \end{align}This is of the form \[ c + b^T \boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T H_{\boldsymbol{x}}\,\boldsymbol{\Delta}, \] for vector \(b\), scalar \(c\), \(H_{\boldsymbol{x}}\) a square symmetric matrix with spectral decomposition \[ H_{\boldsymbol{x}} = \Gamma_{\boldsymbol{x}}\, \Lambda_{\boldsymbol{x}} \, \Gamma_{\boldsymbol{x}}^T \] where \(\Lambda_{\boldsymbol{x}}\) is a diagonal matrix of eigenvalues, ordered in decreasing value and \(\Gamma_{\boldsymbol{x}}\) an \(m\times m\) orthogonal matrix (\(\Gamma_{\boldsymbol{x}}^T \Gamma_{\boldsymbol{x}} = I\)) of the corresponding eigenvectors.
The unique quadratic form that passes through \(\boldsymbol{x}\) and has the same first and second partial derivatives with \(f\) at \(\boldsymbol{x}\).
Differentiating this quadratic form with respect to \(\boldsymbol{\Delta}\) and setting the result to \(\boldsymbol{0}\) we get
\begin{align*} \nabla f(\boldsymbol{x}) + \nabla^2 f(\boldsymbol{x})\,\boldsymbol{\Delta} = 0 \end{align*}as the equation to solve for the step \(\boldsymbol{\Delta}\). Assuming \(\nabla^2 f(\boldsymbol{x})\) is positive definite, we obtain Newton's direction
\begin{align*} \boldsymbol{\Delta} &= -\nabla^2f(\boldsymbol{x})^{-1} \nabla f(\boldsymbol{x}) \\\\ & = -\Gamma_{\boldsymbol{x}} \Lambda_{\boldsymbol{x}}^{-1} \left[\Gamma_{\boldsymbol{x}}^T \nabla f(\boldsymbol{x})\right] \end{align*}Newton's direction The instantaneous rate of change of \(f\) when moving past \(\boldsymbol{x}\) at velocity \({\boldsymbol{\Delta}=-\nabla^2f(\boldsymbol{x})^{-1}}\nabla f(\boldsymbol{x})\) is
\begin{equation} \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} = -\nabla f(\boldsymbol{x})^T \nabla^2f(\boldsymbol{x})^{-1} \nabla f(\boldsymbol{x}) < 0, \label{eq:desc_newton} \end{equation}hence \(\boldsymbol{\Delta}\) is a descent direction.
- Newton's direction is reliable when the difference between the true function \(f(\boldsymbol{x}+\boldsymbol{\Delta})\) and its quadratic model approximation is not too large.
- Unlike the steepest descent, there is a "natural" step length of 1 associated with Newton's direction. Usually we accept the step, unless it fails to lead to sufficient decrease in \(f\). In the latter case, the step length is repeatedly halved until sufficient decrease is achieved.
- When \(\nabla^2 f(\boldsymbol{x})\) is not positive definite, Newton's direction may not even be defined, since \(\nabla^2 f(\boldsymbol{x})^{-1}\) may not even exist.
VISUALIZATION OF NEWTON'S METHOD ON ROSENBROCK'S FUNCTION
Practical issues From the discussion above, we see that any step of the form \[ \boldsymbol{\Delta}=-B_{\boldsymbol{x}}^{-1} \nabla f(\boldsymbol{x}) \qquad \mbox{for }B_{\boldsymbol{x}} \mbox{ positive definite} \] implies \(\boldsymbol{\Delta}\) is a descent direction.
- This observation gives us the freedom to modify \(\nabla^2 f(\boldsymbol{x})\) to make it positive definite and still be guaranteed a descent direction.
- One approach is to take the symmetric eigen-decomposition \(\Gamma_{\boldsymbol{x}}\,\Lambda_{\boldsymbol{x}} \, \Gamma_{\boldsymbol{x}}^T\), reverse the sign of any negative eigenvalues and replace any zero eigenvalues with a small positive number, e.g., for \(\lambda_1,\ldots,\lambda_m \neq 0\) tweak \begin{equation} \Lambda_{\boldsymbol{x}} = \left[ \begin{matrix} \lambda_1 & 0 & \ldots & 0\\ 0 &\lambda_2 & \ldots &0\\ \vdots &&\ddots& \vdots\\ 0 & 0 & \ldots & \lambda_m \end{matrix}\right] \quad \text{to} \quad \Lambda'_{\boldsymbol{x}} = \left[ \begin{matrix} \lambda_1 | & 0 & \ldots & 0\\ 0 &|\lambda_2| & \ldots &0\\ \vdots &&\ddots& \vdots\\ 0 & 0 & \ldots & |\lambda_m| \end{matrix}\right] \end{equation}
- Computational cheaper alternatives add a multiple of the identity matrix to \(\nabla^2 f(\boldsymbol{x})\), sufficiently large to make the result positive definite, i.e., \[ B_{\boldsymbol{x}} = \nabla^2 f(\boldsymbol{x}) + c \boldsymbol{I}, \qquad c>0 \]
where \[ \boldsymbol{I} = \left[\begin{matrix} 1& 0 & \ldots &0\\ 0 &1 & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \ldots & 1 \end{matrix}\right] \] Practical Newton algorithm
- Set \(k=0\) and a guesstimate \(\boldsymbol{x}^{0}\)
- Evaluate \(f(\boldsymbol{x}^{(k)}), \nabla f(\boldsymbol{x}^{(k)}), \nabla^2f(\boldsymbol{x}^{(k)})\), implying a quadratic model approximation to \(f\).
- Test whether \(\boldsymbol{x}^{(k)}\) is a minimum and terminate if it is.
- If \(\nabla^2f(\boldsymbol{x}^{(k)})\) is not positive definite, perturb it so that it is (modifying the quadratic model)
- The search direction is defined by the solution \begin{equation} \boldsymbol{\Delta}= -\nabla^2 f(\boldsymbol{x}^{(k)})^{-1} \nabla f(\boldsymbol{x}^{(k)}) \end{equation}
- If \(f(\boldsymbol{x}^{(k)}+\boldsymbol{\Delta})\) is not
sufficiently lower than \(f(\boldsymbol{x}^{(k)})\), repeatedly
halve \(\boldsymbol{\Delta}\) until it is.
- Set \(\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)} + \boldsymbol{\Delta}\), increment \(k\) by one, return to step 2.
Comments
- Methods that use the Newton direction have a fast rate of local convergence, typically quadratic.
- After a neighbourhood of the solution is reached, convergence to high accuracy often occurs in just a few iterations.
- The main drawback of the Newton direction is the need for the Hessian \(\nabla^2 f(\boldsymbol{x})\). This can be tedious to obtain, especially when the dimension of \(\boldsymbol{x}\) is large.
- Finite-difference techniques may be useful in avoiding the need to calculate second derivatives by hand.
Quasi-Newton methods Is it possible produce methods that only require first derivative information, but with efficiency rivalling Newton's method, rather than steepest descent?
- One approach is to use finite-difference but this is usually more costly than exact evaluation of Hessian.
- Quasi-Newton methods provide an attractive alternative to Newton's method in that they do not require computation of the Hessian and yet still attain a superlinear rate of convergence
- In place of \(\nabla^2 f(\boldsymbol{x})\), they use an approximation \(B_{\boldsymbol{x}}\), which is updated after each step to take into account of the additional knowledge gained in every step. The updates make use of the fact that changes in the gradient provide information about the second derivative of \(f\) along the search direction
REM fundamental theorem of calculus, if \(f:\mathbb{R}\rightarrow \mathbb{R}\) is a continuous function then \[ f(x) = f(a) + \int_{a}^x f'(t) ~\mathrm{d}t \] The multivariate generalisation recasted to the gradient of \(f\) is
\begin{align*} \nabla f(\boldsymbol{x} + \boldsymbol{\Delta}) &= \nabla f(\boldsymbol{x}) + \int_{0}^{1}\nabla^2 f(\boldsymbol{x}+ t \boldsymbol{\Delta})\, \boldsymbol{\Delta}~\mathrm{d}t \end{align*}Adding and subtracting \(\nabla^2 f(\boldsymbol{x})\,\boldsymbol{\Delta}\) to the r.h.s, we get
\begin{align*} \nabla f(\boldsymbol{x} + \boldsymbol{\Delta}) &= \nabla f(\boldsymbol{x}) + \nabla^2f(\boldsymbol{x})\,\boldsymbol{\Delta}\\ \\ & \qquad + \int_{0}^{1}\left[\nabla^2 f(\boldsymbol{x}+ t \boldsymbol{\Delta} ) - \nabla^2 f(\boldsymbol{x}) \right] \, \boldsymbol{\Delta}~\mathrm{d}t \end{align*}Setting \(\boldsymbol{x}=\boldsymbol{x}^{(k)}\) and \(\boldsymbol{\Delta} = \boldsymbol{x}^{(k+1)} -\boldsymbol{x}^{(k)}\) we obtain that for small \(\lVert \boldsymbol{\Delta}\lVert\)
\begin{equation} \nabla f(\boldsymbol{x}^{(k+1)}) - \nabla f(\boldsymbol{x}^{(k)}) \approx \nabla^2 f(\boldsymbol{x}^{(k)}) (\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}) \label{eq:secant} \end{equation}Let \(H^{(k+1)}\) be the approximate positive definite Hessian at the \((k+1)\)-th step
- The basic requirement of a quasi Newton method is that the approximation should exactly match \(\nabla f(\boldsymbol{x}^{(k)})\), i.e., it should get the gradient vector at the previous point, \(\boldsymbol{x}^{(k)}\), exactly right, mathematically
\(\qquad\) so that the new Hessian approximation \(H^{(k+1)}\) mimics the property \eqref{eq:secant}
- Let \(\boldsymbol{s}_k = \boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}\) and \(\boldsymbol{y}_k = \nabla f(\boldsymbol{x}^{(k+1)}) - \nabla f(\boldsymbol{x}^{(k)})\)
- Then the expression in the first bullet point is equivalent to \begin{equation} H^{(k+1)} \boldsymbol{s}_k = \boldsymbol{y}_k \qquad \mbox{secant equation} \end{equation}
- The secant equation will only be feasible for positive definite \(H^{(k+1)}\) under certain conditions on \(\boldsymbol{s}_k\) and \(\boldsymbol{y}_k\), but we can always arrange for these to be met by appropriate step length selection. (Wolfe conditions, see Nocedal & Wright, 2006.
- The secant equation alone does not define a unique \(B^{(k+1)} = (H^{(k+1)})^{-1}\), and some extra conditions are needed.
Conditions and solution
- \(B^{(k+1)}\) satisfies the secant equation
- The difference between \(B^{(k)}\) and \(B^{(k+1)}\) has low rank, i.e., \(B^{(k+1)}\) is as close to \(B^{(k)}\) (measured by some appropriate matrix norm)
- \(B^{(k+1)}\) is positive definite
The unique solution to this problem is the BFGS update \[ B^{(k+1)} = (\boldsymbol{I} - \rho_k \boldsymbol{s}_k \boldsymbol{y}_k^T) B^{(k)} (\boldsymbol{I} - \rho_k \boldsymbol{s}_k \boldsymbol{y}_k^T) + \rho_k \boldsymbol{s}_k \boldsymbol{s}_k^T \] where \(\rho_k^{-1} = \boldsymbol{s}_k^T \boldsymbol{y}_k\) – BFGS is named after Broyden, Fletcher, Goldfarb and Shanno all of whom independently discovered and published it, around 1970.
- The BFGS method then works exactly like Newton's method, but with \(B^{(k)}\) in place of the inverse \(\nabla^2 f(\boldsymbol{x}^{(k)})\), and without the need for second derivative evaluation or for perturbing the Hessian to achieve positive definiteness
- A finite difference approximation to the Hessian is often used to start the method
4.5 Nelder-Mead simplex
Nelder-Mead polytope What if even gradient evaluation is too taxing, or if our objective is not smooth enough for Taylor approximation to be valid? The Nelder-Mead polytope provides a successful answer.
- Let \(m\) be the dimension of \(\boldsymbol{x}\). At each stage of the method we maintain \(m+1\) distinct \(\boldsymbol{x}\) vector, defining a polytope in \(\mathbb{R}^m\) (e.g., for a two dimensional \(\boldsymbol{x}\), the polytope is a triangle)
- The search direction is defined as the vector from the worst point (vertex of polytope with highest value) through the average of of the remaining \(m\) points
- The initial step length is set to twice the distance from the worst point to the centroid of the others. If it succeeds (meaning that the new point is no longer the worst point), then a step length of 1.5 times is tried, and the better of the two is accepted
- If the previous step did not find a successful new point then step lengths of half and one and a half times the distance from the worst point to the centroid are tried
- If the last two steps failed to locate a successful then the polytope is reduced in size, by linear rescaling towards the current best point (which remains fixed)
VISUALIZATION OF NELDER MEAD SIMPLEX ON ROSENBROCK'S FUNCTION
Nelder-Mead polytope
- Variations are possible, in particular with regard to step length and shrinkage factors.
- Slower rate of convergence than Newton type algorithms
- Nelder-Mead is good if the answer does not need to be accurate, and derivatives are difficult to compute.
- Not recommended for general purpose modelling software.
Other methods
- In likelihood maximisation contexts, the method of scoring is used. This replaces the Hessian in Newton's method with the expected Hessian.
- Conjugate gradient method is another way of getting a Newton type method when only first derivatives are available. It is applicable when the objective is somehow related to a sum of squares of differences between data and fitted values. This is closely related to the algorithm for fitting generalised linear models (GLMs).
Other methods
- Simulated annealing is a method for optimising difficult objectives, such as those arising in discrete optimisation problems, or with very spiky objective functions. The basic idea is to propose random changes in \(\boldsymbol{x}\), always accepting a change which reduces \(f(\boldsymbol{x})\), but also accepting moves which increase the objective, with a probability which decreases with the size of increase, andalso decreases as the optimisation progresses.
- The Expectation-Maximisation algorithm is a useful approach to likelihood maximisation when there are missing data or random effects that are difficult to integrate out.
5 Computer intensive methods
5.1 Markov chain Monte Carlo
Brief overview of Bayesian inference
- \(\boldsymbol{x}=(x_1,\ldots,x_n)\): vector of observations
- \(f(\boldsymbol{x}\mid \boldsymbol{\theta})\): sampling density (likelihood)
- \(\pi(\boldsymbol{\theta})\): prior distribution of unknown \(\boldsymbol{\theta}=(\theta_1,\ldots,\theta_d)\).
- Inference on the parameter based on the posterior distribution \[ \pi(\boldsymbol{\theta}\mid \boldsymbol{x}) = \frac{f(\boldsymbol{x}\mid \boldsymbol{\theta})\,\pi(\boldsymbol{\theta})}{f(\boldsymbol{x})} \] where \[ f(\boldsymbol{x})= \int f(\boldsymbol{x}\mid \boldsymbol{\theta})\,\pi(\boldsymbol{\theta})~\mathrm{d}\boldsymbol{\theta} \] is called the marginal likelihood or evidence.
- Forecasting a future value of an observation \(x^*\) based on the posterior predictive distribution, with density \[ f(x^*\mid\boldsymbol{x}) = \int f(x^*\mid \boldsymbol{x},\boldsymbol{\theta}) \,\pi(\boldsymbol{\theta}\mid \boldsymbol{x}) \]
- Often interested in some feature of the posterior, e.g., \[ \mathbb{E}\left[h(\boldsymbol{\theta})\mid \boldsymbol{x}\right] = \int h(\boldsymbol{\theta}) \, \pi (\boldsymbol{\theta}\mid \boldsymbol{x})~\mathrm{d}\boldsymbol{\theta} \]
- Also interested in marginal posterior distributions \[ \pi(\theta_j\mid \boldsymbol{x})\qquad j=1,\ldots,d \]
Conditional conjugacy
- Recall notion of conjugacy in Bayesian problems where there is a single unknown parameter \(\theta\).
- Example: Let \(x_1,\ldots,x_n\) i.i.d. from exponential\((\theta)\) \[ f(\boldsymbol{x}\mid \theta) = \theta^n e^{-\sum_{i=1}^n x_i} \] and suppose our prior is \(\theta\sim \text{Gamma}(2,1)\).
- Then posterior for \(\theta\) given data \(\boldsymbol{x}\) is \[ \theta\mid \boldsymbol{x}\sim \text{Gamma}\left(2+n, 1 + \sum_{i=1}^n x_i\right) \]
- Thus posterior distribution remains in the Gamma family for this problem whatever data are observed. The Gamma family is called conjugate.
- In many problems with a single unknown parameter it is usually possible to find useful conjugate priors. See for example Wikipedia's page for discrete distributions and continuous distributions.
- However, even though in theory this is also possible in higher dimensional situations, in practice the conjugate families end up being highly complex and difficult to summarise.
- On the other hand, many multi-parameter Bayesian problems exhibit conditional conjugacy.
Conditional conjugacy example
- Let \(X_i \mid \mu, \omega \sim N(\mu,1/\omega)\) independently for \(i=1,\ldots,n\). The likelihood is \[ f(\boldsymbol{x}\mid \mu,\omega) \propto \omega^{n/2} \exp\left\{-\frac{\omega}{2}\sum_{i=1}^n (x_i-\mu)^2\right\} \]
- Suppose also we have prior distributions for \(\mu\) and \(\omega\) \[ \mu \sim N (\mu_0,1/\kappa_0) \qquad \text{and} \qquad \omega\sim \text{Gamma}(\alpha_0,\lambda_0) \] where \(\mu\) and \(\omega\) are considered a priori independent.
- Posterior density of \(\mu,\omega\) is \begin{align*} \pi(\mu,\omega\mid \boldsymbol{x})\propto & \exp\left\{-\frac{\omega}{2}\sum (x_i-\mu)^2\right\} \exp\left\{-\frac{\kappa_0}{2}(\mu-\mu_0)^2\right\}\\\\ & \times \omega^{\frac{n}{2}+\alpha_0-1}\exp(-\lambda_0 \omega). \end{align*}
- Conditional posterior distributions belong to the same families as the priors \[ \pi(\mu \mid \omega,\boldsymbol{x}) \propto \exp\left\{-\frac{\omega}{2}\sum (x_i-\mu)^2\right\}\exp\left\{-\frac{\kappa_0}{2}(\mu-\mu_0)^2\right\} \] which is \[N\left(\frac{\kappa_0 \mu_0 + n\, \omega \bar{x}}{\kappa_0+n\, \omega},\frac{1}{\kappa_0 + n\, \omega}\right)\] while \[ \pi(\omega\mid \mu,\boldsymbol{x})\sim \text{Gamma}\left(\alpha_0 + \frac{n}{2},\lambda_0 + \frac{\sum_{i=1}^n(x_i-\mu)^2}{2}\right) \]
5.1.1 Gibbs sampling
- In multi-dimensional Bayesian problems it is rarely possible to compute analytically summary statistics such as posterior means and variances, or even posterior probabilities.
- Therefore, it is necessary to estimate the quantities of interest using a Monte Carlo approach. However, simulating from an arbitrary high dimensional distribution is usually difficult and often impossible to do directly.
- Instead, Markov chain Monte Carlo methods are used to simulate a Markov chain, whose stationary or limiting distribution is the posterior distribution of interest.
- The concept of conditional conjugacy is crucial in the constrution of one of the basic forms of MCMC:
The Gibbs sampler algorithm
- Initialise \(\boldsymbol{\theta}=(\theta_1^{(0)},\theta_2^{(0)},\ldots,\theta_d^{(0)})\)
- Simulate \(\theta_1^{(1)}\) from \(\pi(\theta_1\mid \theta_2^{(0)},\theta_3^{(0)},\ldots,\theta_d^{(0)},\boldsymbol{x})\)
- Simulate \(\theta_2^{(1)}\) from \(\pi(\theta_2\mid \theta_1^{(1)},\theta_3^{(0)},\ldots,\theta_d^{(0)},\boldsymbol{x})\)
- \(\ldots\)
- Simulate \(\theta_d^{(1)}\) from \(\pi(\theta_d\mid \theta_1^{(1)},\theta_2^{(1)},\ldots,\theta_{d-1}^{(1)},\boldsymbol{x})\)
- Iterate.
R code (conditional conjugacy example)
gibbs <- function(X, mu0, kappa0, alpha0, lambda0, nburn=0, ndraw= 5000) { n <- length(X) sumX <- sum(X) alpha1 <- alpha0 + (n/2) mu <- rnorm(1, mu0, sqrt(1/kappa0)) omega <- rgamma(1, alpha0, 1)/lambda0 draws <- matrix(nrow=ndraw, ncol=2) iter <- -nburn while(iter < ndraw) { iter <- iter + 1 stdev <- sqrt(1/(kappa0 + omega*n)) mu1 <- (kappa0*mu0 + omega*sumX)*stdev^2 mu <- rnorm(1, mu1, stdev) omega <- rgamma(1, alpha1, 1)/(lambda0 + (sum((X-mu)^2))/2) if(iter > 0) { draws[iter,1] <- mu draws[iter,2] <- omega } } return(draws) } X <- rnorm(100,3,1) draws <- gibbs(X=X, mu0=0, kappa0=1, alpha0=0.1, lambda0=0.1, nburn=1000, ndraw=1000)
Prediction
Suppose we want to forecast an observable \(x^*\). The predictive distribution has density
\begin{align*} f(x^*\mid \boldsymbol{x})&=\int f(x^*\mid \mu,\omega) \pi(\mu,\omega \mid \boldsymbol{x})~\mathrm{d}\mu\,\mathrm{d}\omega\\\\ &=\mathbb{E}_{\mu,\omega\mid\boldsymbol{X}=\boldsymbol{x}}\left\{ f(x^*\mid \mu,\omega)\right\}\\\\ &\approx \frac{1}{J}\sum_{j=1}^J \left(\frac{\omega^{(j)}}{2\pi}\right)^{1/2} \exp\left\{-\frac{\omega^{(j)}}{2}\sum (x^*-\mu^{(j)})^2\right\} \end{align*}where \((\mu^{(j)},\omega^{(j)})\) are the draws obtained through the Gibbs sampler.
R code
## trace plots, summaries, etc. par(mfrow=c(1,2)) plot(mu,type="l") plot(omega,type="l") ## Autocorrelation acf(mu) acf(omega) ## Partial autocorrelation pacf(mu) pacf(omega) ## running means plot(1:length(mu),cumsum(mu)/(1:length(mu)),type="l") plot(1:length(omega),cumsum(omega)/(1:length(omega)),type="l") ## Predictive density x <- seq(-3,10,len=100) pred <- NULL mu <- draws[,1] omega <- draws[,2] for(j in 1:length(x)) { pred[j] <- mean(dnorm(x[j], mu, 1/sqrt(omega))) } plot(x, pred, type="l") title("predictive density: p(x^*|x)")
5.1.2 The Metropolis-Hastings algorithm
General McMC algorithm
- It is not possible to use Gibbs sampling in all problems since sometimes the conditional posterior distributions cannot be calculated.
- Consider for example replacing in the previous model the assumption of normality by that of \[ X_i\mid \mu,\omega \sim \text{Cauchy}(\mu,1/\omega), \] independently for \(i=1,\ldots,n\).
- Then
\begin{align*}
\pi(\mu\mid \omega, \boldsymbol{x}) &\propto
\left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \exp\left\{-\frac{\kappa_0}{2}\left(\mu-\mu_0\right)^2\right\}, \qquad \mu \in \mathbb{R}\\\\
\pi(\omega\mid \mu, \boldsymbol{x}) &\propto
\left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \omega^{\frac{n}{2}+\alpha_0-1} \exp(-\lambda_0 \omega), \qquad \omega>0
\end{align*}
None of these distributions has a well-known form, so Gibbs sampling is precluded.
Genetic linkage Example
- Genetic linkage of 197 animals distributed into 4 categories \(\boldsymbol{x}=(x_1,x_2,x_3,x_4)=(125,18,20,34)\) with cell probabilities \[ \left(\frac{2 + \theta}{4},\frac{1}{4\,(1-\theta)}, \frac{1}{4\,(1-\theta)},\frac{\theta}{4}\right) \]
- Positive cell probabilities imply \(\theta \in [0,1]\)
- Suppose \(\theta \sim \text{Uniform}(0,1)\).
- The posterior density of \(\theta\) is \begin{align*} \pi(\theta\mid \boldsymbol{x}) &\propto f(\boldsymbol{x}\mid \theta)\\\\ & \propto (2 + \theta)^{x_1}\,(1-\theta)^{x_2+x_3} \theta^{x_4},\qquad \theta \in (0,1). \end{align*}
Metropolis-Hastings algorithm
- Set \(t=0\) and start the chain at some value \({\theta^{(t)}}\)
- Propose a candidate value \({\theta}^{(prop)}\sim q(\cdot\mid {\theta}^{(t)})\), where \(q\) is arbitrary
- Calculate
\[
\alpha({\theta}^{(t)},{\theta}^{(prop)})=
\min\left\{1,\frac{\pi({\theta}^{(prop) } \mid \boldsymbol{x} )\,q({\theta}^{(t)}\mid
{\theta}^{(prop)})}{\,\,\,\,\,\pi({\theta}^{(t)}\mid \boldsymbol{x})\,q({\theta}^{(prop)}\mid
{\theta}^{(t)})}\right\}
\]
- Generate \(U\sim \text{Uniform}(0,1)\)
- If \(U \leq \alpha({\theta}^{(t)},{\theta}^{(prop)})\) set \({\theta}^{(t+1)} = {\theta}^{(prop)}\)
- Otherwise set \({\theta}^{(t+1)} = {\theta}^{(t)}\)
- Set \(t = t+1\) and go back to step 2
Genetic linkage
- Let \(q(\theta^{(prop)}\mid \theta^{(t)})\sim \text{Uniform}(0,1)\). The acceptance probability is \begin{align*} \alpha({\theta}^{(t)},{\theta}^{(prop)}) &= \min \left\{1,\frac{\pi(\theta^{(prop)} \mid \boldsymbol{x})}{\pi(\theta^{(t)}\mid \boldsymbol{x})}\right\}\\\\ &= \min \left\{1,\left(\frac{2 + \theta^{(prop)}}{2 + \theta^{(t)}}\right)^{x_1}\left(\frac{1-\theta^{(prop)}}{1-\theta^{(t)}} \right)^{x_2+x_3} \left(\frac{\theta^{(prop)}}{\theta^{(t)}}^{x_4}\right)\right\} \end{align*}
- Let \(\rho \in (0,1)\) and define \[ q(\theta^{(prop)}\mid \theta^{(t)}) = \phi\left(\frac{\Phi^{-1}(\theta^{(prop)}) - \rho\, \Phi^{-1}(\theta^{(t)})}{\sqrt{1-\rho^2}}\right)\bigg/\sqrt{1-\rho^2} \] where \(\phi\) denotes the density of the standard normal. What is \(\alpha({\theta}^{(t)},{\theta}^{(prop)})\) in this case?
R code: useful functions
# Second candidate generator density and random number generation q2 <- function(xprop,xt,rho) { a <- dnorm(qnorm(xprop)) st.ratio <- ((qnorm(xprop)-rho*qnorm(xt))/sqrt(1-rho^2) dnorm(st.ratio)/(a*sqrt(1-rho^2)) } rq2 <- function(thetat, rho) { epsilon <- rnorm(1, 0 ,1) pnorm(rho*qnorm(thetat) + sqrt(1-rho^2)*epsilon) } # acceptance ratio term for independence sampler, i.e., q~Uniform(0,1) aratio <- function(X, thetaprop, thetat) { A <- X[1]*log((2 + thetaprop)/(2+thetat)) B <- (X[2]+X[3])*log((1-thetaprop)/(1-thetat)) C <- X[4]*log(thetaprop/thetat) return(A+B+C) } genetic_is <- function(X, nburn=0, ndraw=1000){ thetat <- runif(1, 0, 1); accept <- 0 draws <- NULL iter <- -nburn while(iter < ndraw){ iter <- iter + 1 thetaprop <- runif(1, 0 ,1) logalpha <- aratio(X, thetaprop, thetat) U <- runif(1, 0 ,1) if(log(U) <= logalpha){ accept <- accept+1 thetat <- thetaprop } if(iter > 0){ draws[iter] <- thetat } } return(list(draws,accept/(nburn+ndraw))) } genetic_mh <- function(X, rho=0.5, nburn=0, ndraw=1000){ thetat <- runif(1, 0, 1); accept <- 0; draws <- NULL iter <- -nburn while(iter < ndraw){ iter <- iter + 1 thetaprop <- rq2(thetat,rho) logalpha <- aratio(X, thetaprop, thetat) + (log(q2(thetat, thetaprop, rho)) - log(q2(thetaprop, thetat, rho))) ## correct acceptance ratio: U <- runif(1, 0 ,1) ## ratio of candidate generator if(log(U) <= logalpha){ ## density terms does not vanish accept <- accept+1 thetat <- thetaprop } if(iter > 0){ draws[iter] <- thetat } } return(list(draws,accept/(nburn+ndraw))) }
R code
# Run McMC algorithms theta_is <- genetic_is(X, ndraw=5000, nburn=1000) theta_mh <- genetic_mh(X, rho=0.85, ndraw=5000, nburn=1000) theta_is[[2]];theta_mh[[2]] # acceptance probability # Traceplots par(mfrow=c(1,2)) plot(theta_is[[1]],type="l",ylim=c(0,1)) plot(theta_mh[[1]],type="l",ylim=c(0,1)) # Autocorrelation plots par(mfrow=c(1,2)) acf(theta_is[[1]],lag.max=100) acf(theta_mh[[1]],lag.max=100) # Posterior density plot(density(theta_is[[1]]),type="l") lines(density(theta_mh[[1]]),lty=2)
General McMC algorithm
- Break the components of \(\boldsymbol{\theta}\) into \(d\) groups \(\boldsymbol{\theta}_1,\ldots,\boldsymbol{\theta}_d\) where each group \(\boldsymbol{\theta}_j\) has dimension \(\geq 1\).
- Initialise with \(\boldsymbol{\theta}_1^{(0)},\ldots,\boldsymbol{\theta}_d^{(0)}\).
- Update \(\boldsymbol{\theta}_1^{(0)}\) to \(\boldsymbol{\theta}_1^{(1)}\) "according to" the conditional distribution \(\pi\left(\boldsymbol{\theta}_1\mid\boldsymbol{\theta}_2^{(0)}, \boldsymbol{\theta}_3^{(0)},\ldots,\boldsymbol{\theta}_d^{(0)}, \boldsymbol{x}\right)\)
- Update \(\boldsymbol{\theta}_2^{(0)}\) to \(\boldsymbol{\theta}_2^{(1)}\) "according to" the conditional distribution \(\pi\left(\boldsymbol{\theta}_2\mid\boldsymbol{\theta}_1^{(1)}, \boldsymbol{\theta}_3^{(0)},\ldots,\boldsymbol{\theta}_d^{(0)}, \boldsymbol{x}\right)\)
- \(\ldots\)
- Update \(\boldsymbol{\theta}_d^{(0)}\) to \(\boldsymbol{\theta}_d^{(1)}\) "according" to the conditional distribution \(\pi\left(\boldsymbol{\theta}_d\mid\boldsymbol{\theta}_1^{(1)}, \boldsymbol{\theta}_2^{(1)},\ldots,\boldsymbol{\theta}_{d-1}^{(1)}, \boldsymbol{x}\right)\)
- Iterate
General McMC algorithm
- Suppose we are running a general McMC algorithm and that the current value of the chain is \(\boldsymbol{\theta}_1^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\).
- We now want to simulate \(\boldsymbol{\theta}_1^{(t+1)}\), the next value of \(\boldsymbol{\theta}_1\)
- In the updating steps of the previous algorithm we "update according to" the appropriate conditional distribution. For Gibbs sampling, we have seen this means simulating from the conditional distribution.
- For other McMC algorithms "update according to" means something else.
- The most general algorithm in this context is the Metropolis-Hastings algorithm.
Metropolis-Hastings algorithm
- Propose a candidate value \(\boldsymbol{\theta}_1^{(prop)}\) which is a draw from an arbitrary distribution with density \[ q(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_1^{(t)}, \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}) \]
- Take as the next value of \(\boldsymbol{\theta}_1\) in the chain
\begin{equation*}
\boldsymbol{\theta}_1^{(t+1)} =
\begin{cases}
\boldsymbol{\theta}_1^{(prop)}&\text{with probability $\alpha$}\\
\boldsymbol{\theta}_1^{(t)}&\text{with probability $1-\alpha$}
\end{cases}
\end{equation*}
where
\[ \alpha = \min\left\{1,\frac{\pi\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} {\pi\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} \frac{q\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_1^{(prop)}, \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)} {q\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_1^{(t)}, \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)}\right\} \] with \(\pi\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)\) denoting the density corresponding to the conditional posterior distribution of \(\boldsymbol{\theta}_1\) evaluated at \(\boldsymbol{\theta}_1=\boldsymbol{\theta}_1^{(prop)}\) and similarly for \(\pi\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)\)
Comments
- In practice, the way to implement the second part of the Metropolis-Hastings updating mechanism is by drawing a value \(u\) from Uniform\((0,1)\) and taking \[ \boldsymbol{\theta}_1^{(t+1)}=\begin{cases} \boldsymbol{\theta}_1^{(prop)} &u\leq\alpha\\\\ \boldsymbol{\theta}_1^{(t)}& u>\alpha. \end{cases} \]
- The candidate generator \(q(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_1^{(t)},\boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)})\) is arbitrary so, in principle, any choice should work. However, in terms of convergence properties (or mixing properties) of the algorithm, not all candidate generators are equally good.
- Note that the candidate generator can depend on the current value of the chain \(\boldsymbol{\theta}_1^{(t)}\), although this is not a requirement.
- The Metropolis-Hastings algorithm has the major advantage over Gibbs sampler that it is not necessary to be able to simulate from all the conditional posterior distributions
- Moreover, and this can be of crucial importance, we only need to know the conditional posterior densities up to proportionality, since any constants of proportionality cancel in the numerator and denominator of the calculation of \(\alpha\).
- The price for the simplicity is that if \(q\) is poorly chosen, then the number of rejections can be high, so that the efficiency of the procedure can be very low.
- Gibbs sampling is a special case of the Metropolis-Hastings algorithm where the candidate generator is chosen as \[ q(\boldsymbol{\theta}_{1}^{(prop)}\mid \boldsymbol{\theta}_{1}^{(t)}, \boldsymbol{\theta}_{2}^{(t)},\ldots,\boldsymbol{\theta}_{d}^{(t)})= \pi(\boldsymbol{\theta}_{1}^{(prop)}\mid \boldsymbol{\theta}_{2}^{(t)},\ldots,\boldsymbol{\theta}_{d}^{(t)}) \] in other words we draw directly from the conditional posterior distribution. In this case the acceptance probability is 1 so that \(\boldsymbol{\theta}_{1}^{(t+1)}=\boldsymbol{\theta}_{1}^{(prop)}\) always.
- A common choice of the candidate generator is to take \[ q(\boldsymbol{\theta}_{1}^{(prop)}\mid \boldsymbol{\theta}_{1}^{(t)}, \boldsymbol{\theta}_{2}^{(t)},\ldots,\boldsymbol{\theta}_{d}^{(t)})\sim N(\boldsymbol{\theta}_1^{(t)},V)\] where \(V\) is suitably chosen. This is the so-called random walk Metropolis algorithm with normal increments and it is very popular due to its simplicity.
- Note that the symmetry of the candidate generator distribution in random walk Metropolis means that the terms involving \(q\) in the acceptance probability \(\alpha\) cancel, so that \[ \alpha= \min\left\{1,\frac{\pi\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} {\pi\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} \right\}. \]
- The variance \(V\) of this candidate generator plays an important role in the mixing properties of the algorithm: if \(V\) is chosen to be too large, the moves we propose can be too bold leading to very low acceptance probabilities (slow mixing); if, on the other hand, \(V\) is too small, the acceptance probability will be very high at the expense of moving in very little steps (slow convergence).
- \(V\) is chosen by trial and error aiming at an acceptance probability (very) roughly around 30\(\%\).
Cauchy example revisited
Consider \[ X_i\mid \mu,\omega \sim \text{Cauchy}(\mu,1/\omega), \] independently for \(i=1,\ldots,n\). The required conditional posterior densities are
\begin{align*} \pi(\mu\mid \omega, \boldsymbol{x}) &\propto \left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \exp\left\{-\frac{\kappa_0}{2}\left(\mu-\mu_0\right)^2\right\}\\\\ \pi(\omega\mid \mu, \boldsymbol{x}) &\propto \left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \omega^{\frac{n}{2}+\alpha_0-1} \exp(-\lambda_0 \omega). \end{align*}Here we need to resort to a more general Metropolis-Hastings algorithm.
- Using the general McMC algorithm (pg 21), the parameter is divided into 2 groups, namely \(\mu\) and \(\omega\) and we need to update \(\mu\) "according to" \(\pi(\mu\mid \omega,\boldsymbol{x})\) and \(\omega\) "according to" \(\pi(\omega\mid \mu,\boldsymbol{x})\)
- We use random walk Metropolis with normal increments for both \(\mu\) and \(\omega\). This leads to the following algorithm
Algorithm
- Set \(t=0\) and initial values \((\mu^{(t)},\omega^{(t)})\).
- Draw \(\mu^{(prop)}\sim N(\mu^{(t)},v_\mu)\) for some suitably
chosen variance \(v_\mu\), and take
\begin{equation*}
\mu^{(t+1)}=\begin{cases}
{\mu}^{(prop)} &\text{w.p. $\alpha$} \\\\
{\mu}^{(t)}& \text{w.p. $1-\alpha$}.
\end{cases}
\end{equation*}
where
\begin{align*} \alpha = \min\left\{1,\exp\left[\frac{\kappa_0}{2}\left\{\left(\mu^{(t)}-\mu_0\right)^2 - \left(\mu^{(prop)}-\mu_0\right)^2\right\}\right] \times\prod_{i=1}^n\left(\frac{1 + \omega^{(t)} (x_i-\mu^{(t)})^2} {1 + \omega^{(t)} (x_i-\mu^{(prop)})^2}\right)\right\} \end{align*} - Draw \(\omega^{(prop)}\sim N(\omega^{(t)},v_\omega)\) for some suitably chosen \(v_\omega\), and take \[ \omega^{(t+1)}=\begin{cases} {\omega}^{(prop)} &\text{w.p. $\alpha$} \\\\ {\omega}^{(t)}& \text{w.p. $1-\alpha$}. \end{cases} \] where \begin{align*} \alpha = \min\left\{1,\left(\frac{\omega^{(prop)}}{\omega^{(t)}}\right)^{\frac{n}{2}+\alpha_0-1} e^{\lambda_0\left(\omega^{(t)} - \omega^{(prop)}\right)} \times\prod_{i=1}^n\left(\frac{1 + \omega^{(t)} (x_i-\mu^{(t+1)})^2} {1 + \omega^{(prop)} (x_i-\mu^{(t+1)})^2}\right) I(\omega^{(prop)}>0)\right\} \end{align*}
- Iterate
R code
cauchy_rwm <- function(X, mu0=0, kappa0=1, alpha0=1, lambda0=1, nburn=0, ndraw=1000, vmu=1, vomega=1){ n <- length(X) alpha1 <- (n/2) + alpha0 - 1 stdvmu <- sqrt(vmu) stdvomega <- sqrt(vomega) mut <- rnorm(1, mu0, sqrt(1/kappa0)) omegat <- rgamma(1, alpha0, 1)/lambda0 draws <- matrix(nrow=ndraw, ncol=2) acceptmu <- 0 acceptomega <- 0 iter <- -nburn while(iter < ndraw){ iter <- iter + 1 muprop <- rnorm(1, mut, stdvmu) logalpha <- (kappa0/2)*((mut-mu0)^2 - (muprop-mu0)^2) + sum( log(1 + omegat*((X-mut)^2)) - log(1 + omegat*((X-muprop)^2))) u <- runif(1, 0, 1) if(log(u)<logalpha){ acceptmu <- acceptmu + 1; mut <- muprop } omegaprop <- rnorm(1, omegat, stdvomega) if(omegaprop > 0){ logalpha <- alpha1*(log(omegaprop)-log(omegat))+ lambda0*(omegat-omegaprop) + sum(log(1+omegat*((X-mut)^2))- log(1+omegaprop*((X-mut)^2))) u <- runif(1, 0, 1) if(log(u)<logalpha){ acceptomega <- acceptomega + 1; omegat <- omegaprop } } if(iter > 0){ draws[iter,1] <- mut draws[iter,2] <- omegat } } return(draws) }