.header_author {font-size: 4em}

MATH11176 Statistical Programming

Table of Contents

1 General

1.1 Course Information

Lecturer Ioannis Papastathopoulos  
Office hour Mon 15.00 - 16.00 JCMB 4602 (TBC)
Lectures Mon 16.10 - 18.00 Joseph Black Building LTh 100
  Mon 16.10 - 18.00 Ashworth Lab LTh 1 (Weeks 5, 6)
Labs Thu 16.10 - 18.00 JCMB 3210 & 5205

1.2 Recommended reading

References (learning R) source Level
Wickham, H. (2014). Advanced R. CRC Press book, companion intermediate-advanced
Grolemund, G. and Wickham, H. (2017). R for Data Science. O'Reilly. book beginner-intermediate
Wood, N. S. (2015). Core Statistics. Cambridge Univ. Press book, library beginner-intermediate
Crawley, M. (2012). The R book. Wiley book, library beginner
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S-plus. Springer library intermediate-advanced
References (Optimization) source
Nocedal, J. and Wright, S. J. (2006). Numerical Optimization library
Givens, H. and Hoeting, J. A. (2012). Computational Statistics. Wiley library

2 R Basics

2.1 Getting started

Statistical analysis of data sets is conducted using computers. Various specialist computer programs are available to facilitate statistical work. For using general statistical theory directly with custom built models, R is probably the most useful of these, because of its flexibility. R is a programming language and environment designed for statistical analysis and is written and maintained by a community of statisticians.

A major design feature is extendibility. R makes it straightforward to code up statistical methods in a way that is easy to distribute and for others to use. The first place to look for information on getting started with R is


2.1.1 Installing R and using an editor to write code


  • Download R from the CRAN website
  • Run the .exe file that was just downloaded
  • Go to the RStudio download page
  • Under Installers select RStudio x.yy.zzz - Windows XP/Vista/7/8 (where x, y, and z represent version numbers)
  • Double click the file to install it
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.


  • Download R from the CRAN website
  • Select the .pkg file for the latest R version
  • Double click on the downloaded file to install R
  • It is also a good idea to install XQuartz (needed by some packages)
  • Go to the RStudio download page
  • Under Installers select RStudio x.yy.zzz - Mac OS X 10.6+ (64-bit) (where x, y, and z represent version numbers)
  • Double click the file to install RStudio
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.


  • Follow the instructions for your distribution from CRAN, they provide information to get the most recent version of R for common distributions. For most distributions, you could use your package manager (e.g., for Debian/Ubuntu run sudo apt-get install r-base, and for Fedora sudo yum install R), but we don’t recommend this approach as the versions provided by this are usually out of date. See Darren Wilkinson's research blog post for installing the latest version of R. In any case, make sure you have at least R 3.3.1.
  • Go to the RStudio download page
  • Under Installers select the version that matches your distribution, and install it with your preferred method (e.g., with Debian/Ubuntu sudo dpkg -i rstudio-x.yy.zzz-amd64.deb at the terminal).
  • Once it’s installed, open RStudio to make sure it works and you don’t get any error messages.

2.1.2 Adding Comments to Code

You should always write a brief commentary on what your code means, so that when you come to look at it after a period of time it still makes sense. You do this by preceding your comments with a '#'.

For example, you could send all of the below code to R

## It is generally useful to have general
## headings outlining the code that follows
## This line will not be run by R
## but the above line will!

## comments can be used to explain what your code does
## the line below adds together 5 and 6


## the line below will give an error message
because there is no '#' at the beginning
## so *R* tries to run it, but "doesn't understand"

You should save your script files often! They may be useful in the future, as you may wish to re-use your code. You can open as many separate scripts as you like.

2.1.3 Simple R Arithmetic

At its most basic level, R can be used as a calculator. For example:

1+2+3   # Type this
[1] 6   # *R* gives you the answer, labelled with a [1]

Besides the basic arithmetic operators, R contains a number of arithmetic functions. Some of these are

R Function Description
sqrt(x) Compute the square root of \(x\)
abs(x) Compute the absolute value of \(x\)
sin(x), cos(x), tan(x) Trigonometric functions
asin(x), acos(x), atan(x) Inverse Trigonometric functions
exp(x),log(x) Exponential and natural log of \(x\)
gamma(x), lgamma(x) The gamma function and its log
round(x) Round/truncate a number

Exercise: Evaluate the following expressions:

\begin{eqnarray*} 1)&&11+22+33+44+55+66+77\\ 2)&&11\times22\times33\times44\times55\times66\\ 3)&&1-\exp(-1/2)\\ 4)&&\frac{1}{\sqrt{2\pi3}}\exp\left\{-\frac{1}{2}\frac{(1-2)^2}{3}\right\}\\ \end{eqnarray*}

2.2 An Introduction to R Objects

When you start R two important things are created: a command prompt at which to type commands telling R what to do, and an environment, known interchangeably as the global environment or user workspace to hold the objects created by your commands. Unlike the command prompt, you don't see the global environment directly, but its there as an extendable chunk of computer memory for holding your data, commands and other objects. Generically, in R, an "environment" consists of two things. The first, known in R jargon as a frame, is a set of symbols used to refer to objects, along with the data defining those objects. The second is a pointer to an enclosing environment.

Everything in R is an object living in an environment, including R commands themselves. In R, objects can store data, and perform functions on data. In this section, we will introduce some different types of data objects, and show you how to store data in them. Besides the data it contains, one of the most important aspects of an object is its name. Always try to give your objects meaningful names, that way you won't forget what they contain!

Try typing the following to store the value 10 in object x

x <- 10

You can retrieve the value by typing the object name


The <- symbol can be read as the word assigns (assignment symbol). So in English we can read the above command as "assign the value 10 to the object x" and when we type x we are essentially saying "what is \(x\)?".

Objects can also store the results of operations on other objects

y <- x <-  3 * 4 + 20

Objects in R carry around information about the basic type of thing that they are made up of. Actually somewhat confusingly, they carry around 3 different classifications of the sort of basic thing they are made up of: their type, mode and storage mode. The following illustrates this by creating the vector \((1, 2, 3, 4)\) and examining its type, mode and storage mode,

z <- 1:4    # create vector (1, 2, 3, 4)
> typeof(b)
[1] "integer"
> mode(z)
[1] "numeric"
> storage.mode(z)
[1] "integer"

For the most part it is not necessary to worry much about the modes and type of an object: for example the conversion between real and integer numbers is automatic and need seldom concern the programmer. The exception is when calling code written in other languages from R, when it is essential to know the storage mode of data being passed to the external code.

Objects can also carry a variety of extra information as attributes. Attributes have a name (they can be thought of as a named list with unique names), and can be an object of any type. They behave rather like a whole bunch of notes stuck onto the object and carried around with it. The attributes function lets you access all the attributes of an object (as a list), while the attr function allows individual attributes to be set and extracted.

As an example, let’s give the vector z, above, an attribute consisting of a 2 by 2 matrix (and then print it)

attr(z,"mat") <- matrix(1:4,2,2)
> attr(z,"mat")
[,1] [,2]
[1,] 1 3
[2,] 2 4

You can add as many attributes as you like, and they are used by R itself in many ways, including in the implementation of matrices and higher dimensional arrays. They are also used in R 's basic object orientation mechanism to implement the notion of the class of an object.

Object names in R can only contain letters, numbers, underbars _ or fullstops ., with the condition that names can only begin with an alphabetic character, e.g. Temp, and high.5 are all valid, whereas 7up, sample-mean, and results! are not. Note that R is case sensitive, i.e., myObject and myobject are different objects to R. Try storing different values in scalar objects of different names.

Objects can be overwritten, R will remember the last thing you stored. Take care with this when defining objects that you might want to use again later!

Make sure you have written any examples you have tried out in your script file — they will be useful to refer to later.

2.2.1 Vectors and vector arithmetic

Vectors in R are the same as vectors in mathematics and constitute the basic data structure in R. They come in two flavours: atomic vectors and lists. They have three common properties

  • Type, typeof(), what it is.
  • Length, length(), how many elements it contains
  • Attributes, attributes(), additional arbitrary metadata

They differ in the types of their elements: all elements of an atomic vector must be of the same type, whereas the elements of a list can have different types. Try creating a short vector:

myVector <- c(1,2,3,4)

We use the concatenate c() command to tell R that we want our values to be represented as a vector. What happens if you now type myVector?

Vectors may be combined with scalars and/or with other vectors in arithmetic operations and may also be supplied as arguments to functions, e.g.,

[1] 3 4 6
[1] 3 6 8
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

The recycling rule: When working with vectors we often want to perform operations involving a scalar and a vector (such as multiplying all the elements of a vector by 2). In R, scalars are simply vectors of length 1 – there is nothing special about them, but R has a recycling rule, which is a vector oriented generalization of what happens when a scalar is multiplied by a vector. The recycling rule states that when two vectors of different length appear in an operation then the shorter is replicated as many times as is necessary to produce a vector of the length of the longer vector, and this recycled vector is what is used in computation.

Exercise: What will be printed after running

c(1, 2, 3) + c(1, 2)

in the R console?

Exercise: The following table gives the frequencies of the number of matches per box produced by Burnam's match factory on 1st February 2017.

Number of matches Frequency
45 5
46 4
47 8
48 5
49 12
50 16
51 19
52 18
53 14
54 10
55 7

Define a vector matchNum of length 11 by typing in the number of matches per box.

Also define a vector matchFreq which identifies the frequencies of matches per box. Make sure the entries of matchFreq correspond to the relevant entries of matchNum. Try typing length(matchNum) and length(matchFreq) to ensure you didn't miss any!

Setting up these vectors is all very well, but what can we do with them? Suppose we want to know how many matchsticks were produced by Burnam's on 1st February?

R provides an easy way to multiply vectors pointwise:

matchFreq * matchNum

Does this return a scalar or a vector? Store the result of this command in another object. You can also perform mathematical operations on vectors. Here we divide the matchFreq by its smallest value (try ?min): Data manipulation in R is vector based. That is wherever possible we work with whole vectors, rather than individual elements of vectors, since the former is much more efficient in an interpreted language. So standard operators and mathematical functions in R are defined in a vector oriented manner.


Exercise: Using the R function sum(), we can sum the elements of a vector. Use the online help system to help you find out how to use sum(), and calculate and store the number of matchsticks made on 1st February 2017 at Burnam's in an object.

R allows us to create vectors of sequences easily. Experiment with the following:

mySeq  <- 1:10
myReps <- rep(4,5)
mySeq  <- seq(0,10,by=0.66)
mySeq  <- seq(0,10,length=13)

Look up the online help for rep() and seq() to see their syntax and understand what role each value plays in the above examples.


  • Construct the matchNum vector you created by using the method for mySeq above.
  • Create a vector that repeats 10 times a sequence of integers between 1 and 5.
  • Evaluate the square root of all the odd numbers between 1 and 25.
  • The function sum() adds the numbers in the vector supplied. For example
[1] 16

Find the sum of the numbers from 1 to 50. Check your answer (again using R) against the formula \[ S = n\,(n + 1)/2. \] Now find the sum of the reciprocals of the first fifty integers. There is no simple formula to check this but your answer should be about \(\log(50.5) - \log (0.5)\) (why?).


2.2.2 Matrices arithmetic and efficiency

Matrices are 2 dimensional arrays and are of class matrix. They are treated as a separate class to facilitate numerical linear algebra with matrices. Matrices in R are the same as matrices in mathematics. Try creating a small matrix:

myMatrix <- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2, byrow=FALSE)

Notice that we use a vector to tell R what the elements are in a matrix. The byrow command tells R whether to fill the matrix row-wise or column-wise.

Exercise: Experiment with creating matrices. What happens if you change byrow to TRUE? What happens if you omit the byrow argument? Try the following, and explain the output you see:

myMatrix1 <- matrix(c(1,2,3), nrow=3, ncol=2, byrow=FALSE)
myMatrix2 <- matrix(c(1,2,3,4,5,6,7), nrow=3, ncol=2, byrow=FALSE)
myMatrix3 <- matrix(c(1,2,3,4,5,6,7,8), ncol=2, byrow=FALSE)
myMatrix4 <- matrix(c(1,2,3,4,5,6),2,3, FALSE)

Experiment with other settings, and look at ?matrix. It is important to understand how R deals with ambiguous situations. This can save a lot of frustration!

Vectorized elementwise operations mean that A*B does not perform matrix multiplication and A/B does not produce \(AB^{-1}\). Instead these and other matrix operations have special functions and operators. The operator for matrix multiplication is %*%.


a <- matrix(c(2,3,4,1,5,6), nrow=2, ncol=3)
b <- matrix(1:12, nrow=3, ncol=4)
c <- matrix(6:1, nrow=2, ncol=3, byrow=TRUE)

Exercise: What happens if you try b%*%a? Why? Also try 2*a, a*b and a*c, and explain the output in each case. What happens if you try addition instead of multiplication in these examples?

The amazing combination of speed and accuracy apparent in many computer algorithms and numerical methods can lead to a false sense of security and the wrong impression that in translating mathematical and statistical ideas into computer code we can simply treat the numerical methods and algorithms we use as "infallible boxes". This is not the case, consider the following example.

The system.time function in R measures how long a particular operation takes to execute. First generate two \(2000 \times 2000\) matrices, \(A\) and \(B\) and a \(2000\times 1\) vector \(y\).

n <- 2000
A <- matrix(runif(n*n),nrow=n,ncol=n)
B <- matrix(runif(n*n),nrow=n,ncol=n)
y <- runif(n)

Now form the product \(A B y\) in two slightly different ways

system.time(f0 <- (A%*%B)%*%y)
 user  system elapsed
6.150   0.018   6.184
system.time(f1 <- A%*%(B%*%y))
 user  system elapsed
0.015   0.000   0.015

f0 and f1 are identical, to machine precision, so why is one calculation so much quicker than the other? The answer is all to do with how the multiplications were ordered in the two cases, and the number of floating point operations (flop) required by the two alternative orderings

  • In the first case \(AB\) was formed first, and the resulting matrix was used to pre-multiply the vector \(y\).
  • In the second case, the vector \(By\) was formed first, and was then pre-multiplied by \(A\).

Now we need to count the floating point operations.


  • How many flops \((+,\, -,\, *,\, /)\) are involved in the multiplication

\[ \left(\begin{matrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{23} & b_{33} \\ \end{matrix}\right) \left(\begin{matrix} y_{1} \\ y_{2} \\ y_{3} \\ \end{matrix}\right) \]

  • How many flops are involved in the multiplication

\[ \left(\begin{matrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{23} & a_{33} \\ \end{matrix}\right) \left(\begin{matrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{23} & b_{33} \\ \end{matrix}\right) \]

  • Now compute the number of operations required for the two cases above.

2.2.3 Characters and factors

R can store character data in the same way as we have been storing numeric data. Character data is enclosed in single quote marks ' ' or double quote marks " ". Use either, as long as they match. Examples:

c1 <- "Hello"                    # A scalar character string
[1] "Hello"

c2 <- c("Yes","Maybe","No")      # A vector of characters
[1] "Yes"   "Maybe" "No"

c3 <- c("Is","Could Be","Isn't") # Note single quote inside doubles

You can also use some of the functions we used to construct numeric vectors with character data:

c3 <- rep("Monkey",4)             # Using rep()
[1] "Monkey" "Monkey" "Monkey" "Monkey"

Of course, don't expect to be able to use numeric functions such as sum() on characters!

Factors are, conceptually, vectors of labels that serve to group other data. They have a special place in statistical modelling, and as such require special handling. In R, factors have class factor and another attribute levels, which is a vector of the unique labels occurring in the factor object. If you print a factor variable, then what is printed is the label given in each element of the vector. However, what is actually stored is a set of integers, indexing the "levels" attribute, and it is the print function that is actually doing the conversion from stored integer to corresponding label.

Suppose you have data on some people and you want to store their sex. You could use a numeric code — 0 for male, 1 for female, or you could use a character code — 'M' for male and 'F' for female. But you should really use a factor.

Factors are easy to construct from character vectors. Use the function as.factor() with a character vector:

c5 <- c("M","F","F","F","M","M")
f5 <- as.factor(c5)
[1] M F F F M M
Levels:  F M

Notice that a factor gets printed out like a character vector, but without the quote marks. The levels are also printed out.

The categories in a factor can also be seen by using the levels() function:

[1] "F" "M"   # The result is a character vector

And you can change these levels easily by assigning a new character vector to the levels:

levels(f5) <- c("Female","Male")}
[1] Male   Female Female Female Male   Male
Levels:  Female Male

One thing you can do with a factor is to tabulate the counts of each category with the table() function:

Female   Male
3        3     # 3 of each

Exercise: Here is a data set of some people's response to the questions

what is your favourite hot drink? what is your favourite biscuit?
coffee bourbon
tea rich tea
chocolate bourbon
tea pink wafer
tea pink wafer
coffee rich tea
coffee bourbon
chocolate jammie dodger
chocolate bourbon
tea bourbon
tea digestive
tea jammie dodger
chocolate jammie dodger
coffee pink wafer
tea rich tea
tea rich tea
tea digestive
coffee bourbon
tea digestive
coffee pink wafer

Do not type this data in in full. Instead, give each drink and each biscuit a number and make vectors of the numbers corresponding to the lists above (just type the numbers without quotes).

Now turn your vectors into factors with the right names and tabulate the levels.

Now use table(drinks,biscuits) to cross-tabulate the drink and biscuit preferences.

2.2.4 Subscripts

Suppose we want to extract a part of a vector. R allows us to do this using square brackets.

x <- 10:1
y <- x[2]
[1] 9

Here y gets the value of the second element of x.

Slicing more…

You can extract parts of a vector by subscripting with another vector.

y <- x[c(1,3,5)]
[1] 10  8  6

[1] 7 6 5 4

Here the returned value is a vector.

Don't bite off more than you can chew…

If you try and subscript a vector with a number out of range…

[1] NA

you get NA. This is the symbol R uses for missing data.

Negative Subscripts

Negative subscripts? What's the -2nd element of a vector? R defines this as being all the vector except the 2nd element. This gives us an easy way of removing some elements from a vector.

x <- 10:1
x[-3]  # omit the third element
 [1] 10  9  8  7  6  5  4  3  2  1

[1] 8 7 4

x[-c(3,4,7)] # you can also exclude several elements simultaneously
[1] 10  9  6  5  3  2  1
x[c(-3,-4,-7)] # this gives the same answer,
               # think of -c(3,4,7) = (-1) * c(3,4,7)

You can't mix positive and negative subscripts. Why would you want to?

Some more examples

x <- 1:10
x[-length(x)]                    # all but the last
[1] 1 2 3 4 5 6 7 8 9

x[length(x):1]                   # reverse x
[1] 10  9  8  7  6  5  4  3  2  1

There is a function to do this last one: rev() - most of the simple things you may want to do tend to have their own functions! Try using help.search() to find more functions.

Exercise: There is a function in R that acts as the modulo operator - it is used by typing %% - for example 8 Modulo 2 is given by:

8 %% 2
[1] 0

Use this function, together with appropriate commands for sequences, subscripts and lengths of vectors to calculate the number of values between 1 and 1000 that have 13 as a factor.

Remember: for \(n\) to have \(x\) as a factor \(n\) modulo \(x\) must equal \(0\).

Exercise: Create a vector of all the years in the 21st century, then take out the leap years. (Let's assume the year 2000 was actually a leap year!)

Subscripting Matrices Subscripting a matrix works in a very similar way to subscripting vectors, except there are now two dimensions to specify rather than one, so two subscripts are needed! R always uses the convention myMat[row,column] for subscripting.

myMat <- matrix(1:12,nrow=3,ncol=4)     #create the matrix
myMat[2,3]    #return the entry in position [2,3] of the matrix
              #(2nd row, 3rd column)
myMat[3,]     #return the third row of the matrix
myMat[,1:3]   #return the first to third columns of the matrix
myMat[-2,-1]  #return the matrix with the 2nd row and 1st column removed

VERY IMPORTANT: when subscripting matrices, if you end up returning a submatrix with either one row or one column, R will (annoyingly) convert it into a vector! A vector is NOT a \(1 \times n\) matrix, R treats them differently!

You can avoid this by using the as.matrix() command. For example:

a.vect <- myMat[3,]
a.mat  <- as.matrix(myMat[3,])

These both needed to be subscripted in different ways — have a go! In general, everything you learnt about subscripting vectors also works for matrices, but you do need to be careful with the dimensions!

2.2.5 Logical values

As we saw in the previous section, R enables you to compute with Boolean, or Logical, variables. These take on either the value TRUE or FALSE. You can store them in objects just as you can numeric or character data. You can use conditional tests to generate these values:

x    <- 1:10
gt5  <- x > 5
x == 4
gt3  <- x >= 3

Note: <= means "less than or equal to', >= means "greater than or equal to",

Logical operators

You can do "and" and "or" operations on vectors with & and |:

 x > 2 & x != 7
 x < 2 | x > 7

Note: != means "not equal to"

True and False as Numbers…

If you try and do anything numeric with logical values, R converts true to 1 and false to 0. This can be usefully employed to count the number of true values in a vector.


x <- 1:10 
y <- x > 3 
[1] 0 0 0 2 2 2 2 2 2 2 
[1] 7

Logical Subscripts You can use a vector of logical TRUE/FALSE values as a subscript. This is extremely useful.


 x <- c(6,5,6,4,4,3,4,2,3,4) 
 y <- c(5,3,4,2,6,5,4,5,4,3)  # couple of random vectors 

 xeq4 <- x == 4 # which of x equals 4? 

 y[xeq4] # which of y correspond to x==4?
[1] 2 6 4 3   
 y[x == 4] # or simply... 
[1] 2 6 4 3

Or you may just wish to return the elements of a vector that satisfy some rule, for example:

z <- y[y>3]   #put into z only those elements of y that are larger than 3.

You can even apply this idea to a matrix:

ymat<-xmat[xmat[,1]==0,]   #ymat contains only the rows of xmat
                           #whose entry in the first column is zero

Exercise: You can generate random \(\mbox{Uniform}(0,1)\) numbers using the function runif() (r stands for random), e.g.,

RandomUniforms <- runif(10)

Create a vector containing 200 random numbers generated from the Uniform distribution. Create logical vectors indicating:

  • which are larger than \(0.5\);
  • which are between 0.2 and 0.7.

Now use sum() to find how many of your random numbers were in each of these categories.

Are you keeping your script files up to date??

2.2.6 Reading data in from files

First download the file class96.dat from Learn and save it in your working directory. The class96 dataset contains the heights and weights of the students enrolled in GSSE401 in 1996. Let's get this data into R:

myTable <- read.table("class96.dat")

Try the function class(myTable). What do you get? Here, we have a data frame — a list of variables of the same length (each with a unique name). They are arranged in a 2d structure like a matrix (and, indeed, they can be subscripted in exactly the same way).

2.2.7 Data frames and lists

A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.

Creation You create a data frame using data.frame(), which takes named vectors as input

df <- data.frame(x = 1:3, y = c("a", "b", "c"))
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: Factor w/ 3 levels "a","b","c": 1 2 3

Beware data.frame() default behaviour which turns strings into factors. Use stringsAsFactors = FALSE to suppress this behaviour:

df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)
## 'data.frame':    3 obs. of  2 variables:
##  $ x: int  1 2 3
##  $ y: chr  "a" "b" "c"

Because a data.frame is an S3 class, its type reflects the underlying vector used to build it: the list. To check if an object is a data frame, use class() or test explicitly with is.data.frame():

## [1] "list"
## [1] "data.frame"
## [1] TRUE

You can coerce an object to a data frame with as.data.frame():

  • A vector will create a one-column data frame.
  • A list will create one column for each element; it’s an error if they’re not all the same length.
  • A matrix will create a data frame with the same number of columns and rows as the matrix.

Combining data frames

You can combine data frames using cbind() and rbind():

cbind(df, data.frame(z = 3:1))
##   x y z
## 1 1 a 3
## 2 2 b 2
## 3 3 c 1
rbind(df, data.frame(x = 10, y = "z"))
##    x y
## 1  1 a
## 2  2 b
## 3  3 c
## 4 10 z

When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use plyr::rbind.fill() to combine data frames that don’t have the same columns.

It’s a common mistake to try and create a data frame by cbind()-ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly:

bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
## 'data.frame':    2 obs. of  2 variables:
##  $ a: Factor w/ 2 levels "1","2": 1 2
##  $ b: Factor w/ 2 levels "a","b": 1 2
good <- data.frame(a = 1:2, b = c("a", "b"),
  stringsAsFactors = FALSE)
## 'data.frame':    2 obs. of  2 variables:
##  $ a: int  1 2
##  $ b: chr  "a" "b"

The conversion rules for cbind() are complicated and best avoided by ensuring all inputs are of the same type.

As we have seen, data frames can be thought of as matrices of data, where each column has a name, but not necessarily the same type, e.g., having a mixture of logical, numeric, factor and character columns is no problem. This format is a natural way of storing statistical datasets. Here is a small example

dat <- data.frame(y=5:7,lab=c("um","er","er"))
  y lab
1 5 um
2 6 er
3 7 er
[1] "data.frame"

By default the character vector lab was converted to a factor variable. Data frames can be accessed like any other matrix, but also by variable name and as a list (see below).

Let's stick with the example we created above. We've already seen that the class() function tells us that R treats our data as a data frame. This structure is used whenever we want to group vectors of data which have corresponding rows. The names() function gives you the names of the vectors in the data frame:

[1] "V1" "V2" "V3" "V4"

Not very informative, huh? Why is this? Let's give the columns some useful names:

names(myTable) <- c("index","height","weight","sex")}

You can refer to a column in the data frame in one of two ways. The first uses the column name and a dollar sign:


The second uses subscripting in the same was as a matrix. Recall:

myTable[,1]     #extract first column
myTable[1,]     #extract first row


  • Sex is a factor with two levels, but R doesn't know this yet. Using what you learnt in Section Characters and factors and the above, turn the Sex column into a factor.
  • Using the R functions mean() and sd(), find the mean and standard deviation of both the height and weight columns.
  • Using your knowledge of subscripts in R (Section Subscripts, make two new data frames for males and females separately. In the data 1 represents females and 2 represents males. Hint: Use logical subscripts!

Lists Lists are the basic building blocks for all sorts of complicated objects in R. Lists consist of any number of numbered, and optionally named, items, each of which can be an object of any type, e.g.,

mylist <- list(a="Cupcake", 1:13, M=matrix(1,2,2))
[1] "Cupcake"

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13

     [,1] [,2]
[1,]    1    1
[2,]    1    1
[1] "list"

The components of a list are always numbered and may be referred to as such. Elements can be accessed by number, starting from 1, using double square brackets. e.g. mylist[[1]] accesses "Cupcake". If the item has a name then this provides an alternative access method, using $, e.g. mylist$Cupcake also accesses "Cupcake".

Additionally to extracting elements from a list, sub-lists can be extracted too, using single square brackets. For example

mylist[[1:2]]   # this will give an error message

An excellent visual explanation of list subsetting can be found in


(the "Lists of condiments" section!)

2.3 Further matrix operations

We're now going to perform a linear regression on the height, weight, sex data. For each sex, we will fit the following model:

\begin{eqnarray} y & = &a + bx \nonumber \end{eqnarray}

Where the response data, \(y\), is height, and the covariate data, \(x\), is weight. If you've studied linear models (if not believe me!), you will know that if \(\beta = (a,b)\), then the parameter estimates are given by

\begin{equation} \hat{\beta} = (X^TX)^{-1}X^Ty \end{equation}

where \(X\) is the model/design matrix. This gives us a straight-forward way to calculate our parameter estimate. However, first we need to rearrange our data so that we can perform the calculations. For each sex, do the following:

  • Begin by creating a vector of response data by copying the height column of your data frame into a separate vector object. Call this \(y\).
  • Now we need to create a design matrix of covariate data.

This will be a two column matrix, with the first being a column of ones, and the second containing the covariate data for weight. Start by defining a 2 column by \(n\) row matrix, where \(n\) is the length of your covariate data (you can use dim on your data frames to find this). e.g. for a 2 column, 10 row matrix, I would type:

myMatrix <- matrix(ncol=2,nrow=10)

If you look at the matrix you have created, you will see that it is full of NA. This is fine, since we've not supplied any data yet.

  • Now populate the first column of the matrix with 1's. Fill columns by referring to them with subscript notation, e.g.:
myMatrix[,1] <- 1    #Fills the 1st column with 1's

This example demonstrates the use of a feature we observed about how R handles vectors: since the vector myMatrix[,1] (recall: one column of a matrix is a vector!) is of length \(n\) and 1 is of length 1, R repeats the \(1\) (recycling rule) as many times as is necessary to fill myMatrix[,1].

  • Now populate the second column with weight data, using a subscript on your data frame.

Exercise: Use the regression formula. To do this, you will need the following to multiply, transpose, and invert matrices:

Function/operator Use
x %*% y Matrix-multiplies x and y
solve(x) Returns the inverse of matrix x
t(x) Returns the transpose of matrix x

What are your parameter estimates for each sex? Store them in an appropriate form.

2.4 Graphics

In this section, we will use the height, weight, sex dataset you used in Section Further matrix operations to illustrate how to use some of the graphics functions in R. The functions we will be using are:

Function Use
plot() Creates a new plot
hist() Plots a histogram
lines() Adds a line to an existing plot
points() Adds points to an existing plot
text() Adds text to a plot
legend() Adds a legend to a plot

Creating a plot

In this section we are going to create a plot of height versus weight for each sex in the class96 dataset. The basic plot() command can be used as follows:

plot(x, y,   
     main="My Title",
     xlab="The x-axis",
     ylab="The y-axis",

where x and y are vectors of \(x\) and \(y\) coordinates respectively, the other parameters setting the plot labelling, and axis limits (\(x\)-axis between 0 and 10, and \(y\)-axis between 0 and 15).

Exercise: Using the data in your myTable data frame, plot a scattergram of height versus weight, with a suitable title and axis labels. Look at your data range, and set suitable limits for the axes.

Use the following steps to create your graph:

  • First plot one of the sexes with plot() to establish the plot. By default this will create black points.
  • Now use the points() function to plot the other sex onto the existing graph.

In order to get a different colour, you will have to supply an argument to the col parameter inside points()'s brackets.

In this example, we used the graphical parameter col. There are lots of these parameters that control the look of a lot of different types of plot (scatter, line, histogram etc) so the online help system lumps them all together under the par() function. Have a look at the online help, then try changing your plot() expressions to use a +' character for the data points.

Also, have a look at the online help for plot(). You will see that it gives you information on the type parameter for controlling the plot type. We will be using this later.

Now we want to add a legend to our scatter plot. The online help for legend() tells us the syntax:

legend(x, y = NULL, legend, fill = NULL, col = par("col"),
       lty, lwd, pch,
       angle = 45, density = NULL, bty = "o", bg = par("bg"),
       box.lwd = par("lwd"), box.lty = par("lty"),
       pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,
       xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,
       adj = c(0, 0.5), text.width = NULL, text.col = par("col"),
       merge = do.lines && has.pch, trace = FALSE,
       plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,
       inset = 0)

"Wow", you say! Notice that any parameter that has an = sign has a default setting (shown) so you don't have to be explicit about what you want, unless you want to change the default. Legends are plotted in boxes, so as a bare minimum you need to specify:

  • x and y — the coordinates on the plot of the top left of the box;
  • legend — a vector of names of your data series on the plot;
  • col — a vector of colours that correspond to the data series, in the same order as you supplied for legend;
  • pch — a vector of point characters you used to plot your data series.

Any graph you have produced can be saved to file directly from the graph window by selecting File => Save As.

Plotting lines

In this section, we will add trendlines to the scatter plot of height versus weight that we created previously, using the lines() function. Look up the online help, and remember the graphical parameters that you can set (see the help page for par()). =lines() can be used like this:


This command would draw a blue dashed line of width 3 between \((0,0)\) and \((4,2)\). What happens if you use this command without an existing plot being open? Why?

Exercise: Using the values for the regression parameters you estimated in Section Further matrix operations , use the lines() function to add suitable trendlines to the scatter plot of height versus weight, making sure the line colours match the point colours.


R has a really nifty function for creating histograms really quickly — hist(). Let's try it out. First let's generate some data to work on. R provides various functions for generating random numbers from the common probability distributions. All functions return a vector of \(n\) realisations from whichever distribution you choose, parametrised by values of your choice. Here are some of the distributions you can use (see also ?distributions):

R function Probability distribution
rnorm(n,mean,sd) Normal distribution
rgamma(n,shape,rate) Gamma distribution
runif(n,min,max) Uniform distribution
rpois(n,lambda) Poisson distribution
rexp(n,rate) Exponential distribution
rweibull(n,shape,scale) Weibull distribution

WARNING: R does not necessarily use the same parametrisation for these probability distributions as you do in your course. Until you get used to using these functions on a regular basis, always check the help pages - they give you the form of the distribution (in LaTeX format).

Note also, that for each distribution, there are functions to return density, distribution function, and quantiles (look at ?rnorm, for example).


  • Choose your favourite distribution, parametrise it accordingly,

and use the appropriate function to draw 10000 random samples from it into a vector.

  • Use the hist() function to plot a histogram. Add a title and axis labels in the same way you did for plot().
  • Using the quantile function for your distribution, find the upper and lower quartiles.

Use the abline() function to draw vertical lines on the histogram representing the quartiles. Perhaps you should also make the lines a different style so that they stand out?

Challenge: In the exercise above, we used a vector of random numbers to approximate a probability distribution. However, we can use the density function to draw a curve. Try this out for your distribution. Hint: You will first need to create a vector of \(x\) values against which to plot the density. This vector will have to hold a sequence of values running between suitable limits, and have small enough intervals so that the curve appears smooth.

2.5 Loop structures

Loop structures are used when a task needs to be performed repetitively. Loops enable routine manual tasks to be performed with great speed and, together with conditional statements, are of fundamental importance in creating your own functions or programmes. A scenario that seems to occur very frequently when programming is when data are stored in a matrix (or vector) and it is necessary to compute some function of the row or column elements (or entries of the vector). The most basic way that this can be achieved is by the use of for or while loops, the functionality of which is best illustrated via example.

The following code prints the square of the numbers 1 – 10.

Method 1:

for (i in 1:10){    # Start the loop with i = 1.
    print(i^2)        # print the square of i
}                   # goto start of the loop, increasing i by 1

Method 2:

i <- 1              # set i = 1
while (i <= 10){    # if i <= 10 then execute commands in the loop
    print(i^2)        # print the square of i
    i <- i + 1        # increase i by 1
}                   # go to the start of the loop

Some comments are in order. Firstly, you don't need to use the letter i, any valid object name will do. With the first method, i takes values in the vector object 1:10 (if you have not already done so, try typing 1:10 into the R console). Any vector can be used here, so for example if X <- runif(10) is a random vector of uniformly distributed values then replacing the first line in Method 1 above with for (i in X){ will print out the squares of the 10 random numbers in the vector X. It is also possible to loop in reverse order by using 10:1. With the second method, on the other hand, the user dictates what values i takes and how these values evolve within the loop structure itself. The former for method is probably more commonly used in practice, but it is best to be aware of the while structure as it is occasionally useful.

Matters become more complicated when the function in question depends recursively on previous values stored in a vector. Suppose it is required to set up a vector x containing the first 10 factorial numbers ie \((1!,2!,3!,\ldots,10!)\); this is achieved as follows.

x <- 1                  # 1! = 1, so initialise by setting x = 1
for (i in 2:10){        # note the loop now starts from 2
    x[i] <- i * x[i-1]  # let the next entry of x be i  times the previous entry

If only the last number was required (i.e., \(10!\)) then the line within the loop could have been replaced by x <- x * i.

Exercise: Generate a \(5\times5\) matrix of random numbers using the command

X <- matrix(runif(25),5,5).

Using a for loop structure, compute a matrix whose entries on the $i$th row are \(i\) times the entries of the row of X. Your knowledge of matrix subscripting may be useful here. How can this operation be performed without using a loop? Your code may be slow to run if it contains many loops, though there are not always short–cuts available: sometimes it is necessary to use a loop.

Exercise: Using a while loop structure, plot the graph of \(y = kx^2\) for \(k=1,2,4,8,16,32\) for \(x\in[-10,10]\). Hint: first use the plot command for the case that \(k=1\), then the lines command for subsequent plots.

Below is a video from Computerphile explaining loop structures with lambda calculus!

2.6 Conditional statements

Conditional statements are often used within loops and enable the programmer to specify when or if certain tasks are to be performed. The following Example illustrates nested loops and conditional statements. The loop structure starts with i,j \(=1\). Run the code, then try to understand it.

Example 1:

mat <- matrix(c(1,2,3,4,5,6,7,8,9),3,3,byrow=TRUE)

print(mat)                      # print the matrix

# loop through the elements of the matrix in turn.
# Question: in what order does this proceed?

for(i in 1:3){                   # start of i loop
  for (j in 1:3){                # start of j loop (the 'nested' loop)
    if (mat[i,j] > 5){           # if the i,j th element of mat > 5 then ...
      mat[i,j] <- 0              # set the i,j th element to zero
    }                            # ends the 'if' statement
  }                              # end the j loop
}                                # end the i loop
print(mat)                       # print the altered matrix

Of course, the same effect could be achieved by simply typing mat[mat>5] <- 0. This leads onto an important point: R performs looping calculations relatively slowly, so consider whether a loop is required.

Sometimes one may wish to perform an alternative task if the condition, for example (mat[i,j] > 5), is not satisfied. In the following code, which would replace the conditional statement in the loop in the example above, the other elements of mat are set to 1 (ie those elements of mat which are less than or equal to 5).

if (mat[i,j] > 5){
    mat[i,j] <- 0
    mat[i,j] <- 1

Note that indenting code as in the above examples makes it much easier to read - the braces are "tabbed out" so it is clear what actions are performed in each loop. This is good practice and makes it much easier to spot errors in your code. Take care when using & or | in your conditional statements, if you are unsure what is happening, then set up a simple example you can test your statements on (this can occasionally save hours of frustration!).

Note: Conditional statements can be used to exit from a loop, this is achieved by using the command break; alternatively, one can move onto the next iteration of the loop with the command next.

Exercise: Using a loop and conditional statement and for the numbers 1 to 100, set up a character vector containing the string div 4 if the number is divisible by 4 and not div 4 otherwise.

2.7 Programming

The situation is as follows: you have been trawling through the (somewhat unhelpful!) online help files and searching on the internet for a command for which you think "surely this is built into R!". You know mathematically how to compute the quantity you require, but you want to perform the same operation many times over the course of your current project and don't want to have to repeat lines of code over and over again (cutting and pasting lines of code can be a risky business when the integrity of your analysis is at stake). If your repetitive task is definite integration of polynomials for example, you might be thinking: if only I could just type integrate("x^2+1",1,2) to get \[ \int_1^2(x^2+1)\mathrm{d}x \]

Help is at hand! R can be used to define your own programs. Master the art of programming and you can open up a whole new world of R functionality and moreover, your code will both look neater and will be easier to de-bug.

Suppose you want to compute the mean of a vector \(x\) (obviously this task may be achieved with inbuilt R functions). You want to be able to type m(x) into the console and have R return the mean of the vector x or mean.y <- m(y) to store the mean of y in a new object mean.y. This can be achieved as follows.

m <- function(vector){                 # 1.
    n <- length(vector)                # 2.
    sum.vector <- sum(vector)          # 3.
    answer <- (1/n) * sum.vector       # 4.
    return(answer)                     # 5.

The function may now be called in the usual manner: suppose vec is a vector object in R, then typing either m(vec) or m(vector=vec) will return the mean. Try it.

Let's go through the code line by line.

  1. The hardest line to understand. This line defines an object called m and tells R that it is a function object. The word vector in parenthesis is known as the argument of the function. The argument(s) are local names of objects the function is to operate on. To illustrate this, think of \(x\) as in \(f(x) = \sin{x}\) for example. If \(u\) is a similar mathematical object to \(x\) (e.g., if they are both real numbers) then it is natural to define \(f(u) = \sin{u}\); the same idea applies with R functions. In the call to the function, m(vector=vec), a local copy of the object vec is created and stored as a local object vector. If outside the function, I already had an object called vector, then this "external" object will not be changed by any operations made from within the program. Similarly, it is not possible to access the values of a local object from outside the program. The copy of vector is therefore called local to the program m().
  2. Computes the length of vector and stores the answer in a local object called n.
  3. Creates a local object, sum.vector and stores in it the sum of the entries of vector.
  4. Creates a local object answer containing the required mean.
  5. Returns the object answer. If an allocation has been made, for example mean.vec <- m(vec) then the required mean will be stored in an external object, mean.vec, accessible outside the program. Otherwise if the call was made via the command m(vec), then R will just display the answer.

Note: Use of the browser() function in your programs allows to interrupt the flow of a program and access local objects. When R comes across the command browser(), it enters a special browsing mode. Type the name of any local or external object to view it. To exit browsing mode, either hit return to continue running the script or Q to quit (note the capital). You can look at your function code on the R console by typing the name of the function – e.g., try typing sd, this should display R's in-built command for the standard deviation.

Exercise: Write a function called fact to compute the factorial of any number. Compute the factorial of a few (small!) numbers and compare your answer to those obtained via R's own factorial function, factorial().

Exercise Set up a \(60\times30\) matrix, X, using the command, X <- matrix(runif(1800),60,30). Using an optinal argument, cols, which should take the default value FALSE, write a function to compute the variance of either the columns, or rows of X. Note that in the argument line of the function, a default value for a local variable z can be set by using z=FALSE for example; the function dim may also be of use in the program. Use your function to compute the variance of the columns and rows of X and compare these with answers you obtain using the apply function (see the help file or ask a tutor about this).

Here is another Computerphile video, this time explaining lambda calculus.

2.8 Have the right style

Good coding style is like using correct punctuation. You can manage without it, but it sure makes things easier to read. As with styles of punctuation, there are many possible variations. The following lines describe commonly used R style which make R code easier to read, share, and verify.

3 Tidyverse

3.1 Introduction

3.1.1 Base R and CRAN

So far we have been talking about base R, i.e., the R language itself with the set of functions that come with installation. Those include a lot of statistical packages and practically all simple statistical functions. However, there is an enormous ecosystem of add-on packages: if you have a statistical data wrangling problem and it has been done before, there is almost certainly an R package to do it. Luckily, almost all of them are in one place! The Comprehensive Archive R Network.

Within CRAN, there is a subset of packages which are becoming invaluable. This used to be called the Hadleyverse after the person who has written the most of most of them, i.e., Hadley Wickham. Hadley has now said he would like them to be called the tidyverse – tidy being the way he thinks about data being in the right format for doing appropriate statistical analysis.

To load tidyverse in R use


3.1.2 A look back on data frames

Tibbles are a modern reimaging of the data.frames. They keep the features that have stood the test of time, and drop the features that used to be convenient but are now frustrating (i.e., converting character vectors to factors).

Creating tibbles

tibble() is a nice way of creating data frames and encapsulates best practices for data frames. You can create a new tibble from individual vectors

  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
# A tibble: 5 x 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26

Tibbles will recycle vectors of length 1 (e.g., y=1 above is recycled to produce a vector of length 1) but will return an error otherwise:

  x = 1:5, 
  y = 1:2, 
  z = x ^ 2 + y
Error: Column `y` must be length 1 or 5, not 2
In addition: Warning message:
In x^2 + y :
  longer object length is not a multiple of shorter object length

This is because recycling vectors of greater lengths is a frequent source of errors.

Additionally, tibbles evaluate their arguments lazily and sequentially. For example, the object y in

  x = 1:5, 
  y = x ^ 2
# A tibble: 5 x 2
      x     y
  <int> <dbl>
1     1     1
2     2     4
3     3     9
4     4    16
5     5    25

is constructed from the newly created object x = 1:5 whereas. Comparing this with base R data.frame function

  x = 1:5, 
  y = x ^ 2
Error in data.frame(x = 1:5, y = x^2) : object 'x' not found

we get an error telling us that object x was not found!


The mtcars dataset from base R was extracted from the 1974 Motor Trend US magazine. It comprises fuel consumption, 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).

head(mtcars)  # head() returns the first parts of a data frame
                  mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The variables in mtcars are

[, 1]  mpg:   Miles/(US) gallon                        
[, 2]  cyl:   Number of cylinders                      
[, 3]  disp:  Displacement (cu.in.)                    
[, 4]  hp:    Gross horsepower                         
[, 5]  drat:  Rear axle ratio                          
[, 6]  wt:    Weight (1000 lbs)                        
[, 7]  qsec:  1/4 mile time                            
[, 8]  vs:    V/S                                      
[, 9]  am:    Transmission (0 = automatic, 1 = manual) 
[,10]  gear:  Number of forward gears                  
[,11]  carb:  Number of carburators

Check mtcars is a data.frame

[1] TRUE

Now, to coerce any data frame to a tibble object use as_tibble()

tbl_mtcars <- as_tibble(mtcars)
# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
 * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# ... with 22 more rows

Printing tibbles

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data. In addition to its name, each column reports its type, a nice feature borrowed from str() (for mtcars all objects are numeric vectors of double precision <dbl>).

3.2 Data manipulation using dplyr

3.2.1 Introduction

Undoubtedly, one of the most useful and important packages from tidyverse is dplyr. It provides a grammar of data manipulation and a set of consistent verbs that help solve the most common data wrangling problems:

  • select() picks variables based on their names;
  • filter() picks cases based on their values;
  • mutate() adds new variables that are functions of existing variables;
  • summarise() reduces multiple values down to a single summary;
  • arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation "by group".

3.2.2 The select() verb

The select function allows to pick a set of columns by name.


Pick miles per gallon, horsepower and number of carburators

df1 <- mtcars[,c("mpg","hp","carb")]   ## base R 
df2 <- select(mtcars, mpg, hp, mpg)    ## tidyverse::dplyr
head (df2)
                   mpg  hp carb
Mazda RX4         21.0 110    4
Mazda RX4 Wag     21.0 110    4
Datsun 710        22.8  93    1
Hornet 4 Drive    21.4 110    1
Hornet Sportabout 18.7 175    2
Valiant           18.1 105    1

Of course, all dplyr verbs work on tibbles, for example

select(tbl_mtcars, mpg, hp, mpg)

will return the same data frame as before, now stored in tibble format.

A colon can be used to select multiple contiguous columns, for example

df <- select(tbl_mtcars, mpg:hp)  
                   mpg cyl disp  hp
Mazda RX4         21.0   6  160 110
Mazda RX4 Wag     21.0   6  160 110
Datsun 710        22.8   4  108  93
Hornet 4 Drive    21.4   6  258 110
Hornet Sportabout 18.7   8  360 175
Valiant           18.1   6  225 105

Also, see ?select_helpers for variable selection based on names, regular expressions, string match, etc.

3.2.3 The filter() verb

The filter function allows to pick a set of rows that meet specified conditions. For example, suppose we want to keep all cars for which miles per gallon is less than 20. This is easily done by

filter(tbl_mtcars, mpg<20)   ## tidyverse::dplyr 
tbl_mtcars[tbl_mtcars$mpg<20,]   ## base R

More complicated expressions can be fed in filter, e.g., suppose now that we want to keep all cars for which miles per gallon is less than 20 AND horsepower is greater than 230:

## You can use comma or ambersand (&) for AND condition
## 1. comma syntax
filter(tbl_mtcars, mpg<20, hp>230)  
## 2. ambersand syntax 
filter(tbl_mtcars, mpg<20 & hp>230)

For OR condition use the |, i.e.,

filter(tbl_mtcars, mpg<20 | hp>230)

will return a data frame that consists of all the cars for which either miles per gallon is less than 20 OR horsepower is greater than 230.

3.2.4 Pipelining: How you should write your dplyr code

Although not part of the dplyr package, the pipe operator %>% (which is imported from package magrittr) allows to pipe an object forward into a function or call expression. This is extremely useful as it allows to combine multiple operations using clean and readable code.

The magrittr package offers a set of operators which make your code more readable by:

  • structuring sequences of data operations left-to-right (as opposed to from the inside and out),
  • avoiding nested function calls
  • minimizing the need for local variables and function definitions, and
  • making it easy to add steps anywhere in the sequence of operations.

The operators pipe their left-hand side values forward into expressions that appear on the right-hand side, i.e., one can replace f(x) with x %>% f(), where %>% is the (main) pipe-operator. When coupling several function calls with the pipe-operator, the benefit will become more apparent. Consider this example: suppose we would like to obtain information about mpg, hp and carb for the cars that have mpg less than \(20\). One way of doing this is to first select the variables mpg, hp and carb from the tbl_mtcars data frame and then filter the rows that meet the condition mpg less than 20, i.e.,

filter( select(tbl_mtcars, mpg, hp, carb), mpg < 20 )

Here we see a nested structure in the code: the function call select is nested in the function call filter.

With the pipe operator, this piece of code can be rewritten as

tbl_mtcars %>% 
    select(mpg, hp, carb) %>% 
    filter(mpg < 20)

which is cleaner and can be naturally read as: tbl_mtcars then select variables mpg, hp, carb then filter rows that meet the condition mpg < 20

  • x %>% f is equivalent to f(x)
  • x %>% f(y) is equivalent to f(x,y)
  • x %>% f %>% g %>% h is equivalent to h(g(f(x)))

3.2.5 The mutate() verb

The mutate() function is used to create new variables from existing ones. Suppose we would like to convert the weight of all cars in metric system, i.e., from pounds to kilograms. In order to do so in base R, we would need to multiply mtcars$wt by the appropriate conversion factor (\(0.4535924\)), store the new vector in an object and append it to the data frame mtcars. The function mutate does all of this in one go!

conv_factor <- 0.4535924
mutate(tbl_mtcars, wt_metric = wt * conv_factor)      ## without pipe
tbl_mtcars %>% mutate(wt_metric = wt * conv_factor)   ## pipe syntax
# A tibble: 32 x 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb wt_metric
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4  1.188412
 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4  1.304078
 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1  1.052334
 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1  1.458300
 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2  1.560358
 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1  1.569430
 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4  1.619325
 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2  1.446960
 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2  1.428816
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4  1.560358

3.2.6 The group_by() and summarise() verbs

grouped_by_am <- group_by(tbl_mtcars, am)
# A tibble: 32 x 11
# Groups:   am [2]
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
 * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0     1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0     1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1     1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1     0     3     1
 5  18.7     8 360.0   175  3.15 3.440 17.02     0     0     3     2
 6  18.1     6 225.0   105  2.76 3.460 20.22     1     0     3     1
 7  14.3     8 360.0   245  3.21 3.570 15.84     0     0     3     4
 8  24.4     4 146.7    62  3.69 3.190 20.00     1     0     4     2
 9  22.8     4 140.8    95  3.92 3.150 22.90     1     0     4     2
10  19.2     6 167.6   123  3.92 3.440 18.30     1     0     4     4
# ... with 22 more rows

We notice that nothing has changed, the data look exactly the same as if we sent tbl_mtcars in the console. However, the second line of the R output shows additional information, i.e., # Groups: am [2]. This indicates that the new object grouped_by_am has identified and stored the two different groups of cars (automatic and manual).

This allows us to obtain useful summaries in each group with the summarise() function. For example, suppose we were interested in the average weight of cars in the two distinct groups. This can be obtained by:

summarise(grouped_by_am, avg_wt = mean(wt))
# A tibble: 2 x 2
     am   avg_wt
  <dbl>    <dbl>
1     0 3.768895
2     1 2.411000

Of course, this code can be combined in a single line as

summarise(group_by(tbl_mtcars, am), avg_wt = mean(wt))

but this becomes very hard to read. Comparing this with

tbl_mtcars %>%
    group_by(am) %>%
    summarise(avg_wt = mean(wt))

we can appreciate the power of writing code using the pipe operator %>%!!

4 Optimization

REFRESHER: If you feel your multivariable calculus is rusty, I strongly recommend Dennis Auroux Multivariable Calculus lectures that were recorded during Fall 2007 at MIT. In particular, Lecture 8:

  • 15.40: Contour plot
  • 34.48: Partial derivatives
  • 41.57: Visualization

and Lecture 9:

4.1 Introduction

Maximum likelihood estimation is central to statistical inference. Long hours can be be invested in learning about the theoretical performance of MLEs and their analytic derivation. Faced with a complex likelihood lacking analytic solution, however, many people are unsure how to proceed.

Most functions cannot be optimized analytically. For example, maximizing \[ g(x)= (\log x)/(1+x) \qquad x>0 \] with respect to x by setting the derivative equal to zero and solving for \(x\) leads to an algebraic impasse because \[ 1 - \log x + 1/x = 0 \] has no analytic solution. Many realistic statistical models induce likelihoods that cannot be optimized analytically–indeed, we would argue that greater realism is strongly associated with reduced ability to find optima analytically.

Data scientists face other optimization tasks, too, aside from maximum likelihood. Minimizing risk in Bayesian decision problems, solving nonlinear least squares problems, finding highest posterior density intervals distributions, and a wide variety of other tasks all involve optimization.

More generally, many problems in statistics can be characterized as solving

\[\min_{\boldsymbol{x} \in \mathbb{R}^m} \, f(\boldsymbol{x})\]

where \(f: \mathbb{R}^m \mapsto \mathbb{R}\) is a real scalar valued function of vector \(x \in \mathbb{R}^m\). The function \(f\) can represent

  • negative log-likelihood;
  • negative posterior distribution;
  • decision rule, cost, distance, energy, etc.

4.1.1 Local vs global optimization

Minimization methods generally operate by seeking a local minimum of \(f\). That is, they seek a point \(\boldsymbol{x}^*\), such that, \[ f(\boldsymbol{x}^* + \boldsymbol{\Delta}) \geq f(\boldsymbol{x}^*), \] for any arbitrary \(\boldsymbol{\Delta}\).

Unless \(f\) is a strictly convex function, it is not possible to guarantee that such a \(\boldsymbol{x}^*\) is a global minimum. That is, it is not possible to guarantee that \[f(\boldsymbol{x}^* + \boldsymbol{\Delta})\geq f(\boldsymbol{x}^*)\] for any arbitrary \(\boldsymbol{\Delta}\).

4.1.2 Synthetic and Analytic point of view for convex functions (\(m=1\))

Definition: A function \(f: G\mapsto \mathbb{R}\), where \(G\) is an open subinterval of \(\mathbb{R}\), is called convex on \(G\) if its graph lies below any of its chords: for \(x_1,x_2 \in G\) and \(t\in (0,1)\), \[ f(t\, x_1 + (1-t)\, x_2)\leq t \,f(x_1) + (1-t)\,f(x_2) \] Illustration:


From the definition, it automatically follows that \(f\) is continuous on \(G\) since the fact that \(f\) is convex may be rewritten as \[ \Delta_{u,v} \leq \Delta_{v,w} \quad \mbox{where }\Delta_{u,v} := \frac{f(v) - f(u)}{ v-u}\cdot \]

If \(f\) is twice differentiable on \(G\), then \(f\) is convex if and only if \[ f''(x)\geq 0 \qquad \mbox{for all }x \in G. \]

4.1.3 Optimization algorithms

  • Beginning at \(\boldsymbol{x}^{(0)}\), optimization algorithms are designed to generate a sequence of iterates \(\{\boldsymbol{x}^{(k)}\}_{k=0}^{\infty}\) that terminate when
    • either no more progress can be made;
    • or when it seems that a solution point has been approximated with sufficient accuracy.
  • In deciding how to move from iterate \(\boldsymbol{x}^{(k)}\), the algorithms use information about the function \(f\) at \(\boldsymbol{x}^{(k)}\), and possibly also information from earlier iterates \(\boldsymbol{x}^{(0)}, \boldsymbol{x}^{(1)},\ldots, \boldsymbol{x}^{(k-1)}\).
  • Optimization methods are short-sighted. At any stage of operation all that optimization methods "know" about a function are a few properties of the function at the current best \(\boldsymbol{x}\). It is on the basis of such information that a method tries to find an improved best \(\boldsymbol{x}\).

4.1.4 Search direction and Trust Region algorithms

Optimization algorithms start from the initial guesstimate \(\boldsymbol{x}^{(0)}\) of the optimum. Starting from \(k=0\), most methods then iterate the following steps

1 Evaluate \(f(\boldsymbol{x}^{(k)})\), and possibly the first and second derivatives of \(f\) with respect to the elements of \(\boldsymbol{x}\), at \(\boldsymbol{x}^{(k)}\)

2 Either

  • use the information from step 1 (and possibly previous executions of step 1) to find a search direction, \(\boldsymbol{\Delta}\), such that for some \(\alpha>0\), \(f(\boldsymbol{x}^{(k)} + \alpha \boldsymbol{\Delta})\) will provide sufficient decrease relative to \(f(\boldsymbol{x}^{(k)})\). Search for such an \(\alpha\) and set \(\boldsymbol{x}^{(k+1)} = \boldsymbol{x}^{(k)} + \alpha \boldsymbol{\Delta}\).
  • use the information from step 1 (and possibly previous executions of step 1) to construct a local model of \(f\), and find \(\boldsymbol{x}\) minimizing this model within a defined region around \(\boldsymbol{x}^{(k)}\). If this value of \(\boldsymbol{x}\) leads to a sufficient decrease in \(f\), accept it as \(\boldsymbol{x}^{(k+1)}\), otherwise shrink the search region until it does.

4.2 Multivariate Taylor's theorem

Suppose that \(f:\mathbb{R}^m\rightarrow \mathbb{R}\) is twice continuously differentiable function of \(\boldsymbol{x}\) and that \(\boldsymbol{\Delta}\) is of the same dimension as \(\boldsymbol{x}\). Then

\begin{equation} f(\boldsymbol{x} + \boldsymbol{\Delta}) = f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T \nabla^2f(\boldsymbol{x} + t \boldsymbol{\Delta}) \,\boldsymbol{\Delta} \end{equation}

for some \(t\in(0,1)\), where

\begin{equation*} \nabla f = \left[\begin{matrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \end{matrix}\right] \qquad \text{and} \qquad \nabla^2 f = \left[ \begin{matrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \ldots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1}& \frac{\partial^2 f}{\partial x_2^2} & \ldots\\ \vdots & \vdots & \ddots \end{matrix} \right], \end{equation*}

are the gradient vector and Hessian matrix, respectively.

Conditions for minimum: From Taylor's theorem, it can be established that the conditions for \(f(\boldsymbol{x}^*)\) to be a minimum are that

\begin{equation*} \nabla f(\boldsymbol{x}^*) = 0 \qquad \text{and}\qquad \nabla^2 f(\boldsymbol{x}^*)\text{ is positive semi-definite}. \end{equation*}

Remember: A symmetric \(m\times m\) real matrix \(M\) is called positive semi-definite if \[ \boldsymbol{x}^T\, M \,\boldsymbol{x} > 0 \] for every \(\boldsymbol{x} \in \mathbb{R}^m \setminus \{\boldsymbol{0}\}\).

4.3 Steepest descent

Suppose we are located at an arbitrary point \(\boldsymbol{x}\) in \(\mathbb{R}^m\). We ask, which direction leads to the most rapid decrease in \(f\) for sufficiently small step, and use this as the search direction. If \(\boldsymbol{\Delta}\) is small enough, then \[ f(\boldsymbol{x}+\boldsymbol{\Delta}) \approx f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta}. \] This is the tangent plane approximation of \(f\) at \(\boldsymbol{x}\).

  • For \(\boldsymbol{\Delta}\) of fixed length, the greatest decrease will be achieved by minimising the inner product

\[ \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} = \left\| \nabla f( \boldsymbol{x}) \right\| \left\| \boldsymbol{\Delta}\right\| \cos(\theta), \] that is, by choosing \(\boldsymbol{\Delta}\) parallel to \(-\nabla f(\boldsymbol{x})\) (recall \(\theta = \mbox{Angle}\left(f(\boldsymbol{x}),\boldsymbol{\Delta}\right)\)). Here \(\left\| \cdot \right\|\) denotes Euclidean distance.

  • Set

\[ \boldsymbol{\Delta}= -\nabla f(\boldsymbol{x})/\left\|\nabla f(\boldsymbol{x})\right\|. \]

Given some direction, we need a method for choosing how far to move from \(\boldsymbol{x}\) along this direction.

Trade-off: Need to make reasonable progress, but do not want to spend too much time choosing the exactly optimal distance to move, since the point will only be abandoned at the next iteration.

Best to try and choose the step length \(\alpha > 0\) so that \(f(\boldsymbol{x}+\alpha \boldsymbol{\Delta})\) is sufficiently much lower than \(f(\boldsymbol{x})\), and also that the magnitude of the gradient of \(f\) in the \(\boldsymbol{\Delta}\) direction is sufficiently reduced by the step. At the moment we assume we have such a method (e.g., Wolfe conditions).

Given that we initially started at the point \(\boldsymbol{x}^{(0)}\), the gradient descent algorithm implements the following iterative procedure for finding the minimum of a function \(f\). \[ \boldsymbol{x}^{(k+1)} = \boldsymbol{x}^{(k)} - \alpha \,\frac{\nabla f(\boldsymbol{x}^{(k)})}{\left\| \nabla f(\boldsymbol{x}^{(k)}) \right\|} \]

4.4 Newton's method

REM: tangent plane approximation used to motivate steepest descent \[ f(\boldsymbol{x}+\boldsymbol{\Delta}) \approx f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} \] The quantity \(\nabla f(\boldsymbol{x})^T \boldsymbol{\Delta}\) is called the directional derivative of \(f\) along the vector \(\boldsymbol{\Delta}\) and is equal to

\begin{align*} \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} = \lim_{h\rightarrow 0} \frac{f(\boldsymbol{x}+h \boldsymbol{\Delta}) - f(\boldsymbol{x})}{h} \end{align*}

Interpretation: \(f(\boldsymbol{x})^T \boldsymbol{\Delta}\) is the instantaneous rate of change of the function \(f\), when moving past \(\boldsymbol{x}\) at velocity \(\boldsymbol{\Delta}\).

Explanation of steepest descent's poor convergence

Taylor expansion: \(t\in (0,1)\)

\begin{align*} f(\boldsymbol{x} + \boldsymbol{\Delta}) &= f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T \nabla^2f(\boldsymbol{x} + t \boldsymbol{\Delta}) \,\boldsymbol{\Delta}% \\ % &\approx f(\boldsymbol{x}) + \nabla % f(\boldsymbol{x})^T\boldsymbol{\Delta} + O(\lVert\boldsymbol{\Delta}\lVert^2) \end{align*}
  • As steepest descent approaches the minimum, \(\boldsymbol{x}^*\), the directional derivative term retained in the Taylor expansion of \(f\) becomes negligible relative to the neglected second derivative term since \(\nabla f(\boldsymbol{x}^*) = 0\)
  • This largely explains the method's poor performance.

Second order approximation Consider instead the quadratic form

\begin{align} f(\boldsymbol{x} + \boldsymbol{\Delta}) &= f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T \nabla^2f(\boldsymbol{x} + t \boldsymbol{\Delta}) \,\boldsymbol{\Delta} \nonumber\\ &= f(\boldsymbol{x}) + \nabla f(\boldsymbol{x})^T\boldsymbol{\Delta} + \frac{1}{2}\boldsymbol{\Delta}^T \nabla^2 f(\boldsymbol{x}) \,\boldsymbol{\Delta} + O(\lVert\boldsymbol{\Delta}\lVert^3) \label{eq:quad_approx} \end{align}

This is of the form \[ c + b^T \boldsymbol{\Delta} + \frac{1}{2} \boldsymbol{\Delta}^T H_{\boldsymbol{x}}\,\boldsymbol{\Delta}, \] for vector \(b\), scalar \(c\), \(H_{\boldsymbol{x}}\) a square symmetric matrix with spectral decomposition \[ H_{\boldsymbol{x}} = \Gamma_{\boldsymbol{x}}\, \Lambda_{\boldsymbol{x}} \, \Gamma_{\boldsymbol{x}}^T \] where \(\Lambda_{\boldsymbol{x}}\) is a diagonal matrix of eigenvalues, ordered in decreasing value and \(\Gamma_{\boldsymbol{x}}\) an \(m\times m\) orthogonal matrix (\(\Gamma_{\boldsymbol{x}}^T \Gamma_{\boldsymbol{x}} = I\)) of the corresponding eigenvectors.

The unique quadratic form that passes through \(\boldsymbol{x}\) and has the same first and second partial derivatives with \(f\) at \(\boldsymbol{x}\).

Differentiating this quadratic form with respect to \(\boldsymbol{\Delta}\) and setting the result to \(\boldsymbol{0}\) we get

\begin{align*} \nabla f(\boldsymbol{x}) + \nabla^2 f(\boldsymbol{x})\,\boldsymbol{\Delta} = 0 \end{align*}

as the equation to solve for the step \(\boldsymbol{\Delta}\). Assuming \(\nabla^2 f(\boldsymbol{x})\) is positive definite, we obtain Newton's direction

\begin{align*} \boldsymbol{\Delta} &= -\nabla^2f(\boldsymbol{x})^{-1} \nabla f(\boldsymbol{x}) \\\\ & = -\Gamma_{\boldsymbol{x}} \Lambda_{\boldsymbol{x}}^{-1} \left[\Gamma_{\boldsymbol{x}}^T \nabla f(\boldsymbol{x})\right] \end{align*}

Newton's direction The instantaneous rate of change of \(f\) when moving past \(\boldsymbol{x}\) at velocity \({\boldsymbol{\Delta}=-\nabla^2f(\boldsymbol{x})^{-1}}\nabla f(\boldsymbol{x})\) is

\begin{equation} \nabla f(\boldsymbol{x})^T \boldsymbol{\Delta} = -\nabla f(\boldsymbol{x})^T \nabla^2f(\boldsymbol{x})^{-1} \nabla f(\boldsymbol{x}) < 0, \label{eq:desc_newton} \end{equation}

hence \(\boldsymbol{\Delta}\) is a descent direction.

  • Newton's direction is reliable when the difference between the true function \(f(\boldsymbol{x}+\boldsymbol{\Delta})\) and its quadratic model approximation is not too large.
  • Unlike the steepest descent, there is a "natural" step length of 1 associated with Newton's direction. Usually we accept the step, unless it fails to lead to sufficient decrease in \(f\). In the latter case, the step length is repeatedly halved until sufficient decrease is achieved.
  • When \(\nabla^2 f(\boldsymbol{x})\) is not positive definite, Newton's direction may not even be defined, since \(\nabla^2 f(\boldsymbol{x})^{-1}\) may not even exist.


Practical issues From the discussion above, we see that any step of the form \[ \boldsymbol{\Delta}=-B_{\boldsymbol{x}}^{-1} \nabla f(\boldsymbol{x}) \qquad \mbox{for }B_{\boldsymbol{x}} \mbox{ positive definite} \] implies \(\boldsymbol{\Delta}\) is a descent direction.

  • This observation gives us the freedom to modify \(\nabla^2 f(\boldsymbol{x})\) to make it positive definite and still be guaranteed a descent direction.
  • One approach is to take the symmetric eigen-decomposition \(\Gamma_{\boldsymbol{x}}\,\Lambda_{\boldsymbol{x}} \, \Gamma_{\boldsymbol{x}}^T\), reverse the sign of any negative eigenvalues and replace any zero eigenvalues with a small positive number, e.g., for \(\lambda_1,\ldots,\lambda_m \neq 0\) tweak \begin{equation} \Lambda_{\boldsymbol{x}} = \left[ \begin{matrix} \lambda_1 & 0 & \ldots & 0\\ 0 &\lambda_2 & \ldots &0\\ \vdots &&\ddots& \vdots\\ 0 & 0 & \ldots & \lambda_m \end{matrix}\right] \quad \text{to} \quad \Lambda'_{\boldsymbol{x}} = \left[ \begin{matrix} \lambda_1 | & 0 & \ldots & 0\\ 0 &|\lambda_2| & \ldots &0\\ \vdots &&\ddots& \vdots\\ 0 & 0 & \ldots & |\lambda_m| \end{matrix}\right] \end{equation}
  • Computational cheaper alternatives add a multiple of the identity matrix to \(\nabla^2 f(\boldsymbol{x})\), sufficiently large to make the result positive definite, i.e., \[ B_{\boldsymbol{x}} = \nabla^2 f(\boldsymbol{x}) + c \boldsymbol{I}, \qquad c>0 \]

where \[ \boldsymbol{I} = \left[\begin{matrix} 1& 0 & \ldots &0\\ 0 &1 & \ldots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \ldots & 1 \end{matrix}\right] \] Practical Newton algorithm

  • Set \(k=0\) and a guesstimate \(\boldsymbol{x}^{0}\)
  • Evaluate \(f(\boldsymbol{x}^{(k)}), \nabla f(\boldsymbol{x}^{(k)}), \nabla^2f(\boldsymbol{x}^{(k)})\), implying a quadratic model approximation to \(f\).
  • Test whether \(\boldsymbol{x}^{(k)}\) is a minimum and terminate if it is.
  • If \(\nabla^2f(\boldsymbol{x}^{(k)})\) is not positive definite, perturb it so that it is (modifying the quadratic model)
  • The search direction is defined by the solution \begin{equation} \boldsymbol{\Delta}= -\nabla^2 f(\boldsymbol{x}^{(k)})^{-1} \nabla f(\boldsymbol{x}^{(k)}) \end{equation}
  • If \(f(\boldsymbol{x}^{(k)}+\boldsymbol{\Delta})\) is not sufficiently lower than \(f(\boldsymbol{x}^{(k)})\), repeatedly halve \(\boldsymbol{\Delta}\) until it is.
    • Set \(\boldsymbol{x}^{(k+1)}=\boldsymbol{x}^{(k)} + \boldsymbol{\Delta}\), increment \(k\) by one, return to step 2.


  • Methods that use the Newton direction have a fast rate of local convergence, typically quadratic.
  • After a neighbourhood of the solution is reached, convergence to high accuracy often occurs in just a few iterations.
  • The main drawback of the Newton direction is the need for the Hessian \(\nabla^2 f(\boldsymbol{x})\). This can be tedious to obtain, especially when the dimension of \(\boldsymbol{x}\) is large.
  • Finite-difference techniques may be useful in avoiding the need to calculate second derivatives by hand.

Quasi-Newton methods Is it possible produce methods that only require first derivative information, but with efficiency rivalling Newton's method, rather than steepest descent?

  • One approach is to use finite-difference but this is usually more costly than exact evaluation of Hessian.
  • Quasi-Newton methods provide an attractive alternative to Newton's method in that they do not require computation of the Hessian and yet still attain a superlinear rate of convergence
  • In place of \(\nabla^2 f(\boldsymbol{x})\), they use an approximation \(B_{\boldsymbol{x}}\), which is updated after each step to take into account of the additional knowledge gained in every step. The updates make use of the fact that changes in the gradient provide information about the second derivative of \(f\) along the search direction

REM fundamental theorem of calculus, if \(f:\mathbb{R}\rightarrow \mathbb{R}\) is a continuous function then \[ f(x) = f(a) + \int_{a}^x f'(t) ~\mathrm{d}t \] The multivariate generalisation recasted to the gradient of \(f\) is

\begin{align*} \nabla f(\boldsymbol{x} + \boldsymbol{\Delta}) &= \nabla f(\boldsymbol{x}) + \int_{0}^{1}\nabla^2 f(\boldsymbol{x}+ t \boldsymbol{\Delta})\, \boldsymbol{\Delta}~\mathrm{d}t \end{align*}

Adding and subtracting \(\nabla^2 f(\boldsymbol{x})\,\boldsymbol{\Delta}\) to the r.h.s, we get

\begin{align*} \nabla f(\boldsymbol{x} + \boldsymbol{\Delta}) &= \nabla f(\boldsymbol{x}) + \nabla^2f(\boldsymbol{x})\,\boldsymbol{\Delta}\\ \\ & \qquad + \int_{0}^{1}\left[\nabla^2 f(\boldsymbol{x}+ t \boldsymbol{\Delta} ) - \nabla^2 f(\boldsymbol{x}) \right] \, \boldsymbol{\Delta}~\mathrm{d}t \end{align*}

Setting \(\boldsymbol{x}=\boldsymbol{x}^{(k)}\) and \(\boldsymbol{\Delta} = \boldsymbol{x}^{(k+1)} -\boldsymbol{x}^{(k)}\) we obtain that for small \(\lVert \boldsymbol{\Delta}\lVert\)

\begin{equation} \nabla f(\boldsymbol{x}^{(k+1)}) - \nabla f(\boldsymbol{x}^{(k)}) \approx \nabla^2 f(\boldsymbol{x}^{(k)}) (\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}) \label{eq:secant} \end{equation}

Let \(H^{(k+1)}\) be the approximate positive definite Hessian at the \((k+1)\)-th step

  • The basic requirement of a quasi Newton method is that the approximation should exactly match \(\nabla f(\boldsymbol{x}^{(k)})\), i.e., it should get the gradient vector at the previous point, \(\boldsymbol{x}^{(k)}\), exactly right, mathematically
\begin{equation} \nabla f(\boldsymbol{x}^{(k+1)}) - \nabla f(\boldsymbol{x}^{(k)}) = H^{(k+1)}(\boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}) \label{eq:quasi_newton} \end{equation}

\(\qquad\) so that the new Hessian approximation \(H^{(k+1)}\) mimics the property \eqref{eq:secant}

  • Let \(\boldsymbol{s}_k = \boldsymbol{x}^{(k+1)}-\boldsymbol{x}^{(k)}\) and \(\boldsymbol{y}_k = \nabla f(\boldsymbol{x}^{(k+1)}) - \nabla f(\boldsymbol{x}^{(k)})\)
  • Then the expression in the first bullet point is equivalent to \begin{equation} H^{(k+1)} \boldsymbol{s}_k = \boldsymbol{y}_k \qquad \mbox{secant equation} \end{equation}
  • The secant equation will only be feasible for positive definite \(H^{(k+1)}\) under certain conditions on \(\boldsymbol{s}_k\) and \(\boldsymbol{y}_k\), but we can always arrange for these to be met by appropriate step length selection. (Wolfe conditions, see Nocedal & Wright, 2006.
  • The secant equation alone does not define a unique \(B^{(k+1)} = (H^{(k+1)})^{-1}\), and some extra conditions are needed.

Conditions and solution

  • \(B^{(k+1)}\) satisfies the secant equation
  • The difference between \(B^{(k)}\) and \(B^{(k+1)}\) has low rank, i.e., \(B^{(k+1)}\) is as close to \(B^{(k)}\) (measured by some appropriate matrix norm)
  • \(B^{(k+1)}\) is positive definite

The unique solution to this problem is the BFGS update \[ B^{(k+1)} = (\boldsymbol{I} - \rho_k \boldsymbol{s}_k \boldsymbol{y}_k^T) B^{(k)} (\boldsymbol{I} - \rho_k \boldsymbol{s}_k \boldsymbol{y}_k^T) + \rho_k \boldsymbol{s}_k \boldsymbol{s}_k^T \] where \(\rho_k^{-1} = \boldsymbol{s}_k^T \boldsymbol{y}_k\) – BFGS is named after Broyden, Fletcher, Goldfarb and Shanno all of whom independently discovered and published it, around 1970.

  • The BFGS method then works exactly like Newton's method, but with \(B^{(k)}\) in place of the inverse \(\nabla^2 f(\boldsymbol{x}^{(k)})\), and without the need for second derivative evaluation or for perturbing the Hessian to achieve positive definiteness
  • A finite difference approximation to the Hessian is often used to start the method

4.5 Nelder-Mead simplex

Nelder-Mead polytope What if even gradient evaluation is too taxing, or if our objective is not smooth enough for Taylor approximation to be valid? The Nelder-Mead polytope provides a successful answer.

  • Let \(m\) be the dimension of \(\boldsymbol{x}\). At each stage of the method we maintain \(m+1\) distinct \(\boldsymbol{x}\) vector, defining a polytope in \(\mathbb{R}^m\) (e.g., for a two dimensional \(\boldsymbol{x}\), the polytope is a triangle)
  • The search direction is defined as the vector from the worst point (vertex of polytope with highest value) through the average of of the remaining \(m\) points
  • The initial step length is set to twice the distance from the worst point to the centroid of the others. If it succeeds (meaning that the new point is no longer the worst point), then a step length of 1.5 times is tried, and the better of the two is accepted
  • If the previous step did not find a successful new point then step lengths of half and one and a half times the distance from the worst point to the centroid are tried
  • If the last two steps failed to locate a successful then the polytope is reduced in size, by linear rescaling towards the current best point (which remains fixed)


Nelder-Mead polytope

  • Variations are possible, in particular with regard to step length and shrinkage factors.
  • Slower rate of convergence than Newton type algorithms
  • Nelder-Mead is good if the answer does not need to be accurate, and derivatives are difficult to compute.
  • Not recommended for general purpose modelling software.

Other methods

  • In likelihood maximisation contexts, the method of scoring is used. This replaces the Hessian in Newton's method with the expected Hessian.
  • Conjugate gradient method is another way of getting a Newton type method when only first derivatives are available. It is applicable when the objective is somehow related to a sum of squares of differences between data and fitted values. This is closely related to the algorithm for fitting generalised linear models (GLMs).

Other methods

  • Simulated annealing is a method for optimising difficult objectives, such as those arising in discrete optimisation problems, or with very spiky objective functions. The basic idea is to propose random changes in \(\boldsymbol{x}\), always accepting a change which reduces \(f(\boldsymbol{x})\), but also accepting moves which increase the objective, with a probability which decreases with the size of increase, andalso decreases as the optimisation progresses.
  • The Expectation-Maximisation algorithm is a useful approach to likelihood maximisation when there are missing data or random effects that are difficult to integrate out.

5 Computer intensive methods

5.1 Markov chain Monte Carlo

Brief overview of Bayesian inference

  • \(\boldsymbol{x}=(x_1,\ldots,x_n)\): vector of observations
  • \(f(\boldsymbol{x}\mid \boldsymbol{\theta})\): sampling density (likelihood)
  • \(\pi(\boldsymbol{\theta})\): prior distribution of unknown \(\boldsymbol{\theta}=(\theta_1,\ldots,\theta_d)\).
  • Inference on the parameter based on the posterior distribution \[ \pi(\boldsymbol{\theta}\mid \boldsymbol{x}) = \frac{f(\boldsymbol{x}\mid \boldsymbol{\theta})\,\pi(\boldsymbol{\theta})}{f(\boldsymbol{x})} \] where \[ f(\boldsymbol{x})= \int f(\boldsymbol{x}\mid \boldsymbol{\theta})\,\pi(\boldsymbol{\theta})~\mathrm{d}\boldsymbol{\theta} \] is called the marginal likelihood or evidence.
  • Forecasting a future value of an observation \(x^*\) based on the posterior predictive distribution, with density \[ f(x^*\mid\boldsymbol{x}) = \int f(x^*\mid \boldsymbol{x},\boldsymbol{\theta}) \,\pi(\boldsymbol{\theta}\mid \boldsymbol{x}) \]
  • Often interested in some feature of the posterior, e.g., \[ \mathbb{E}\left[h(\boldsymbol{\theta})\mid \boldsymbol{x}\right] = \int h(\boldsymbol{\theta}) \, \pi (\boldsymbol{\theta}\mid \boldsymbol{x})~\mathrm{d}\boldsymbol{\theta} \]
  • Also interested in marginal posterior distributions \[ \pi(\theta_j\mid \boldsymbol{x})\qquad j=1,\ldots,d \]

Conditional conjugacy

  • Recall notion of conjugacy in Bayesian problems where there is a single unknown parameter \(\theta\).
  • Example: Let \(x_1,\ldots,x_n\) i.i.d. from exponential\((\theta)\) \[ f(\boldsymbol{x}\mid \theta) = \theta^n e^{-\sum_{i=1}^n x_i} \] and suppose our prior is \(\theta\sim \text{Gamma}(2,1)\).
  • Then posterior for \(\theta\) given data \(\boldsymbol{x}\) is \[ \theta\mid \boldsymbol{x}\sim \text{Gamma}\left(2+n, 1 + \sum_{i=1}^n x_i\right) \]
  • Thus posterior distribution remains in the Gamma family for this problem whatever data are observed. The Gamma family is called conjugate.
  • In many problems with a single unknown parameter it is usually possible to find useful conjugate priors. See for example Wikipedia's page for discrete distributions and continuous distributions.
  • However, even though in theory this is also possible in higher dimensional situations, in practice the conjugate families end up being highly complex and difficult to summarise.
  • On the other hand, many multi-parameter Bayesian problems exhibit conditional conjugacy.

Conditional conjugacy example

  • Let \(X_i \mid \mu, \omega \sim N(\mu,1/\omega)\) independently for \(i=1,\ldots,n\). The likelihood is \[ f(\boldsymbol{x}\mid \mu,\omega) \propto \omega^{n/2} \exp\left\{-\frac{\omega}{2}\sum_{i=1}^n (x_i-\mu)^2\right\} \]
  • Suppose also we have prior distributions for \(\mu\) and \(\omega\) \[ \mu \sim N (\mu_0,1/\kappa_0) \qquad \text{and} \qquad \omega\sim \text{Gamma}(\alpha_0,\lambda_0) \] where \(\mu\) and \(\omega\) are considered a priori independent.
  • Posterior density of \(\mu,\omega\) is \begin{align*} \pi(\mu,\omega\mid \boldsymbol{x})\propto & \exp\left\{-\frac{\omega}{2}\sum (x_i-\mu)^2\right\} \exp\left\{-\frac{\kappa_0}{2}(\mu-\mu_0)^2\right\}\\\\ & \times \omega^{\frac{n}{2}+\alpha_0-1}\exp(-\lambda_0 \omega). \end{align*}
  • Conditional posterior distributions belong to the same families as the priors \[ \pi(\mu \mid \omega,\boldsymbol{x}) \propto \exp\left\{-\frac{\omega}{2}\sum (x_i-\mu)^2\right\}\exp\left\{-\frac{\kappa_0}{2}(\mu-\mu_0)^2\right\} \] which is \[N\left(\frac{\kappa_0 \mu_0 + n\, \omega \bar{x}}{\kappa_0+n\, \omega},\frac{1}{\kappa_0 + n\, \omega}\right)\] while \[ \pi(\omega\mid \mu,\boldsymbol{x})\sim \text{Gamma}\left(\alpha_0 + \frac{n}{2},\lambda_0 + \frac{\sum_{i=1}^n(x_i-\mu)^2}{2}\right) \]


5.1.1 Gibbs sampling

  • In multi-dimensional Bayesian problems it is rarely possible to compute analytically summary statistics such as posterior means and variances, or even posterior probabilities.
  • Therefore, it is necessary to estimate the quantities of interest using a Monte Carlo approach. However, simulating from an arbitrary high dimensional distribution is usually difficult and often impossible to do directly.
  • Instead, Markov chain Monte Carlo methods are used to simulate a Markov chain, whose stationary or limiting distribution is the posterior distribution of interest.
  • The concept of conditional conjugacy is crucial in the constrution of one of the basic forms of MCMC:

The Gibbs sampler algorithm

  • Initialise \(\boldsymbol{\theta}=(\theta_1^{(0)},\theta_2^{(0)},\ldots,\theta_d^{(0)})\)
  • Simulate \(\theta_1^{(1)}\) from \(\pi(\theta_1\mid \theta_2^{(0)},\theta_3^{(0)},\ldots,\theta_d^{(0)},\boldsymbol{x})\)
  • Simulate \(\theta_2^{(1)}\) from \(\pi(\theta_2\mid \theta_1^{(1)},\theta_3^{(0)},\ldots,\theta_d^{(0)},\boldsymbol{x})\)
  • \(\ldots\)
  • Simulate \(\theta_d^{(1)}\) from \(\pi(\theta_d\mid \theta_1^{(1)},\theta_2^{(1)},\ldots,\theta_{d-1}^{(1)},\boldsymbol{x})\)
  • Iterate.

R code (conditional conjugacy example)

  gibbs <- function(X, mu0, kappa0, alpha0, lambda0, nburn=0, ndraw= 5000)
    n      <- length(X)
    sumX   <- sum(X)
    alpha1 <- alpha0 + (n/2)
    mu     <- rnorm(1, mu0, sqrt(1/kappa0))
    omega  <- rgamma(1, alpha0, 1)/lambda0
    draws  <- matrix(nrow=ndraw, ncol=2)
    iter   <- -nburn
    while(iter < ndraw)
        iter   <- iter + 1
        stdev  <- sqrt(1/(kappa0 + omega*n))
        mu1    <- (kappa0*mu0 + omega*sumX)*stdev^2
        mu     <- rnorm(1, mu1, stdev)
        omega  <- rgamma(1, alpha1, 1)/(lambda0 + (sum((X-mu)^2))/2)
        if(iter > 0)
            draws[iter,1] <- mu
            draws[iter,2] <- omega

X     <- rnorm(100,3,1)
draws <- gibbs(X=X, mu0=0, kappa0=1, alpha0=0.1, lambda0=0.1, nburn=1000, ndraw=1000)


Suppose we want to forecast an observable \(x^*\). The predictive distribution has density

\begin{align*} f(x^*\mid \boldsymbol{x})&=\int f(x^*\mid \mu,\omega) \pi(\mu,\omega \mid \boldsymbol{x})~\mathrm{d}\mu\,\mathrm{d}\omega\\\\ &=\mathbb{E}_{\mu,\omega\mid\boldsymbol{X}=\boldsymbol{x}}\left\{ f(x^*\mid \mu,\omega)\right\}\\\\ &\approx \frac{1}{J}\sum_{j=1}^J \left(\frac{\omega^{(j)}}{2\pi}\right)^{1/2} \exp\left\{-\frac{\omega^{(j)}}{2}\sum (x^*-\mu^{(j)})^2\right\} \end{align*}

where \((\mu^{(j)},\omega^{(j)})\) are the draws obtained through the Gibbs sampler.

R code

## trace plots, summaries, etc.
## Autocorrelation
## Partial autocorrelation
## running means

## Predictive density
x     <- seq(-3,10,len=100)
pred  <- NULL
mu    <- draws[,1]
omega <- draws[,2]
for(j in 1:length(x))
  pred[j] <- mean(dnorm(x[j], mu, 1/sqrt(omega)))
plot(x, pred, type="l")
title("predictive density: p(x^*|x)")

5.1.2 The Metropolis-Hastings algorithm

General McMC algorithm

  • It is not possible to use Gibbs sampling in all problems since sometimes the conditional posterior distributions cannot be calculated.
  • Consider for example replacing in the previous model the assumption of normality by that of \[ X_i\mid \mu,\omega \sim \text{Cauchy}(\mu,1/\omega), \] independently for \(i=1,\ldots,n\).
  • Then \begin{align*} \pi(\mu\mid \omega, \boldsymbol{x}) &\propto \left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \exp\left\{-\frac{\kappa_0}{2}\left(\mu-\mu_0\right)^2\right\}, \qquad \mu \in \mathbb{R}\\\\ \pi(\omega\mid \mu, \boldsymbol{x}) &\propto \left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \omega^{\frac{n}{2}+\alpha_0-1} \exp(-\lambda_0 \omega), \qquad \omega>0 \end{align*}

    None of these distributions has a well-known form, so Gibbs sampling is precluded.

Genetic linkage Example

  • Genetic linkage of 197 animals distributed into 4 categories \(\boldsymbol{x}=(x_1,x_2,x_3,x_4)=(125,18,20,34)\) with cell probabilities \[ \left(\frac{2 + \theta}{4},\frac{1}{4\,(1-\theta)}, \frac{1}{4\,(1-\theta)},\frac{\theta}{4}\right) \]
  • Positive cell probabilities imply \(\theta \in [0,1]\)
  • Suppose \(\theta \sim \text{Uniform}(0,1)\).
  • The posterior density of \(\theta\) is \begin{align*} \pi(\theta\mid \boldsymbol{x}) &\propto f(\boldsymbol{x}\mid \theta)\\\\ & \propto (2 + \theta)^{x_1}\,(1-\theta)^{x_2+x_3} \theta^{x_4},\qquad \theta \in (0,1). \end{align*}

Metropolis-Hastings algorithm

  1. Set \(t=0\) and start the chain at some value \({\theta^{(t)}}\)
  2. Propose a candidate value \({\theta}^{(prop)}\sim q(\cdot\mid {\theta}^{(t)})\), where \(q\) is arbitrary
  3. Calculate \[ \alpha({\theta}^{(t)},{\theta}^{(prop)})= \min\left\{1,\frac{\pi({\theta}^{(prop) } \mid \boldsymbol{x} )\,q({\theta}^{(t)}\mid {\theta}^{(prop)})}{\,\,\,\,\,\pi({\theta}^{(t)}\mid \boldsymbol{x})\,q({\theta}^{(prop)}\mid {\theta}^{(t)})}\right\} \]
    1. Generate \(U\sim \text{Uniform}(0,1)\)
    2. If \(U \leq \alpha({\theta}^{(t)},{\theta}^{(prop)})\) set \({\theta}^{(t+1)} = {\theta}^{(prop)}\)
    3. Otherwise set \({\theta}^{(t+1)} = {\theta}^{(t)}\)
    4. Set \(t = t+1\) and go back to step 2

Genetic linkage

  • Let \(q(\theta^{(prop)}\mid \theta^{(t)})\sim \text{Uniform}(0,1)\). The acceptance probability is \begin{align*} \alpha({\theta}^{(t)},{\theta}^{(prop)}) &= \min \left\{1,\frac{\pi(\theta^{(prop)} \mid \boldsymbol{x})}{\pi(\theta^{(t)}\mid \boldsymbol{x})}\right\}\\\\ &= \min \left\{1,\left(\frac{2 + \theta^{(prop)}}{2 + \theta^{(t)}}\right)^{x_1}\left(\frac{1-\theta^{(prop)}}{1-\theta^{(t)}} \right)^{x_2+x_3} \left(\frac{\theta^{(prop)}}{\theta^{(t)}}^{x_4}\right)\right\} \end{align*}
  • Let \(\rho \in (0,1)\) and define \[ q(\theta^{(prop)}\mid \theta^{(t)}) = \phi\left(\frac{\Phi^{-1}(\theta^{(prop)}) - \rho\, \Phi^{-1}(\theta^{(t)})}{\sqrt{1-\rho^2}}\right)\bigg/\sqrt{1-\rho^2} \] where \(\phi\) denotes the density of the standard normal. What is \(\alpha({\theta}^{(t)},{\theta}^{(prop)})\) in this case?

    How can we come up with such a candidate generator?

R code: useful functions

# Second candidate generator density and random number generation
q2 <- function(xprop,xt,rho)
  a        <- dnorm(qnorm(xprop))
  st.ratio <- ((qnorm(xprop)-rho*qnorm(xt))/sqrt(1-rho^2)

rq2 <- function(thetat, rho)
  epsilon <- rnorm(1, 0 ,1)
  pnorm(rho*qnorm(thetat) + sqrt(1-rho^2)*epsilon)

# acceptance ratio term for independence sampler, i.e., q~Uniform(0,1)
aratio <- function(X, thetaprop, thetat)
  A <- X[1]*log((2 + thetaprop)/(2+thetat))
  B <- (X[2]+X[3])*log((1-thetaprop)/(1-thetat))
  C <- X[4]*log(thetaprop/thetat)

genetic_is <- function(X, nburn=0, ndraw=1000){
    thetat  <- runif(1, 0, 1); accept <- 0
    draws   <- NULL
    iter    <- -nburn
    while(iter < ndraw){
        iter      <- iter + 1
        thetaprop <- runif(1, 0 ,1)
        logalpha  <- aratio(X, thetaprop, thetat)
        U         <- runif(1, 0 ,1)
        if(log(U) <= logalpha){
            accept  <- accept+1
            thetat  <- thetaprop
        if(iter > 0){
           draws[iter] <- thetat 


genetic_mh <- function(X, rho=0.5, nburn=0, ndraw=1000){
    thetat  <- runif(1, 0, 1); accept <- 0; draws  <- NULL
    iter    <- -nburn
    while(iter < ndraw){
        iter      <- iter + 1
        thetaprop <- rq2(thetat,rho)
        logalpha  <- aratio(X, thetaprop, thetat) +
            (log(q2(thetat, thetaprop, rho)) -
             log(q2(thetaprop, thetat, rho)))  ## correct acceptance ratio:
        U         <- runif(1, 0 ,1)            ## ratio of candidate generator 
        if(log(U) <= logalpha){                ## density terms does not vanish
            accept  <- accept+1
            thetat  <- thetaprop
        if(iter > 0){
           draws[iter] <- thetat

R code

# Run McMC algorithms
theta_is   <- genetic_is(X, ndraw=5000, nburn=1000)
theta_mh   <- genetic_mh(X, rho=0.85, ndraw=5000, nburn=1000)
theta_is[[2]];theta_mh[[2]]     # acceptance probability

# Traceplots

# Autocorrelation plots

# Posterior density

General McMC algorithm

  1. Break the components of \(\boldsymbol{\theta}\) into \(d\) groups \(\boldsymbol{\theta}_1,\ldots,\boldsymbol{\theta}_d\) where each group \(\boldsymbol{\theta}_j\) has dimension \(\geq 1\).
  2. Initialise with \(\boldsymbol{\theta}_1^{(0)},\ldots,\boldsymbol{\theta}_d^{(0)}\).
  3. Update \(\boldsymbol{\theta}_1^{(0)}\) to \(\boldsymbol{\theta}_1^{(1)}\) "according to" the conditional distribution \(\pi\left(\boldsymbol{\theta}_1\mid\boldsymbol{\theta}_2^{(0)}, \boldsymbol{\theta}_3^{(0)},\ldots,\boldsymbol{\theta}_d^{(0)}, \boldsymbol{x}\right)\)
  4. Update \(\boldsymbol{\theta}_2^{(0)}\) to \(\boldsymbol{\theta}_2^{(1)}\) "according to" the conditional distribution \(\pi\left(\boldsymbol{\theta}_2\mid\boldsymbol{\theta}_1^{(1)}, \boldsymbol{\theta}_3^{(0)},\ldots,\boldsymbol{\theta}_d^{(0)}, \boldsymbol{x}\right)\)
  5. \(\ldots\)
  6. Update \(\boldsymbol{\theta}_d^{(0)}\) to \(\boldsymbol{\theta}_d^{(1)}\) "according" to the conditional distribution \(\pi\left(\boldsymbol{\theta}_d\mid\boldsymbol{\theta}_1^{(1)}, \boldsymbol{\theta}_2^{(1)},\ldots,\boldsymbol{\theta}_{d-1}^{(1)}, \boldsymbol{x}\right)\)
  7. Iterate

General McMC algorithm

  • Suppose we are running a general McMC algorithm and that the current value of the chain is \(\boldsymbol{\theta}_1^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\).
  • We now want to simulate \(\boldsymbol{\theta}_1^{(t+1)}\), the next value of \(\boldsymbol{\theta}_1\)
  • In the updating steps of the previous algorithm we "update according to" the appropriate conditional distribution. For Gibbs sampling, we have seen this means simulating from the conditional distribution.
  • For other McMC algorithms "update according to" means something else.
  • The most general algorithm in this context is the Metropolis-Hastings algorithm.

Metropolis-Hastings algorithm

  • Propose a candidate value \(\boldsymbol{\theta}_1^{(prop)}\) which is a draw from an arbitrary distribution with density \[ q(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_1^{(t)}, \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}) \]
  • Take as the next value of \(\boldsymbol{\theta}_1\) in the chain \begin{equation*} \boldsymbol{\theta}_1^{(t+1)} = \begin{cases} \boldsymbol{\theta}_1^{(prop)}&\text{with probability $\alpha$}\\ \boldsymbol{\theta}_1^{(t)}&\text{with probability $1-\alpha$} \end{cases} \end{equation*}


    \[ \alpha = \min\left\{1,\frac{\pi\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} {\pi\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} \frac{q\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_1^{(prop)}, \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)} {q\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_1^{(t)}, \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)}\right\} \] with \(\pi\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)\) denoting the density corresponding to the conditional posterior distribution of \(\boldsymbol{\theta}_1\) evaluated at \(\boldsymbol{\theta}_1=\boldsymbol{\theta}_1^{(prop)}\) and similarly for \(\pi\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)}\right)\)


  • In practice, the way to implement the second part of the Metropolis-Hastings updating mechanism is by drawing a value \(u\) from Uniform\((0,1)\) and taking \[ \boldsymbol{\theta}_1^{(t+1)}=\begin{cases} \boldsymbol{\theta}_1^{(prop)} &u\leq\alpha\\\\ \boldsymbol{\theta}_1^{(t)}& u>\alpha. \end{cases} \]
  • The candidate generator \(q(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_1^{(t)},\boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)})\) is arbitrary so, in principle, any choice should work. However, in terms of convergence properties (or mixing properties) of the algorithm, not all candidate generators are equally good.
  • Note that the candidate generator can depend on the current value of the chain \(\boldsymbol{\theta}_1^{(t)}\), although this is not a requirement.
  • The Metropolis-Hastings algorithm has the major advantage over Gibbs sampler that it is not necessary to be able to simulate from all the conditional posterior distributions
  • Moreover, and this can be of crucial importance, we only need to know the conditional posterior densities up to proportionality, since any constants of proportionality cancel in the numerator and denominator of the calculation of \(\alpha\).
  • The price for the simplicity is that if \(q\) is poorly chosen, then the number of rejections can be high, so that the efficiency of the procedure can be very low.
  • Gibbs sampling is a special case of the Metropolis-Hastings algorithm where the candidate generator is chosen as \[ q(\boldsymbol{\theta}_{1}^{(prop)}\mid \boldsymbol{\theta}_{1}^{(t)}, \boldsymbol{\theta}_{2}^{(t)},\ldots,\boldsymbol{\theta}_{d}^{(t)})= \pi(\boldsymbol{\theta}_{1}^{(prop)}\mid \boldsymbol{\theta}_{2}^{(t)},\ldots,\boldsymbol{\theta}_{d}^{(t)}) \] in other words we draw directly from the conditional posterior distribution. In this case the acceptance probability is 1 so that \(\boldsymbol{\theta}_{1}^{(t+1)}=\boldsymbol{\theta}_{1}^{(prop)}\) always.
  • A common choice of the candidate generator is to take \[ q(\boldsymbol{\theta}_{1}^{(prop)}\mid \boldsymbol{\theta}_{1}^{(t)}, \boldsymbol{\theta}_{2}^{(t)},\ldots,\boldsymbol{\theta}_{d}^{(t)})\sim N(\boldsymbol{\theta}_1^{(t)},V)\] where \(V\) is suitably chosen. This is the so-called random walk Metropolis algorithm with normal increments and it is very popular due to its simplicity.
  • Note that the symmetry of the candidate generator distribution in random walk Metropolis means that the terms involving \(q\) in the acceptance probability \(\alpha\) cancel, so that \[ \alpha= \min\left\{1,\frac{\pi\left(\boldsymbol{\theta}_1^{(prop)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} {\pi\left(\boldsymbol{\theta}_1^{(t)}\mid \boldsymbol{\theta}_2^{(t)},\ldots,\boldsymbol{\theta}_d^{(t)},\boldsymbol{x}\right)} \right\}. \]
  • The variance \(V\) of this candidate generator plays an important role in the mixing properties of the algorithm: if \(V\) is chosen to be too large, the moves we propose can be too bold leading to very low acceptance probabilities (slow mixing); if, on the other hand, \(V\) is too small, the acceptance probability will be very high at the expense of moving in very little steps (slow convergence).
  • \(V\) is chosen by trial and error aiming at an acceptance probability (very) roughly around 30\(\%\).

Cauchy example revisited

Consider \[ X_i\mid \mu,\omega \sim \text{Cauchy}(\mu,1/\omega), \] independently for \(i=1,\ldots,n\). The required conditional posterior densities are

\begin{align*} \pi(\mu\mid \omega, \boldsymbol{x}) &\propto \left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \exp\left\{-\frac{\kappa_0}{2}\left(\mu-\mu_0\right)^2\right\}\\\\ \pi(\omega\mid \mu, \boldsymbol{x}) &\propto \left\{\prod_{i=1}^n\frac{1}{1 + \omega (x_i-\mu)^2}\right\} \omega^{\frac{n}{2}+\alpha_0-1} \exp(-\lambda_0 \omega). \end{align*}

Here we need to resort to a more general Metropolis-Hastings algorithm.

  • Using the general McMC algorithm (pg 21), the parameter is divided into 2 groups, namely \(\mu\) and \(\omega\) and we need to update \(\mu\) "according to" \(\pi(\mu\mid \omega,\boldsymbol{x})\) and \(\omega\) "according to" \(\pi(\omega\mid \mu,\boldsymbol{x})\)
  • We use random walk Metropolis with normal increments for both \(\mu\) and \(\omega\). This leads to the following algorithm


  1. Set \(t=0\) and initial values \((\mu^{(t)},\omega^{(t)})\).
  2. Draw \(\mu^{(prop)}\sim N(\mu^{(t)},v_\mu)\) for some suitably chosen variance \(v_\mu\), and take \begin{equation*} \mu^{(t+1)}=\begin{cases} {\mu}^{(prop)} &\text{w.p. $\alpha$} \\\\ {\mu}^{(t)}& \text{w.p. $1-\alpha$}. \end{cases} \end{equation*}


    \begin{align*} \alpha = \min\left\{1,\exp\left[\frac{\kappa_0}{2}\left\{\left(\mu^{(t)}-\mu_0\right)^2 - \left(\mu^{(prop)}-\mu_0\right)^2\right\}\right] \times\prod_{i=1}^n\left(\frac{1 + \omega^{(t)} (x_i-\mu^{(t)})^2} {1 + \omega^{(t)} (x_i-\mu^{(prop)})^2}\right)\right\} \end{align*}
  3. Draw \(\omega^{(prop)}\sim N(\omega^{(t)},v_\omega)\) for some suitably chosen \(v_\omega\), and take \[ \omega^{(t+1)}=\begin{cases} {\omega}^{(prop)} &\text{w.p. $\alpha$} \\\\ {\omega}^{(t)}& \text{w.p. $1-\alpha$}. \end{cases} \] where \begin{align*} \alpha = \min\left\{1,\left(\frac{\omega^{(prop)}}{\omega^{(t)}}\right)^{\frac{n}{2}+\alpha_0-1} e^{\lambda_0\left(\omega^{(t)} - \omega^{(prop)}\right)} \times\prod_{i=1}^n\left(\frac{1 + \omega^{(t)} (x_i-\mu^{(t+1)})^2} {1 + \omega^{(prop)} (x_i-\mu^{(t+1)})^2}\right) I(\omega^{(prop)}>0)\right\} \end{align*}
  4. Iterate

R code

cauchy_rwm <- function(X, mu0=0, kappa0=1, alpha0=1, lambda0=1,
                       nburn=0, ndraw=1000, vmu=1, vomega=1){
    n           <- length(X)
    alpha1      <- (n/2) + alpha0 - 1
    stdvmu      <- sqrt(vmu) 
    stdvomega   <- sqrt(vomega)
    mut         <- rnorm(1, mu0, sqrt(1/kappa0))
    omegat      <- rgamma(1, alpha0, 1)/lambda0
    draws       <- matrix(nrow=ndraw, ncol=2)
    acceptmu    <- 0
    acceptomega <- 0
    iter        <- -nburn
    while(iter < ndraw){
        iter     <- iter + 1
        muprop   <- rnorm(1, mut, stdvmu)
        logalpha <- (kappa0/2)*((mut-mu0)^2 - (muprop-mu0)^2) +
            sum( log(1 + omegat*((X-mut)^2)) -
                 log(1 + omegat*((X-muprop)^2)))
        u <- runif(1, 0, 1)
          acceptmu <- acceptmu + 1; mut <- muprop
        omegaprop <- rnorm(1, omegat, stdvomega)
        if(omegaprop > 0){
            logalpha <- alpha1*(log(omegaprop)-log(omegat))+
                lambda0*(omegat-omegaprop) +
            u <- runif(1, 0, 1)
                acceptomega <- acceptomega + 1; omegat <- omegaprop
        if(iter > 0){
            draws[iter,1] <- mut
            draws[iter,2] <- omegat

Author: Ioannis Papastathopoulos

Created: 2017-11-21 Tue 07:14

Emacs 24.5.1 (Org mode 8.2.10)