Likelihood

From Sean_Carver
Revision as of 17:26, 29 January 2009 by Carver (talk | contribs)
Jump to: navigation, search

The likelihood tells you how probable it is that your model produces your data. This measure tells you how good your model is at matching your experimental data (from a wet-lab or a computer). First we need to understand the concept of likelihood, and this brings us into probability theory. We start the likelihood lab by studying random numbers on a computer. But before we get into likelihood, let's discuss the homework from last class. Click here for slides of completed homework.

New Tools for New Things in MATLAB

This section concerns the discussion of the homework. Discussion of the "one point creative bonus problem" will always be followed by the question what next? or what else? So far I haven't given you many tools for doing things in MATLAB. Here are some general tools that apply everywhere:

If you want to know more about a command (perhaps how to use it differently), use the help command

help randn
help hist

The help usually lists related commands at the bottom.

For many but not all commands in MATLAB you can see the MATLAB code that they execute. The others are not written in MATLAB. If you can see the code, you can copy it, thereby making a new command, then modify the new command. Reading MATLAB commands' code is an excellent way to learn advanced MATLAB skills.

type randn
type hist

Finally you can search for commands relevant to a key word.

lookfor random
lookfor histogram

I did this for next Tuesday's lecture and discovered this gem

A = randn(100000,2);
hist3(A,[30,30])

Coming Tuesday ...

Random Numbers in MATLAB

Start MATLAB (see the page Starting Matlab) then type the following into the MATLAB prompt

rand

This produces a random number. Repeat several times. Continue typing the material in boxes into the MATLAB prompt.

help rand

Returns help for the rand command. We'll discuss this help file. Try

rand(3)
rand(3,2)
rand(3,2,2)

These commands allow you to create arrays of random numbers without loops. Compare how long it takes to run

A = rand(10000,1);

With

for i = 1:10000
B(i,1) = rand;
end

Note however there is really two issues with the second method: first, we could have avoided the loop, and second, the we successively built a larger and larger array without first initializing it. Now that B is defined, repeating the second set of command will go much faster, but still not as fast as avoiding the loop.

Note semicolons suppress a command's insistence on displaying the result on the screen. Try it without the semicolon:

A = rand(10000,1)

Finally, if you are running a model that uses random numbers it is useful to be able to get the same random numbers again to reproduce your results. This can be done: computer random numbers are not really random, they are pseudo-random. The number returned by the generator is completely determined by the "seed" used to initialize the generator and the number of times that the generator has been used since the last initialization. Try:

rand('state',0);
rand
rand
rand
rand('state',0);
rand
rand
rand
rand('state',1);
rand
rand
rand

The command "clock" can be used to set the seed in "random" way. Use "up arrow" to repeat a command.

clock
clock
clock
sum(100*clock)
sum(100*clock)
sum(100*clock)
rand('state',sum(100*clock));
rand
rand
rand('state',sum(100*clock));
rand
rand


If you want a "random" seed and still be able to reproduce your simulations use:

s = sum(100*clock);
rand('state',s);
rand
rand
rand('state',s)
rand
rand

Probability

We want to use the command rand to roll a die. The command rand returns a number between 0 and 1. So 6*rand returns a number between 0 and 6. The command ceil rounds numbers up.

help ceil
ceil(6*rand)

Repeat this command several times. But now let's create a new command. Open the Editor and type

function d = die
% DIE Roll a die
% d = die returns a random integer between 1 and 6, inclusive

d = ceil(6*rand);

Save the file as die.m then type

die
die
die

For the next step it is useful to get more than one throw of the die. Save as dice.m then change

function d = dice(n)
% DICE Rolls dice
% d = dice(n) returns n random integers between 1 and 6, inclusive

d = ceil(6*rand(n,1));

Now some statistics:

mean(dice(100))
mean(dice(1000))
mean(dice(10000))
mean(dice(100000))
  • Homework Problem B.1: Roll the dice many times. Make a guess as to whether the mean converges as the number of dice increases. Assuming you do not know the answer, what numerical experiments would increase your confidence in this regard? Create a PowerPoint presentation to write down your answer.

Now try the same thing for

median
mode
std
min
max

The command std is the standard deviation.

  • Homework Problem B.2: Roll the dice many times. Make guesses as to whether the statistics median, mode, std, min and max converge as the number of dice increases. For the ones that converge, which limits are equal. Add to the PowerPoint presentation from problem 1.

Probability Mass Function

The PMF gives the theoretical fraction of times each value is returned on repeated experiments. For a fair die

f(i) = 1/6, for i = 1,2,3,4,5,6

f(i) = 0 otherwise.

For a loaded die, some the six non-zero values of f would differ, but they must add to 1.

The statistics considered above can be calculated theoretically from the PMF. Then there are theorems that show convergence of these statistics to their theoretical values ... or counterexamples showing that convergence does not happen. Warning: Convergence is a subtle concept in Probability Theory.

Of note: The theoretical "mode" for loaded die is the i which maximizes f. For a fair die, all values are equal and the "mode" does not exist, theoretically. The idea of "most probable" is important to us.

Plotting the PMF: Roll the dice and create a histogram with 6 bins:

A = dice(100);
hist(A,6)

Repeat several times. Then try:

A = dice(1000);
hist(A,6)
A = hist(10000);
hist(A,6)
hist(A,60)
hist(A,600)

Notice that as the width of the bin gets smaller, the number in each bin (containing the integers 1,2,3,4,5,6) does not change. Also notice that the height of each bin, divided by the total number of dice, equals

1/6

Probability theory defines probability with axioms (we will not study the axioms). The most common interpretation of probability of an outcome is that it is the fraction of times it occurs when if the experiment (roll the die) could be repeated many times. The PMF gives the probability of occurrence of each discrete outcome: 1,2,3,4,5, or 6.

Probability Density Function

Probabilists make an analogy between probability and mass. Mathematically, you can think of probability (or mass) as concentrated at discrete mathematical points (1,2,3,4,5,6) or spread out across a continuum. Think of a continuous solid: a 10 kilo barbell. How much mass is there in 1 cubic centimeter of the volume of the barbell? Much less than 10 kilos! How much mass in 1 cubic millimeter? The mass in 1 cubic millimeter is less than the mass in one cubic centimeter by a factor of 1000! As you take smaller and smaller volumes you get less and less mass. Ignoring atoms and elementary particle, suppose the barbell is continuous solid. The smaller the volume the less mass in that volume. And there is no mass at a mathematical point! But the ratio of mass to volume typically converges: it is called the density. If mass were concentrated at individual points (as with the dice) the density at those points would be infinite!

Now lets compare dice and rand. Dice returned a discrete number: 1,2,3,4,5,6. Rand returns a "continuous" number, or approximately continuous as nothing on a digital computer is really continuous, if you zoom way in. But we won't zoom in. For dice, the number of occurrences in each bin (containing exactly one of the numbers 1,2,3,4,5, or 6) did not depend on the size of the bin. Now try:

A = rand(1000,1);
hist(A,10)
hist(A,100)
hist(A,1000)
hist(A,10000)

Here is some terminology: Dice and rand return what is called a random variable, which is some function of the state of the random number generator. (The mathematical definition of random variable is similar.) Dice returns a discrete random variable rand returns a "continuous" random variable. Discrete random variables have probability mass functions (PMFs) the probability density at points with nonzero mass is infinite. Continuous random variables have probability density functions (PDFs) with zero probability mass everywhere. There is such a thing as a mixed random variable which is part continuous and part discrete, but we will not get into it.

You can visualize PMFs and PDFs both with the hist command, although the vertical scale will be wrong.

A = rand(100000,1);
hist(A,100)

A different random number generator with a different PDF: randn; "n" for normal. The random variable rand is said to have a uniform distribution whereas randn has a normal or Gaussian distribution.

A = randn(100000,1);
hist(A,100)

Note the famous bell curve.

Some facts about PDFs, densities, and the bins from the command hist:

  • The density at a point is approximately the fraction of points that fall into its bin divided by the width of the bin.
  • Don't like approximations? Think limit.
  • The probability of landing in a bin is approximately the density at one point in the bin times the width of the bin.
  • Don't like approximations? Think integral.
  • For discrete random variables, the probability masses (nonzero values of PMF) must add to 1.
  • For continuous random variables, the area (integral) under the PDF must be 1.
  • Probability mass must be less than or equal to 1; probability densities can approach infinity.

Let's revisit the statistics. Theoretically the statistics converge to

  • Mean: average of possible values of random variable weighted by the density
  • Median: point in the domain of the PDF such that half the area (=0.5) under the PDF is lower and half the area is higher
  • Mode: point in domain of the PDF which maximizes the PDF.
  • Std: mean of squared distances from mean
  • Min: lowest value with positive density
  • Max: highest value with positive density
  • Homework B.3: Does the mean, median, mode, std, min, max of rand and randn converge as the number of samples increases? To what? Answer this problem by plotting an approximate PDF with hist, taking a screenshot of the Figure, dragging and dropping into PowerPoint, and approximately labeling the statistics on the Figure with text boxes. For STD use a line centered around the mean. It doesn't have to be precise.

Functions of Random Variables

We have two random number generators (rand and randn). In MATLAB that's it for random number generators. You can get more interesting random variables by taking function of random variables. Remember the die

ceil(6*rand)

Let's define some more functions. Open affine.m in the editor and type in:

function D = affine(A,m,b)
% AFFINE affine transformation of data D

D = m*A + b;

Save. Now open sigmoid.m and type in

function D = sigmoid(A,k,h)
% SIGMOID sigmoid transformation of data D

D = 1./(1+exp(-(A-h)/k));

Now lets plot these functions. But first lets learn about the colon operator. Try the following to see what it does:

A = 0:10
A = 0:1
A = 0:0.1:1
A = 1:0
A = 1:-0.1:0

Get the idea? Now lets generate a sequence of points where we will evaluate the function, then evaluate the affine function for different values of m and b:

A = 0:0.001:1;
F1 = affine(A,10,10);
F2 = affine(A,5,10);
F3 = affine(A,10,5);

Now plot

plot(A,F1)
plot(A,F2)

Doesn't look any different, does it? Reason: the axes are chosen automatically. Try it again and look closely at the axes. Now plot all three on the same axes:

hold on
plot(A,F1,'b')
plot(A,F2,'g')
plot(A,F3,'r')
hold off

Hold means the next plot comes on top of the old one. Have you figured out what the letters 'b', 'g', and 'r' are for? Colors. For more options, type

help plot

Now generate random data (with rand and randn) and use affine to create a new random variables:

A = rand(10000,1);
F = affine(A,10,10);
hist(F,100)
A = randn(10000,1);
F = affine(A,10,10);
hist(F,100)

In the above example m=b=10. Change m and b and repeat. The histograms will look similar; pay close attention to the axes -- both the scale and origin.

  • Homework B.4: Use numerical experiments to figure out the following question: Given a random variable A, what is the relationship betweenn the PDF of A and the PDF of F = affine(A,m,b)?

Now let's try it with the sigmoid function. First plot it:

A = 0:0.01:1;
F = sigmoid(A,.1,0.5);
plot(A,F);

Now

A = rand(10000,1);
F = sigmoid(A,.1,0.5);
hist(F,100)
  • Homework B.5: Consider the uniform random variable A = rand and the random variable F = sigmoid(A,k,h). Where does the PDF of F have peaks; where does it have troughs. Hint: For different k,h, be sure to plot both the function over the range of rand (0 to 1) and the histogram.

Now try running data from randn through sigmoid.

A = randn(10000,1);
F = sigmoid(A,0.8,0.5);
hist(F,100);

Whereas every time you applied the affine function you still got something that looked like a bell curve, applying the sigmoid function destroys the shape of the bell curve. This property of nonlinear functions make system ID with nonlinear models challenging.

The Concept of Likelihood Introduced

We have now seen random variables with PDFs that depend upon parameters

m = 5;
b = 10;
A = randn;
F = affine(A,m,b)

If you change m and b, you change the PDF of F. But we have done that already. Now suppose that you observe the random variable F (randn filtered through affine) but you don't know what m and b are!

s = sum(100*clock);
rand('state',s)
m = rand;
b = rand;
A = randn;
F = affine(A,m,b)

Look at the value of F and guess m and b. How would you guess m and b?

Here's one method: for all m and b, you have a PDF which measures how probable the observation of F is. Pick the parameters to maximize this probability (density).

The probability density is really a function of both the value of the random variable (the data point, as we have seen before) and the parameters (m and b). The data point has already been collected: it is fixed. The probability density as a function of the parameters, with the data point fixed, is called the likelihood function. Our method picks m and b to maximize the likelihood.

How do you do this? Here's a hint: with one data point our method is pretty lame. Consider:

m0 = 0;
b0 = F;
A0 = randn(10000,1);
F0 = affine(A0,m0,b0);
hist(F0,100)
min(F0)
max(F0)

By choosing m0 = 0, you concentrate all the probability mass at one point b0 (set to F), making the likelihood of the parameters (probability density of the data point at the parameters) infinite.

The next thing we did is 10000 observations of the "fitted model" which returned 10000 identical data points (min(F0) = max(F0). Infinite likelihood is obviously the highest possible likelihood, but was this a good guess of m and b? Let's see:

m
b

You can get a better guess if you have more data

A = randn(10000,1);
F = affine(A,m,b);
hist(F,100)

Can you guess from this histogram what m and b are? We will talk about the calculation next class.

  • Bonus Homework Problem (1 point) B.*: Do something creative with the tools presented here. For example, can you come up with a random variable with other interesting properties? Make sure you communicate what you have done when you turn in your work.