next up previous contents
Next: A different prior Up: Tutorial Previous: Tutorial   Contents


Simple concept learning

Our first example is the `concept learning' problem illustrated in Fig 1. In concept learning the assumption is that there is an unknown target (or `true') concept which is simply a subset of some data space. Data consists of positive examples (examples in the target concept) and negative examples (examples outside the target concept). There is also a hypothesis space: sets which the learning system can hypothesise to be the true concept.

Consider a very simple concept learning scenario where the hypothesis space contains a mere five hypotheses, each of which is a rectangle (Fig 1). The data consists of two negative examples (black dots in Fig 1) and a single positive example (the white dot). Clearly, only the $b$ and $c$ rectangles are consistent with the data.

Figure 1: Concept learning with five hypotheses and three examples
\includegraphics[width=0.9\textwidth]{exs.eps}

In MCMCMS the user is required to define a prior over the hypothesis space using an SLP. Fig 2 shows an SLP defining one possible prior over our tiny hypothesis space. This SLP can be found in tutorial/slps/tabular.slp. (This filename, and all others in this section are given relative to the top level of MCMCMS once you have unpacked it.)

Figure 2: SLP defining a prior over a hypothesis space of five rectangles (in the file tutorial/slps/tabular.slp).
\begin{figure}\centering
\begin{verbatim}0.1 :: hyp(rectangle(a,1,2,3,4)).
0.1...
...gle(d,2,1,8,3)).
0.2 :: hyp(rectangle(e,4,2,7,9)).\end{verbatim}\end{figure}

Each rectangle is represented by a term in first-order logic which gives the name of the rectangle and the co-ordinates of the bottom-left and top-right corners respectively. Hypotheses can be represented as the user sees fit: the important thing, as we shall see below, is to choose a representation that makes it easy to compute likelihoods.

A full explanation of how SLPs define probability distributions can be found, for example, in [CussensCussens2001]. In this simple example, each probability label in Fig 2 gives a prior probability for the corresponding rectangle. The choice of predicate symbol hyp/1 is arbitrary.

Similarly, the choice of data representation is a matter of convenience, although using anything other than Prolog might involve a lot of extra work. Fig 3 shows how we have chosen to represent the three data points of Fig 1: we just give co-ordinates and the class for each datapoint. This data can be found in tutorial/data/exs.pl

Figure 3: Data for concept learning (in the file tutorial/data/exs.pl).
\begin{figure}\centering
\begin{verbatim}datum((5.5,6.5),pos).
datum((5.5,2.5),neg).
datum((6.5,8.5),neg).\end{verbatim}\end{figure}

Now that prior and data have been determined we need to consider the likelihood function. To see what sort of likelihood function needs to be defined, consider the form of Bayes theorem relevant to concept learning. Each datapoint $E_i$ is of the form $(X_{i},C_{i})$, where $X_{i}$ is the datapoint's position and $C_i$ is its class. A hypothesis $H$ is only required to predict an example's class given its position. It need not make any predictions concerning positions. (In statistical parlance, the co-ordinates giving the position of the example are called either explanatory variables, predictor variables or covariates.) Let $X$ and $C$ represent the vectors of positions and classes of examples, respectively. In the current case this means that $X=((5.5,6.5),(5.5,2.5),(6.5,8.5))$ and $C=(pos,neg,neg)$. The posterior $P(H\vert X,C)$ for any hypothesis $H$ can then be calculated as follows:

\begin{displaymath}P(H\vert X,C) = \frac{P(H\vert X)P(C\vert H,X)}{P(C\vert X)}
\end{displaymath} (1)

We will assume that each hypothesis attaches a class label to an example independently of other examples we have:
\begin{displaymath}P(C\vert H,X) = \prod_{i} P(C_{i}\vert H,X_{i})
\end{displaymath} (2)

The LHS of (2) is the likelihood function for this scenario. Since each hypothesis $H$ classifies examples categorically, it is easy to see that each term on the RHS of (2) is either 1, if $H$ classifies $(X_{i},C_{i})$ correctly, and 0 otherwise. It follows that the overall likelihood $P(C\vert H,X)$ is 1 if $H$ is consistent with all datapoints and 0 otherwise. Note that the prior $P(H\vert X)$ is only prior to the classes of the examples, their positions can influence the prior. In the current scenario, as Fig 2 shows, this potential influence is not used. See the work on C&RT described in Section 6.3 for a case where the values of predictor variables in the data do affect the prior.

Code for computing likelihoods (in fact, log-likelihoods) is shown in Fig 4. The first clause of predicate model_llhood/2 declares that a Model (i.e. a hypothesis) has log-likelihood $-\infty$ (i.e. likelihood 0) if there exists an example (X,C) which is inconsistent with the model. If there are no inconsistent examples, then a log-likelihood of 0 (i.e. a likelihood of 1) is returned by the second clause. Note the use of a cut (!) to ensure that model_llhood/2 defines a function. This code can be found in the file tutorial/simple_concept_learning.pl.

Figure 4: Computing the log-likelihood for rectangular hypotheses (in the file tutorial/simple_concept_learning.pl).
\begin{figure}\centering
\begin{verbatim}model_llhood(Model,-inf) :- % Model h...
...<X2 , XY>=Y1 , XY=<Y2. % neg inside the rectangle
\end{verbatim}\end{figure}

The likelihood is the place where the representations of the hypothesis space and the data meet. What is essential is that the definition of model_llhood/2 is consistent with whatever hypothesis and data representations are chosen.

Since MCMCMS uses the Metropolis-Hastings algorithm it does not require the log-likelihoods directly. All that is needed is that, given any two hypotheses $M^*$ and $M_i$, the likelihood ratio is computed. In our current example this ratio is simply $P(C\vert M^{*},X)/P(C\vert M_{i},X)$. In some cases it is possible to compute this ratio without going to the trouble of computing the individual likelihoods (or log-likelihoods). However, in our current case there is no such cleverness, we compute the ratio by computing log-likelihoods and then computing the likelihood ratio. The code for doing this is in Fig 5 and can be found in tutorial/simple_concept_learning.pl. Note that LLi is instantiated at the time lhood_canonical/5 is called, so lhood_canonical/5 only needs to compute LLst, the log-likelihood of MSt, to get the likelihood ratio.

Figure 5: Code for computing likelihood ratios (in the file tutorial/simple_concept_learning.pl)
\begin{figure}\centering
\begin{verbatim}lhood_canonical(Mst,_Mi,LLst,LLi,Rati...
...e log-likelihood of Mst
Ratio is exp(LLst - LLi).\end{verbatim}\end{figure}

To run MCMCMS, both model_llhood/2 and lhood_canonical/5 must be defined by the user. However, if lhood_canonical/5 does not use model_llhood/2 then the latter can be given a `dummy' definition. See Section 7.1 for further details.

Now that prior, data and likelihood have been defined we can run MCMCMS to produce a sample from the posterior. To do this we start up Prolog (either Sicstus or Yap) and load the file tutorial/simple_concept_learning.pl which is displayed in Fig 6.

Figure 6: Main file (tutorial/simple_concept_learning.pl) for simple concept learning.
\begin{figure}\centering
\begin{verbatim}:- ensure_loaded( '../run.pl' ).run...
...<X2 , XY>=Y1 , XY=<Y2. % neg inside the rectangle
\end{verbatim}\end{figure}

Most of the code in tutorial/simple_concept_learning.plconcerns the likelihood and has already been discussed. As for the rest, the first line loads in the main MCMCMS file ../run.pl. The predicate run_test/0 is there just to abbreviate the given call to the run/1 predicate. run/1 is a MCMCMS predicate (i.e. the user does not define it). The user calls run/1 with a list of options to perform a given MCMC run. These options are described in detail in Appendix A. In this case we just supply a runs_file and specify a directory for output.

A runs_file should be a Prolog file which defines a predicate run_data/11which sets parameters for a number of runs of the system. Fig 7 shows the run_data/11code relevant to our current example. This code can be found in tutorial/runs/concept_learning_run.pl. Detailed documentation for run_data/11can be found in Appendix B. The code given in Fig 7 says that the prior is defined in the file tutorial/slps/tabular.slp, where the slps directory is relative to the `main' file for running MCMCMS, that the predicate hyp/1 defines the prior, that data is in data/exs.pl (again data is relative to the main file) and that the run should be for 100000 iterations.

Figure 7: run_data/11fact for concept learning (found in the file tutorial/runs/concept_learning_run.pl).
\begin{figure}\centering
\begin{verbatim}run_data(
1, % id
uc, % backtrack s...
...ll % do all post-processing operations, if any
).\end{verbatim}\end{figure}

The rm parameter instructs that only `real' models should be considered. Whenever a failure is encountered, Prolog's backtracking is used to find the nearest model. Alternatively to this space, when fm is used, failures are reported at the top level of the MCMC and the algorithm registers a non-jump iteration. For an example illustrating the differences between rm and fm see [Angelopoulos CussensAngelopoulos Cussens2004, p.5].

The exs parameter plays a dual role. Firstly, this parameter determines the name of a file to load. Unless this file name is null, it is an error if the file does not exist or is not readable. This file will generally contain the data (it does in our current example), although this is not mandatory. As long as lhood_canonical/5 is defined to give the correct likelihood ratios, MCMCMS does not mind how the data is found. Secondly, the goal exs is called. This is a `hook' which can be used to do e.g. data pre-processing. As can be seen from Fig 6, in this case exs is defined to do nothing. If exs/0 were not defined the program would still run, but a warning would be generated.

Once tutorial/simple_concept_learning.pl has been compiled, we kick off the run by simply entering run_test. at the Prolog prompt. Once the call to run_test/0 has completed the data from the run can be found in tutorial/out/simple_concept_learning_out. Recall that out/simple_concept_learning_out was the subdirectory specified in the results_dir option given to run/1. Fig 8 shows what the MCMCMS run looks like. As Fig 8 indicates, the run/0 call actually performs two MCMC runs since there are two run_data/9 facts in tutorial/runs/concept_learning_run.pl. The second experiment will be discussed in Section 2.2.

Figure 8: Running MCMCMS. sicstus, compile(simple_concept_learning). and run_test. are user input.
\begin{figure}\centering
\begin{verbatim}pc275 /research/slps/mcmcms-0_1_1/tu...
...ial/out/simple_concept_learning_out')
yes
\vert ?-\end{verbatim}\end{figure}

The models (in this case rectangles) visited during the MCMC run are saved in the file tr_uc_rm_exs_tab_i100K__s1.gz. Note that the filename is constructed from the various options defined using run_data/11. This file is simply a (gzipped) text-file trace of the models visited and so is 200001 lines long. (There is one extra initial line to record the initial model.) In the trace file lines starting with a `b' indicate proposed models and those with a `c' indicate the current model.

Fig 9 shows the start of the trace file which we produced for this MCMCMS run. Rectangle d is the initial model (it is chosen using the prior). Rectangle c is then proposed and as the third line shows it is accepted and thus becomes the current model. Rectangle a is then proposed but the proposal is not accepted, so rectangle c remains the current model. In fact, inspection of the trace shows that the chain remains stuck at rectangle c until the 22nd iteration.

Figure 9: The start of a trace produced by an MCMCMS run (found in the file tutorial/out/simple_concept_learning_out/tr_uc_rm_exs_tab_i100K__s1.gz).
\begin{figure}\centering
\begin{verbatim}c(rectangle(d,2,1,8,3)).
b(rectangle(...
...
b(rectangle(a,1,2,3,4)).
c(rectangle(c,5,6,6,7)).\end{verbatim}\end{figure}

In this current tiny example it is trivial to pull out a simple posterior from the trace. Fig 10 shows how to pull out all the counts in the trace file. The last three lines of Fig 10 show the models actually visited (as opposed to merely proposed). The posterior probabilities estimated from the trace for rectangles b and c are 0.248 and 0.752, respectively, close to the correct values of 0.25 and 0.75. Note that d gets an estimated posterior of $1/200001 = 5 \times
10^{-5}$. By chance it happened to be the initial model. A small burn-in (throwing away an early part of the trace) would be enough to delete it entirely. The first five `b' lines of Fig 10 show that models (rectangles) are being proposed according to the prior distribution.

Figure 10: Extracting counts from the MCMC trace file
\begin{figure}\centering
\begin{verbatim}zcat tr_uc_rm_exs_tab_i100K__s1.gz \v...
...rectangle(c,5,6,6,7)).
1 c(rectangle(d,2,1,8,3)).\end{verbatim}\end{figure}


next up previous contents
Next: A different prior Up: Tutorial Previous: Tutorial   Contents
Nicos Angelopoulos 2008-06-02