GRaphical Models in Mallet

General CRFs in GRMM

GRMM contains a command-line interface to training CRFs with arbitrary graphical structure. This document walks you through training a two-level factorial CRF. The main class for CRFs with general structure is called ACRF, which stands for abstract CRF.

Your Data Files

It is assumed that you have a fully observed training set with a bunch of labels in it, and a bunch of features. We have to index the features and labels in your data somehow, so we assume you data is split up into a sequence. We call each sequence position a time step. Each time step has a set of labels and a feature vector.

You should have training and testing files that look like this:

  LABEL11 LABEL12 ... LABEL1k ---- feature11 feature12 ...
  LABEL21 LABEL22 ... LABEL2k ---- feature21 feature22 ...
  LABEL32 LABEL32 ... LABEL3k ---- feature31 feature32 ...
  ....

That is, each line is one time step. It has a bunch of space-separated labels, followed by the special token ----, followed by the names of all the binary features that are on at that time step. Using the ---- allows different time steps to have different numbers of labels (I've never actually tried this, however.)

The command-line interface supports only binary features, but the underlying inference and learning code fully supports continuous features. If you wanted to use continuous features, you'd have to make some minor changes to the command-line code, which I won't go into.

Now, GRMM doesn't assume that there's a chain structure among the time steps unless you tell it to. If you're doing document classification, for example, each time step could be an entire document. Or if you were doing 2D classification in an image, then each "timestep" would be a node in the grid, and the feature vector would be the features local to that node. The time-step framework is just to make it easier to read in your training data, and output results.

Two examples of such data files are conll2000.train1k.txt and conll2000.test1k.txt. We will be using them in the rest of this tutorial.

Templates

Parameter tying is accomplished by the notion of templates. For each training instance, the ACRF trainer creates an unrolled graph, which the flat factor graph for p(y|x) that is defined by the CRF. (Imagine unrolling in a DBN.)

A template object represents a bunch of factors that have tied parameters. Given a training instance, the template object tells the ACRF trainer which sets of variables should have factors, by adding them to an object called an UnrolledGraph. Then the ACRF training code knows to give all those factors the same parameters, because they come from the same template. For example, a linear-chain CRF has one template object that adds all of the first-order edges. (This class is ACRF.BigramTemplate.)

The major method implemented by Template is addInstantiatedCliques. This method is called by the ACRF training harness. It is expected to take an UnrolledGraph, and add all the factors that belong in the graph based on the input. These factors should be instances of ACRF.UnrolledVarSet.

To create your own models, the main thing you need to do is create your subclass of ACRF.Template. A convenient class to subclass is ACRF.SequenceTemplate. See ACRF.BigramTemplate for an example.

The Command-Line Interface

The command-line interface is handled through GenericAcrfTui. Its most important parameters are

  --training    Name of the training file
  --testing     Name of the testing file
  --model-file  A text file specifying which template objects to use.

As an example, we'll train a factorial DCRF on the CoNLL 2000 paper as in Sutton, Rohanimanesh, and McCallum in ICML 2004. (Before doing any of this, you need to build GRMM using make.)

First, set up your data files. Toy data subsets are provied in data/grmm/conll2000.train1k.txt and data/grmm/conll2000.train1k.txt

Second, create a text file that describes the templates we want to use. We want three templates: one for the linear-chain edges across NP, one for the linear-chain edges across POS, and one for the in-between edges. So create a text file called tmpls.txt, with three lines:

  new ACRF.BigramTemplate (0)
  new ACRF.BigramTemplate (1)
  new ACRF.PairwiseFactorTemplate (0,1)

Third, train and test the ACRF. The following command line will do it:

  java -cp $GRMM/class:$GRMM/lib/mallet-deps.jar:$GRMM/lib/grmm-deps.jar \
    edu.umass.cs.mallet.grmm.learning.GenericAcrfTui \ 
    --training $GRMM/data/grmm/conll2000.train1k.txt \
    --testing  $GRMM/data//grmm/conll2000.test1k.txt \
    --model-file tmpls.txt > stdout.txt 2> stderr.txt

This TUI will print lots of diagnostic information, and save a serialized copy of the trained model.