cc.mallet.classify
Class FeatureConstraintUtil

java.lang.Object
  extended by cc.mallet.classify.FeatureConstraintUtil

public class FeatureConstraintUtil
extends java.lang.Object

Utility functions for creating feature constraints that can be used with GE training.

Author:
Gregory Druck gdruck@cs.umass.edu

Constructor Summary
FeatureConstraintUtil()
           
 
Method Summary
static java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labelFeatures(InstanceList list, java.util.ArrayList<java.lang.Integer> features)
          Label features using heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFile(java.lang.String filename, InstanceList data)
          Reads feature constraints from a file, whether they are stored using Strings or indices.
static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileIndex(java.lang.String filename, InstanceList data)
          Reads feature constraints stored using strings from a file.
static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileString(java.lang.String filename, InstanceList data)
          Reads feature constraints stored using strings from a file.
static java.util.ArrayList<java.lang.Integer> selectFeaturesByInfoGain(InstanceList list, int numFeatures)
          Select features with the highest information gain.
static java.util.ArrayList<java.lang.Integer> selectTopLDAFeatures(int numSelFeatures, ParallelTopicModel lda, Alphabet alphabet)
          Select top features in LDA topics.
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list, java.util.ArrayList<java.lang.Integer> features)
          Set target distributions using estimates from data.
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingFeatureVoting(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures, InstanceList trainingData)
          Set target distributions using feature voting heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingHeuristic(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures, int numLabels, double majorityProb)
          Set target distributions using "Schapire" heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

FeatureConstraintUtil

public FeatureConstraintUtil()
Method Detail

readConstraintsFromFile

public static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFile(java.lang.String filename,
                                                                                    InstanceList data)
Reads feature constraints from a file, whether they are stored using Strings or indices.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

readConstraintsFromFileString

public static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileString(java.lang.String filename,
                                                                                          InstanceList data)
Reads feature constraints stored using strings from a file. feature_name (label_name:probability)+ Labels that do appear get probability 0.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

readConstraintsFromFileIndex

public static java.util.HashMap<java.lang.Integer,double[]> readConstraintsFromFileIndex(java.lang.String filename,
                                                                                         InstanceList data)
Reads feature constraints stored using strings from a file. feature_index label_0_prob label_1_prob ... label_n_prob Here each label must appear.

Parameters:
filename - File with feature constraints.
data - InstanceList used for alphabets.
Returns:
Constraints.

selectFeaturesByInfoGain

public static java.util.ArrayList<java.lang.Integer> selectFeaturesByInfoGain(InstanceList list,
                                                                              int numFeatures)
Select features with the highest information gain.

Parameters:
list - InstanceList for computing information gain.
numFeatures - Number of features to select.
Returns:
List of features with the highest information gains.

selectTopLDAFeatures

public static java.util.ArrayList<java.lang.Integer> selectTopLDAFeatures(int numSelFeatures,
                                                                          ParallelTopicModel lda,
                                                                          Alphabet alphabet)
Select top features in LDA topics.

Parameters:
numSelFeatures - Number of features to select.
ldaEst - LDAEstimatePr which provides an interface to an LDA model.
seqAlphabet - The alphabet for the sequence dataset, which may be different from the vector dataset alphabet.
alphabet - The vector dataset alphabet.
Returns:
ArrayList with the int indices of the selected features.

setTargetsUsingData

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingData(InstanceList list,
                                                                                java.util.ArrayList<java.lang.Integer> features)
Set target distributions using estimates from data.

Parameters:
list - InstanceList used to estimate target distributions.
features - List of features for constraints.
Returns:
Constraints (map of feature index to target distribution), with target distributions set using estimates from supplied data.

setTargetsUsingHeuristic

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingHeuristic(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures,
                                                                                     int numLabels,
                                                                                     double majorityProb)
Set target distributions using "Schapire" heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.

Parameters:
labeledFeatures - HashMap of feature indices to lists of label indices for that feature.
numLabels - Total number of labels.
majorityProb - Probability mass divided among majority labels.
Returns:
Constraints (map of feature index to target distribution), with target distributions set using heuristic.

setTargetsUsingFeatureVoting

public static java.util.HashMap<java.lang.Integer,double[]> setTargetsUsingFeatureVoting(java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labeledFeatures,
                                                                                         InstanceList trainingData)
Set target distributions using feature voting heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.

Parameters:
labeledFeatures - HashMap of feature indices to lists of label indices for that feature.
trainingData - InstanceList to use for computing expectations with feature voting.
Returns:
Constraints (map of feature index to target distribution), with target distributions set using feature voting.

labelFeatures

public static java.util.HashMap<java.lang.Integer,java.util.ArrayList<java.lang.Integer>> labelFeatures(InstanceList list,
                                                                                                        java.util.ArrayList<java.lang.Integer> features)
Label features using heuristic described in "Learning from Labeled Features using Generalized Expectation Criteria" Gregory Druck, Gideon Mann, Andrew McCallum.

Parameters:
list - InstanceList used to compute statistics for labeling features.
features - List of features to label.
Returns:
Labeled features, HashMap mapping feature indices to list of labels.