Mallet

Machine learning for language toolkit

Home
Tutorial slides / video
Download
API
Quick Start
Sponsors
About
Importing Data
Data Transformations
Classification
Sequence Tagging
Topic Modeling
Optimization

View the Project on GitHub mimno/Mallet

Data Import - Stopwords

The Mallet import commands import-file and import-dir allow you to filter a list of “stopwords” from documents before processing. Removing high-frequency words can have a significant effect on model outputs. There are a number of options that allow you to control which words are removed:

Examples

% cat input.txt
X	X	This is my awesome stopword test!

% bin/mallet import-file --input input.txt --print-output 
name: X
target: X
input: this(0)=1.0
is(1)=1.0
my(2)=1.0
awesome(3)=1.0
stopword(4)=1.0
test(5)=1.0

% bin/mallet import-file --input input.txt --print-output --remove-stopwords 
name: X
target: X
input: awesome(0)=1.0
stopword(1)=1.0
test(2)=1.0

% cat extra.txt
awesome
test

% bin/mallet import-file --input input.txt --print-output --remove-stopwords --extra-stopwords extra.txt 
name: X
target: X
input: stopword(0)=1.0

% bin/mallet import-file --input input.txt --print-output --extra-stopwords extra.txt
name: X
target: X
input: this(0)=1.0
is(1)=1.0
my(2)=1.0
awesome(3)=1.0
stopword(4)=1.0
test(5)=1.0

% cat stoplist.txt 
my
test

% bin/mallet import-file --input input.txt --print-output --stoplist-file stoplist.txt 
name: X
target: X
input: this(0)=1.0
is(1)=1.0
awesome(2)=1.0
stopword(3)=1.0

% bin/mallet import-file --input input.txt --print-output --stoplist-file stoplist.txt --extra-stopwords extra.txt 
name: X
target: X
input: this(0)=1.0
is(1)=1.0
stopword(2)=1.0