Huntag – a sequential tagger for NLP using Maximum Entropy Learning and Hidden Markov Models
Introduction
Huntag can perform any kind of supervised sequential sentence tagging tasks. It has been used for NP chunking, Named Entity Recognition, and clause chunking.
The flexibility of Huntag comes from the fact that it will generate any kind of features from the input data given the appropriate python functions. Several dozens of features used regularly in NLP tasks are already implemented in the file features.py, however the user is encouraged to add any number of her own.
Once the desired features are implemented, a data set and a configuration file containing the list of feature functions to be used are all Huntag needs to perform training and tagging.
Download
The latest version of huntag can be downloaded through our public cvs server with the following command:
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co huntaggers
Note that several training corpora come together with the source code, so the whole download is more than a 100 megabytes.
Installation
Huntag was written in python, so no compilation step is needed. But the Maximum Entropy toolkit must be installed with Python bindings. It can be obtained from:
http://homepages.inf.ed.ac.uk/lzhang10/maxent_toolkit.html
Usage
Tagging:
cat data | ./huntag.py --model=MODEL --features=FEATUREFILE –traincorpus=TRAINCORPUS [ --keptfeats=KEPTFEATS ] [ --lmweight LMWEIGHT=0.2 ] [ --daemon ]
MODEL specifies the maximum entropy emission model. It uses the standard serialization format of the maxent module, and is typically written by train.py.
FEATUREFILE is a human-edited description of the features to be used. At the most basic level, it assigns python functions to feature names. See below for more details.
TRAINCORPUS is a gold standard data file used to build the language model for tags from.
KEPTFEATS is a file that contains a list of features. This can be smaller than the ones listed in FEATUREFILE, acting as a post-filter.
LMWEIGHT is the relative importance of the tag language model when compared to the maximum entropy emission model.
–daemon is a toggle that forces a sentence-by-sentence processing mode, suitable for, say, socket communication.
Training:
cat traindata | ./featurize.py -f FEATUREFILE | ./train.py MODEL ITERATION_NUMBER GAUSSIAN_PRIOR
The MODEL specifies the model name to be written, ITERATION_NUMBER gives the number of iterations for the maxent trainer, and GAUSSIAN_PRIOR is self-explanatory. (1 is a meaningful default for it.)
Specialized versions:
huntaggers comes pre-packaged with training corpora that can be used to train Hungarian and English named entity recognizers, NP-chunkers and clause chunkers. The ./*.sh shell scripts (hunner.sh, hunchunk.sh etc.) give an idea about the pairing between feature description files, training corpora, and parameters. We do not distribute the trained emission models, because they are very large. They can be easily built using the commands above, from these files:
Hungarian NER: data/ner/szuz/szeged.train hunner/features.txt Hungarian maximal NP-chunking: data/chunk/szeged.max hunchunk/features_krPatt.txt English NER: data/ner/eng/eng.bie1 hunner/features.txt English base NP-chunking: data/chunk/eng/penn.base hunchunk/features_eng.txt Hungarian clause chunking: data/cp/szeged.cp.bie1 hunchunk/features_old.txt
Data formats
Input file
Input data must be a tab-separated file with one word per line and an empty line to mark sentence boundaries. Each line must contain the same number of fields. Such a block consists of a tab-separated line for each token of the sentence. By convention, the first (0th, as numbered in the feature description file) column is the word form, but this is not actually mandatory.
Train corpus file
The format of the train corpus file is the same as the input file, except for an extra column that gives the gold standard output for each token. This field must contain the correct tag for the word, which may be in the BI format used at CoNLL shared tasks (e. g. B-NP to mark the first word of a noun phrase, I-NP to mark the rest and O to mark words outside an NP) or in the so-called BIE1 format which has a seperate symbol for words constituting a chunk themselves (1-NP) and one for the last words of multi-word phrases (E-NP). The first two characters of answer tags should always conform to one of these two conventions, the rest may be any string describing the category.
Feature description file
The feature file may start with a command specifying the default radius for features. This is non-mandatory. Example:
!defaultRadius 5
After this, it can give values to variables that shall be used by the featurizing methods. Example:
let krpatt minLength 2 let krpatt maxLength 99 let krpatt lang hu
The first argument specifies the name of the feature, the third a key, the fourth a numeric value. The dictionary of key-value pairs will be passed to the feature.
After this come the actual assignments of feture names to features. Examples:
token ngr ngrams 0 sentence bwsamecases isBetweenSameCases 1 lex street hunner/lex/streetname.lex 0 token lemmalowered lemmaLowered 0,2
The first keyword can have three values, token, lex and sentence. For example, in the first example line above, the feature name ngr will be assigned to the python method ngrams() that return a feature string for the given token. The third argument is a column or comma-separated list of column. It specifies which fields of the input should be passed to the feature function. Counting starts from zero.
For sentence features, the input is aggregated sentence-wise into a list, and this list is then passed to the feature function. This function should return a list consisting of one feature string for each of the tokens of the sentence.
For lex features, the second argument specifies a lexicon file rather than a python function name. The specified token field is matched against this lexicon file.
Authors
Huntag was created by Gábor Recski and Dániel Varga. It is a reimplementation and generalization of a Named Entity Recognizer built by Dániel Varga and Eszter Simon.
License
Huntag is made available under the GNU Lesser General Public License v3.0. If you received Huntag in a package that also contain the Hungarian training corpora for Named Entity Recoginition and chunking task, then please note that these corpora are derivative works based on the Szeged Treebank, and they are made available under the same restrictions that apply to the original Szeged Treebank.
Reference
If you use the tool, please cite the following paper:
Gábor Recski, Dániel Varga A Hungarian NP-chunker The Odd Yearbook, (2009)
If you use some specialized version for Hungarian, please also cite the following paper:
Dóra Csendes, János Csirik, Tibor Gyimóthy and András Kocsor The Szeged Treebank Text, Speech and Dialogue, Lecture Notes in Computer Science, Volume 3658/2005, (2005)