hunmorph – morphological analyzer

Hunmorph is an open source tool and programming library for spell-checking, stemming and morphological analysing of agglutinative, german and other languages.

Our research group has been working on a Hungarian morphological analyzer since 2003. First we extended the codebase of  MySpell, a reimplementation of the well-known Ispell spellchecker, yielding a generic word analysis library. At this point the development of the library has forked. Now the extended MySpell, called HunSpell, is part of the OpenOffice.org multilingual office suite. Hunmorph is the program tuned to morphological analysis.

The hunmorph framework is built from three components:

  • the ocamorph runtime analyzer is a language independent affix stripping implementation
  • morphdb.hu is a lexical database and morphological grammar, which can be used by ocamorph (details can be found at http://mokk.bme.hu/resources/morphdb.hu)
  • hunlex is an off-line resource management component, which complements the efficiency of our runtime layer with a high-level description language and a configurable precompiler.

The ocamorph analyzer uses non human readable language resources the so called aff/dic files (this is the same format used by OO.org’s MySpell). The aff/dic files are produced by the hunlex lexicon compiler from the morphdb.hu resources. The aff/dic files are platform independent, and so they are published with this distribution: if you don’t want to modify the lexicon or the grammar, you don’t need to use hunlex to create them.

Download

 

1. Download ocamorph source files from the public CVS

cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co ocamorph

2. Download the precompiled Hungarian language resources from

ftp://ftp.mokk.bme.hu/Tool/Hunmorph/Resources/Morphdb.hu/morphdb-hu-20060525.tgz

If  you want to modify the resources, you need

3. the morphdb.hu source

cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co lexicons/morphdb.hu

4. and the lexicon compiler

cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co hunlex

Compile ocamorph

 

To compile ocamorph on Linux/OsX/Cygwin you need ocaml compiler version 3.08.02 or newer.

cd ocamorph
make

This compiles the ocamorph executable. You can install it by

sudo make install

If you want to install somewhere else, use:

mkdir YOUR_DIR

 

make INSTALLPREFIX=YOUR_DIR install

test your ocamorph

ocamorph --help

Build binary language resource

To run ocamorph you’ll need the language specific aff/dic resource files. If you don’t want to modify the resources you can find precompiled aff/dic files in lexicons/morphdb.hu/out To run your ocamorph type

echo "ablakot" | ocamorph --aff lexicons/morphdb.hu/out/morphdb_hu.aff --dic lexicons/morphdb.hu/out/morphdb_hu.dic

and you get

> ablakot ablak/NOUN>

As you can see ocamorph reads from stdin and writes to stdout.

The warm up time of ocamorph can be very long: it builds a trie from the lexicon and minimalizes it. Ocamorph can save the minimalized trie to a binary (platform dependent) file. To build it type:

echo "ablakot" ocamorph --aff lexicons/morphdb.hu/out/morphdb_hu.aff \ --dic lexicons/morphdb.hu/out/morphdb_hu.dic --bin morphdb_hu.bin

After this you can use the binary resource

echo "ablakot" | ocamorph --bin morphdb_hu.bin

Running ocamorph with the bin file is much more faster but the bin file has to be recreated on every platform as well as if you recompile ocamorph. If you’d like to modify the lexicon or the grammar, please refer to lexicons/morphdb.hu/README

 

Please cite

  1. the paper about hunmorph: Hunmorph : Open source word analysis
  2. and about morpdb.hu: Morphdb.hu: Hungarian lexical database and morphological grammar

Comments are closed.