Our research group has been working on a Hungarian morphological analyzer since 2003. First we extended the codebase of MySpell, a reimplementation of the well-known Ispell spellchecker, yielding a generic word analysis library. At this point the development of the library has forked. Now the extended MySpell, called HunSpell, is part of the OpenOffice.org multilingual office suite. Hunmorph is the program tuned to morphological analysis.
The hunmorph framework is built from three components:
- the ocamorph runtime analyzer is a language independent affix stripping implementation
- morphdb.hu is a lexical database and morphological grammar, which can be used by ocamorph (details can be found at http://mokk.bme.hu/resources/morphdb.hu)
- hunlex is an off-line resource management component, which complements the efficiency of our runtime layer with a high-level description language and a configurable precompiler.
The ocamorph analyzer uses non human readable language resources the so called aff/dic files (this is the same format used by OO.org’s MySpell). The aff/dic files are produced by the hunlex lexicon compiler from the morphdb.hu resources. The aff/dic files are platform independent, and so they are published with this distribution: if you don’t want to modify the lexicon or the grammar, you don’t need to use hunlex to create them.
Download
1. Download ocamorph source files from the public CVS
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co ocamorph
2. Download the precompiled Hungarian language resources from
ftp://ftp.mokk.bme.hu/Tool/Hunmorph/Resources/Morphdb.hu/morphdb-hu-20060525.tgz
If you want to modify the resources, you need
3. the morphdb.hu source
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co lexicons/morphdb.hu
4. and the lexicon compiler
cvs -d :pserver:anonymous:anonymous@cvs.mokk.bme.hu:/local/cvs co hunlex
Compile ocamorph
To compile ocamorph on Linux/OsX/Cygwin you need ocaml compiler version 3.08.02 or newer.
cd ocamorph
make
This compiles the ocamorph executable. You can install it by
sudo make install
If you want to install somewhere else, use:
mkdir YOUR_DIR
make INSTALLPREFIX=YOUR_DIR install
test your ocamorph
ocamorph --help
Build binary language resource
To run ocamorph you’ll need the language specific aff/dic resource files. If you don’t want to modify the resources you can find precompiled aff/dic files in lexicons/morphdb.hu/out To run your ocamorph type
echo "ablakot" | ocamorph --aff lexicons/morphdb.hu/out/morphdb_hu.aff --dic lexicons/morphdb.hu/out/morphdb_hu.dic
and you get
> ablakot ablak/NOUN>
As you can see ocamorph reads from stdin and writes to stdout.
The warm up time of ocamorph can be very long: it builds a trie from the lexicon and minimalizes it. Ocamorph can save the minimalized trie to a binary (platform dependent) file. To build it type:
echo "ablakot" ocamorph --aff lexicons/morphdb.hu/out/morphdb_hu.aff \ --dic lexicons/morphdb.hu/out/morphdb_hu.dic --bin morphdb_hu.bin
After this you can use the binary resource
echo "ablakot" | ocamorph --bin morphdb_hu.bin
Running ocamorph with the bin file is much more faster but the bin file has to be recreated on every platform as well as if you recompile ocamorph. If you’d like to modify the lexicon or the grammar, please refer to lexicons/morphdb.hu/README
Please cite
- the paper about hunmorph: Hunmorph: Open source word analysis
- and about morpdb.hu: Morphdb.hu: Hungarian lexical database and morphological grammar