Hunglish Corpus

The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 54.2 m words in 2.07 m sentences.

The Corpus can be downloaded from our ftp server. If you have any questions don’t hesitate to ask via the hunglish-corpus mailing list. (the main language of the list is Hungarian, but you are welcome to post in English).

The corpus may be searched through our sentence search service. This can be used as a smart bilingual lexicon.

Coverage

The raw corpus was gathered from the Web:

  • Literature. The raw files for “classical” material no longer under copyright came from Project Gutenberg and the Hungarian Electronic Library. The sentence pair files for “modern” material still under copyright and made avaliable to MOKK for research purposes are all shuffled together, and no other formats are provided.
  • Legal texts. Files from CELEX database, and The EU Constitution.
  • Software documentation. The raw files come from OpenOffice.org, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects.
  • Movie subtitles. The raw files were provided to MOKK for research purposes only and can not be republished in their raw form. The sentence pair files are given in “shuffled” version: lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many spelling errors owing to OCR text extraction.
  • Magazines and news. Translations of English magazines (not localized versions but real translations) including Diplomacy and Trade, National Geography, etc.
  • Financial reports of Hungarian companies – still under processing.

Statistics

subcorpus size (Mb) tokens, 1000 sentences, 1000
legal texts 233 3153 1400 max
literary 85 1724
990
software 8 127 140
film 18 327 390
magazines 5 36
business under processing

Copyright Questions

Some raw materials used for the Hunglish corpus are under copyright (literature, film subtitles, magazines). We prevented the illegal use of copyrighted material by shuffling the texts at sentence level. This form is still useful for research purposes, while it does not infringe upon the rightholders’ interests. If you are a copyright holder, and you consider the shuffled files infringing, please send email and we will remove the material in question from the corpus.

The Hunglish corpus is open for use (with the above restrictions) under a creative commons attributions licence, refer to our publication.

How The Alignment Was Performed

  • Raw text extraction: converting from pdf, html, rtf, doc etc. to plain iso latin text.
  • Sentence boundary detection with the rule-based huntoken tokenizer.
  • Stemming with the hunmorph morphological analyzer (using our Hungarian and English morphological databases).
  • Rough sentence translation using Hunglish dictionary.
  • Sentence alignment with hunalign.

Our alignment method outperforms pure statistical methods by combining such methods with the extensive use of a bilingual dictionary. In the abscence of a dictionary resource, the aligner can bootstrap itself in tandem with our automatical lexicon builder.

Data formats

The data is provided in various simple formats:

Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. .bi files do not contain segments where deletion or contraction occured. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible from these files. Where copyright considerations made it necessary, the lines of .bi files were shuffled (sorted alphabetically).

Alignment “ladder” (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token ” ~~~ ” is placed between sentences. The reserved special sentence “<p>” is used as a paragraph delimiter.

The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded.

hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.

Authors

The corpus was created as part of the hunglish project by the joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor Trón), and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy).

Acknowledgements

The Hunglish project is supported by an ITEM grant by the the Hungarian Ministry of Informatics and Communication. András Aklán (BUTE) provided effective project management for the production process and Mike Maxwell (LDC) advised us on the structure of the public area and found many bugs. We thank Magyar Telekom Rt. for infrastructure support.

Reference

If you use the corpus, please reference the following paper:

D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005).
Parallel corpora for medium density languages
In Proceedings of the RANLP 2005, pages 590-596.
(pdf)

Comments are closed.