Hunglish Corpus Version 2.0
The Hunglish Corpus is a free sentence-aligned Hungarian-English parallel corpus of about 120 million words in 4 million sentence pairs. This is the Version 2.0 release of the Corpus, approximately doubling the size of the original 1.0 release from 2005.
The Corpus can be downloaded from our ftp server. (The obsoleted Version 1.0 is still made available at its original location, to avoid dead links to material.) If you have any questions don’t hesitate to ask via the hunglish-corpus mailing list. (the main language of the list is Hungarian, but you are welcome to post in English). The corpus may be searched through our sentence search service. This can be used as a smart bilingual lexicon.
Coverage
The raw corpus was gathered from the Web. It consists of several distinct subcorpora:
- Classical literature. (classical.lit) The raw files for “classical” material no longer under copyright came from Project Gutenberg and the Hungarian Electronic Library.
- Modern literature. (modern.lit) The sentence pair files for “modern” material still under copyright and made available to MOKK for research purposes are all shuffled together, and no other formats are provided. The genre is mostly popular fiction, predominantly fantasy, science fiction and crime fiction.
- Legal texts. (law) Files from CELEX database and The EU Constitution.
- Software documentation. (softwaredoc) The raw files come from OpenOffice.org, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects.
- Movie subtitles. (subtitles) The raw files were provided to MOKK for research purposes only and can not be republished in their raw form. The sentence pair files are given in “shuffled” version: lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many spelling errors owing to OCR text extraction.
Structure
The main directory contains directories for the various subcorpora. In each of these, there is a directory call “bi” that contains the bisentence files. In the case of the classical.lit, law and softwaredoc directories, there are three additional subdirectories called hu,en and lad, for the raw Hungarian, raw English, and unfiltered alignment files respectively. See below for the format of these files.
Statistics
subcorpus | number of documents | size, MB | tokens, million | bisentences, 1000 |
---|---|---|---|---|
Modern literature | 278 | 217 | 37.1 | 1670 |
Classical literature | 83 | 100 | 17.2 | 652 |
Movie subtitles | 437 | 19 | 3.2 | 343 |
Software docs | 9 | 9 | 1.2 | 135 |
Legal text | 20378 | 399 | 56.6 | 1351 |
Total | 21185 | 744 | 115.3 | 4151 |
When compared to the original Hunglish Corpus, there were two significant additions in Version 2.0. We added a new shuffled modern literature subcorpus. This amounted to 1220 thousand bisentences, increasing the size of the Modern literature subcorpus by about four. We processed the CELEX legal material made available since the publication of Hunglish Version 1.0. This amounted to 400 thousand bisentences. All in all, the additions almost doubled the size of the corpus.
Copyright Questions
Some raw materials used for the Hunglish corpus are under copyright (modern literature, movie subtitles). We prevented the illegal use of copyrighted material by shuffling the texts at sentence level. This form is still useful for research purposes, while it does not infringe upon the rightholders’ interests. If you are a copyright holder, and you consider the shuffled files infringing, please send email and we will remove the material in question from the corpus. The Hunglish Corpus is open for use (with the above restrictions) under a Creative Commons Attributions license, refer to our publication.
How The Alignment Was Performed
- Raw text extraction: converting from pdf, html, rtf, doc etc. to plain ISO Latin text.
- Sentence boundary detection and tokenization with the rule-based huntoken tokenizer.
- Stemming with the hunmorph morphological analyzer (using our Hungarian and English morphological databases). The stemmed text is not provided, it is just an intermediate form to improve the precision of the alignment step.
- Sentence alignment with hunalign.
Our alignment method outperforms pure statistical methods by combining such methods with the extensive use of a bilingual dictionary. In the absence of a dictionary resource, the aligner can bootstrap itself in tandem with our automatic lexicon builder.
Data formats
The data is provided in various simple formats: Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible from these files. Where copyright considerations made it necessary, the lines of .bi files were shuffled (sorted alphabetically). Alignment “ladder” (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token ” ~~~ ” is placed between sentences. The reserved special sentence “<p>” is used as a paragraph delimiter. The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded. hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively.
Authors
The corpus was created as part of the Hunglish project by the joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor Trón), and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy). Version 2.0 was built by Dániel Varga and András Rung.
Acknowledgements
The Hunglish project was supported by an ITEM grant by the the Hungarian Ministry of Informatics and Communication. Version 2.0 of the corpus was created as part of the CESAR project. We wish to thank Gergő Péter Barna, András Farkas, Tamás Váradi and Attila Zséder for their invaluable help.
Reference
If you use the corpus, please cite the following paper:
D. Varga, L. Németh, P. Halácsy, A. Kornai, V. Trón, V. Nagy (2005). Parallel corpora for medium density languages In Proceedings of the RANLP 2005, pages 590-596. (pdf)