The corpus consists of 18 million pages downloaded from the
.hu domain, thus representing common written language fairly extensively. Texts that were present multiple times and files which contained no useable text were filtered out. We stratified the remainder in four sections according to the proportion of words in a page that were accepted by a spellchecker.
- Allowing only 40% unrecognised words, all non-Hungarian documents were filtered out with great reliability.
- At a threshold of 8% all pages without accents disappeared, but those containing internet and other jargon still remained.
- At the 4% limit only those pages remain which contain fewer mistakes than an average print document. Decreasing the limit further will not increase the quality of the remaining text but would eliminate all pages that do not adhere to a strict spelling norm.
The Size of the Corpus
|corpus||pages (million)||token (million)||type (million)|
The Webcorpus may be downloaded in two formats: as a frequency dictionary based on the texts and as the original texts. If you use the corpus or the frequency dictionary, please refer to
Halácsy Péter, Kornai András, Németh László, Rung András, Szakadát István, Trón Viktor Creating open language resources for Hungarian In Proceedings of the 4th international conference on Language Resources and Evaluation (LREC2004), 2004 pdf Kornai, A, Halácsy, P, Nagy, V, Oravecz, Cs, Trón, V, and Varga, D (2006). Web-based frequency dictionaries for medium density languages In: Proceedings of the 2nd International Workshop on Web as Corpus, edited by Adam Kilgarriff, Marco Baroni ACL-06, pages 1--9. pdf
Based on the complete collection of texts, we produced a frequency dictionary in the form of a simple tab-separated ISO Latin-2 encoded Unix text file with five fields per row. The first field contains the word, the rest are frequency data referring to how many times the term appeared in the complete webcorpus, the 40%, the 8%, and the 4% strata. Since sorting such a big list of words is not an easy task, we make the vocabulary available both in alphabetical order and according to frequency. Furthermore, we created a list of the first 100 thousand most frequent word forms. This is sufficient for many tasks as it contains roughly 95% of word tokens found in colloquial Hungarian.
|web2.2-alfa-sorted.txt.gz||~100Mb zipped, ~400Mb unzipped||the complete dictionary in alphabetical order|
|web2.2-freq-sorted.txt.gz||as above||the complete dictionary sorted according to frequency|
|web2.2-freq-sorted.top100k.txt||~3Mb unzipped||the first 100 000 most frequent words, according to frequency, with data on frequency|
|web2.2-freq-sorted.top100k.nofreqs.txt||~1Mb unzipped||the first 100 000 most frequent words in order of frequency, without data (simple wordlist)|
We only make the 4% threshold corpus available for download (the others are available upon request), since this one already has 589 million words from 1,221 million Hungarian web pages, and it is 4Gb even in zipped format. It is published in 10, almost identical but independent parts, each in a tar.gz file.
web2-4p-0.tar.gz 09-Jun-2004 22:15 365M web2-4p-1.tar.gz 09-Jun-2004 22:22 377M web2-4p-2.tar.gz 09-Jun-2004 22:30 371M web2-4p-3.tar.gz 09-Jun-2004 22:38 373M web2-4p-4.tar.gz 09-Jun-2004 22:46 366M web2-4p-5.tar.gz 09-Jun-2004 22:54 370M web2-4p-6.tar.gz 09-Jun-2004 23:01 372M web2-4p-7.tar.gz 09-Jun-2004 23:09 371M web2-4p-8.tar.gz 09-Jun-2004 23:17 370M web2-4p-9.tar.gz 09-Jun-2004 23:25 375M
Depending on the size of the corpus you need, download one or more files and unzip them in the same folder. The documents unzip in different files in the
content/ folder, each segmented into words and sentences, in rough XML format (& signs are not coded) with ISO Latin-2 encoding.
We provide a web-based query interface for the frequency dictionary. This is based on a morphologically analysed and disambiguated version of the Webcorpus. Morphological disambiguation was created using our hunpos system.