Hungarian Webcorpus

With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was crawled in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre.

The corpus consists of 18 million pages downloaded from the .hu domain, thus representing common written language fairly extensively. Texts that were present multiple times and files which contained no useable text were filtered out. We stratified the remainder in four sections according to the proportion of words in a page that were accepted by a spellchecker.

  • Allowing only 40% unrecognised words, all non-Hungarian documents were filtered out with great reliability.
  • At a threshold of 8% all pages without accents disappeared, but those containing internet and other jargon still remained.
  • At the 4% limit only those pages remain which contain fewer mistakes than an average print document. Decreasing the limit further will not increase the quality of the remaining text but would eliminate all pages that do not adhere to a strict spelling norm.

The Size of the Corpus

corpus pages (million) token (million) type (million)
full 3,5 1486 19,1
40% 3,125 1310 15,4
8% 1,918 928 10,9
4% 1,221 589 7,2

 

Download

The Webcorpus may be downloaded in two formats: as a frequency dictionary based on the texts and as the original texts. If you use the corpus or the frequency dictionary, please refer to

Halácsy Péter, Kornai András, Németh László, Rung András,
Szakadát István, Trón Viktor Creating open language resources for
Hungarian In Proceedings of the 4th international conference on
Language Resources and Evaluation (LREC2004), 2004 pdf

Kornai, A, Halácsy, P, Nagy, V, Oravecz, Cs, Trón, V, and Varga, D (2006).
Web-based frequency dictionaries for medium density languages 
In: Proceedings of the 2nd International Workshop on Web as Corpus,
edited by Adam Kilgarriff, Marco Baroni  ACL-06, pages 1--9. pdf

Frequency dictionary

Based on the complete collection of texts, we produced a frequency dictionary in the form of a simple tab-separated ISO Latin-2 encoded Unix text file with five fields per row. The first field contains the word, the rest are frequency data referring to how many times the term appeared in the complete webcorpus, the 40%, the 8%, and the 4% strata. Since sorting such a big list of words is not an easy task, we make the vocabulary available both in alphabetical order and according to frequency. Furthermore, we created a list of the first 100 thousand most frequent word forms. This is sufficient for many tasks as it contains roughly 95% of word tokens found in colloquial Hungarian.

filename size description
web2.2-alfa-sorted.txt.gz ~100Mb zipped, ~400Mb unzipped the complete dictionary in alphabetical order
web2.2-freq-sorted.txt.gz as above the complete dictionary sorted according to frequency
web2.2-freq-sorted.top100k.txt ~3Mb unzipped the first 100 000 most frequent words, according to frequency, with data on frequency
web2.2-freq-sorted.top100k.nofreqs.txt ~1Mb unzipped the first 100 000 most frequent words in order of frequency, without data (simple wordlist)

Corpus

We only make the 4% threshold corpus available for download (the others are available upon request), since this one already has 589 million words from 1,221 million Hungarian web pages, and it is 4Gb even in zipped format. It is published in 10, almost identical but independent parts, each in a tar.gz file.

web2-4p-0.tar.gz        09-Jun-2004 22:15  365M
web2-4p-1.tar.gz        09-Jun-2004 22:22  377M
web2-4p-2.tar.gz        09-Jun-2004 22:30  371M
web2-4p-3.tar.gz        09-Jun-2004 22:38  373M
web2-4p-4.tar.gz        09-Jun-2004 22:46  366M
web2-4p-5.tar.gz        09-Jun-2004 22:54  370M
web2-4p-6.tar.gz        09-Jun-2004 23:01  372M
web2-4p-7.tar.gz        09-Jun-2004 23:09  371M
web2-4p-8.tar.gz        09-Jun-2004 23:17  370M
web2-4p-9.tar.gz        09-Jun-2004 23:25  375M

Depending on the size of the corpus you need, download one or more files and unzip them in the same folder. The documents unzip in different files in the content/ folder, each segmented into words and sentences, in rough XML format (& signs are not coded) with ISO Latin-2 encoding.

A morphologically annotated version of the corpus can be found here. It was morphologically analyzed by our hunmorph tool, and morphologically disambiguated on the inflection level by our hunpos tool.

 

Web-based frontend

We provide a web-based query interface for the frequency dictionary. This is based on a morphologically analysed and disambiguated version of the Webcorpus. Morphological disambiguation was created using our hunpos system.

Comments are closed.