Hungarian Webcorpus

With over 1.48 billion words unfiltered (589m words fully filtered), this is by far the largest Hungarian language corpus, and unlike the Hungarian National Corpus (125m words), it is available in its entirety under a permissive Open Content license. The Hungarian webcorpus was created in the winter of 2003 as part of the WordSword project at the Media Research and Education Centre.

The corpus consists of 18 million pages downloaded from the .hu domain, thus representing common written language fairly extensively. Texts that were present multiple times and files which contained no useable text were filtered out. We stratified the remainder in four sections according to the proportion of words in a page that were accepted by a spellchecker.

Allowing only 40% unrecognised words, all non-Hungarian documents were filtered out with great reliability.
At a threshold of 8% all pages without accents disappeared, but those containing internet and other jargon still remained.
At the 4% limit only those pages remain which contain fewer mistakes than an average print document. Decreasing the limit further will not increase the quality of the remaining text but would eliminate all pages that do not adhere to a strict spelling norm.

The Size of the Corpus

corpus	pages (million)	token (million)	type (million)
full	3,5	1486	19,1
40%	3,125	1310	15,4
8%	1,918	928	10,9
4%	1,221	589	7,2

Download

The Webcorpus may be downloaded in two formats: as a frequency dictionary based on the texts and as the original texts. If you use the corpus or the frequency dictionary, please refer to

Halácsy Péter, Kornai András, Németh László, Rung András,
Szakadát István, Trón Viktor Creating open language resources for
Hungarian In Proceedings of the 4th international conference on
Language Resources and Evaluation (LREC2004), 2004 ps pdf

Kornai, A, Halácsy, P, Nagy, V, Oravecz, Cs, Trón, V, and Varga, D (2006).
Web-based frequency dictionaries for medium density languages 
In: Proceedings of the 2nd International Workshop on Web as Corpus,
edited by Adam Kilgarriff, Marco Baroni  ACL-06, pages 1--9. pdf

Frequency dictionary

Based on the complete collection of texts, we produced a frequency dictionary in the form of a simple tab-separated text file with five fields per row. The first field contains the word, the rest are frequency data referring to how many times the term appeared in the complete webcorpus, the 40%, the 8%, and the 4% strata. Since sorting such a big list of words is not an easy task, we make the vocabulary available both in alphabetical order and according to frequency. Furthermore, we created a list of the first 100 thousand most frequent word forms. This is sufficient for many tasks as it contains roughly 95% of word tokens found in colloquial Hungarian.

filename	size	description
web2.2-alfa-sorted.txt.gz	~100Mb zipped, ~400Mb unzipped	the complete dictionary in alphabetical order
web2.2-freq-sorted.txt.gz	as above	the complete dictionary according to frequency
web2.2-freq-sorted.top100k.txt	~3Mb unzipped	the first 100 000 most frequent words, according to frequency, with data on frequency
web2.2-freq-sorted.top100k.nofreqs.txt	~1Mb unzipped	the first 100 000 most frequent words in order of frequency, without data (simple wordlist)

Corpus

We only make the 4% threshold corpus available for download (the others are available upon request), since this one already has 589 million words from 1,221 million Hungarian web pages, and it is 4Gb even in zipped format. It is published in 10, almost identical but independent parts, each in a tar.gz file.

web2-4p-0.tar.gz        09-Jun-2004 22:15  365M
web2-4p-1.tar.gz        09-Jun-2004 22:22  377M
web2-4p-2.tar.gz        09-Jun-2004 22:30  371M
web2-4p-3.tar.gz        09-Jun-2004 22:38  373M
web2-4p-4.tar.gz        09-Jun-2004 22:46  366M
web2-4p-5.tar.gz        09-Jun-2004 22:54  370M
web2-4p-6.tar.gz        09-Jun-2004 23:01  372M
web2-4p-7.tar.gz        09-Jun-2004 23:09  371M
web2-4p-8.tar.gz        09-Jun-2004 23:17  370M
web2-4p-9.tar.gz        09-Jun-2004 23:25  375M

Depending on the size of the corpus you need, download one or more files and unzip them in the same folder. The documents unzip in differnet files in the content/ folder, each segmented into words and sentences, in rough XML format (& signs are not coded).

← hunpos – HMM part-of-speech tagger

(Magyar) Bevezetés az Informatikába →

Comments are closed.