Language Ressources

In the field of data we have built up one of the largest data resources for German, presently comprising about 50 million sentences of newspaper text since 1994 and about 13 million words (types) (Wortschatz). For legal reasons, however, only the last two years of data can be accessed via the internet.

In collaboration with partners in the respective countries, our collection of text is presently being extended in a principled manner to other languages worldwide, including e.g. Korean and Norwegian, where the list of words of a language is always generated according to the Zipfian distribution, thus allowing for a better comparison of the vocabularies and statistical properties of languages (Corpora).

The text resources also are available on a daily basis, thus enabling us to do a very precise diachronic analysis of the usage and contexts of words (Wort-des-Tages).

As part of the project D-SPIN other linguistic data ressources are being made available by way of a service oriented architecture.

We also maintain access to digital resources of ancient texts as part of the eAQUA project (eAQUA).