Couranten Corpus

Version 2.0 (July 2025)

About the Corpus application

The corpus application is developed by the INT. The backend of the application is the BlackLab Lucene based search engine developed for corpora with token-based annotation (https://blacklab.ivdnt.org/). The web-based frontend is a further development of the corpus-frontend application developed by INT (https://github.com/instituutnederlandsetaal/blacklab-frontend) in CLARIN and CLARIAH projects. Its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University (https://github.com/Taalmonsters/WhiteLab2.0).

About the Couranten Corpus

The Couranten Corpus comprises seventeenth-century Dutch newspapers available on Delpher. The oldest surviving newspapers were published in 1618. For the Delpher-website the Koninklijke Bibliotheek in The Hague has scanned the newspapers. These scans have been read by optical character recognition (OCR). However, OCR could not deal with the old fonts and texts of these newspapers. That is why the Meertens Institute set up a citizen science project, led by Nicoline van der Sijs. By means of a collaborative web application, created by Rob Zeeman, all newspapers were transcribed and corrected by more than 300 volunteers of the Stichting Vrijwilligersnetwerk Nederlandse Taal. Subsequently, interns Thomas Angenent, Aafje Baarslag, Rianne de Koning, Guido Moerdijk, Jeroen Pelkman and Lennart van Winzum checked and corrected the metadata and added new metadata, for instance on genre (advertisements, national news, international news, etc.). The last correction of the metadata was done at the INT.

This sizeable corpus currently contains the contents of 13 newspapers, 110.260 articles and 18.232.836 words. The information in these newspapers is of interest to researchers of various disciplines, ranging from historians to historical linguists, literature scholars and art historians.

In the future, transcriptions of newly digitised newspapers from the seventeenth century and newspapers from the eighteenth century will be added to the Couranten Corpus.

The first online accessible version of the Couranten Corpus was released on 12th May 2022.

This second online accessible version of the Couranten Corpus was released on 14th July 2025.

For background information about the origins, content and language use of 17th-century newspapers see: ‘De wereld in kranten’, in: Marc van Oostendorp & Nicoline van der Sijs (2019), ‘Een mooie mengelmoes’. Meertaligheid in de Gouden Eeuw, Amsterdam, available at https://library.oapen.org/handle/20.500.12657/24873

An overview of all newspapers published in the Low Countries during the 17th century is given by Arthur der Weduwen in Dutch and Flemish Newspapers of the Seventeenth Century, 1618-1700, Leiden: 2017.

Newspapers

The newspaper articles in this corpus are taken from the following thirteen newspaper titles:

Amsterdamse courant (1670-1699)
Courante uyt Italien, Duytslandt, &c. (1619-1669)
Europische courant (1642-1646)
Extraordinarisse Post-tijdinghe (was: Haegse Post-Tydingen) (1641)
Haegse post-tydinge (1663-1677)
Haerlemse courante (1659-1662)
Oprechte Haerlemsche courant (1659-1700)
Opregte Leydse courant (1698)
Ordinaris dingsdaeghse courante (1640-1670)
Ordinarisse middel-weeckse courante (1639-1669)
Tĳdinghe uyt verscheyde quartieren (1619-1671)
Utrechtse courant (1675-1698)
VVeeckelycke courante van Europa (1656-1658)

GiGaNT Lexicon service

To make the Couranten Corpus more accessible, suggestions for query expansion are given, using the INT lexicon service with the historical computational lexicon GiGaNT-HILEX.

The current version of GiGaNT-HILEX in the lexicon service contains the lexicon modules based on the Dictionary of the Dutch Language (Woordenboek der Nederlandsche Taal, WNT) and the Dictionary of Middle Dutch (Middelnederlandsch Woordenboek, MNW).

If you want to make use of this service, please contact Katrien Depuydt (katrien.depuydt@ivdnt.org).

Linguistic Annotation

It is also possible to use the linguistic annotation to search the corpus. The linguistic annotation has been created automatically by means of a tagger-lemmatizer based on the Hugging Face Transformers library, cf. https://github.com/instituutnederlandsetaal/int-huggingface-tagger. The tagger is available in the GaLAHaD platform for linguistic annotation of Historical Dutch as ‘hug-tdn-all-enhanced’.
Since linguistic enrichment took place automatically and it was not feasible to correct all data manually, some imperfections in the data are inevitable. On a manually corrected evaluation subcorpus of the Couranten Corpus, the tagger has an accuracy of 96% on PoS, and 92% on lemma, cf. https://portal.clarin.ivdnt.org/galahad/overview/benchmarks (choose option ‘couranten’).

The part of speech tagging has been done using the tagset and tagging principles for the annotation of diachronic corpora of historical Dutch, developed in the context of the CLARIAH+ project. This annotation layer has been added to the corpus, and can also be used to search the online corpus. A detailed description can be found here.

More information about the used lemmatization principles can be found in Lemmatiseerprincipes voor GiGaNT, het centrale lexicon van het INT.

Credits

We would like to thank the Koninklijke Bibliotheek (Royal Library of the Netherlands) for giving access to images, metadata and OCR of the newspapers.

We would like to thank the volunteers of the Stichting Vrijwilligersnetwerk Nederlandse Taal for transcribing and correcting all texts and metadata.

When referring to the Couranten Corpus, please use the following reference:

Couranten Corpus (version 2.0) (July 2025) [Online Service]. Available at the Dutch Language Institute: https://hdl.handle.net/10032/tm-a3-c2.

For information on the Couranten Corpus, please contact the project leader Nicoline van der Sijs.

For BlackLab:

Software available at https://github.com/instituutnederlandsetaal/BlackLab

Does, Jesse de, Jan Niestadt en Katrien Depuydt (2017), Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, pp. 151-165. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi

For the corpus frontend:

Software available at: https://github.com/instituutnederlandsetaal/blacklab-frontend

Logo provenance:

David Teniers (1610-1690), Bauernstube mit Zeitungsleser. Kunsthistorisches Museum (Wenen), via Wikimedia Commons.

Version information

Version 2.0 (14th July 2025): the user interface has been updated with more grouping functionalities. Features have also been added to the part of speech tags. Colophons have been added (see under Text type).