Couranten Corpus

About the Corpus application

The corpus application is developed by the INT. The backend of the application is the BlackLab Lucene based search engine developed for corpora with token-based annotation (http://inl.github.io/BlackLab/). The web-based frontend is a further development of the corpus-frontend application developed by INT (https://github.com/INL/corpus-frontend) in CLARIN and CLARIAH projects. Its design is inspired by the first version of the OpenSoNaR user interface by Tilburg and Radboud University (https://github.com/Taalmonsters/WhiteLab2.0).

About the Couranten Corpus

The Couranten Corpus comprises the seventeenth-century Dutch newspapers available on Delpher. The oldest surviving newspapers were published in 1618. For the Delpher-website the Koninklijke Bibliotheek in The Hague has scanned the newspapers. These scans have been read by optical character recognition (OCR). However, OCR could not deal with the old fonts and texts of these newspapers. That is why the Meertens Institute set up a citizen science project, led by Nicoline van der Sijs. By means of a collaborative web application, created by Rob Zeeman, all newspapers were transcribed and corrected by more than 300 volunteers of the Stichting Vrijwilligersnetwerk Nederlandse Taal. Subsequently, interns Thomas Angenent, Aafje Baarslag, Rianne de Koning, Guido Moerdijk, Jeroen Pelkman and Lennart van Winzum checked and corrected the metadata and added new metadata, for instance on genre (advertisements, national news, international news, etc.). The last correction of the metadata was done at the INT.

This sizeable corpus currently contains the contents of 13 newspapers, 109.532 articles and 18.926.425 words. The information in these newspapers is of interest to researchers of various disciplines, ranging from historians to historical linguists, literature scholars and art historians.

In the future, transcriptions of newly digitised newspapers from the seventeenth century and newspapers from the eighteenth century will be added to the Couranten Corpus.

This first online accessible version of the Couranten Corpus was released on 12th May 2022.

Newspapers

The newspaper articles in this corpus are taken from the following thirteen newspaper titles:

  1. Amsterdamse courant (1670-1699)
  2. Courante uyt Italien, Duytslandt, &c. (1619-1669)
  3. Europische courant (1642-1646)
  4. Extraordinarisse Post-tijdinghe (was: Haegse Post-Tydingen) (1641)
  5. Haegse post-tydinge (1663-1677)
  6. Haerlemse courante (1659-1662)
  7. Oprechte Haerlemsche courant (1659-1700)
  8. Opregte Leydse courant (1698)
  9. Ordinaris dingsdaeghse courante (1640-1670)
  10. Ordinarisse middel-weeckse courante (1639-1669)
  11. Tijdinghe uyt verscheyde quartieren (1619-1671)
  12. Utrechtse courant (1675-1698)
  13. VVeeckelycke courante van Europa (1656-1658)

GiGaNT Lexicon service

To make the Couranten Corpus more accessible, suggestions for query expansion are given, using the INT lexicon service with the historical computational lexicon GiGaNT-HILEX.

The current version of GiGaNT-HILEX in the lexicon service contains the lexicon modules based on the Dictionary of the Dutch Language (Woordenboek der Nederlandsche Taal, WNT) and the Dictionary of Middle Dutch (Middelnederlandsch Woordenboek, MNW).

If you want to make use of this service, please contact Katrien Depuydt (katrien.depuydt@ivdnt.org).

Linguistic Annotation

It is also possible to use the linguistic annotation to search the corpus. The linguistic annotation has been created by means of a Support Vector Machine-based statistical part of speech tagger, trained on the Letters as Loot corpus, and a lemmatizer using the INT historical lexicon and a simple statistical spelling variation model. It is an alpha version, since no effort has yet been done to train the tagger with more appropriate training material.

The part of speech tagging has been done using the tagset and tagging principles for the annotation of diachronic corpora of historical Dutch, developed in the context of the CLARIAH+ project. This annotation layer has been added to the corpus, and can also be used to search the online corpus. A detailed description can be found here.

More information about the used lemmatization principles can be found in Marijke Mooijaart, Het lemma in the GiGaNT lexicon.

Credits

We would like to thank the Koninklijke Bibliotheek (Royal Library of the Netherlands) for giving access to images, metadata and OCR of the newspapers.

We would like to thank the volunteers of the Stichting Vrijwilligersnetwerk Nederlandse Taal for transcribing and correcting all texts and metadata.

When referring to Couranten Corpus, please use the following reference:

Couranten Corpus (version 1.0) (May 2022) [Online Service]. Available at the Dutch Language Institute: http://hdl.handle.net/10032/tm-a2-u9.

For information on the Couranten Corpus, please contact the project leader Nicoline van der Sijs.

For BlackLab:

Software available at https://github.com/INL/BlackLab

Does, Jesse de, Jan Niestadt en Katrien Depuydt (2017), Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, pp. 151-165. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi

For the corpus frontend:

Software available at: https://github.com/INL/corpus-frontend