Historical Corpus of Dutch

About the Corpus application

The corpus application is developed by the INT. The backend of the application is the BlackLab Lucene based search engine developed for corpora with token-based annotation (https://blacklab.ivdnt.org/). The web-based frontend is a further development of the corpus-frontend application developed by INT (https://github.com/instituutnederlandsetaal/blacklab-frontend) in CLARIN and CLARIAH projects. Its design is inspired by the first version of the OpenSoNaR user interface by Tilburg University and Radboud University (https://github.com/Taalmonsters/WhiteLab2.0).

About the HCD

The Historisch Corpus van het Nederlands / Historical Corpus of Dutch (HCD) is a diachronic, regionally balanced, multigenre corpus of written Dutch. It aims to fill an important gap in the research infrastructure for historical Dutch, which has long lacked a balanced corpus with data from across the centuries and from various regions and genres. The HCD was built by researchers from the Vrije Universiteit Brussel and the Universiteit Leiden and is here made available by the Instituut voor de Nederlandse Taal.

Structure of the HCD

The HCD is constructed along three variational dimensions: time, region and genre.

Time: The HCD covers the sixteenth to the nineteenth century. Textual material was chosen from around the middle of each century: 1550, 1650, 1750, and 1850. For each of these dates, a margin of 20 years before and 20 years after the date was built in to find sufficient sources, resulting in four time periods: 1530-1570, 1630-1670, 1730-1770, and 1830-1870.

Region: The HCD comprises textual material from four regions in the northern and southern Low Countries: Holland and Zeeland in the north (in the present-day Netherlands), and Brabant and Flanders in the South (in present-day Belgium). Holland and Brabant can be considered central regions, while Zeeland and Flanders occupy a more peripheral position so that the corpus can also be used to investigate centre-periphery dynamics. Texts originate from larger cities such as Amsterdam, Antwerp, Middelburg, and Ghent, but also from smaller towns and villages (e.g. Arnemuiden, Strijpen).

Genre: The HCD comprises administrative texts, ego-documents, and pamphlets. The administrative texts are handwritten, formal texts, such as town council meeting reports and resolutions. The authors of these texts were generally used to writing because of their profession. The sources for this genre were related to guilds or to industry on the one hand, and to the general administration on the other. Ego-documents are less formal, handwritten texts such as travelogues, diaries and chronicles of local events or family history. The pamphlets are published texts, mostly commentaries or polemics about current affairs, politics or religious topics, while they also include public ordinances and regulations. Due to the variety of documents, printed pamphlets may vary on the continuum between more and less formal.

Procedure

All textual materials were manually transcribed from photographs of the original documents and checked multiple times. When we used existing transcriptions, as in the case of some administrative texts, these were checked against the original archival materials. References to publications, libraries and archives can be found in Van de Voorde (2022).

Size of the HCD

The HCD consists of 209 texts, amounting to 451.453 words. It comprises 58 administrative texts, 60 ego-documents, and 91 pamphlets. We aimed for 10.000 words per region and per period for each genre. For reasons of representativeness, these 10.000 words were preferably spread over multiple documents. In most cases, we are therefore dealing with fragments, and not with complete texts. The figure below, taken from Van de Voorde et al. (2023), shows the word count per genre, period and region. Most of the deviations from the intended 10.000 words can be found in the sixteenth century. A smaller lacuna can be noted for the nineteenth-century ego-documents from Brabant.

Word count per genre, period, and region

Funding

The HCD was co-developed by researchers from the Vrije Universiteit Brussel and the Universiteit Leiden. Particularly important was the research project ‘Pluricentricity in language history. Building blocks for an integrated history of Dutch (1550-1850)’, funded by the Fonds Wetenschappelijk Onderzoek (FWO).

References for extensive information on the compilation of the HCD

Van de Voorde, Iris. 2022. Pluricentriciteit in de taalgeschiedenis: Bouwstenen voor een geïntegreerde geschiedenis van het Nederlands (16de-19de eeuw). Amsterdam: LOT. Open access: https://www.lotpublications.nl/pluricentriciteit-in-de-taalgeschiedenis.

Van de Voorde, Iris, Gijsbert Rutten, Rik Vosters, Marijke van der Wal & Wim Vandenbussche. 2023. ‘Historical Corpus of Dutch: A new multi-genre corpus of Early and Late Modern Dutch’. Taal & Tongval 75: 114-132. Open access: https://www.aup-online.com/content/journals/10.5117/TET2023.1.006.VAND.

Credits

When referring to HCD, please use the following reference:

Historical Corpus of Dutch. Compiled by Iris Van de Voorde, Gijsbert Rutten, Rik Vosters, Marijke van der Wal & Wim Vandenbussche, with the help of research assistants and volunteers. Vrije Universiteit Brussel & Universiteit Leiden. 1^st release April 2025. Available at the Dutch Language Institute: https://hdl.handle.net/10032/tm-a3-a2.

For BlackLab:

Software available at https://github.com/instituutnederlandsetaal/BlackLab

Does, Jesse de, Jan Niestadt & Katrien Depuydt (2017), Creating research environments with BlackLab. In: Jan Odijk and Arjan van Hessen (eds.) CLARIN in the Low Countries, pp. 151-165. London: Ubiquity Press. DOI: https://doi.org/10.5334/bbi

For the BlackLab Frontend:

Software available at: https://github.com/instituutnederlandsetaal/blacklab-frontend

Logo provenance:

Hand-painted glass plate in wooden frame for illumination cabinet, image of two men with pamphlet (between 1760-1790). Museum Rotterdam via Wikimedia Commons.