Loading…
The Litkey Corpus: A richly annotated longitudinal corpus of German texts written by primary school children
Compared to early language development, later changes to the language system during orthography and literacy acquisition have not yet been researched in detail. We present a longitudinal corpus of texts on short picture stories written by German primary school children between grades 2 and 4 and gra...
Saved in:
Published in: | Behavior research methods 2019-08, Vol.51 (4), p.1889-1918 |
---|---|
Main Authors: | , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Compared to early language development, later changes to the language system during orthography and literacy acquisition have not yet been researched in detail. We present a longitudinal corpus of texts on short picture stories written by German primary school children between grades 2 and 4 and grades 3 and 4. It includes 1,922 texts with 212,505 tokens (6,364 types) from 251 children. For each text, rich metadata is available, including age, grade and linguistic background (at least 60% of the children were multilingual). To our knowledge, our corpus is the largest longitudinal corpus of written texts by children at primary school age. Each word is included in its original spelling as well as in a normalized form (
target hypothesis
), specifying the intended word form, which we corrected for orthographic but not grammatical errors. Original and target word forms are aligned character-wise and the target word forms are enriched with phonological, syllabic, and morphological information. Additionally, for each target word form, we established key lexical variables, e.g., word frequency or summed bigram frequency, as specified in
childLex
. Where applicable, we also specify key features of German orthography (e.g., consonant doubling, vowel-lengthening ). Taken together, this information allows for a detailed assessment of the properties of words that tend to increase the likelihood of spelling errors. The corpus is available in different formats—as tab-delimited annotated token and type based lists, in an XML format, and via the corpus search tool ANNIS. |
---|---|
ISSN: | 1554-3528 1554-3528 |
DOI: | 10.3758/s13428-019-01261-x |