Loading…

Curating global datasets of structural linguistic features for independence

The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence o...

Full description

Saved in:
Bibliographic Details
Published in:Scientific data 2025-01, Vol.12 (1), p.106-23, Article 106
Main Authors: Graff, Anna, Chousou-Polydouri, Natalia, Inman, David, Skirgård, Hedvig, Lischka, Marc, Zakharko, Taras, Barbieri, Chiara, Bickel, Balthasar
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c422t-9ca25cabbda9c56026791710833d431dcc82a544a71a1bcc008d865d048facd83
container_end_page 23
container_issue 1
container_start_page 106
container_title Scientific data
container_volume 12
creator Graff, Anna
Chousou-Polydouri, Natalia
Inman, David
Skirgård, Hedvig
Lischka, Marc
Zakharko, Taras
Barbieri, Chiara
Bickel, Balthasar
description The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.
doi_str_mv 10.1038/s41597-024-04319-4
format article
fullrecord <record><control><sourceid>proquest_doaj_</sourceid><recordid>TN_cdi_doaj_primary_oai_doaj_org_article_14aa56a38c0a41d4a9b8fd64388dd570</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><doaj_id>oai_doaj_org_article_14aa56a38c0a41d4a9b8fd64388dd570</doaj_id><sourcerecordid>3156879440</sourcerecordid><originalsourceid>FETCH-LOGICAL-c422t-9ca25cabbda9c56026791710833d431dcc82a544a71a1bcc008d865d048facd83</originalsourceid><addsrcrecordid>eNp9kU1v1DAQhiMEolXpH-CAInHhEhjb49g5IbTio6ISFzhbE9sJWWXjxXaQ-Pd4m1JaDlxsa-bxOx9vVT1n8JqB0G8SMtmpBjg2gIJ1DT6qzjlI3iC24vG991l1mdIeAJhAkAqeVmei01xx7M6rz7s1Up6WsR7n0NNcO8qUfE51GOqU42pzAeZ6Lsg6pTzZevBUYj7VQ4j1tDh_9OVYrH9WPRloTv7y9r6ovn14_3X3qbn-8vFq9-66sch5bjpLXFrqe0edlS3wVnVMMdBCuDKJs1ZzkoikGLHeWgDtdCsdoB7IOi0uqqtN1wXam2OcDhR_mUCTuQmEOBqKpdPZG4ZEsiWhLRAyh9T1enAtCq2dK7soWm83rePaH7yzfsll3AeiDzPL9N2M4adhTCFXwIvCq1uFGH6sPmVzmJL180yLD2sygkkly-BwKvbyH3Qf1riUXZ2oVqsO8UTxjbIxpBT9cNcNA3Py3mzem-K9ufHeYPn04v4cd1_-OF0AsQGppJbRx7-1_yP7G61wuiQ</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3156879440</pqid></control><display><type>article</type><title>Curating global datasets of structural linguistic features for independence</title><source>PubMed (Medline)</source><source>Publicly Available Content (ProQuest)</source><source>Springer Nature - nature.com Journals - Fully Open Access</source><creator>Graff, Anna ; Chousou-Polydouri, Natalia ; Inman, David ; Skirgård, Hedvig ; Lischka, Marc ; Zakharko, Taras ; Barbieri, Chiara ; Bickel, Balthasar</creator><creatorcontrib>Graff, Anna ; Chousou-Polydouri, Natalia ; Inman, David ; Skirgård, Hedvig ; Lischka, Marc ; Zakharko, Taras ; Barbieri, Chiara ; Bickel, Balthasar</creatorcontrib><description>The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.</description><identifier>ISSN: 2052-4463</identifier><identifier>EISSN: 2052-4463</identifier><identifier>DOI: 10.1038/s41597-024-04319-4</identifier><identifier>PMID: 39827249</identifier><language>eng</language><publisher>London: Nature Publishing Group UK</publisher><subject>631/181/1403/2473 ; 631/181/19 ; 706/689/19 ; Data Curation ; Data Descriptor ; Databases, Factual ; Datasets ; Humanities and Social Sciences ; Humans ; Language ; Linguistics ; multidisciplinary ; Science ; Science (multidisciplinary) ; Statistical analysis ; Statistics</subject><ispartof>Scientific data, 2025-01, Vol.12 (1), p.106-23, Article 106</ispartof><rights>The Author(s) 2025 corrected publication 2025</rights><rights>2025. The Author(s).</rights><rights>Copyright Nature Publishing Group 2025</rights><rights>The Author(s) 2025 2025</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c422t-9ca25cabbda9c56026791710833d431dcc82a544a71a1bcc008d865d048facd83</cites><orcidid>0000-0003-1892-591X ; 0000-0002-7748-2381 ; 0000-0002-7703-3471 ; 0000-0001-8827-5655 ; 0009-0007-9493-2392 ; 0000-0001-7601-8424 ; 0000-0002-9087-0565 ; 0000-0002-5693-975X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.proquest.com/docview/3156879440/fulltextPDF?pq-origsite=primo$$EPDF$$P50$$Gproquest$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/3156879440?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,881,25728,27898,27899,36986,36987,44563,53763,53765,75093</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39827249$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Graff, Anna</creatorcontrib><creatorcontrib>Chousou-Polydouri, Natalia</creatorcontrib><creatorcontrib>Inman, David</creatorcontrib><creatorcontrib>Skirgård, Hedvig</creatorcontrib><creatorcontrib>Lischka, Marc</creatorcontrib><creatorcontrib>Zakharko, Taras</creatorcontrib><creatorcontrib>Barbieri, Chiara</creatorcontrib><creatorcontrib>Bickel, Balthasar</creatorcontrib><title>Curating global datasets of structural linguistic features for independence</title><title>Scientific data</title><addtitle>Sci Data</addtitle><addtitle>Sci Data</addtitle><description>The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.</description><subject>631/181/1403/2473</subject><subject>631/181/19</subject><subject>706/689/19</subject><subject>Data Curation</subject><subject>Data Descriptor</subject><subject>Databases, Factual</subject><subject>Datasets</subject><subject>Humanities and Social Sciences</subject><subject>Humans</subject><subject>Language</subject><subject>Linguistics</subject><subject>multidisciplinary</subject><subject>Science</subject><subject>Science (multidisciplinary)</subject><subject>Statistical analysis</subject><subject>Statistics</subject><issn>2052-4463</issn><issn>2052-4463</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><sourceid>DOA</sourceid><recordid>eNp9kU1v1DAQhiMEolXpH-CAInHhEhjb49g5IbTio6ISFzhbE9sJWWXjxXaQ-Pd4m1JaDlxsa-bxOx9vVT1n8JqB0G8SMtmpBjg2gIJ1DT6qzjlI3iC24vG991l1mdIeAJhAkAqeVmei01xx7M6rz7s1Up6WsR7n0NNcO8qUfE51GOqU42pzAeZ6Lsg6pTzZevBUYj7VQ4j1tDh_9OVYrH9WPRloTv7y9r6ovn14_3X3qbn-8vFq9-66sch5bjpLXFrqe0edlS3wVnVMMdBCuDKJs1ZzkoikGLHeWgDtdCsdoB7IOi0uqqtN1wXam2OcDhR_mUCTuQmEOBqKpdPZG4ZEsiWhLRAyh9T1enAtCq2dK7soWm83rePaH7yzfsll3AeiDzPL9N2M4adhTCFXwIvCq1uFGH6sPmVzmJL180yLD2sygkkly-BwKvbyH3Qf1riUXZ2oVqsO8UTxjbIxpBT9cNcNA3Py3mzem-K9ufHeYPn04v4cd1_-OF0AsQGppJbRx7-1_yP7G61wuiQ</recordid><startdate>20250118</startdate><enddate>20250118</enddate><creator>Graff, Anna</creator><creator>Chousou-Polydouri, Natalia</creator><creator>Inman, David</creator><creator>Skirgård, Hedvig</creator><creator>Lischka, Marc</creator><creator>Zakharko, Taras</creator><creator>Barbieri, Chiara</creator><creator>Bickel, Balthasar</creator><general>Nature Publishing Group UK</general><general>Nature Publishing Group</general><general>Nature Portfolio</general><scope>C6C</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7X7</scope><scope>7XB</scope><scope>88E</scope><scope>8FE</scope><scope>8FH</scope><scope>8FI</scope><scope>8FJ</scope><scope>8FK</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BBNVY</scope><scope>BENPR</scope><scope>BHPHI</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FYUFA</scope><scope>GHDGH</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>K9.</scope><scope>LK8</scope><scope>M0S</scope><scope>M1P</scope><scope>M7P</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PJZUB</scope><scope>PKEHL</scope><scope>PPXIY</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>7X8</scope><scope>5PM</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0003-1892-591X</orcidid><orcidid>https://orcid.org/0000-0002-7748-2381</orcidid><orcidid>https://orcid.org/0000-0002-7703-3471</orcidid><orcidid>https://orcid.org/0000-0001-8827-5655</orcidid><orcidid>https://orcid.org/0009-0007-9493-2392</orcidid><orcidid>https://orcid.org/0000-0001-7601-8424</orcidid><orcidid>https://orcid.org/0000-0002-9087-0565</orcidid><orcidid>https://orcid.org/0000-0002-5693-975X</orcidid></search><sort><creationdate>20250118</creationdate><title>Curating global datasets of structural linguistic features for independence</title><author>Graff, Anna ; Chousou-Polydouri, Natalia ; Inman, David ; Skirgård, Hedvig ; Lischka, Marc ; Zakharko, Taras ; Barbieri, Chiara ; Bickel, Balthasar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c422t-9ca25cabbda9c56026791710833d431dcc82a544a71a1bcc008d865d048facd83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>631/181/1403/2473</topic><topic>631/181/19</topic><topic>706/689/19</topic><topic>Data Curation</topic><topic>Data Descriptor</topic><topic>Databases, Factual</topic><topic>Datasets</topic><topic>Humanities and Social Sciences</topic><topic>Humans</topic><topic>Language</topic><topic>Linguistics</topic><topic>multidisciplinary</topic><topic>Science</topic><topic>Science (multidisciplinary)</topic><topic>Statistical analysis</topic><topic>Statistics</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Graff, Anna</creatorcontrib><creatorcontrib>Chousou-Polydouri, Natalia</creatorcontrib><creatorcontrib>Inman, David</creatorcontrib><creatorcontrib>Skirgård, Hedvig</creatorcontrib><creatorcontrib>Lischka, Marc</creatorcontrib><creatorcontrib>Zakharko, Taras</creatorcontrib><creatorcontrib>Barbieri, Chiara</creatorcontrib><creatorcontrib>Bickel, Balthasar</creatorcontrib><collection>Springer Nature OA Free Journals</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Health &amp; Medical Collection</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>Medical Database (Alumni Edition)</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Natural Science Collection</collection><collection>Hospital Premium Collection</collection><collection>Hospital Premium Collection (Alumni Edition)</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>Biological Science Collection</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>ProQuest Natural Science Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>Health Research Premium Collection</collection><collection>Health Research Premium Collection (Alumni)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Biological Sciences</collection><collection>Health &amp; Medical Collection (Alumni Edition)</collection><collection>PML(ProQuest Medical Library)</collection><collection>Biological Science Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content (ProQuest)</collection><collection>ProQuest Health &amp; Medical Research Collection</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Health &amp; Nursing</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied &amp; Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>Scientific data</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Graff, Anna</au><au>Chousou-Polydouri, Natalia</au><au>Inman, David</au><au>Skirgård, Hedvig</au><au>Lischka, Marc</au><au>Zakharko, Taras</au><au>Barbieri, Chiara</au><au>Bickel, Balthasar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Curating global datasets of structural linguistic features for independence</atitle><jtitle>Scientific data</jtitle><stitle>Sci Data</stitle><addtitle>Sci Data</addtitle><date>2025-01-18</date><risdate>2025</risdate><volume>12</volume><issue>1</issue><spage>106</spage><epage>23</epage><pages>106-23</pages><artnum>106</artnum><issn>2052-4463</issn><eissn>2052-4463</eissn><abstract>The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.</abstract><cop>London</cop><pub>Nature Publishing Group UK</pub><pmid>39827249</pmid><doi>10.1038/s41597-024-04319-4</doi><tpages>23</tpages><orcidid>https://orcid.org/0000-0003-1892-591X</orcidid><orcidid>https://orcid.org/0000-0002-7748-2381</orcidid><orcidid>https://orcid.org/0000-0002-7703-3471</orcidid><orcidid>https://orcid.org/0000-0001-8827-5655</orcidid><orcidid>https://orcid.org/0009-0007-9493-2392</orcidid><orcidid>https://orcid.org/0000-0001-7601-8424</orcidid><orcidid>https://orcid.org/0000-0002-9087-0565</orcidid><orcidid>https://orcid.org/0000-0002-5693-975X</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2052-4463
ispartof Scientific data, 2025-01, Vol.12 (1), p.106-23, Article 106
issn 2052-4463
2052-4463
language eng
recordid cdi_doaj_primary_oai_doaj_org_article_14aa56a38c0a41d4a9b8fd64388dd570
source PubMed (Medline); Publicly Available Content (ProQuest); Springer Nature - nature.com Journals - Fully Open Access
subjects 631/181/1403/2473
631/181/19
706/689/19
Data Curation
Data Descriptor
Databases, Factual
Datasets
Humanities and Social Sciences
Humans
Language
Linguistics
multidisciplinary
Science
Science (multidisciplinary)
Statistical analysis
Statistics
title Curating global datasets of structural linguistic features for independence
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-27T06%3A52%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_doaj_&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Curating%20global%20datasets%20of%20structural%20linguistic%20features%20for%20independence&rft.jtitle=Scientific%20data&rft.au=Graff,%20Anna&rft.date=2025-01-18&rft.volume=12&rft.issue=1&rft.spage=106&rft.epage=23&rft.pages=106-23&rft.artnum=106&rft.issn=2052-4463&rft.eissn=2052-4463&rft_id=info:doi/10.1038/s41597-024-04319-4&rft_dat=%3Cproquest_doaj_%3E3156879440%3C/proquest_doaj_%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c422t-9ca25cabbda9c56026791710833d431dcc82a544a71a1bcc008d865d048facd83%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3156879440&rft_id=info:pmid/39827249&rfr_iscdi=true