Loading…

A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus

•The variety of methods adopted to perform the automatic analysis, ranging from traditional morphosyntactic analysis based on statistical methods, transformers-based language models, sentiment and emotions analysis, and perplexity metrics.•The types of information which is automatically retrieved fr...

Full description

Saved in:
Bibliographic Details
Published in:Computer speech & language 2025-01, Vol.89, p.101691, Article 101691
Main Authors: Sigona, Francesco, Radicioni, Daniele P., Gili Fivela, Barbara, Colla, Davide, Delsanto, Matteo, Mensa, Enrico, Bolioli, Andrea, Vigorelli, Pietro
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites cdi_FETCH-LOGICAL-c222t-14afffd670da7b83b3093161b3f1c28d8efafc689696c9a47cad6e5043efc6193
container_end_page
container_issue
container_start_page 101691
container_title Computer speech & language
container_volume 89
creator Sigona, Francesco
Radicioni, Daniele P.
Gili Fivela, Barbara
Colla, Davide
Delsanto, Matteo
Mensa, Enrico
Bolioli, Andrea
Vigorelli, Pietro
description •The variety of methods adopted to perform the automatic analysis, ranging from traditional morphosyntactic analysis based on statistical methods, transformers-based language models, sentiment and emotions analysis, and perplexity metrics.•The types of information which is automatically retrieved from the dialogue transcripts that regards lexical and morphosyntactic choices as well as speaker's emotions.•The analysis of highly ecological and large corpus, that is the Anchise Corpus. Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions. Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian M
doi_str_mv 10.1016/j.csl.2024.101691
format article
fullrecord <record><control><sourceid>elsevier_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1016_j_csl_2024_101691</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0885230824000743</els_id><sourcerecordid>S0885230824000743</sourcerecordid><originalsourceid>FETCH-LOGICAL-c222t-14afffd670da7b83b3093161b3f1c28d8efafc689696c9a47cad6e5043efc6193</originalsourceid><addsrcrecordid>eNp9kMtqwzAQRbVooenjA7rTDzjVw5HldhVCXxDoJl0LeTSqFRzLSE5K_r5O03U3M8yFM1wOIfeczTnj6mE7h9zNBRPl713zCzJjWi8KIZm-Itc5bxljalFWM-KXFOJu2I92DLG3HbXTOOaQafR0TLbPkEKDjuYBEdpTOmAcOqRdOIT-i36HsaUOd9iPwT7STYt02UMbMtKpgaCrmIZ9viWX3nYZ7_72Dfl8ed6s3or1x-v7arkuQAgxFry03nunKuZs1WjZSFZLrngjPQehnUZvPShdq1pBbcsKrFO4YKXEKea1vCH8_BdSzDmhN0MKO5uOhjNzkmG2ZpJjTnLMWc7EPJ0ZnIodAiaTIWAP6EJCGI2L4R_6B1K-b88</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus</title><source>Elsevier</source><creator>Sigona, Francesco ; Radicioni, Daniele P. ; Gili Fivela, Barbara ; Colla, Davide ; Delsanto, Matteo ; Mensa, Enrico ; Bolioli, Andrea ; Vigorelli, Pietro</creator><creatorcontrib>Sigona, Francesco ; Radicioni, Daniele P. ; Gili Fivela, Barbara ; Colla, Davide ; Delsanto, Matteo ; Mensa, Enrico ; Bolioli, Andrea ; Vigorelli, Pietro</creatorcontrib><description>•The variety of methods adopted to perform the automatic analysis, ranging from traditional morphosyntactic analysis based on statistical methods, transformers-based language models, sentiment and emotions analysis, and perplexity metrics.•The types of information which is automatically retrieved from the dialogue transcripts that regards lexical and morphosyntactic choices as well as speaker's emotions.•The analysis of highly ecological and large corpus, that is the Anchise Corpus. Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions. Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values. Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the classifier based on DLBs achieved an F1 score of 0.79 for binary classification between SEVERE and MILD, and 0.61 for multi-label categorization. Sentiment and emotion analyzes showed inverse trends for joy while MMSE scores suggested that less impaired individuals were less joyful, or more “negative”, than others. Considering the real-world context, this is consistent with the hypothesis of a gradual reduction in awareness in individuals affected by dementia. Finally, integrating various profiles of analysis has been proved to be effective in offering a wider picture of linguistic and communication deficits, as well as more precise data regarding the progression of dementia.</description><identifier>ISSN: 0885-2308</identifier><identifier>DOI: 10.1016/j.csl.2024.101691</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Automatic speech and language analysis ; Digital linguistic biomarkers ; Emotion analysis ; Enabling approach ; MMSE ; Naturalistic conversations ; NLP ; Perplexity</subject><ispartof>Computer speech &amp; language, 2025-01, Vol.89, p.101691, Article 101691</ispartof><rights>2024 The Authors</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c222t-14afffd670da7b83b3093161b3f1c28d8efafc689696c9a47cad6e5043efc6193</cites><orcidid>0000-0003-2939-0009</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27903,27904</link.rule.ids></links><search><creatorcontrib>Sigona, Francesco</creatorcontrib><creatorcontrib>Radicioni, Daniele P.</creatorcontrib><creatorcontrib>Gili Fivela, Barbara</creatorcontrib><creatorcontrib>Colla, Davide</creatorcontrib><creatorcontrib>Delsanto, Matteo</creatorcontrib><creatorcontrib>Mensa, Enrico</creatorcontrib><creatorcontrib>Bolioli, Andrea</creatorcontrib><creatorcontrib>Vigorelli, Pietro</creatorcontrib><title>A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus</title><title>Computer speech &amp; language</title><description>•The variety of methods adopted to perform the automatic analysis, ranging from traditional morphosyntactic analysis based on statistical methods, transformers-based language models, sentiment and emotions analysis, and perplexity metrics.•The types of information which is automatically retrieved from the dialogue transcripts that regards lexical and morphosyntactic choices as well as speaker's emotions.•The analysis of highly ecological and large corpus, that is the Anchise Corpus. Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions. Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values. Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the classifier based on DLBs achieved an F1 score of 0.79 for binary classification between SEVERE and MILD, and 0.61 for multi-label categorization. Sentiment and emotion analyzes showed inverse trends for joy while MMSE scores suggested that less impaired individuals were less joyful, or more “negative”, than others. Considering the real-world context, this is consistent with the hypothesis of a gradual reduction in awareness in individuals affected by dementia. Finally, integrating various profiles of analysis has been proved to be effective in offering a wider picture of linguistic and communication deficits, as well as more precise data regarding the progression of dementia.</description><subject>Automatic speech and language analysis</subject><subject>Digital linguistic biomarkers</subject><subject>Emotion analysis</subject><subject>Enabling approach</subject><subject>MMSE</subject><subject>Naturalistic conversations</subject><subject>NLP</subject><subject>Perplexity</subject><issn>0885-2308</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNp9kMtqwzAQRbVooenjA7rTDzjVw5HldhVCXxDoJl0LeTSqFRzLSE5K_r5O03U3M8yFM1wOIfeczTnj6mE7h9zNBRPl713zCzJjWi8KIZm-Itc5bxljalFWM-KXFOJu2I92DLG3HbXTOOaQafR0TLbPkEKDjuYBEdpTOmAcOqRdOIT-i36HsaUOd9iPwT7STYt02UMbMtKpgaCrmIZ9viWX3nYZ7_72Dfl8ed6s3or1x-v7arkuQAgxFry03nunKuZs1WjZSFZLrngjPQehnUZvPShdq1pBbcsKrFO4YKXEKea1vCH8_BdSzDmhN0MKO5uOhjNzkmG2ZpJjTnLMWc7EPJ0ZnIodAiaTIWAP6EJCGI2L4R_6B1K-b88</recordid><startdate>202501</startdate><enddate>202501</enddate><creator>Sigona, Francesco</creator><creator>Radicioni, Daniele P.</creator><creator>Gili Fivela, Barbara</creator><creator>Colla, Davide</creator><creator>Delsanto, Matteo</creator><creator>Mensa, Enrico</creator><creator>Bolioli, Andrea</creator><creator>Vigorelli, Pietro</creator><general>Elsevier Ltd</general><scope>6I.</scope><scope>AAFTH</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-2939-0009</orcidid></search><sort><creationdate>202501</creationdate><title>A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus</title><author>Sigona, Francesco ; Radicioni, Daniele P. ; Gili Fivela, Barbara ; Colla, Davide ; Delsanto, Matteo ; Mensa, Enrico ; Bolioli, Andrea ; Vigorelli, Pietro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c222t-14afffd670da7b83b3093161b3f1c28d8efafc689696c9a47cad6e5043efc6193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Automatic speech and language analysis</topic><topic>Digital linguistic biomarkers</topic><topic>Emotion analysis</topic><topic>Enabling approach</topic><topic>MMSE</topic><topic>Naturalistic conversations</topic><topic>NLP</topic><topic>Perplexity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sigona, Francesco</creatorcontrib><creatorcontrib>Radicioni, Daniele P.</creatorcontrib><creatorcontrib>Gili Fivela, Barbara</creatorcontrib><creatorcontrib>Colla, Davide</creatorcontrib><creatorcontrib>Delsanto, Matteo</creatorcontrib><creatorcontrib>Mensa, Enrico</creatorcontrib><creatorcontrib>Bolioli, Andrea</creatorcontrib><creatorcontrib>Vigorelli, Pietro</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>CrossRef</collection><jtitle>Computer speech &amp; language</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Sigona, Francesco</au><au>Radicioni, Daniele P.</au><au>Gili Fivela, Barbara</au><au>Colla, Davide</au><au>Delsanto, Matteo</au><au>Mensa, Enrico</au><au>Bolioli, Andrea</au><au>Vigorelli, Pietro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus</atitle><jtitle>Computer speech &amp; language</jtitle><date>2025-01</date><risdate>2025</risdate><volume>89</volume><spage>101691</spage><pages>101691-</pages><artnum>101691</artnum><issn>0885-2308</issn><abstract>•The variety of methods adopted to perform the automatic analysis, ranging from traditional morphosyntactic analysis based on statistical methods, transformers-based language models, sentiment and emotions analysis, and perplexity metrics.•The types of information which is automatically retrieved from the dialogue transcripts that regards lexical and morphosyntactic choices as well as speaker's emotions.•The analysis of highly ecological and large corpus, that is the Anchise Corpus. Automatic linguistic analysis can provide cost-effective, valuable clues to the diagnosis of cognitive difficulties and to therapeutic practice, and hence impact positively on wellbeing. In this work, we analyzed transcribed conversations between elderly individuals living with dementia and healthcare professionals. The material came from the Anchise 2022 Corpus, a large collection of transcripts of conversations in Italian recorded in naturalistic conditions. The aim of the work was to test the effectiveness of a number of automatic analyzes in finding correlations with the progression of dementia in individuals with cognitive decline as measured by the Mini-Mental State Examination (MMSE) score, which is the only psychometric-clinical information available on the participants in the conversations. Healthy controls (HC) were not considered in this study, nor does the corpus itself include HCs. The main innovation and strength of the work consists in the high ecological validity of the language analyzed (most of the literature to date concerns controlled language experiments); in the use of Italian (there is little corpora for Italian); in the size of the analyzed data (more than 200 conversations were considered); in the adoption of a wide range of NLP methods, that span from traditional morphosyntactic investigation to deep linguistic models for conducting analyzes such as through perplexity, sentiment (polarity) and emotions. Analyzing real-world interactions not designed with computational analysis in mind, such as is the case of the Anchise Corpus, is particularly challenging. To achieve the research goals, a wide variety of tools were employed. These included traditional morphosyntactic analysis based on digital linguistic biomarkers (DLBs), transformer-based language models, sentiment and emotion analysis, and perplexity metrics. Analyzes were conducted both on the continuous range of MMSE values and on the severe/moderate/mild categorization suggested by AIFA (Italian Medicines Agency) guidelines, based on MMSE threshold values. Correlations between MMSE and individual DLBs were weak, up to 0.19 for positive, and -0.21 for negative correlation values. Nevertheless, some correlations were statistically significant and consistent with the literature, suggesting that people with a greater degree of impairment tend to show a reduced vocabulary, to have anomia, to adopt a more informal linguist register, and to display a simplified use of verbs, with a decrease in the use of participles, gerunds, subjunctive moods, modal verbs, as well as a flattening in the use of the tenses towards the present to the detriment of the past. The -0.26 inverse correlation between perplexity and MMSE suggests that perplexity captures slightly more specific linguistic information, which can complement the MMSE scores. In the categorization tasks, the classifier based on DLBs achieved an F1 score of 0.79 for binary classification between SEVERE and MILD, and 0.61 for multi-label categorization. Sentiment and emotion analyzes showed inverse trends for joy while MMSE scores suggested that less impaired individuals were less joyful, or more “negative”, than others. Considering the real-world context, this is consistent with the hypothesis of a gradual reduction in awareness in individuals affected by dementia. Finally, integrating various profiles of analysis has been proved to be effective in offering a wider picture of linguistic and communication deficits, as well as more precise data regarding the progression of dementia.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.csl.2024.101691</doi><orcidid>https://orcid.org/0000-0003-2939-0009</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0885-2308
ispartof Computer speech & language, 2025-01, Vol.89, p.101691, Article 101691
issn 0885-2308
language eng
recordid cdi_crossref_primary_10_1016_j_csl_2024_101691
source Elsevier
subjects Automatic speech and language analysis
Digital linguistic biomarkers
Emotion analysis
Enabling approach
MMSE
Naturalistic conversations
NLP
Perplexity
title A computational analysis of transcribed speech of people living with dementia: The Anchise 2022 Corpus
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T22%3A25%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-elsevier_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20computational%20analysis%20of%20transcribed%20speech%20of%20people%20living%20with%20dementia:%20The%20Anchise%202022%20Corpus&rft.jtitle=Computer%20speech%20&%20language&rft.au=Sigona,%20Francesco&rft.date=2025-01&rft.volume=89&rft.spage=101691&rft.pages=101691-&rft.artnum=101691&rft.issn=0885-2308&rft_id=info:doi/10.1016/j.csl.2024.101691&rft_dat=%3Celsevier_cross%3ES0885230824000743%3C/elsevier_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c222t-14afffd670da7b83b3093161b3f1c28d8efafc689696c9a47cad6e5043efc6193%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true