Loading…

SDCF: semi-automatically structured dataset of citation functions

There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets cont...

Full description

Saved in:
Bibliographic Details
Published in:Scientometrics 2022-08, Vol.127 (8), p.4569-4608
Main Authors: Basuki, Setio, Tsuchiya, Masatoshi
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493
cites cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493
container_end_page 4608
container_issue 8
container_start_page 4569
container_title Scientometrics
container_volume 127
creator Basuki, Setio
Tsuchiya, Masatoshi
description There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.
doi_str_mv 10.1007/s11192-022-04471-x
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2700751647</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2700751647</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKt_wNOC5-hMsmmy3kq1KhQ8qOcQsols2Y-aZKH996au4M3DZHJ43nfgIeQa4RYB5F1ExIpRYHnKUiLdn5AZCqUoUws8JTNArmiFHM7JRYxbyCEOakaWbw-r9X0RXddQM6ahM6mxpm0PRUxhtGkMri5qk0x0qRh8YZuUiaEv_Njb4ydekjNv2uiufvecfKwf31fPdPP69LJabqjlokrUC6GgZhKU9RWzXjBuVclrLo0wygMohTa_AtD72nAuFQMhnZNgoCorPic3U-8uDF-ji0lvhzH0-aTOrSAFLkqZKTZRNgwxBuf1LjSdCQeNoI-q9KRKZ1X6R5Xe5xCfQjHD_acLf9X_pL4BP0RrXA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2700751647</pqid></control><display><type>article</type><title>SDCF: semi-automatically structured dataset of citation functions</title><source>Library &amp; Information Science Abstracts (LISA)</source><source>Springer Nature</source><creator>Basuki, Setio ; Tsuchiya, Masatoshi</creator><creatorcontrib>Basuki, Setio ; Tsuchiya, Masatoshi</creatorcontrib><description>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</description><identifier>ISSN: 0138-9130</identifier><identifier>EISSN: 1588-2861</identifier><identifier>DOI: 10.1007/s11192-022-04471-x</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Accuracy ; Annotations ; Business competition ; Classification ; Coders ; Computer Science ; Datasets ; Filtration ; Information Storage and Retrieval ; Labeling ; Labels ; Library Science ; Machine learning ; Scientific papers ; Strategy</subject><ispartof>Scientometrics, 2022-08, Vol.127 (8), p.4569-4608</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</citedby><cites>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Basuki, Setio</creatorcontrib><creatorcontrib>Tsuchiya, Masatoshi</creatorcontrib><title>SDCF: semi-automatically structured dataset of citation functions</title><title>Scientometrics</title><addtitle>Scientometrics</addtitle><description>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</description><subject>Accuracy</subject><subject>Annotations</subject><subject>Business competition</subject><subject>Classification</subject><subject>Coders</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Filtration</subject><subject>Information Storage and Retrieval</subject><subject>Labeling</subject><subject>Labels</subject><subject>Library Science</subject><subject>Machine learning</subject><subject>Scientific papers</subject><subject>Strategy</subject><issn>0138-9130</issn><issn>1588-2861</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp9kE1LAzEQhoMoWKt_wNOC5-hMsmmy3kq1KhQ8qOcQsols2Y-aZKH996au4M3DZHJ43nfgIeQa4RYB5F1ExIpRYHnKUiLdn5AZCqUoUws8JTNArmiFHM7JRYxbyCEOakaWbw-r9X0RXddQM6ahM6mxpm0PRUxhtGkMri5qk0x0qRh8YZuUiaEv_Njb4ydekjNv2uiufvecfKwf31fPdPP69LJabqjlokrUC6GgZhKU9RWzXjBuVclrLo0wygMohTa_AtD72nAuFQMhnZNgoCorPic3U-8uDF-ji0lvhzH0-aTOrSAFLkqZKTZRNgwxBuf1LjSdCQeNoI-q9KRKZ1X6R5Xe5xCfQjHD_acLf9X_pL4BP0RrXA</recordid><startdate>20220801</startdate><enddate>20220801</enddate><creator>Basuki, Setio</creator><creator>Tsuchiya, Masatoshi</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20220801</creationdate><title>SDCF: semi-automatically structured dataset of citation functions</title><author>Basuki, Setio ; Tsuchiya, Masatoshi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Accuracy</topic><topic>Annotations</topic><topic>Business competition</topic><topic>Classification</topic><topic>Coders</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Filtration</topic><topic>Information Storage and Retrieval</topic><topic>Labeling</topic><topic>Labels</topic><topic>Library Science</topic><topic>Machine learning</topic><topic>Scientific papers</topic><topic>Strategy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Basuki, Setio</creatorcontrib><creatorcontrib>Tsuchiya, Masatoshi</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><jtitle>Scientometrics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Basuki, Setio</au><au>Tsuchiya, Masatoshi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SDCF: semi-automatically structured dataset of citation functions</atitle><jtitle>Scientometrics</jtitle><stitle>Scientometrics</stitle><date>2022-08-01</date><risdate>2022</risdate><volume>127</volume><issue>8</issue><spage>4569</spage><epage>4608</epage><pages>4569-4608</pages><issn>0138-9130</issn><eissn>1588-2861</eissn><abstract>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s11192-022-04471-x</doi><tpages>40</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0138-9130
ispartof Scientometrics, 2022-08, Vol.127 (8), p.4569-4608
issn 0138-9130
1588-2861
language eng
recordid cdi_proquest_journals_2700751647
source Library & Information Science Abstracts (LISA); Springer Nature
subjects Accuracy
Annotations
Business competition
Classification
Coders
Computer Science
Datasets
Filtration
Information Storage and Retrieval
Labeling
Labels
Library Science
Machine learning
Scientific papers
Strategy
title SDCF: semi-automatically structured dataset of citation functions
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T23%3A34%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SDCF:%20semi-automatically%20structured%20dataset%20of%20citation%20functions&rft.jtitle=Scientometrics&rft.au=Basuki,%20Setio&rft.date=2022-08-01&rft.volume=127&rft.issue=8&rft.spage=4569&rft.epage=4608&rft.pages=4569-4608&rft.issn=0138-9130&rft.eissn=1588-2861&rft_id=info:doi/10.1007/s11192-022-04471-x&rft_dat=%3Cproquest_cross%3E2700751647%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2700751647&rft_id=info:pmid/&rfr_iscdi=true