Loading…

SDCF: semi-automatically structured dataset of citation functions

There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets cont...

Full description

Saved in:

Bibliographic Details
Published in:	Scientometrics 2022-08, Vol.127 (8), p.4569-4608
Main Authors:	Basuki, Setio, Tsuchiya, Masatoshi
Format:	Article
Language:	English
Subjects:	Accuracy Annotations Business competition Classification Coders Computer Science Datasets Filtration Information Storage and Retrieval Labeling Labels Library Science Machine learning Scientific papers Strategy
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by	cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493
cites	cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493
container_end_page	4608
container_issue	8
container_start_page	4569
container_title	Scientometrics
container_volume	127
creator	Basuki, Setio Tsuchiya, Masatoshi
description	There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.
doi_str_mv	10.1007/s11192-022-04471-x
format	article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2700751647</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2700751647</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKt_wNOC5-hMsmmy3kq1KhQ8qOcQsols2Y-aZKH996au4M3DZHJ43nfgIeQa4RYB5F1ExIpRYHnKUiLdn5AZCqUoUws8JTNArmiFHM7JRYxbyCEOakaWbw-r9X0RXddQM6ahM6mxpm0PRUxhtGkMri5qk0x0qRh8YZuUiaEv_Njb4ydekjNv2uiufvecfKwf31fPdPP69LJabqjlokrUC6GgZhKU9RWzXjBuVclrLo0wygMohTa_AtD72nAuFQMhnZNgoCorPic3U-8uDF-ji0lvhzH0-aTOrSAFLkqZKTZRNgwxBuf1LjSdCQeNoI-q9KRKZ1X6R5Xe5xCfQjHD_acLf9X_pL4BP0RrXA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2700751647</pqid></control><display><type>article</type><title>SDCF: semi-automatically structured dataset of citation functions</title><source>Library & Information Science Abstracts (LISA)</source><source>Springer Nature</source><creator>Basuki, Setio ; Tsuchiya, Masatoshi</creator><creatorcontrib>Basuki, Setio ; Tsuchiya, Masatoshi</creatorcontrib><description>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</description><identifier>ISSN: 0138-9130</identifier><identifier>EISSN: 1588-2861</identifier><identifier>DOI: 10.1007/s11192-022-04471-x</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Accuracy ; Annotations ; Business competition ; Classification ; Coders ; Computer Science ; Datasets ; Filtration ; Information Storage and Retrieval ; Labeling ; Labels ; Library Science ; Machine learning ; Scientific papers ; Strategy</subject><ispartof>Scientometrics, 2022-08, Vol.127 (8), p.4569-4608</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</citedby><cites>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Basuki, Setio</creatorcontrib><creatorcontrib>Tsuchiya, Masatoshi</creatorcontrib><title>SDCF: semi-automatically structured dataset of citation functions</title><title>Scientometrics</title><addtitle>Scientometrics</addtitle><description>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</description><subject>Accuracy</subject><subject>Annotations</subject><subject>Business competition</subject><subject>Classification</subject><subject>Coders</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Filtration</subject><subject>Information Storage and Retrieval</subject><subject>Labeling</subject><subject>Labels</subject><subject>Library Science</subject><subject>Machine learning</subject><subject>Scientific papers</subject><subject>Strategy</subject><issn>0138-9130</issn><issn>1588-2861</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp9kE1LAzEQhoMoWKt_wNOC5-hMsmmy3kq1KhQ8qOcQsols2Y-aZKH996au4M3DZHJ43nfgIeQa4RYB5F1ExIpRYHnKUiLdn5AZCqUoUws8JTNArmiFHM7JRYxbyCEOakaWbw-r9X0RXddQM6ahM6mxpm0PRUxhtGkMri5qk0x0qRh8YZuUiaEv_Njb4ydekjNv2uiufvecfKwf31fPdPP69LJabqjlokrUC6GgZhKU9RWzXjBuVclrLo0wygMohTa_AtD72nAuFQMhnZNgoCorPic3U-8uDF-ji0lvhzH0-aTOrSAFLkqZKTZRNgwxBuf1LjSdCQeNoI-q9KRKZ1X6R5Xe5xCfQjHD_acLf9X_pL4BP0RrXA</recordid><startdate>20220801</startdate><enddate>20220801</enddate><creator>Basuki, Setio</creator><creator>Tsuchiya, Masatoshi</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20220801</creationdate><title>SDCF: semi-automatically structured dataset of citation functions</title><author>Basuki, Setio ; Tsuchiya, Masatoshi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Accuracy</topic><topic>Annotations</topic><topic>Business competition</topic><topic>Classification</topic><topic>Coders</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Filtration</topic><topic>Information Storage and Retrieval</topic><topic>Labeling</topic><topic>Labels</topic><topic>Library Science</topic><topic>Machine learning</topic><topic>Scientific papers</topic><topic>Strategy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Basuki, Setio</creatorcontrib><creatorcontrib>Tsuchiya, Masatoshi</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Scientometrics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Basuki, Setio</au><au>Tsuchiya, Masatoshi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SDCF: semi-automatically structured dataset of citation functions</atitle><jtitle>Scientometrics</jtitle><stitle>Scientometrics</stitle><date>2022-08-01</date><risdate>2022</risdate><volume>127</volume><issue>8</issue><spage>4569</spage><epage>4608</epage><pages>4569-4608</pages><issn>0138-9130</issn><eissn>1588-2861</eissn><abstract>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s11192-022-04471-x</doi><tpages>40</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0138-9130
ispartof	Scientometrics, 2022-08, Vol.127 (8), p.4569-4608
issn	0138-9130 1588-2861
language	eng
recordid	cdi_proquest_journals_2700751647
source	Library & Information Science Abstracts (LISA); Springer Nature
subjects	Accuracy Annotations Business competition Classification Coders Computer Science Datasets Filtration Information Storage and Retrieval Labeling Labels Library Science Machine learning Scientific papers Strategy
title	SDCF: semi-automatically structured dataset of citation functions
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T23%3A34%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SDCF:%20semi-automatically%20structured%20dataset%20of%20citation%20functions&rft.jtitle=Scientometrics&rft.au=Basuki,%20Setio&rft.date=2022-08-01&rft.volume=127&rft.issue=8&rft.spage=4569&rft.epage=4608&rft.pages=4569-4608&rft.issn=0138-9130&rft.eissn=1588-2861&rft_id=info:doi/10.1007/s11192-022-04471-x&rft_dat=%3Cproquest_cross%3E2700751647%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2700751647&rft_id=info:pmid/&rfr_iscdi=true