Loading…
SDCF: semi-automatically structured dataset of citation functions
There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets cont...
Saved in:
Published in: | Scientometrics 2022-08, Vol.127 (8), p.4569-4608 |
---|---|
Main Authors: | , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493 |
---|---|
cites | cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493 |
container_end_page | 4608 |
container_issue | 8 |
container_start_page | 4569 |
container_title | Scientometrics |
container_volume | 127 |
creator | Basuki, Setio Tsuchiya, Masatoshi |
description | There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five
coarse
labels and 21
fine-grained
labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for
coarse
labels and 0.71 for
fine-grained
labels. Following this, we performed two classification stages, i.e.,
filtering,
and
fine-grained
to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the
filtering
stage. In the
fine-grained
stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances. |
doi_str_mv | 10.1007/s11192-022-04471-x |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2700751647</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2700751647</sourcerecordid><originalsourceid>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</originalsourceid><addsrcrecordid>eNp9kE1LAzEQhoMoWKt_wNOC5-hMsmmy3kq1KhQ8qOcQsols2Y-aZKH996au4M3DZHJ43nfgIeQa4RYB5F1ExIpRYHnKUiLdn5AZCqUoUws8JTNArmiFHM7JRYxbyCEOakaWbw-r9X0RXddQM6ahM6mxpm0PRUxhtGkMri5qk0x0qRh8YZuUiaEv_Njb4ydekjNv2uiufvecfKwf31fPdPP69LJabqjlokrUC6GgZhKU9RWzXjBuVclrLo0wygMohTa_AtD72nAuFQMhnZNgoCorPic3U-8uDF-ji0lvhzH0-aTOrSAFLkqZKTZRNgwxBuf1LjSdCQeNoI-q9KRKZ1X6R5Xe5xCfQjHD_acLf9X_pL4BP0RrXA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2700751647</pqid></control><display><type>article</type><title>SDCF: semi-automatically structured dataset of citation functions</title><source>Library & Information Science Abstracts (LISA)</source><source>Springer Nature</source><creator>Basuki, Setio ; Tsuchiya, Masatoshi</creator><creatorcontrib>Basuki, Setio ; Tsuchiya, Masatoshi</creatorcontrib><description>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five
coarse
labels and 21
fine-grained
labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for
coarse
labels and 0.71 for
fine-grained
labels. Following this, we performed two classification stages, i.e.,
filtering,
and
fine-grained
to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the
filtering
stage. In the
fine-grained
stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</description><identifier>ISSN: 0138-9130</identifier><identifier>EISSN: 1588-2861</identifier><identifier>DOI: 10.1007/s11192-022-04471-x</identifier><language>eng</language><publisher>Cham: Springer International Publishing</publisher><subject>Accuracy ; Annotations ; Business competition ; Classification ; Coders ; Computer Science ; Datasets ; Filtration ; Information Storage and Retrieval ; Labeling ; Labels ; Library Science ; Machine learning ; Scientific papers ; Strategy</subject><ispartof>Scientometrics, 2022-08, Vol.127 (8), p.4569-4608</ispartof><rights>The Author(s) 2022</rights><rights>The Author(s) 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</citedby><cites>FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135</link.rule.ids></links><search><creatorcontrib>Basuki, Setio</creatorcontrib><creatorcontrib>Tsuchiya, Masatoshi</creatorcontrib><title>SDCF: semi-automatically structured dataset of citation functions</title><title>Scientometrics</title><addtitle>Scientometrics</addtitle><description>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five
coarse
labels and 21
fine-grained
labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for
coarse
labels and 0.71 for
fine-grained
labels. Following this, we performed two classification stages, i.e.,
filtering,
and
fine-grained
to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the
filtering
stage. In the
fine-grained
stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</description><subject>Accuracy</subject><subject>Annotations</subject><subject>Business competition</subject><subject>Classification</subject><subject>Coders</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Filtration</subject><subject>Information Storage and Retrieval</subject><subject>Labeling</subject><subject>Labels</subject><subject>Library Science</subject><subject>Machine learning</subject><subject>Scientific papers</subject><subject>Strategy</subject><issn>0138-9130</issn><issn>1588-2861</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp9kE1LAzEQhoMoWKt_wNOC5-hMsmmy3kq1KhQ8qOcQsols2Y-aZKH996au4M3DZHJ43nfgIeQa4RYB5F1ExIpRYHnKUiLdn5AZCqUoUws8JTNArmiFHM7JRYxbyCEOakaWbw-r9X0RXddQM6ahM6mxpm0PRUxhtGkMri5qk0x0qRh8YZuUiaEv_Njb4ydekjNv2uiufvecfKwf31fPdPP69LJabqjlokrUC6GgZhKU9RWzXjBuVclrLo0wygMohTa_AtD72nAuFQMhnZNgoCorPic3U-8uDF-ji0lvhzH0-aTOrSAFLkqZKTZRNgwxBuf1LjSdCQeNoI-q9KRKZ1X6R5Xe5xCfQjHD_acLf9X_pL4BP0RrXA</recordid><startdate>20220801</startdate><enddate>20220801</enddate><creator>Basuki, Setio</creator><creator>Tsuchiya, Masatoshi</creator><general>Springer International Publishing</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope></search><sort><creationdate>20220801</creationdate><title>SDCF: semi-automatically structured dataset of citation functions</title><author>Basuki, Setio ; Tsuchiya, Masatoshi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Accuracy</topic><topic>Annotations</topic><topic>Business competition</topic><topic>Classification</topic><topic>Coders</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Filtration</topic><topic>Information Storage and Retrieval</topic><topic>Labeling</topic><topic>Labels</topic><topic>Library Science</topic><topic>Machine learning</topic><topic>Scientific papers</topic><topic>Strategy</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Basuki, Setio</creatorcontrib><creatorcontrib>Tsuchiya, Masatoshi</creatorcontrib><collection>SpringerOpen</collection><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Scientometrics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Basuki, Setio</au><au>Tsuchiya, Masatoshi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SDCF: semi-automatically structured dataset of citation functions</atitle><jtitle>Scientometrics</jtitle><stitle>Scientometrics</stitle><date>2022-08-01</date><risdate>2022</risdate><volume>127</volume><issue>8</issue><spage>4569</spage><epage>4608</epage><pages>4569-4608</pages><issn>0138-9130</issn><eissn>1588-2861</eissn><abstract>There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five
coarse
labels and 21
fine-grained
labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for
coarse
labels and 0.71 for
fine-grained
labels. Following this, we performed two classification stages, i.e.,
filtering,
and
fine-grained
to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the
filtering
stage. In the
fine-grained
stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.</abstract><cop>Cham</cop><pub>Springer International Publishing</pub><doi>10.1007/s11192-022-04471-x</doi><tpages>40</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0138-9130 |
ispartof | Scientometrics, 2022-08, Vol.127 (8), p.4569-4608 |
issn | 0138-9130 1588-2861 |
language | eng |
recordid | cdi_proquest_journals_2700751647 |
source | Library & Information Science Abstracts (LISA); Springer Nature |
subjects | Accuracy Annotations Business competition Classification Coders Computer Science Datasets Filtration Information Storage and Retrieval Labeling Labels Library Science Machine learning Scientific papers Strategy |
title | SDCF: semi-automatically structured dataset of citation functions |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T23%3A34%3A26IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SDCF:%20semi-automatically%20structured%20dataset%20of%20citation%20functions&rft.jtitle=Scientometrics&rft.au=Basuki,%20Setio&rft.date=2022-08-01&rft.volume=127&rft.issue=8&rft.spage=4569&rft.epage=4608&rft.pages=4569-4608&rft.issn=0138-9130&rft.eissn=1588-2861&rft_id=info:doi/10.1007/s11192-022-04471-x&rft_dat=%3Cproquest_cross%3E2700751647%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c359t-f5580d2708cf92cf523c843d37a5a8f00881c008501ffda33782057ee70a09493%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2700751647&rft_id=info:pmid/&rfr_iscdi=true |