Loading…
TTC-3600: A new benchmark dataset for Turkish text categorization
Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wron...
Saved in:
Published in: | Journal of information science 2017-04, Vol.43 (2), p.174-185 |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | cdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3 |
---|---|
cites | cdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3 |
container_end_page | 185 |
container_issue | 2 |
container_start_page | 174 |
container_title | Journal of information science |
container_volume | 43 |
creator | Kılınç, Deniz Özçift, Akın Bozyigit, Fatma Yıldırım, Pelin Yücalar, Fatih Borandag, Emin |
description | Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers. |
doi_str_mv | 10.1177/0165551515620551 |
format | article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1913946279</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1177_0165551515620551</sage_id><sourcerecordid>1913946279</sourcerecordid><originalsourceid>FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</originalsourceid><addsrcrecordid>eNp1UEtLw0AQXkTBWr17XPAcnenuJllvpWgVCl7iOUz20abVpO5u8fHrTakHEWQOM_C9mI-xS4RrxKK4AcyVUjhMPoHhOGIjLCRmuSzVMRvt4WyPn7KzGNcAoLSQIzatqlkmcoBbPuWde-eN68zqlcKGW0oUXeK-D7zahU0bVzy5j8QNJbfsQ_tFqe27c3bi6SW6i589Zs_3d9XsIVs8zR9n00VmJMiUaT9pSJSCiKwj4w2BLnwhJQpLpbC5JyBC8E3ZlBZQSWu0BzSNUd5bI8bs6uC7Df3bzsVUr_td6IbIGjUKLfNJoQcWHFgm9DEG5-ttaId3PmuEel9U_beoQZIdJJGW7pfpf_xvgn5nTA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1913946279</pqid></control><display><type>article</type><title>TTC-3600: A new benchmark dataset for Turkish text categorization</title><source>Library & Information Science Abstracts (LISA)</source><source>Sage Journals Online</source><creator>Kılınç, Deniz ; Özçift, Akın ; Bozyigit, Fatma ; Yıldırım, Pelin ; Yücalar, Fatih ; Borandag, Emin</creator><creatorcontrib>Kılınç, Deniz ; Özçift, Akın ; Bozyigit, Fatma ; Yıldırım, Pelin ; Yücalar, Fatih ; Borandag, Emin</creatorcontrib><description>Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.</description><identifier>ISSN: 0165-5515</identifier><identifier>EISSN: 1741-6485</identifier><identifier>DOI: 10.1177/0165551515620551</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Categories ; Classification ; Classifiers ; Data mining ; Datasets ; Feature selection ; Internet ; Machine learning ; News ; Text categorization ; Texts</subject><ispartof>Journal of information science, 2017-04, Vol.43 (2), p.174-185</ispartof><rights>The Author(s) 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</citedby><cites>FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135,79364</link.rule.ids></links><search><creatorcontrib>Kılınç, Deniz</creatorcontrib><creatorcontrib>Özçift, Akın</creatorcontrib><creatorcontrib>Bozyigit, Fatma</creatorcontrib><creatorcontrib>Yıldırım, Pelin</creatorcontrib><creatorcontrib>Yücalar, Fatih</creatorcontrib><creatorcontrib>Borandag, Emin</creatorcontrib><title>TTC-3600: A new benchmark dataset for Turkish text categorization</title><title>Journal of information science</title><description>Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.</description><subject>Categories</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Feature selection</subject><subject>Internet</subject><subject>Machine learning</subject><subject>News</subject><subject>Text categorization</subject><subject>Texts</subject><issn>0165-5515</issn><issn>1741-6485</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp1UEtLw0AQXkTBWr17XPAcnenuJllvpWgVCl7iOUz20abVpO5u8fHrTakHEWQOM_C9mI-xS4RrxKK4AcyVUjhMPoHhOGIjLCRmuSzVMRvt4WyPn7KzGNcAoLSQIzatqlkmcoBbPuWde-eN68zqlcKGW0oUXeK-D7zahU0bVzy5j8QNJbfsQ_tFqe27c3bi6SW6i589Zs_3d9XsIVs8zR9n00VmJMiUaT9pSJSCiKwj4w2BLnwhJQpLpbC5JyBC8E3ZlBZQSWu0BzSNUd5bI8bs6uC7Df3bzsVUr_td6IbIGjUKLfNJoQcWHFgm9DEG5-ttaId3PmuEel9U_beoQZIdJJGW7pfpf_xvgn5nTA</recordid><startdate>20170401</startdate><enddate>20170401</enddate><creator>Kılınç, Deniz</creator><creator>Özçift, Akın</creator><creator>Bozyigit, Fatma</creator><creator>Yıldırım, Pelin</creator><creator>Yücalar, Fatih</creator><creator>Borandag, Emin</creator><general>SAGE Publications</general><general>Bowker-Saur Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20170401</creationdate><title>TTC-3600: A new benchmark dataset for Turkish text categorization</title><author>Kılınç, Deniz ; Özçift, Akın ; Bozyigit, Fatma ; Yıldırım, Pelin ; Yücalar, Fatih ; Borandag, Emin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Categories</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Feature selection</topic><topic>Internet</topic><topic>Machine learning</topic><topic>News</topic><topic>Text categorization</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kılınç, Deniz</creatorcontrib><creatorcontrib>Özçift, Akın</creatorcontrib><creatorcontrib>Bozyigit, Fatma</creatorcontrib><creatorcontrib>Yıldırım, Pelin</creatorcontrib><creatorcontrib>Yücalar, Fatih</creatorcontrib><creatorcontrib>Borandag, Emin</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of information science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kılınç, Deniz</au><au>Özçift, Akın</au><au>Bozyigit, Fatma</au><au>Yıldırım, Pelin</au><au>Yücalar, Fatih</au><au>Borandag, Emin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TTC-3600: A new benchmark dataset for Turkish text categorization</atitle><jtitle>Journal of information science</jtitle><date>2017-04-01</date><risdate>2017</risdate><volume>43</volume><issue>2</issue><spage>174</spage><epage>185</epage><pages>174-185</pages><issn>0165-5515</issn><eissn>1741-6485</eissn><abstract>Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1177/0165551515620551</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0165-5515 |
ispartof | Journal of information science, 2017-04, Vol.43 (2), p.174-185 |
issn | 0165-5515 1741-6485 |
language | eng |
recordid | cdi_proquest_journals_1913946279 |
source | Library & Information Science Abstracts (LISA); Sage Journals Online |
subjects | Categories Classification Classifiers Data mining Datasets Feature selection Internet Machine learning News Text categorization Texts |
title | TTC-3600: A new benchmark dataset for Turkish text categorization |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T12%3A24%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TTC-3600:%20A%20new%20benchmark%20dataset%20for%20Turkish%20text%20categorization&rft.jtitle=Journal%20of%20information%20science&rft.au=K%C4%B1l%C4%B1n%C3%A7,%20Deniz&rft.date=2017-04-01&rft.volume=43&rft.issue=2&rft.spage=174&rft.epage=185&rft.pages=174-185&rft.issn=0165-5515&rft.eissn=1741-6485&rft_id=info:doi/10.1177/0165551515620551&rft_dat=%3Cproquest_cross%3E1913946279%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1913946279&rft_id=info:pmid/&rft_sage_id=10.1177_0165551515620551&rfr_iscdi=true |