Loading…

TTC-3600: A new benchmark dataset for Turkish text categorization

Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wron...

Full description

Saved in:
Bibliographic Details
Published in:Journal of information science 2017-04, Vol.43 (2), p.174-185
Main Authors: Kılınç, Deniz, Özçift, Akın, Bozyigit, Fatma, Yıldırım, Pelin, Yücalar, Fatih, Borandag, Emin
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by cdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3
cites cdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3
container_end_page 185
container_issue 2
container_start_page 174
container_title Journal of information science
container_volume 43
creator Kılınç, Deniz
Özçift, Akın
Bozyigit, Fatma
Yıldırım, Pelin
Yücalar, Fatih
Borandag, Emin
description Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.
doi_str_mv 10.1177/0165551515620551
format article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_1913946279</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sage_id>10.1177_0165551515620551</sage_id><sourcerecordid>1913946279</sourcerecordid><originalsourceid>FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</originalsourceid><addsrcrecordid>eNp1UEtLw0AQXkTBWr17XPAcnenuJllvpWgVCl7iOUz20abVpO5u8fHrTakHEWQOM_C9mI-xS4RrxKK4AcyVUjhMPoHhOGIjLCRmuSzVMRvt4WyPn7KzGNcAoLSQIzatqlkmcoBbPuWde-eN68zqlcKGW0oUXeK-D7zahU0bVzy5j8QNJbfsQ_tFqe27c3bi6SW6i589Zs_3d9XsIVs8zR9n00VmJMiUaT9pSJSCiKwj4w2BLnwhJQpLpbC5JyBC8E3ZlBZQSWu0BzSNUd5bI8bs6uC7Df3bzsVUr_td6IbIGjUKLfNJoQcWHFgm9DEG5-ttaId3PmuEel9U_beoQZIdJJGW7pfpf_xvgn5nTA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1913946279</pqid></control><display><type>article</type><title>TTC-3600: A new benchmark dataset for Turkish text categorization</title><source>Library &amp; Information Science Abstracts (LISA)</source><source>Sage Journals Online</source><creator>Kılınç, Deniz ; Özçift, Akın ; Bozyigit, Fatma ; Yıldırım, Pelin ; Yücalar, Fatih ; Borandag, Emin</creator><creatorcontrib>Kılınç, Deniz ; Özçift, Akın ; Bozyigit, Fatma ; Yıldırım, Pelin ; Yücalar, Fatih ; Borandag, Emin</creatorcontrib><description>Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.</description><identifier>ISSN: 0165-5515</identifier><identifier>EISSN: 1741-6485</identifier><identifier>DOI: 10.1177/0165551515620551</identifier><language>eng</language><publisher>London, England: SAGE Publications</publisher><subject>Categories ; Classification ; Classifiers ; Data mining ; Datasets ; Feature selection ; Internet ; Machine learning ; News ; Text categorization ; Texts</subject><ispartof>Journal of information science, 2017-04, Vol.43 (2), p.174-185</ispartof><rights>The Author(s) 2015</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</citedby><cites>FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925,34135,79364</link.rule.ids></links><search><creatorcontrib>Kılınç, Deniz</creatorcontrib><creatorcontrib>Özçift, Akın</creatorcontrib><creatorcontrib>Bozyigit, Fatma</creatorcontrib><creatorcontrib>Yıldırım, Pelin</creatorcontrib><creatorcontrib>Yücalar, Fatih</creatorcontrib><creatorcontrib>Borandag, Emin</creatorcontrib><title>TTC-3600: A new benchmark dataset for Turkish text categorization</title><title>Journal of information science</title><description>Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.</description><subject>Categories</subject><subject>Classification</subject><subject>Classifiers</subject><subject>Data mining</subject><subject>Datasets</subject><subject>Feature selection</subject><subject>Internet</subject><subject>Machine learning</subject><subject>News</subject><subject>Text categorization</subject><subject>Texts</subject><issn>0165-5515</issn><issn>1741-6485</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2017</creationdate><recordtype>article</recordtype><sourceid>F2A</sourceid><recordid>eNp1UEtLw0AQXkTBWr17XPAcnenuJllvpWgVCl7iOUz20abVpO5u8fHrTakHEWQOM_C9mI-xS4RrxKK4AcyVUjhMPoHhOGIjLCRmuSzVMRvt4WyPn7KzGNcAoLSQIzatqlkmcoBbPuWde-eN68zqlcKGW0oUXeK-D7zahU0bVzy5j8QNJbfsQ_tFqe27c3bi6SW6i589Zs_3d9XsIVs8zR9n00VmJMiUaT9pSJSCiKwj4w2BLnwhJQpLpbC5JyBC8E3ZlBZQSWu0BzSNUd5bI8bs6uC7Df3bzsVUr_td6IbIGjUKLfNJoQcWHFgm9DEG5-ttaId3PmuEel9U_beoQZIdJJGW7pfpf_xvgn5nTA</recordid><startdate>20170401</startdate><enddate>20170401</enddate><creator>Kılınç, Deniz</creator><creator>Özçift, Akın</creator><creator>Bozyigit, Fatma</creator><creator>Yıldırım, Pelin</creator><creator>Yücalar, Fatih</creator><creator>Borandag, Emin</creator><general>SAGE Publications</general><general>Bowker-Saur Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>E3H</scope><scope>F2A</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>20170401</creationdate><title>TTC-3600: A new benchmark dataset for Turkish text categorization</title><author>Kılınç, Deniz ; Özçift, Akın ; Bozyigit, Fatma ; Yıldırım, Pelin ; Yücalar, Fatih ; Borandag, Emin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2017</creationdate><topic>Categories</topic><topic>Classification</topic><topic>Classifiers</topic><topic>Data mining</topic><topic>Datasets</topic><topic>Feature selection</topic><topic>Internet</topic><topic>Machine learning</topic><topic>News</topic><topic>Text categorization</topic><topic>Texts</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kılınç, Deniz</creatorcontrib><creatorcontrib>Özçift, Akın</creatorcontrib><creatorcontrib>Bozyigit, Fatma</creatorcontrib><creatorcontrib>Yıldırım, Pelin</creatorcontrib><creatorcontrib>Yücalar, Fatih</creatorcontrib><creatorcontrib>Borandag, Emin</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>Library &amp; Information Sciences Abstracts (LISA)</collection><collection>Library &amp; Information Science Abstracts (LISA)</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>Journal of information science</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kılınç, Deniz</au><au>Özçift, Akın</au><au>Bozyigit, Fatma</au><au>Yıldırım, Pelin</au><au>Yücalar, Fatih</au><au>Borandag, Emin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TTC-3600: A new benchmark dataset for Turkish text categorization</atitle><jtitle>Journal of information science</jtitle><date>2017-04-01</date><risdate>2017</risdate><volume>43</volume><issue>2</issue><spage>174</spage><epage>185</epage><pages>174-185</pages><issn>0165-5515</issn><eissn>1741-6485</eissn><abstract>Owing to the rapid growth of the World Wide Web, the number of documents that can be accessed via the Internet explosively increases with each passing day. Considering news portals in particular, sometimes documents related to categories such as technology, sports and politics seem to be in the wrong category or documents are located in a generic category called others. At this point, text categorization (TC), which is generally addressed as a supervised learning task is needed. Although there are substantial number of studies conducted on TC in other languages, the number of studies conducted in Turkish is very limited owing to the lack of accessibility and usability of datasets created. In this paper, a new dataset named TTC-3600, which can be widely used in studies of TC of Turkish news and articles, is created. TTC-3600 is a well-documented dataset and its file formats are compatible with well-known text mining tools. Five widely used classifiers within the field of TC and two feature selection methods are evaluated on TTC-3600. The experimental results indicate that the best accuracy criterion value 91.03% is obtained with the combination of Random Forest classifier and attribute ranking-based feature selection method in all comparisons performed after pre-processing and feature selection steps. The publicly available TTC-3600 dataset and the experimental results of this study can be utilized in comparative experiments by other researchers.</abstract><cop>London, England</cop><pub>SAGE Publications</pub><doi>10.1177/0165551515620551</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0165-5515
ispartof Journal of information science, 2017-04, Vol.43 (2), p.174-185
issn 0165-5515
1741-6485
language eng
recordid cdi_proquest_journals_1913946279
source Library & Information Science Abstracts (LISA); Sage Journals Online
subjects Categories
Classification
Classifiers
Data mining
Datasets
Feature selection
Internet
Machine learning
News
Text categorization
Texts
title TTC-3600: A new benchmark dataset for Turkish text categorization
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T12%3A24%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TTC-3600:%20A%20new%20benchmark%20dataset%20for%20Turkish%20text%20categorization&rft.jtitle=Journal%20of%20information%20science&rft.au=K%C4%B1l%C4%B1n%C3%A7,%20Deniz&rft.date=2017-04-01&rft.volume=43&rft.issue=2&rft.spage=174&rft.epage=185&rft.pages=174-185&rft.issn=0165-5515&rft.eissn=1741-6485&rft_id=info:doi/10.1177/0165551515620551&rft_dat=%3Cproquest_cross%3E1913946279%3C/proquest_cross%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-c404t-9f2ba383aaadeacfca097f74413da83d6fa0aa10fb8b8d0154dc9f01cbc5ffdc3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=1913946279&rft_id=info:pmid/&rft_sage_id=10.1177_0165551515620551&rfr_iscdi=true