Loading…

A Complete and Fast Scraping Method for Collecting Tweets

In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to i...

Full description

Saved in:
Bibliographic Details
Main Authors: You, Jaebeom, Lee, Jaekyu, Kwon, Hyuk-Yoon
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 27
container_issue
container_start_page 24
container_title
container_volume
creator You, Jaebeom
Lee, Jaekyu
Kwon, Hyuk-Yoon
description In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222,194 tweets while Twitter standard API can crawl only 12,586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222,194 tweets for 97 users when four processes are running simultaneously.
doi_str_mv 10.1109/BigComp51126.2021.00014
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9373132</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9373132</ieee_id><sourcerecordid>9373132</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-80a6cbcfbf97d232f39296ba135ab97c28035472cad8a0bc28355847eaee926f3</originalsourceid><addsrcrecordid>eNotjM1Kw0AURkdBsNQ8gQvnBRJn7s383GUNVoWKC-u6TCZ3aiRtQhIQ394WXX2cj8MR4k6rQmtF9w_tvuoPg9EabAEKdKGU0uWFyMh57cBrT1DaS7EAdCYnNPZaZNP0ddbIEji1ELSS50jHM8twbOQ6TLN8j2MY2uNevvL82Tcy9ePJ6jqO8_ndfjPP0424SqGbOPvfpfhYP26r53zz9vRSrTZ5Cwrn3KtgYx1Tncg1gJCQgGwdNJpQk4vgFZrSQQyND6o-MRrjS8eBmcAmXIrbv27LzLthbA9h_NkROtQI-Au9eUjj</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>A Complete and Fast Scraping Method for Collecting Tweets</title><source>IEEE Xplore All Conference Series</source><creator>You, Jaebeom ; Lee, Jaekyu ; Kwon, Hyuk-Yoon</creator><creatorcontrib>You, Jaebeom ; Lee, Jaekyu ; Kwon, Hyuk-Yoon</creatorcontrib><description>In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222,194 tweets while Twitter standard API can crawl only 12,586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222,194 tweets for 97 users when four processes are running simultaneously.</description><identifier>EISSN: 2375-9356</identifier><identifier>EISBN: 9781728189246</identifier><identifier>EISBN: 1728189241</identifier><identifier>DOI: 10.1109/BigComp51126.2021.00014</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Blogs ; Crawling ; Multiprocessing ; Parallel processing ; Robots ; Social networking (online) ; Spatial databases ; Tor Network ; Tweets ; Web pages</subject><ispartof>2021 IEEE International Conference on Big Data and Smart Computing (BigComp), 2021, p.24-27</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9373132$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,780,784,789,790,27925,54555,54932</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9373132$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>You, Jaebeom</creatorcontrib><creatorcontrib>Lee, Jaekyu</creatorcontrib><creatorcontrib>Kwon, Hyuk-Yoon</creatorcontrib><title>A Complete and Fast Scraping Method for Collecting Tweets</title><title>2021 IEEE International Conference on Big Data and Smart Computing (BigComp)</title><addtitle>BIGCOMP</addtitle><description>In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222,194 tweets while Twitter standard API can crawl only 12,586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222,194 tweets for 97 users when four processes are running simultaneously.</description><subject>Blogs</subject><subject>Crawling</subject><subject>Multiprocessing</subject><subject>Parallel processing</subject><subject>Robots</subject><subject>Social networking (online)</subject><subject>Spatial databases</subject><subject>Tor Network</subject><subject>Tweets</subject><subject>Web pages</subject><issn>2375-9356</issn><isbn>9781728189246</isbn><isbn>1728189241</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotjM1Kw0AURkdBsNQ8gQvnBRJn7s383GUNVoWKC-u6TCZ3aiRtQhIQ394WXX2cj8MR4k6rQmtF9w_tvuoPg9EabAEKdKGU0uWFyMh57cBrT1DaS7EAdCYnNPZaZNP0ddbIEji1ELSS50jHM8twbOQ6TLN8j2MY2uNevvL82Tcy9ePJ6jqO8_ndfjPP0424SqGbOPvfpfhYP26r53zz9vRSrTZ5Cwrn3KtgYx1Tncg1gJCQgGwdNJpQk4vgFZrSQQyND6o-MRrjS8eBmcAmXIrbv27LzLthbA9h_NkROtQI-Au9eUjj</recordid><startdate>202101</startdate><enddate>202101</enddate><creator>You, Jaebeom</creator><creator>Lee, Jaekyu</creator><creator>Kwon, Hyuk-Yoon</creator><general>IEEE</general><scope>6IE</scope><scope>6IL</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIL</scope></search><sort><creationdate>202101</creationdate><title>A Complete and Fast Scraping Method for Collecting Tweets</title><author>You, Jaebeom ; Lee, Jaekyu ; Kwon, Hyuk-Yoon</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-80a6cbcfbf97d232f39296ba135ab97c28035472cad8a0bc28355847eaee926f3</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Blogs</topic><topic>Crawling</topic><topic>Multiprocessing</topic><topic>Parallel processing</topic><topic>Robots</topic><topic>Social networking (online)</topic><topic>Spatial databases</topic><topic>Tor Network</topic><topic>Tweets</topic><topic>Web pages</topic><toplevel>online_resources</toplevel><creatorcontrib>You, Jaebeom</creatorcontrib><creatorcontrib>Lee, Jaekyu</creatorcontrib><creatorcontrib>Kwon, Hyuk-Yoon</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEL</collection><collection>IEEE Proceedings Order Plans (POP All) 1998-Present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>You, Jaebeom</au><au>Lee, Jaekyu</au><au>Kwon, Hyuk-Yoon</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>A Complete and Fast Scraping Method for Collecting Tweets</atitle><btitle>2021 IEEE International Conference on Big Data and Smart Computing (BigComp)</btitle><stitle>BIGCOMP</stitle><date>2021-01</date><risdate>2021</risdate><spage>24</spage><epage>27</epage><pages>24-27</pages><eissn>2375-9356</eissn><eisbn>9781728189246</eisbn><eisbn>1728189241</eisbn><coden>IEEPAD</coden><abstract>In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222,194 tweets while Twitter standard API can crawl only 12,586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222,194 tweets for 97 users when four processes are running simultaneously.</abstract><pub>IEEE</pub><doi>10.1109/BigComp51126.2021.00014</doi><tpages>4</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2375-9356
ispartof 2021 IEEE International Conference on Big Data and Smart Computing (BigComp), 2021, p.24-27
issn 2375-9356
language eng
recordid cdi_ieee_primary_9373132
source IEEE Xplore All Conference Series
subjects Blogs
Crawling
Multiprocessing
Parallel processing
Robots
Social networking (online)
Spatial databases
Tor Network
Tweets
Web pages
title A Complete and Fast Scraping Method for Collecting Tweets
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T14%3A14%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=A%20Complete%20and%20Fast%20Scraping%20Method%20for%20Collecting%20Tweets&rft.btitle=2021%20IEEE%20International%20Conference%20on%20Big%20Data%20and%20Smart%20Computing%20(BigComp)&rft.au=You,%20Jaebeom&rft.date=2021-01&rft.spage=24&rft.epage=27&rft.pages=24-27&rft.eissn=2375-9356&rft.coden=IEEPAD&rft_id=info:doi/10.1109/BigComp51126.2021.00014&rft.eisbn=9781728189246&rft.eisbn_list=1728189241&rft_dat=%3Cieee_CHZPO%3E9373132%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-80a6cbcfbf97d232f39296ba135ab97c28035472cad8a0bc28355847eaee926f3%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9373132&rfr_iscdi=true