Loading…
Design and analyses of web scraping on burstable virtual machines
Summary Web scraping is a widely used technique for decision‐making, collecting, and structuring public data from the internet. As the volume of data continues to grow, the need for more efficient methods of data extraction becomes crucial. This article introduces a novel web scraping framework that...
Saved in:
Published in: | Concurrency and computation 2024-04, Vol.36 (9), p.n/a |
---|---|
Main Authors: | , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Summary
Web scraping is a widely used technique for decision‐making, collecting, and structuring public data from the internet. As the volume of data continues to grow, the need for more efficient methods of data extraction becomes crucial. This article introduces a novel web scraping framework that utilizes Burstable virtual machines (VMs) on Amazon Web Services with the objective of reducing the monetary cost of execution while ensuring compliance with service level agreements (SLAs). To achieve this, the framework utilizes a combination of fixed and temporary Burstable VMs in a mixed cluster, which can be elastically scaled up to fulfill the SLA and scaled down to minimize monetary costs. Two strategies for handling VM allocation are proposed and evaluated: (i) a queue and SLA‐based strategy that employs queue size information and SLA criteria to determine the required number of VMs for the current scraping requests, and (ii) a credit‐based strategy that incorporates information about Burstable VM credits to effectively manage instance creation and termination. Experimental tests show that the proposed framework meets the defined SLA while achieving cost reductions of up to 74% compared to an approach that executes on fixed‐size clusters of Burstable instances. |
---|---|
ISSN: | 1532-0626 1532-0634 |
DOI: | 10.1002/cpe.7999 |