Loading…

Web Page Extraction and Classification Using JSOUP and Naïve Bayes

Classification of web pages manually requires quite a long time because most of the available web pages are not structured, so the classification method is needed quickly and accurately. Naïve Bayes algorithm with a good probabilistic approach in classifying web pages, seen from the advantages that...

Full description

Saved in:
Bibliographic Details
Published in:IOP conference series. Materials Science and Engineering 2020-06, Vol.875 (1), p.12089
Main Authors: Cokrowibowo, S, Nur, N, Irmayanti
Format: Article
Language:English
Citations: Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Classification of web pages manually requires quite a long time because most of the available web pages are not structured, so the classification method is needed quickly and accurately. Naïve Bayes algorithm with a good probabilistic approach in classifying web pages, seen from the advantages that are included in the classical category with a simple probability concept. However, this algorithm provides pretty good performance for many modern cases with large data. For the process of extracting information from web pages, it is proposed to use JSOUP which is a java library that provides a good API for extracting, manipulating data, and completing the initial data cleaning using the best methods from DOM, and CSS. The use of the JSOUP library makes it possible to be able to do web page analysis without having to save web documents to a computer store, so computer storage resources will be constant even though the amount of training data is increased. In this study, implementing JSOUP as a tool for extracting information from web pages and Naïve Bayes algorithm for classification of web pages.
ISSN:1757-8981
1757-899X
DOI:10.1088/1757-899X/875/1/012089