Loading…
Web Page Extraction and Classification Using JSOUP and Naïve Bayes
Classification of web pages manually requires quite a long time because most of the available web pages are not structured, so the classification method is needed quickly and accurately. Naïve Bayes algorithm with a good probabilistic approach in classifying web pages, seen from the advantages that...
Saved in:
Published in: | IOP conference series. Materials Science and Engineering 2020-06, Vol.875 (1), p.12089 |
---|---|
Main Authors: | , , |
Format: | Article |
Language: | English |
Citations: | Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Classification of web pages manually requires quite a long time because most of the available web pages are not structured, so the classification method is needed quickly and accurately. Naïve Bayes algorithm with a good probabilistic approach in classifying web pages, seen from the advantages that are included in the classical category with a simple probability concept. However, this algorithm provides pretty good performance for many modern cases with large data. For the process of extracting information from web pages, it is proposed to use JSOUP which is a java library that provides a good API for extracting, manipulating data, and completing the initial data cleaning using the best methods from DOM, and CSS. The use of the JSOUP library makes it possible to be able to do web page analysis without having to save web documents to a computer store, so computer storage resources will be constant even though the amount of training data is increased. In this study, implementing JSOUP as a tool for extracting information from web pages and Naïve Bayes algorithm for classification of web pages. |
---|---|
ISSN: | 1757-8981 1757-899X |
DOI: | 10.1088/1757-899X/875/1/012089 |