Loading…

Efficient Web Page Main Text Extraction towards Online News Analysis

We propose a simple approach to fast extract the main text content from Web pages, especially online news pages. Most existing approaches need to construct the DOM tree structure from the HTML source of the Web page first, and then, extract the important content by pruning/merge the DOM branches/sub...

Full description

Saved in:
Bibliographic Details
Main Authors: Baoyao Zhou, Yuhong Xiong, Wei Liu
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We propose a simple approach to fast extract the main text content from Web pages, especially online news pages. Most existing approaches need to construct the DOM tree structure from the HTML source of the Web page first, and then, extract the important content by pruning/merge the DOM branches/sub-trees. Such DOM tree processing tasks are very time-consuming. Our solution processes the HTML source as a paragraphed text string directly and extracts the main text content by only analyzing the word count of text paragraphs. Compared with the existing DOM based approaches, the proposed approach is simple and fast, but not loses the accuracy. The proposed solution can be applied into practical applications with critical requirement on the efficiency, such as online news analysis. The experimental results show that our solution can efficiently and effectively extract the news content from online news pages for further analysis.
DOI:10.1109/ICEBE.2009.15