Loading…
Efficient Web Page Main Text Extraction towards Online News Analysis
We propose a simple approach to fast extract the main text content from Web pages, especially online news pages. Most existing approaches need to construct the DOM tree structure from the HTML source of the Web page first, and then, extract the important content by pruning/merge the DOM branches/sub...
Saved in:
Main Authors: | , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | We propose a simple approach to fast extract the main text content from Web pages, especially online news pages. Most existing approaches need to construct the DOM tree structure from the HTML source of the Web page first, and then, extract the important content by pruning/merge the DOM branches/sub-trees. Such DOM tree processing tasks are very time-consuming. Our solution processes the HTML source as a paragraphed text string directly and extracts the main text content by only analyzing the word count of text paragraphs. Compared with the existing DOM based approaches, the proposed approach is simple and fast, but not loses the accuracy. The proposed solution can be applied into practical applications with critical requirement on the efficiency, such as online news analysis. The experimental results show that our solution can efficiently and effectively extract the news content from online news pages for further analysis. |
---|---|
DOI: | 10.1109/ICEBE.2009.15 |