Loading…
Automatic Table-of-Contents Generation for Efficient Information Access
Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdi...
Saved in:
Published in: | SN computer science 2020-09, Vol.1 (5), p.283, Article 283 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Citations: | Items that this one cites Items that cite this one |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Purpose
This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation.
Methods
Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth.
Results
We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments.
Conclusions
The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work. |
---|---|
ISSN: | 2662-995X 2661-8907 |
DOI: | 10.1007/s42979-020-00302-z |