Loading…

Automatic Table-of-Contents Generation for Efficient Information Access

Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdi...

Full description

Saved in:
Bibliographic Details
Published in:SN computer science 2020-09, Vol.1 (5), p.283, Article 283
Main Authors: Bentabet, Najah-Imane, Juge, Rémi, El Maarouf, Ismaïl, Valsamou-Stanislawski, Dialekti, Ferradans, Sira
Format: Article
Language:English
Subjects:
Citations: Items that this one cites
Items that cite this one
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Purpose This paper presents a novel neural-based approach, applicable to any searchable PDF document that first detects the titles and then hierarchically orders them using a sequence labelling approach to generate automatically the Table of Contents (TOC). A TOC signals the main divisions and subdivisions of a document to assist with navigation and information localisation. Methods Unlike previous methods, we do not assume the presence of parsable TOC pages in the document but infer the TOC from a data-driven analysis of sections titles, their order and their depth. Results We offer an exhaustive analysis of the proposed model and evaluate it on French and English using documents from the financial domain, which we release to increase community’s interest. We compare this model to state-of-the-art approaches and show its superiority in multiple experiments. Conclusions The approach described in this paper can easily be adapted to other domains and documents and its application to the analysis of financial prospectuses will be strengthened by the release of datasets. The TOC generation algorithms used in this paper obtain state-of-the-art results and provide strong baselines for future work.
ISSN:2662-995X
2661-8907
DOI:10.1007/s42979-020-00302-z