Loading…

The future of document indexing: GPT and Donut revolutionize table of content processing

Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a mode...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-03
Main Authors: Feyisa, Degaga Wolde, Berihun, Haylemicheal, Amanuel Zewdu, Najimoghadam, Mahsa, Zare, Marzieh
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Feyisa, Degaga Wolde
Berihun, Haylemicheal
Amanuel Zewdu
Najimoghadam, Mahsa
Zare, Marzieh
description Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2956475087</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2956475087</sourcerecordid><originalsourceid>FETCH-proquest_journals_29564750873</originalsourceid><addsrcrecordid>eNqNy70OgjAUhuHGxESj3MNJnElqSwFd_R0dGNwIwkFL8FRpa4xXLxovwOkbvvcZsLGQch6mkRAjFljbcM5FnAil5JgdswtC7Z3vEEwNlSn9FcmBpgqfms5L2B0yKKiCtSHvoMOHab3ThvQLwRWn9utKQ-7Dbp0p0doeTtmwLlqLwW8nbLbdZKt92Cd3j9bljfEd9VcuFiqOEsXTRP5XvQGWekIj</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2956475087</pqid></control><display><type>article</type><title>The future of document indexing: GPT and Donut revolutionize table of content processing</title><source>Access via ProQuest (Open Access)</source><creator>Feyisa, Degaga Wolde ; Berihun, Haylemicheal ; Amanuel Zewdu ; Najimoghadam, Mahsa ; Zare, Marzieh</creator><creatorcontrib>Feyisa, Degaga Wolde ; Berihun, Haylemicheal ; Amanuel Zewdu ; Najimoghadam, Mahsa ; Zare, Marzieh</creatorcontrib><description>Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Construction specifications ; Documents ; Indexing ; Information retrieval ; Large language models</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2956475087?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Feyisa, Degaga Wolde</creatorcontrib><creatorcontrib>Berihun, Haylemicheal</creatorcontrib><creatorcontrib>Amanuel Zewdu</creatorcontrib><creatorcontrib>Najimoghadam, Mahsa</creatorcontrib><creatorcontrib>Zare, Marzieh</creatorcontrib><title>The future of document indexing: GPT and Donut revolutionize table of content processing</title><title>arXiv.org</title><description>Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.</description><subject>Artificial intelligence</subject><subject>Construction specifications</subject><subject>Documents</subject><subject>Indexing</subject><subject>Information retrieval</subject><subject>Large language models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNy70OgjAUhuHGxESj3MNJnElqSwFd_R0dGNwIwkFL8FRpa4xXLxovwOkbvvcZsLGQch6mkRAjFljbcM5FnAil5JgdswtC7Z3vEEwNlSn9FcmBpgqfms5L2B0yKKiCtSHvoMOHab3ThvQLwRWn9utKQ-7Dbp0p0doeTtmwLlqLwW8nbLbdZKt92Cd3j9bljfEd9VcuFiqOEsXTRP5XvQGWekIj</recordid><startdate>20240312</startdate><enddate>20240312</enddate><creator>Feyisa, Degaga Wolde</creator><creator>Berihun, Haylemicheal</creator><creator>Amanuel Zewdu</creator><creator>Najimoghadam, Mahsa</creator><creator>Zare, Marzieh</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240312</creationdate><title>The future of document indexing: GPT and Donut revolutionize table of content processing</title><author>Feyisa, Degaga Wolde ; Berihun, Haylemicheal ; Amanuel Zewdu ; Najimoghadam, Mahsa ; Zare, Marzieh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29564750873</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Construction specifications</topic><topic>Documents</topic><topic>Indexing</topic><topic>Information retrieval</topic><topic>Large language models</topic><toplevel>online_resources</toplevel><creatorcontrib>Feyisa, Degaga Wolde</creatorcontrib><creatorcontrib>Berihun, Haylemicheal</creatorcontrib><creatorcontrib>Amanuel Zewdu</creatorcontrib><creatorcontrib>Najimoghadam, Mahsa</creatorcontrib><creatorcontrib>Zare, Marzieh</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Feyisa, Degaga Wolde</au><au>Berihun, Haylemicheal</au><au>Amanuel Zewdu</au><au>Najimoghadam, Mahsa</au><au>Zare, Marzieh</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>The future of document indexing: GPT and Donut revolutionize table of content processing</atitle><jtitle>arXiv.org</jtitle><date>2024-03-12</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-03
issn 2331-8422
language eng
recordid cdi_proquest_journals_2956475087
source Access via ProQuest (Open Access)
subjects Artificial intelligence
Construction specifications
Documents
Indexing
Information retrieval
Large language models
title The future of document indexing: GPT and Donut revolutionize table of content processing
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T12%3A55%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=The%20future%20of%20document%20indexing:%20GPT%20and%20Donut%20revolutionize%20table%20of%20content%20processing&rft.jtitle=arXiv.org&rft.au=Feyisa,%20Degaga%20Wolde&rft.date=2024-03-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2956475087%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29564750873%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2956475087&rft_id=info:pmid/&rfr_iscdi=true