Loading…
The future of document indexing: GPT and Donut revolutionize table of content processing
Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a mode...
Saved in:
Published in: | arXiv.org 2024-03 |
---|---|
Main Authors: | , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Feyisa, Degaga Wolde Berihun, Haylemicheal Amanuel Zewdu Najimoghadam, Mahsa Zare, Marzieh |
description | Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2956475087</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2956475087</sourcerecordid><originalsourceid>FETCH-proquest_journals_29564750873</originalsourceid><addsrcrecordid>eNqNy70OgjAUhuHGxESj3MNJnElqSwFd_R0dGNwIwkFL8FRpa4xXLxovwOkbvvcZsLGQch6mkRAjFljbcM5FnAil5JgdswtC7Z3vEEwNlSn9FcmBpgqfms5L2B0yKKiCtSHvoMOHab3ThvQLwRWn9utKQ-7Dbp0p0doeTtmwLlqLwW8nbLbdZKt92Cd3j9bljfEd9VcuFiqOEsXTRP5XvQGWekIj</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2956475087</pqid></control><display><type>article</type><title>The future of document indexing: GPT and Donut revolutionize table of content processing</title><source>Access via ProQuest (Open Access)</source><creator>Feyisa, Degaga Wolde ; Berihun, Haylemicheal ; Amanuel Zewdu ; Najimoghadam, Mahsa ; Zare, Marzieh</creator><creatorcontrib>Feyisa, Degaga Wolde ; Berihun, Haylemicheal ; Amanuel Zewdu ; Najimoghadam, Mahsa ; Zare, Marzieh</creatorcontrib><description>Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Construction specifications ; Documents ; Indexing ; Information retrieval ; Large language models</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2956475087?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Feyisa, Degaga Wolde</creatorcontrib><creatorcontrib>Berihun, Haylemicheal</creatorcontrib><creatorcontrib>Amanuel Zewdu</creatorcontrib><creatorcontrib>Najimoghadam, Mahsa</creatorcontrib><creatorcontrib>Zare, Marzieh</creatorcontrib><title>The future of document indexing: GPT and Donut revolutionize table of content processing</title><title>arXiv.org</title><description>Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.</description><subject>Artificial intelligence</subject><subject>Construction specifications</subject><subject>Documents</subject><subject>Indexing</subject><subject>Information retrieval</subject><subject>Large language models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNy70OgjAUhuHGxESj3MNJnElqSwFd_R0dGNwIwkFL8FRpa4xXLxovwOkbvvcZsLGQch6mkRAjFljbcM5FnAil5JgdswtC7Z3vEEwNlSn9FcmBpgqfms5L2B0yKKiCtSHvoMOHab3ThvQLwRWn9utKQ-7Dbp0p0doeTtmwLlqLwW8nbLbdZKt92Cd3j9bljfEd9VcuFiqOEsXTRP5XvQGWekIj</recordid><startdate>20240312</startdate><enddate>20240312</enddate><creator>Feyisa, Degaga Wolde</creator><creator>Berihun, Haylemicheal</creator><creator>Amanuel Zewdu</creator><creator>Najimoghadam, Mahsa</creator><creator>Zare, Marzieh</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240312</creationdate><title>The future of document indexing: GPT and Donut revolutionize table of content processing</title><author>Feyisa, Degaga Wolde ; Berihun, Haylemicheal ; Amanuel Zewdu ; Najimoghadam, Mahsa ; Zare, Marzieh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29564750873</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Construction specifications</topic><topic>Documents</topic><topic>Indexing</topic><topic>Information retrieval</topic><topic>Large language models</topic><toplevel>online_resources</toplevel><creatorcontrib>Feyisa, Degaga Wolde</creatorcontrib><creatorcontrib>Berihun, Haylemicheal</creatorcontrib><creatorcontrib>Amanuel Zewdu</creatorcontrib><creatorcontrib>Najimoghadam, Mahsa</creatorcontrib><creatorcontrib>Zare, Marzieh</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Feyisa, Degaga Wolde</au><au>Berihun, Haylemicheal</au><au>Amanuel Zewdu</au><au>Najimoghadam, Mahsa</au><au>Zare, Marzieh</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>The future of document indexing: GPT and Donut revolutionize table of content processing</atitle><jtitle>arXiv.org</jtitle><date>2024-03-12</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-03 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2956475087 |
source | Access via ProQuest (Open Access) |
subjects | Artificial intelligence Construction specifications Documents Indexing Information retrieval Large language models |
title | The future of document indexing: GPT and Donut revolutionize table of content processing |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T12%3A55%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=The%20future%20of%20document%20indexing:%20GPT%20and%20Donut%20revolutionize%20table%20of%20content%20processing&rft.jtitle=arXiv.org&rft.au=Feyisa,%20Degaga%20Wolde&rft.date=2024-03-12&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2956475087%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_29564750873%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2956475087&rft_id=info:pmid/&rfr_iscdi=true |