Loading…

DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering

The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing...

Full description

Saved in:

Bibliographic Details
Published in:	arXiv.org 2024-03
Main Authors:	Nguyen, Alex, Wang, Zilong, Shang, Jingbo, Mekala, Dheeraj
Format:	Article
Language:	English
Subjects:	Annotations Inference Layouts Natural language processing Portable document format Privacy Questions
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Nguyen, Alex Wang, Zilong Shang, Jingbo Mekala, Dheeraj
description	The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.
format	article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3030960489</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3030960489</sourcerecordid><originalsourceid>FETCH-proquest_journals_30309604893</originalsourceid><addsrcrecordid>eNqNys0KgkAYheEhCJLyHj4IWilMM1raTqyoRfRna5lqjBH9pmaUbj-DLqDNeRfn6RGHcT71o4CxAXGtLSmlbDZnYcgdcl3u011yzlanBSRwQVUoeYdDJZpCmxq6gQRRN6JRGj3IjFCo8OHBBLZYSCPxJkEhLPWtrSU2cGyl_Vo_QfuWprMj0i9EZaX765CM16ss3fhPo19fnZe6NdhdOaecxjMaRDH_T30AL8lDtQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3030960489</pqid></control><display><type>article</type><title>DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering</title><source>Publicly Available Content Database</source><creator>Nguyen, Alex ; Wang, Zilong ; Shang, Jingbo ; Mekala, Dheeraj</creator><creatorcontrib>Nguyen, Alex ; Wang, Zilong ; Shang, Jingbo ; Mekala, Dheeraj</creatorcontrib><description>The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Annotations ; Inference ; Layouts ; Natural language processing ; Portable document format ; Privacy ; Questions</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3030960489?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>780,784,25753,37012,44590</link.rule.ids></links><search><creatorcontrib>Nguyen, Alex</creatorcontrib><creatorcontrib>Wang, Zilong</creatorcontrib><creatorcontrib>Shang, Jingbo</creatorcontrib><creatorcontrib>Mekala, Dheeraj</creatorcontrib><title>DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering</title><title>arXiv.org</title><description>The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.</description><subject>Annotations</subject><subject>Inference</subject><subject>Layouts</subject><subject>Natural language processing</subject><subject>Portable document format</subject><subject>Privacy</subject><subject>Questions</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNys0KgkAYheEhCJLyHj4IWilMM1raTqyoRfRna5lqjBH9pmaUbj-DLqDNeRfn6RGHcT71o4CxAXGtLSmlbDZnYcgdcl3u011yzlanBSRwQVUoeYdDJZpCmxq6gQRRN6JRGj3IjFCo8OHBBLZYSCPxJkEhLPWtrSU2cGyl_Vo_QfuWprMj0i9EZaX765CM16ss3fhPo19fnZe6NdhdOaecxjMaRDH_T30AL8lDtQ</recordid><startdate>20240330</startdate><enddate>20240330</enddate><creator>Nguyen, Alex</creator><creator>Wang, Zilong</creator><creator>Shang, Jingbo</creator><creator>Mekala, Dheeraj</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240330</creationdate><title>DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering</title><author>Nguyen, Alex ; Wang, Zilong ; Shang, Jingbo ; Mekala, Dheeraj</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30309604893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Annotations</topic><topic>Inference</topic><topic>Layouts</topic><topic>Natural language processing</topic><topic>Portable document format</topic><topic>Privacy</topic><topic>Questions</topic><toplevel>online_resources</toplevel><creatorcontrib>Nguyen, Alex</creatorcontrib><creatorcontrib>Wang, Zilong</creatorcontrib><creatorcontrib>Shang, Jingbo</creatorcontrib><creatorcontrib>Mekala, Dheeraj</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central</collection><collection>ProQuest Central Essentials</collection><collection>AUTh Library subscriptions: ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nguyen, Alex</au><au>Wang, Zilong</au><au>Shang, Jingbo</au><au>Mekala, Dheeraj</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering</atitle><jtitle>arXiv.org</jtitle><date>2024-03-30</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>The application of natural language processing models to PDF documents is pivotal for various business applications yet the challenge of training models for this purpose persists in businesses due to specific hurdles. These include the complexity of working with PDF formats that necessitate parsing text and layout information for curating training data and the lack of privacy-preserving annotation tools. This paper introduces DOCMASTER, a unified platform designed for annotating PDF documents, model training, and inference, tailored to document question-answering. The annotation interface enables users to input questions and highlight text spans within the PDF file as answers, saving layout information and text spans accordingly. Furthermore, DOCMASTER supports both state-of-the-art layout-aware and text models for comprehensive training purposes. Importantly, as annotations, training, and inference occur on-device, it also safeguards privacy. The platform has been instrumental in driving several research prototypes concerning document analysis such as the AI assistant utilized by University of California San Diego's (UCSD) International Services and Engagement Office (ISEO) for processing a substantial volume of PDF documents.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-03
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3030960489
source	Publicly Available Content Database
subjects	Annotations Inference Layouts Natural language processing Portable document format Privacy Questions
title	DOCMASTER: A Unified Platform for Annotation, Training, & Inference in Document Question-Answering
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T14%3A05%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=DOCMASTER:%20A%20Unified%20Platform%20for%20Annotation,%20Training,%20&%20Inference%20in%20Document%20Question-Answering&rft.jtitle=arXiv.org&rft.au=Nguyen,%20Alex&rft.date=2024-03-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3030960489%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_30309604893%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3030960489&rft_id=info:pmid/&rfr_iscdi=true