Loading…

Capturing Semantics for Imputation with Pre-trained Language Models

Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. M...

Full description

Saved in:

Bibliographic Details
Main Authors:	Mei, Yinan, Song, Shaoxu, Fang, Chenguang, Yang, Haifeng, Fang, Jingyun, Long, Jiang
Format:	Conference Proceeding
Language:	English
Subjects:	Computational modeling Deep Learning Filling Imputation Pre-trained Language Models Redundancy Semantics Training Training data
Online Access:	Request full text
Tags:	Add Tag No Tags, Be the first to tag this record!

cited_by
cites
container_end_page	72
container_issue
container_start_page	61
container_title
container_volume
creator	Mei, Yinan Song, Shaoxu Fang, Chenguang Yang, Haifeng Fang, Jingyun Long, Jiang
description	Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.
doi_str_mv	10.1109/ICDE51399.2021.00013
format	conference_proceeding
fullrecord	<record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9458712</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9458712</ieee_id><sourcerecordid>9458712</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843</originalsourceid><addsrcrecordid>eNotj81KxDAURqMgOIx9Al3kBVrvTZo2WUodnYGKggruhrS9rZHpD2mK-PYWdPVtzjnwMXaDkCCCuT0U9zuF0phEgMAEAFCescjkGnOh0aBO5TnbCJmrGET2ccmief5aMTApooINKwo7hcW7oeOv1NshuHrm7ej5oZ-WYIMbB_7twid_8RQHb91ADS_t0C22I_40NnSar9hFa08zRf-7Ze8Pu7diH5fPj4firoydABniBnSdZhlUiHkNKhMIYA01mMmaqkrpXLWVqkFoWVlloKk1KUWKGgJp1idbdv3XdUR0nLzrrf85mnQ1UchfGANLkA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><source>IEEE Xplore All Conference Series</source><creator>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</creator><creatorcontrib>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</creatorcontrib><description>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</description><identifier>EISSN: 2375-026X</identifier><identifier>EISBN: 9781728191843</identifier><identifier>EISBN: 172819184X</identifier><identifier>DOI: 10.1109/ICDE51399.2021.00013</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computational modeling ; Deep Learning ; Filling ; Imputation ; Pre-trained Language Models ; Redundancy ; Semantics ; Training ; Training data</subject><ispartof>2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, p.61-72</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9458712$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9458712$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mei, Yinan</creatorcontrib><creatorcontrib>Song, Shaoxu</creatorcontrib><creatorcontrib>Fang, Chenguang</creatorcontrib><creatorcontrib>Yang, Haifeng</creatorcontrib><creatorcontrib>Fang, Jingyun</creatorcontrib><creatorcontrib>Long, Jiang</creatorcontrib><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><title>2021 IEEE 37th International Conference on Data Engineering (ICDE)</title><addtitle>ICDE</addtitle><description>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</description><subject>Computational modeling</subject><subject>Deep Learning</subject><subject>Filling</subject><subject>Imputation</subject><subject>Pre-trained Language Models</subject><subject>Redundancy</subject><subject>Semantics</subject><subject>Training</subject><subject>Training data</subject><issn>2375-026X</issn><isbn>9781728191843</isbn><isbn>172819184X</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81KxDAURqMgOIx9Al3kBVrvTZo2WUodnYGKggruhrS9rZHpD2mK-PYWdPVtzjnwMXaDkCCCuT0U9zuF0phEgMAEAFCescjkGnOh0aBO5TnbCJmrGET2ccmief5aMTApooINKwo7hcW7oeOv1NshuHrm7ej5oZ-WYIMbB_7twid_8RQHb91ADS_t0C22I_40NnSar9hFa08zRf-7Ze8Pu7diH5fPj4firoydABniBnSdZhlUiHkNKhMIYA01mMmaqkrpXLWVqkFoWVlloKk1KUWKGgJp1idbdv3XdUR0nLzrrf85mnQ1UchfGANLkA</recordid><startdate>202104</startdate><enddate>202104</enddate><creator>Mei, Yinan</creator><creator>Song, Shaoxu</creator><creator>Fang, Chenguang</creator><creator>Yang, Haifeng</creator><creator>Fang, Jingyun</creator><creator>Long, Jiang</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>202104</creationdate><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><author>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computational modeling</topic><topic>Deep Learning</topic><topic>Filling</topic><topic>Imputation</topic><topic>Pre-trained Language Models</topic><topic>Redundancy</topic><topic>Semantics</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Mei, Yinan</creatorcontrib><creatorcontrib>Song, Shaoxu</creatorcontrib><creatorcontrib>Fang, Chenguang</creatorcontrib><creatorcontrib>Yang, Haifeng</creatorcontrib><creatorcontrib>Fang, Jingyun</creatorcontrib><creatorcontrib>Long, Jiang</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mei, Yinan</au><au>Song, Shaoxu</au><au>Fang, Chenguang</au><au>Yang, Haifeng</au><au>Fang, Jingyun</au><au>Long, Jiang</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Capturing Semantics for Imputation with Pre-trained Language Models</atitle><btitle>2021 IEEE 37th International Conference on Data Engineering (ICDE)</btitle><stitle>ICDE</stitle><date>2021-04</date><risdate>2021</risdate><spage>61</spage><epage>72</epage><pages>61-72</pages><eissn>2375-026X</eissn><eisbn>9781728191843</eisbn><eisbn>172819184X</eisbn><coden>IEEPAD</coden><abstract>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</abstract><pub>IEEE</pub><doi>10.1109/ICDE51399.2021.00013</doi><tpages>12</tpages></addata></record>
fulltext	fulltext_linktorsrc
identifier	EISSN: 2375-026X
ispartof	2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, p.61-72
issn	2375-026X
language	eng
recordid	cdi_ieee_primary_9458712
source	IEEE Xplore All Conference Series
subjects	Computational modeling Deep Learning Filling Imputation Pre-trained Language Models Redundancy Semantics Training Training data
title	Capturing Semantics for Imputation with Pre-trained Language Models
url	http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T17%3A22%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Capturing%20Semantics%20for%20Imputation%20with%20Pre-trained%20Language%20Models&rft.btitle=2021%20IEEE%2037th%20International%20Conference%20on%20Data%20Engineering%20(ICDE)&rft.au=Mei,%20Yinan&rft.date=2021-04&rft.spage=61&rft.epage=72&rft.pages=61-72&rft.eissn=2375-026X&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICDE51399.2021.00013&rft.eisbn=9781728191843&rft.eisbn_list=172819184X&rft_dat=%3Cieee_CHZPO%3E9458712%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9458712&rfr_iscdi=true