Loading…

Capturing Semantics for Imputation with Pre-trained Language Models

Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. M...

Full description

Saved in:
Bibliographic Details
Main Authors: Mei, Yinan, Song, Shaoxu, Fang, Chenguang, Yang, Haifeng, Fang, Jingyun, Long, Jiang
Format: Conference Proceeding
Language:English
Subjects:
Online Access:Request full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page 72
container_issue
container_start_page 61
container_title
container_volume
creator Mei, Yinan
Song, Shaoxu
Fang, Chenguang
Yang, Haifeng
Fang, Jingyun
Long, Jiang
description Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.
doi_str_mv 10.1109/ICDE51399.2021.00013
format conference_proceeding
fullrecord <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9458712</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9458712</ieee_id><sourcerecordid>9458712</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843</originalsourceid><addsrcrecordid>eNotj81KxDAURqMgOIx9Al3kBVrvTZo2WUodnYGKggruhrS9rZHpD2mK-PYWdPVtzjnwMXaDkCCCuT0U9zuF0phEgMAEAFCescjkGnOh0aBO5TnbCJmrGET2ccmief5aMTApooINKwo7hcW7oeOv1NshuHrm7ej5oZ-WYIMbB_7twid_8RQHb91ADS_t0C22I_40NnSar9hFa08zRf-7Ze8Pu7diH5fPj4firoydABniBnSdZhlUiHkNKhMIYA01mMmaqkrpXLWVqkFoWVlloKk1KUWKGgJp1idbdv3XdUR0nLzrrf85mnQ1UchfGANLkA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><source>IEEE Xplore All Conference Series</source><creator>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</creator><creatorcontrib>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</creatorcontrib><description>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</description><identifier>EISSN: 2375-026X</identifier><identifier>EISBN: 9781728191843</identifier><identifier>EISBN: 172819184X</identifier><identifier>DOI: 10.1109/ICDE51399.2021.00013</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computational modeling ; Deep Learning ; Filling ; Imputation ; Pre-trained Language Models ; Redundancy ; Semantics ; Training ; Training data</subject><ispartof>2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, p.61-72</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9458712$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9458712$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mei, Yinan</creatorcontrib><creatorcontrib>Song, Shaoxu</creatorcontrib><creatorcontrib>Fang, Chenguang</creatorcontrib><creatorcontrib>Yang, Haifeng</creatorcontrib><creatorcontrib>Fang, Jingyun</creatorcontrib><creatorcontrib>Long, Jiang</creatorcontrib><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><title>2021 IEEE 37th International Conference on Data Engineering (ICDE)</title><addtitle>ICDE</addtitle><description>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</description><subject>Computational modeling</subject><subject>Deep Learning</subject><subject>Filling</subject><subject>Imputation</subject><subject>Pre-trained Language Models</subject><subject>Redundancy</subject><subject>Semantics</subject><subject>Training</subject><subject>Training data</subject><issn>2375-026X</issn><isbn>9781728191843</isbn><isbn>172819184X</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81KxDAURqMgOIx9Al3kBVrvTZo2WUodnYGKggruhrS9rZHpD2mK-PYWdPVtzjnwMXaDkCCCuT0U9zuF0phEgMAEAFCescjkGnOh0aBO5TnbCJmrGET2ccmief5aMTApooINKwo7hcW7oeOv1NshuHrm7ej5oZ-WYIMbB_7twid_8RQHb91ADS_t0C22I_40NnSar9hFa08zRf-7Ze8Pu7diH5fPj4firoydABniBnSdZhlUiHkNKhMIYA01mMmaqkrpXLWVqkFoWVlloKk1KUWKGgJp1idbdv3XdUR0nLzrrf85mnQ1UchfGANLkA</recordid><startdate>202104</startdate><enddate>202104</enddate><creator>Mei, Yinan</creator><creator>Song, Shaoxu</creator><creator>Fang, Chenguang</creator><creator>Yang, Haifeng</creator><creator>Fang, Jingyun</creator><creator>Long, Jiang</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>202104</creationdate><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><author>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computational modeling</topic><topic>Deep Learning</topic><topic>Filling</topic><topic>Imputation</topic><topic>Pre-trained Language Models</topic><topic>Redundancy</topic><topic>Semantics</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Mei, Yinan</creatorcontrib><creatorcontrib>Song, Shaoxu</creatorcontrib><creatorcontrib>Fang, Chenguang</creatorcontrib><creatorcontrib>Yang, Haifeng</creatorcontrib><creatorcontrib>Fang, Jingyun</creatorcontrib><creatorcontrib>Long, Jiang</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mei, Yinan</au><au>Song, Shaoxu</au><au>Fang, Chenguang</au><au>Yang, Haifeng</au><au>Fang, Jingyun</au><au>Long, Jiang</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Capturing Semantics for Imputation with Pre-trained Language Models</atitle><btitle>2021 IEEE 37th International Conference on Data Engineering (ICDE)</btitle><stitle>ICDE</stitle><date>2021-04</date><risdate>2021</risdate><spage>61</spage><epage>72</epage><pages>61-72</pages><eissn>2375-026X</eissn><eisbn>9781728191843</eisbn><eisbn>172819184X</eisbn><coden>IEEPAD</coden><abstract>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</abstract><pub>IEEE</pub><doi>10.1109/ICDE51399.2021.00013</doi><tpages>12</tpages></addata></record>
fulltext fulltext_linktorsrc
identifier EISSN: 2375-026X
ispartof 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, p.61-72
issn 2375-026X
language eng
recordid cdi_ieee_primary_9458712
source IEEE Xplore All Conference Series
subjects Computational modeling
Deep Learning
Filling
Imputation
Pre-trained Language Models
Redundancy
Semantics
Training
Training data
title Capturing Semantics for Imputation with Pre-trained Language Models
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T17%3A22%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Capturing%20Semantics%20for%20Imputation%20with%20Pre-trained%20Language%20Models&rft.btitle=2021%20IEEE%2037th%20International%20Conference%20on%20Data%20Engineering%20(ICDE)&rft.au=Mei,%20Yinan&rft.date=2021-04&rft.spage=61&rft.epage=72&rft.pages=61-72&rft.eissn=2375-026X&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICDE51399.2021.00013&rft.eisbn=9781728191843&rft.eisbn_list=172819184X&rft_dat=%3Cieee_CHZPO%3E9458712%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9458712&rfr_iscdi=true