Loading…
Capturing Semantics for Imputation with Pre-trained Language Models
Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. M...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | 72 |
container_issue | |
container_start_page | 61 |
container_title | |
container_volume | |
creator | Mei, Yinan Song, Shaoxu Fang, Chenguang Yang, Haifeng Fang, Jingyun Long, Jiang |
description | Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations. |
doi_str_mv | 10.1109/ICDE51399.2021.00013 |
format | conference_proceeding |
fullrecord | <record><control><sourceid>ieee_CHZPO</sourceid><recordid>TN_cdi_ieee_primary_9458712</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9458712</ieee_id><sourcerecordid>9458712</sourcerecordid><originalsourceid>FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843</originalsourceid><addsrcrecordid>eNotj81KxDAURqMgOIx9Al3kBVrvTZo2WUodnYGKggruhrS9rZHpD2mK-PYWdPVtzjnwMXaDkCCCuT0U9zuF0phEgMAEAFCescjkGnOh0aBO5TnbCJmrGET2ccmief5aMTApooINKwo7hcW7oeOv1NshuHrm7ej5oZ-WYIMbB_7twid_8RQHb91ADS_t0C22I_40NnSar9hFa08zRf-7Ze8Pu7diH5fPj4firoydABniBnSdZhlUiHkNKhMIYA01mMmaqkrpXLWVqkFoWVlloKk1KUWKGgJp1idbdv3XdUR0nLzrrf85mnQ1UchfGANLkA</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>conference_proceeding</recordtype></control><display><type>conference_proceeding</type><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><source>IEEE Xplore All Conference Series</source><creator>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</creator><creatorcontrib>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</creatorcontrib><description>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</description><identifier>EISSN: 2375-026X</identifier><identifier>EISBN: 9781728191843</identifier><identifier>EISBN: 172819184X</identifier><identifier>DOI: 10.1109/ICDE51399.2021.00013</identifier><identifier>CODEN: IEEPAD</identifier><language>eng</language><publisher>IEEE</publisher><subject>Computational modeling ; Deep Learning ; Filling ; Imputation ; Pre-trained Language Models ; Redundancy ; Semantics ; Training ; Training data</subject><ispartof>2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, p.61-72</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9458712$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>309,310,777,781,786,787,27906,54536,54913</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9458712$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Mei, Yinan</creatorcontrib><creatorcontrib>Song, Shaoxu</creatorcontrib><creatorcontrib>Fang, Chenguang</creatorcontrib><creatorcontrib>Yang, Haifeng</creatorcontrib><creatorcontrib>Fang, Jingyun</creatorcontrib><creatorcontrib>Long, Jiang</creatorcontrib><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><title>2021 IEEE 37th International Conference on Data Engineering (ICDE)</title><addtitle>ICDE</addtitle><description>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</description><subject>Computational modeling</subject><subject>Deep Learning</subject><subject>Filling</subject><subject>Imputation</subject><subject>Pre-trained Language Models</subject><subject>Redundancy</subject><subject>Semantics</subject><subject>Training</subject><subject>Training data</subject><issn>2375-026X</issn><isbn>9781728191843</isbn><isbn>172819184X</isbn><fulltext>true</fulltext><rsrctype>conference_proceeding</rsrctype><creationdate>2021</creationdate><recordtype>conference_proceeding</recordtype><sourceid>6IE</sourceid><recordid>eNotj81KxDAURqMgOIx9Al3kBVrvTZo2WUodnYGKggruhrS9rZHpD2mK-PYWdPVtzjnwMXaDkCCCuT0U9zuF0phEgMAEAFCescjkGnOh0aBO5TnbCJmrGET2ccmief5aMTApooINKwo7hcW7oeOv1NshuHrm7ej5oZ-WYIMbB_7twid_8RQHb91ADS_t0C22I_40NnSar9hFa08zRf-7Ze8Pu7diH5fPj4firoydABniBnSdZhlUiHkNKhMIYA01mMmaqkrpXLWVqkFoWVlloKk1KUWKGgJp1idbdv3XdUR0nLzrrf85mnQ1UchfGANLkA</recordid><startdate>202104</startdate><enddate>202104</enddate><creator>Mei, Yinan</creator><creator>Song, Shaoxu</creator><creator>Fang, Chenguang</creator><creator>Yang, Haifeng</creator><creator>Fang, Jingyun</creator><creator>Long, Jiang</creator><general>IEEE</general><scope>6IE</scope><scope>6IH</scope><scope>CBEJK</scope><scope>RIE</scope><scope>RIO</scope></search><sort><creationdate>202104</creationdate><title>Capturing Semantics for Imputation with Pre-trained Language Models</title><author>Mei, Yinan ; Song, Shaoxu ; Fang, Chenguang ; Yang, Haifeng ; Fang, Jingyun ; Long, Jiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843</frbrgroupid><rsrctype>conference_proceedings</rsrctype><prefilter>conference_proceedings</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computational modeling</topic><topic>Deep Learning</topic><topic>Filling</topic><topic>Imputation</topic><topic>Pre-trained Language Models</topic><topic>Redundancy</topic><topic>Semantics</topic><topic>Training</topic><topic>Training data</topic><toplevel>online_resources</toplevel><creatorcontrib>Mei, Yinan</creatorcontrib><creatorcontrib>Song, Shaoxu</creatorcontrib><creatorcontrib>Fang, Chenguang</creatorcontrib><creatorcontrib>Yang, Haifeng</creatorcontrib><creatorcontrib>Fang, Jingyun</creatorcontrib><creatorcontrib>Long, Jiang</creatorcontrib><collection>IEEE Electronic Library (IEL) Conference Proceedings</collection><collection>IEEE Proceedings Order Plan (POP) 1998-present by volume</collection><collection>IEEE Xplore All Conference Proceedings</collection><collection>IEEE Electronic Library (IEL)</collection><collection>IEEE Proceedings Order Plans (POP) 1998-present</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mei, Yinan</au><au>Song, Shaoxu</au><au>Fang, Chenguang</au><au>Yang, Haifeng</au><au>Fang, Jingyun</au><au>Long, Jiang</au><format>book</format><genre>proceeding</genre><ristype>CONF</ristype><atitle>Capturing Semantics for Imputation with Pre-trained Language Models</atitle><btitle>2021 IEEE 37th International Conference on Data Engineering (ICDE)</btitle><stitle>ICDE</stitle><date>2021-04</date><risdate>2021</risdate><spage>61</spage><epage>72</epage><pages>61-72</pages><eissn>2375-026X</eissn><eisbn>9781728191843</eisbn><eisbn>172819184X</eisbn><coden>IEEPAD</coden><abstract>Existing imputation methods generally generate several possible fillings as candidates and determine the value from the candidates for imputing. However, semantics are ignored in these methods. Recently, pre-trained language models achieve good performances in various language understanding tasks. Motivated by this, we propose IPM that captures semantics for Imputation with Pre-trained language Models. A straightforward idea is to model the imputation task as a multiclass classfication task, named IPM-Multi. IPM-Multi predicts the missing values by fine-tuning the pre-trained model. Due to the low redundancy of databases and large domain sizes, IPM-Multi may suffer the over-fitting problem. In this case, we develop another approach named IPM-Binary. IPM-Binary first generates a set of uncertain candidates and fine-tunes a pre-trained language model to select candidates. Specifically, IPM-Binary models the candidate selection task as a binary classification problem. Unlike IPM-Multi, IPM-Binary computes the probability for each candidate filling respectively, by accepting both complete attributes and a candidate filling as input. The attention mechanism enhances the ability of IPM-Binary to capture semantic information. Moreover, negative sampling from neighbors rather than domains is employed to accelerate the training process and makes the training more targeted and effective. As a result, IPM-Binary requires fewer data to converge. We compare our proposal IPM to the state-of-the-art baselines on multiple datasets. And the extensive experimental results show that IPM outperforms existing solutions. The evaluation of IPM validates our intuitions and demonstrates the effectiveness of the proposed optimizations.</abstract><pub>IEEE</pub><doi>10.1109/ICDE51399.2021.00013</doi><tpages>12</tpages></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | EISSN: 2375-026X |
ispartof | 2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021, p.61-72 |
issn | 2375-026X |
language | eng |
recordid | cdi_ieee_primary_9458712 |
source | IEEE Xplore All Conference Series |
subjects | Computational modeling Deep Learning Filling Imputation Pre-trained Language Models Redundancy Semantics Training Training data |
title | Capturing Semantics for Imputation with Pre-trained Language Models |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T17%3A22%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-ieee_CHZPO&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=proceeding&rft.atitle=Capturing%20Semantics%20for%20Imputation%20with%20Pre-trained%20Language%20Models&rft.btitle=2021%20IEEE%2037th%20International%20Conference%20on%20Data%20Engineering%20(ICDE)&rft.au=Mei,%20Yinan&rft.date=2021-04&rft.spage=61&rft.epage=72&rft.pages=61-72&rft.eissn=2375-026X&rft.coden=IEEPAD&rft_id=info:doi/10.1109/ICDE51399.2021.00013&rft.eisbn=9781728191843&rft.eisbn_list=172819184X&rft_dat=%3Cieee_CHZPO%3E9458712%3C/ieee_CHZPO%3E%3Cgrp_id%3Ecdi_FETCH-LOGICAL-i203t-d08c4660b117c0562100a9ed163cebb5875fb5c0283ba590dc8e55e5ede039843%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=9458712&rfr_iscdi=true |