Loading…

HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddi...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2024-12
Main Authors: Bhattarai, Manish, Barron, Ryan, Eren, Maksim, Vu, Minh, Grantcharov, Vesselin, Boureima, Ismael, Stanev, Valentin, Matuszek, Cynthia, Valtchinov, Vladimir, Rasmussen, Kim, Alexandrov, Boian
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Bhattarai, Manish
Barron, Ryan
Eren, Maksim
Vu, Minh
Grantcharov, Vesselin
Boureima, Ismael
Stanev, Valentin
Matuszek, Cynthia
Valtchinov, Vladimir
Rasmussen, Kim
Alexandrov, Boian
description Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3142373699</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3142373699</sourcerecordid><originalsourceid>FETCH-proquest_journals_31423736993</originalsourceid><addsrcrecordid>eNqNjM0KgkAURocgSMp3GGgt2Iw_2U7CMHAVrZNJrzaiM3ZHff4m6AFafRy-w1kRh3F-8I4BYxviGtP5vs-imIUhd8gjz9LiRHMJKLB6yUr0NBueUNdStTTtZasGUBMttDG00Uivw4h6gZreYEIJi_WF-tKIYKwpJqkVLUCgsoUdWTeiN-D-dkv2l-x-zj0bec9gprLTMyp7lfwQMB7zKEn4f9YH7eVDxA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3142373699</pqid></control><display><type>article</type><title>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</title><source>Publicly Available Content Database</source><creator>Bhattarai, Manish ; Barron, Ryan ; Eren, Maksim ; Vu, Minh ; Grantcharov, Vesselin ; Boureima, Ismael ; Stanev, Valentin ; Matuszek, Cynthia ; Valtchinov, Vladimir ; Rasmussen, Kim ; Alexandrov, Boian</creator><creatorcontrib>Bhattarai, Manish ; Barron, Ryan ; Eren, Maksim ; Vu, Minh ; Grantcharov, Vesselin ; Boureima, Ismael ; Stanev, Valentin ; Matuszek, Cynthia ; Valtchinov, Vladimir ; Rasmussen, Kim ; Alexandrov, Boian</creatorcontrib><description>Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Alignment ; Clustering ; Cybersecurity ; Documents ; Embedding ; Hierarchies ; Large language models ; Machine learning ; Retrieval</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3142373699?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25732,36991,44569</link.rule.ids></links><search><creatorcontrib>Bhattarai, Manish</creatorcontrib><creatorcontrib>Barron, Ryan</creatorcontrib><creatorcontrib>Eren, Maksim</creatorcontrib><creatorcontrib>Vu, Minh</creatorcontrib><creatorcontrib>Grantcharov, Vesselin</creatorcontrib><creatorcontrib>Boureima, Ismael</creatorcontrib><creatorcontrib>Stanev, Valentin</creatorcontrib><creatorcontrib>Matuszek, Cynthia</creatorcontrib><creatorcontrib>Valtchinov, Vladimir</creatorcontrib><creatorcontrib>Rasmussen, Kim</creatorcontrib><creatorcontrib>Alexandrov, Boian</creatorcontrib><title>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</title><title>arXiv.org</title><description>Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.</description><subject>Alignment</subject><subject>Clustering</subject><subject>Cybersecurity</subject><subject>Documents</subject><subject>Embedding</subject><subject>Hierarchies</subject><subject>Large language models</subject><subject>Machine learning</subject><subject>Retrieval</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjM0KgkAURocgSMp3GGgt2Iw_2U7CMHAVrZNJrzaiM3ZHff4m6AFafRy-w1kRh3F-8I4BYxviGtP5vs-imIUhd8gjz9LiRHMJKLB6yUr0NBueUNdStTTtZasGUBMttDG00Uivw4h6gZreYEIJi_WF-tKIYKwpJqkVLUCgsoUdWTeiN-D-dkv2l-x-zj0bec9gprLTMyp7lfwQMB7zKEn4f9YH7eVDxA</recordid><startdate>20241205</startdate><enddate>20241205</enddate><creator>Bhattarai, Manish</creator><creator>Barron, Ryan</creator><creator>Eren, Maksim</creator><creator>Vu, Minh</creator><creator>Grantcharov, Vesselin</creator><creator>Boureima, Ismael</creator><creator>Stanev, Valentin</creator><creator>Matuszek, Cynthia</creator><creator>Valtchinov, Vladimir</creator><creator>Rasmussen, Kim</creator><creator>Alexandrov, Boian</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241205</creationdate><title>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</title><author>Bhattarai, Manish ; Barron, Ryan ; Eren, Maksim ; Vu, Minh ; Grantcharov, Vesselin ; Boureima, Ismael ; Stanev, Valentin ; Matuszek, Cynthia ; Valtchinov, Vladimir ; Rasmussen, Kim ; Alexandrov, Boian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31423736993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Alignment</topic><topic>Clustering</topic><topic>Cybersecurity</topic><topic>Documents</topic><topic>Embedding</topic><topic>Hierarchies</topic><topic>Large language models</topic><topic>Machine learning</topic><topic>Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Bhattarai, Manish</creatorcontrib><creatorcontrib>Barron, Ryan</creatorcontrib><creatorcontrib>Eren, Maksim</creatorcontrib><creatorcontrib>Vu, Minh</creatorcontrib><creatorcontrib>Grantcharov, Vesselin</creatorcontrib><creatorcontrib>Boureima, Ismael</creatorcontrib><creatorcontrib>Stanev, Valentin</creatorcontrib><creatorcontrib>Matuszek, Cynthia</creatorcontrib><creatorcontrib>Valtchinov, Vladimir</creatorcontrib><creatorcontrib>Rasmussen, Kim</creatorcontrib><creatorcontrib>Alexandrov, Boian</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Databases</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bhattarai, Manish</au><au>Barron, Ryan</au><au>Eren, Maksim</au><au>Vu, Minh</au><au>Grantcharov, Vesselin</au><au>Boureima, Ismael</au><au>Stanev, Valentin</au><au>Matuszek, Cynthia</au><au>Valtchinov, Vladimir</au><au>Rasmussen, Kim</au><au>Alexandrov, Boian</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-12-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-12
issn 2331-8422
language eng
recordid cdi_proquest_journals_3142373699
source Publicly Available Content Database
subjects Alignment
Clustering
Cybersecurity
Documents
Embedding
Hierarchies
Large language models
Machine learning
Retrieval
title HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T19%3A37%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=HEAL:%20Hierarchical%20Embedding%20Alignment%20Loss%20for%20Improved%20Retrieval%20and%20Representation%20Learning&rft.jtitle=arXiv.org&rft.au=Bhattarai,%20Manish&rft.date=2024-12-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3142373699%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31423736993%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3142373699&rft_id=info:pmid/&rfr_iscdi=true