Loading…
HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddi...
Saved in:
Published in: | arXiv.org 2024-12 |
---|---|
Main Authors: | , , , , , , , , , , |
Format: | Article |
Language: | English |
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
cited_by | |
---|---|
cites | |
container_end_page | |
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Bhattarai, Manish Barron, Ryan Eren, Maksim Vu, Minh Grantcharov, Vesselin Boureima, Ismael Stanev, Valentin Matuszek, Cynthia Valtchinov, Vladimir Rasmussen, Kim Alexandrov, Boian |
description | Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths. |
format | article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3142373699</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3142373699</sourcerecordid><originalsourceid>FETCH-proquest_journals_31423736993</originalsourceid><addsrcrecordid>eNqNjM0KgkAURocgSMp3GGgt2Iw_2U7CMHAVrZNJrzaiM3ZHff4m6AFafRy-w1kRh3F-8I4BYxviGtP5vs-imIUhd8gjz9LiRHMJKLB6yUr0NBueUNdStTTtZasGUBMttDG00Uivw4h6gZreYEIJi_WF-tKIYKwpJqkVLUCgsoUdWTeiN-D-dkv2l-x-zj0bec9gprLTMyp7lfwQMB7zKEn4f9YH7eVDxA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3142373699</pqid></control><display><type>article</type><title>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</title><source>Publicly Available Content Database</source><creator>Bhattarai, Manish ; Barron, Ryan ; Eren, Maksim ; Vu, Minh ; Grantcharov, Vesselin ; Boureima, Ismael ; Stanev, Valentin ; Matuszek, Cynthia ; Valtchinov, Vladimir ; Rasmussen, Kim ; Alexandrov, Boian</creator><creatorcontrib>Bhattarai, Manish ; Barron, Ryan ; Eren, Maksim ; Vu, Minh ; Grantcharov, Vesselin ; Boureima, Ismael ; Stanev, Valentin ; Matuszek, Cynthia ; Valtchinov, Vladimir ; Rasmussen, Kim ; Alexandrov, Boian</creatorcontrib><description>Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Alignment ; Clustering ; Cybersecurity ; Documents ; Embedding ; Hierarchies ; Large language models ; Machine learning ; Retrieval</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/3142373699?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25732,36991,44569</link.rule.ids></links><search><creatorcontrib>Bhattarai, Manish</creatorcontrib><creatorcontrib>Barron, Ryan</creatorcontrib><creatorcontrib>Eren, Maksim</creatorcontrib><creatorcontrib>Vu, Minh</creatorcontrib><creatorcontrib>Grantcharov, Vesselin</creatorcontrib><creatorcontrib>Boureima, Ismael</creatorcontrib><creatorcontrib>Stanev, Valentin</creatorcontrib><creatorcontrib>Matuszek, Cynthia</creatorcontrib><creatorcontrib>Valtchinov, Vladimir</creatorcontrib><creatorcontrib>Rasmussen, Kim</creatorcontrib><creatorcontrib>Alexandrov, Boian</creatorcontrib><title>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</title><title>arXiv.org</title><description>Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.</description><subject>Alignment</subject><subject>Clustering</subject><subject>Cybersecurity</subject><subject>Documents</subject><subject>Embedding</subject><subject>Hierarchies</subject><subject>Large language models</subject><subject>Machine learning</subject><subject>Retrieval</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNqNjM0KgkAURocgSMp3GGgt2Iw_2U7CMHAVrZNJrzaiM3ZHff4m6AFafRy-w1kRh3F-8I4BYxviGtP5vs-imIUhd8gjz9LiRHMJKLB6yUr0NBueUNdStTTtZasGUBMttDG00Uivw4h6gZreYEIJi_WF-tKIYKwpJqkVLUCgsoUdWTeiN-D-dkv2l-x-zj0bec9gprLTMyp7lfwQMB7zKEn4f9YH7eVDxA</recordid><startdate>20241205</startdate><enddate>20241205</enddate><creator>Bhattarai, Manish</creator><creator>Barron, Ryan</creator><creator>Eren, Maksim</creator><creator>Vu, Minh</creator><creator>Grantcharov, Vesselin</creator><creator>Boureima, Ismael</creator><creator>Stanev, Valentin</creator><creator>Matuszek, Cynthia</creator><creator>Valtchinov, Vladimir</creator><creator>Rasmussen, Kim</creator><creator>Alexandrov, Boian</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241205</creationdate><title>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</title><author>Bhattarai, Manish ; Barron, Ryan ; Eren, Maksim ; Vu, Minh ; Grantcharov, Vesselin ; Boureima, Ismael ; Stanev, Valentin ; Matuszek, Cynthia ; Valtchinov, Vladimir ; Rasmussen, Kim ; Alexandrov, Boian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31423736993</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Alignment</topic><topic>Clustering</topic><topic>Cybersecurity</topic><topic>Documents</topic><topic>Embedding</topic><topic>Hierarchies</topic><topic>Large language models</topic><topic>Machine learning</topic><topic>Retrieval</topic><toplevel>online_resources</toplevel><creatorcontrib>Bhattarai, Manish</creatorcontrib><creatorcontrib>Barron, Ryan</creatorcontrib><creatorcontrib>Eren, Maksim</creatorcontrib><creatorcontrib>Vu, Minh</creatorcontrib><creatorcontrib>Grantcharov, Vesselin</creatorcontrib><creatorcontrib>Boureima, Ismael</creatorcontrib><creatorcontrib>Stanev, Valentin</creatorcontrib><creatorcontrib>Matuszek, Cynthia</creatorcontrib><creatorcontrib>Valtchinov, Vladimir</creatorcontrib><creatorcontrib>Rasmussen, Kim</creatorcontrib><creatorcontrib>Alexandrov, Boian</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Databases</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Bhattarai, Manish</au><au>Barron, Ryan</au><au>Eren, Maksim</au><au>Vu, Minh</au><au>Grantcharov, Vesselin</au><au>Boureima, Ismael</au><au>Stanev, Valentin</au><au>Matuszek, Cynthia</au><au>Valtchinov, Vladimir</au><au>Rasmussen, Kim</au><au>Alexandrov, Boian</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-12-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain's specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-12 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3142373699 |
source | Publicly Available Content Database |
subjects | Alignment Clustering Cybersecurity Documents Embedding Hierarchies Large language models Machine learning Retrieval |
title | HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning |
url | http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T19%3A37%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=HEAL:%20Hierarchical%20Embedding%20Alignment%20Loss%20for%20Improved%20Retrieval%20and%20Representation%20Learning&rft.jtitle=arXiv.org&rft.au=Bhattarai,%20Manish&rft.date=2024-12-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3142373699%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_31423736993%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=3142373699&rft_id=info:pmid/&rfr_iscdi=true |