Loading…

LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked

Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these...

Full description

Saved in:
Bibliographic Details
Published in:arXiv.org 2023-10
Main Authors: Phute, Mansi, Helbling, Alec, Hull, Matthew, Peng, ShengYun, Szyller, Sebastian, Cornelius, Cory, Duen Horng Chau
Format: Article
Language:English
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
cited_by
cites
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Phute, Mansi
Helbling, Alec
Hull, Matthew
Peng, ShengYun
Szyller, Sebastian
Cornelius, Cory
Duen Horng Chau
description Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2.
format article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2851491283</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2851491283</sourcerecordid><originalsourceid>FETCH-proquest_journals_28514912833</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRw9_HxVQhOzUlTcElNS80rTrVScKqECLhWJOZm5iWWZObn6SgAlRUreOfllyuEZKRWKjgWpSo4pWbmpSuEFGUmZ6em8DCwpiXmFKfyQmluBmU31xBnD92CovzC0tTikvis_NKiPKBUvJGFqaGJpaGRhbExcaoAsIk5Jw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2851491283</pqid></control><display><type>article</type><title>LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked</title><source>Publicly Available Content Database</source><creator>Phute, Mansi ; Helbling, Alec ; Hull, Matthew ; Peng, ShengYun ; Szyller, Sebastian ; Cornelius, Cory ; Duen Horng Chau</creator><creatorcontrib>Phute, Mansi ; Helbling, Alec ; Hull, Matthew ; Peng, ShengYun ; Szyller, Sebastian ; Cornelius, Cory ; Duen Horng Chau</creatorcontrib><description>Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Large language models</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.proquest.com/docview/2851491283?pq-origsite=primo$$EHTML$$P50$$Gproquest$$Hfree_for_read</linktohtml><link.rule.ids>776,780,25731,36989,44566</link.rule.ids></links><search><creatorcontrib>Phute, Mansi</creatorcontrib><creatorcontrib>Helbling, Alec</creatorcontrib><creatorcontrib>Hull, Matthew</creatorcontrib><creatorcontrib>Peng, ShengYun</creatorcontrib><creatorcontrib>Szyller, Sebastian</creatorcontrib><creatorcontrib>Cornelius, Cory</creatorcontrib><creatorcontrib>Duen Horng Chau</creatorcontrib><title>LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked</title><title>arXiv.org</title><description>Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2.</description><subject>Large language models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>PIMPY</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRw9_HxVQhOzUlTcElNS80rTrVScKqECLhWJOZm5iWWZObn6SgAlRUreOfllyuEZKRWKjgWpSo4pWbmpSuEFGUmZ6em8DCwpiXmFKfyQmluBmU31xBnD92CovzC0tTikvis_NKiPKBUvJGFqaGJpaGRhbExcaoAsIk5Jw</recordid><startdate>20231024</startdate><enddate>20231024</enddate><creator>Phute, Mansi</creator><creator>Helbling, Alec</creator><creator>Hull, Matthew</creator><creator>Peng, ShengYun</creator><creator>Szyller, Sebastian</creator><creator>Cornelius, Cory</creator><creator>Duen Horng Chau</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231024</creationdate><title>LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked</title><author>Phute, Mansi ; Helbling, Alec ; Hull, Matthew ; Peng, ShengYun ; Szyller, Sebastian ; Cornelius, Cory ; Duen Horng Chau</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28514912833</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Large language models</topic><toplevel>online_resources</toplevel><creatorcontrib>Phute, Mansi</creatorcontrib><creatorcontrib>Helbling, Alec</creatorcontrib><creatorcontrib>Hull, Matthew</creatorcontrib><creatorcontrib>Peng, ShengYun</creatorcontrib><creatorcontrib>Szyller, Sebastian</creatorcontrib><creatorcontrib>Cornelius, Cory</creatorcontrib><creatorcontrib>Duen Horng Chau</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Phute, Mansi</au><au>Helbling, Alec</au><au>Hull, Matthew</au><au>Peng, ShengYun</au><au>Szyller, Sebastian</au><au>Cornelius, Cory</au><au>Duen Horng Chau</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked</atitle><jtitle>arXiv.org</jtitle><date>2023-10-24</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Large language models (LLMs) are popular for high-quality text generation but can produce harmful content, even when aligned with human values through reinforcement learning. Adversarial prompts can bypass their safety measures. We propose LLM Self Defense, a simple approach to defend against these attacks by having an LLM screen the induced responses. Our method does not require any fine-tuning, input preprocessing, or iterative output generation. Instead, we incorporate the generated content into a pre-defined prompt and employ another instance of an LLM to analyze the text and predict whether it is harmful. We test LLM Self Defense on GPT 3.5 and Llama 2, two of the current most prominent LLMs against various types of attacks, such as forcefully inducing affirmative responses to prompts and prompt engineering attacks. Notably, LLM Self Defense succeeds in reducing the attack success rate to virtually 0 using both GPT 3.5 and Llama 2.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_2851491283
source Publicly Available Content Database
subjects Large language models
title LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
url http://sfxeu10.hosted.exlibrisgroup.com/loughborough?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T01%3A53%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=LLM%20Self%20Defense:%20By%20Self%20Examination,%20LLMs%20Know%20They%20Are%20Being%20Tricked&rft.jtitle=arXiv.org&rft.au=Phute,%20Mansi&rft.date=2023-10-24&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2851491283%3C/proquest%3E%3Cgrp_id%3Ecdi_FETCH-proquest_journals_28514912833%3C/grp_id%3E%3Coa%3E%3C/oa%3E%3Curl%3E%3C/url%3E&rft_id=info:oai/&rft_pqid=2851491283&rft_id=info:pmid/&rfr_iscdi=true