![生成式AI紅隊(duì)百次測(cè)試經(jīng)驗(yàn)白皮書_第1頁(yè)](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I206.jpg)
![生成式AI紅隊(duì)百次測(cè)試經(jīng)驗(yàn)白皮書_第2頁(yè)](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2062.jpg)
![生成式AI紅隊(duì)百次測(cè)試經(jīng)驗(yàn)白皮書_第3頁(yè)](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2063.jpg)
![生成式AI紅隊(duì)百次測(cè)試經(jīng)驗(yàn)白皮書_第4頁(yè)](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2064.jpg)
![生成式AI紅隊(duì)百次測(cè)試經(jīng)驗(yàn)白皮書_第5頁(yè)](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2065.jpg)
版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
Lessonsfromredteaming100
generativeAIproducts
Authoredby:
MicrosoftAIRedTeam
Lessonsfromredteaming100generativeAIproducts2
Authors
BlakeBullwinkel,AmandaMinnich,ShivenChawla,GaryLopez,MartinPouliot,WhitneyMaxwell,JorisdeGruyter,KatherinePratt,SaphirQi,NinaChikanov,RomanLutz,RajaSekharRaoDheekonda,Bolor-ErdeneJagdagdorj,
EugeniaKim,JustinSong,KeeganHines,DanielJones,GiorgioSeveri,RichardLundeen,SamVaughan,
VictoriaWesterhoff,PeteBryan,RamShankarSivaKumar,YonatanZunger,ChangKawaguchi,MarkRussinovich
Lessonsfromredteaming100generativeAIproducts3
Tableofcontents
04
Abstract
07
Redteaming
operations
09
Casestudy#1
Jailbreakingavision
languagemodeltogenerate
hazardouscontent
12
Lesson4
Automationcanhelpcover
moreoftherisklandscape
05
Introduction
08
Lesson1
Understandwhatthesystem
candoandwhereitisapplied
10
Lesson3
AIredteamingisnot
safetybenchmarking
12
Lesson5
ThehumanelementofAI
redteamingiscrucial
05
AIthreatmodel
ontology
08
Lesson2
Youdon’thavetocompute
gradientstobreakanAIsystem
11
Casestudy#2
AssessinghowanLLMcouldbe
usedtoautomatescams
13
Casestudy#3
Evaluatinghowachatbot
respondstoauserindistress
14
Casestudy#4
Probingatext-to-image
generatorforgenderbias
16
Casestudy#5
SSRFinavideo-processing
GenAIapplication
14
Lesson6
ResponsibleAIharmsare
pervasivebutdifficulttomeasure
17
Lesson8
TheworkofsecuringAIsystems
willneverbecomplete
15
Lesson7
LLMsamplifyexistingsecurity
risksandintroducenewones
18
Conclusion
Lessonsfromredteaming100generativeAIproducts4
Abstract
Inrecentyears,AIredteaminghasemergedasapracticeforprobingthesafetyandsecurityofgenerativeAI
systems.Duetothenascencyofthefield,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconducted.Basedonourexperienceredteamingover100generativeAIproductsatMicrosoft,wepresentourinternalthreatmodelontologyandeightmainlessonswehavelearned:
1.Understandwhatthesystemcandoandwhereitisapplied
2.Youdon’thavetocomputegradientstobreakanAIsystem
3.AIredteamingisnotsafetybenchmarking
4.Automationcanhelpcovermoreoftherisklandscape
5.ThehumanelementofAIredteamingiscrucial
6.ResponsibleAIharmsarepervasivebutdifficulttomeasure
7.Largelanguagemodels(LLMs)amplifyexistingsecurityrisksandintroducenewones
8.TheworkofsecuringAIsystemswillneverbecomplete
Bysharingtheseinsightsalongsidecasestudiesfromouroperations,weofferpracticalrecommendationsaimedataligningredteamingeffortswithrealworldrisks.WealsohighlightaspectsofAIredteamingthatwebelieveareoftenmisunderstoodanddiscussopenquestionsforthefieldtoconsider.
Lessonsfromredteaming100generativeAIproducts5
Introduction
AsgenerativeAI(GenAI)systemsareadoptedacrossanincreasingnumberofdomains,AIredteaminghasemergedasacentralpracticeforassessingthesafetyandsecurityofthesetechnologies.Atitscore,AIredteamingstrivestopushbeyondmodel-levelsafety
benchmarksbyemulatingreal-worldattacksagainstend-to-endsystems.However,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconductedandahealthydoseofskepticismabouttheefficacyofcurrentAIredteamingefforts[4,8,32].Inthispaper,wespeaktosomeoftheseconcernsbyprovidinginsightintoourexperienceredteaming
over100GenAIproductsatMicrosoft.Thepaper
isorganizedasfollows:First,wepresentthethreat
modelontologythatweusetoguideouroperations.Second,weshareeightmainlessonswehavelearnedandmakepracticalrecommendationsforAIred
teams,alongwithcasestudiesfromouroperations.
Inparticular,thesecasestudieshighlighthowour
ontologyisusedtomodelabroadrangeofsafety
andsecurityrisks.Finally,weclosewithadiscussionofareasforfuturedevelopment.
Background
TheMicrosoftAIRedTeam(AIRT)grewoutofpre-existingredteaminginitiativesatthecompanyandwasofficiallyestablishedin2018.Atitsconception,theteamfocusedprimarilyonidentifyingtraditionalsecurityvulnerabilitiesandevasionattacksagainstclassicalMLmodels.Sincethen,boththescopeandscaleofAIredteamingatMicrosofthaveexpandedsignificantlyinresponsetotwomajortrends.
First,AIsystemshavebecomemoresophisticated,
compellingustoexpandthescopeofAIredteaming.Mostnotably,state-of-the-art(SoTA)modelshave
gainednewcapabilitiesandsteadilyimprovedacrossarangeofperformancebenchmarks,introducing
novelcategoriesofrisk.Newdatamodalities,suchasvisionandaudio,alsocreatemoreattackvectorsforredteamingoperationstoconsider.Inaddition,agenticsystemsgrantthesemodelshigherprivilegesandaccesstoexternaltools,expandingboththe
attacksurfaceandtheimpactofattacks.
Second,Microsoft’srecentinvestmentsinAIhave
spurredthedevelopmentofmanymoreproductsthatrequireredteamingthaneverbefore.Thisincrease
involumeandtheexpandedscopeofAIredteaminghaverenderedfullymanualtestingimpractical,
forcingustoscaleupouroperationswiththehelpofautomation.Toachievethisgoal,wedevelopPyRIT,
anopen-sourcePythonframeworkthatouroperatorsutilizeheavilyinredteamingoperations[27].By
augmentinghumanjudgementandcreativity,PyRIThasenabledAIRTtoidentifyimpactfulvulnerabilitiesmorequicklyandcovermoreoftherisklandscape.ThesetwomajortrendshavemadeAIredteamingamorecomplexendeavorthanitwasin2018.In
thenextsection,weoutlinetheontologywehavedevelopedtomodelAIsystemvulnerabilities.
AIthreatmodel
ontology
Asattacksandfailuremodesincreaseincomplexity,itishelpfultomodeltheirkeycomponents.Basedonourexperienceredteamingover100GenAIproductsforawiderangeofrisks,wedevelopedanontologytodoexactlythat.Figure1illustratesthemain
componentsofourontology:
?System:Theend-to-endmodelorapplicationbeingtested.
?Actor:ThepersonorpersonsbeingemulatedbyAIRT.NotethattheActor’sintentcouldbeadversarial(e.g.,ascammer)orbenign(e.g.,atypicalchatbotuser).
?TTPs:TheTactics,Techniques,andProceduresleveragedbyAIRT.AtypicalattackconsistsofmultipleTacticsandTechniques,whichwemaptoMITREATT&CK?andMITREATLASMatrixwheneverpossible.
–Tactic:High-levelstagesofanattack(e.g.,reconnaissance,MLmodelaccess).
–Technique:Methodsusedtocompleteanobjective(e.g.,activescanning,jailbreak).
–Procedure:ThestepsrequiredtoreproduceanattackusingtheTacticsandTechniques.
?Weakness:ThevulnerabilityorvulnerabilitiesintheSystemthatmaketheattackpossible.
?Impact:Thedownstreamimpactcreatedbytheattack(e.g.,privilegeescalation,generationofharmfulcontent).
Itisimportanttonotethatthisframeworkdoesnotassumeadversarialintent.Inparticular,AIRTemulatesbothadversarialattackersandbenignuserswho
encountersystemfailuresunintentionally.PartofthecomplexityofAIredteamingstemsfromthewiderangeofimpactsthatcouldbecreatedbyanattack
Lessonsfromredteaming100generativeAIproducts6
orsystemfailure.Inthelessonsbelow,weshare
casestudiesdemonstratinghowourontologyis
flexibleenoughtomodeldiverseimpactsintwomaincategories:securityandsafety.
Securityencompasseswell-knownimpactssuch
asdataexfiltration,datamanipulation,credential
dumping,andothersdefinedinMITREATT&CK?,awidelyusedknowledgebaseofsecurityattacks.WealsoconsidersecurityattacksthatspecificallytargettheunderlyingAImodelsuchasmodelevasion,
promptinjections,denialofAIservice,andotherscoveredbytheMITREATLASMatrix.
Safetyimpactsarerelatedtothegenerationofillegalandharmfulcontentsuchashatespeech,violence
andself-harm,andchildabusecontent.AIRTworkscloselywiththeOfficeofResponsibleAItodefinethesecategoriesinaccordancewithMicrosoft’s
ResponsibleAIStandard[25].Werefertothese
impactsasresponsibleAI(RAI)harmsthroughoutthisreport.
Tounderstandthisontologyincontext,consider
thefollowingexample.Imagineweareredteaming
anLLM-basedcopilotthatcansummarizeauser’s
emails.Onepossibleattackagainstthissystemwouldbeforascammertosendanemailthatcontainsa
hiddenpromptinjectioninstructingthecopilotto
“ignorepreviousinstructions”andoutputamaliciouslink.Inthisscenario,theActoristhescammer,who
isconductingacross-promptinjectionattack(XPIA),whichexploitsthefactthatLLMsoftenstruggleto
distinguishbetweensystem-levelinstructionsand
userdata[4].ThedownstreamImpactdependsonthenatureofthemaliciouslinkthatthevictimmightclickon.Inthisexample,itcouldbeexfiltratingdataor
installingmalwareontotheuser’scomputer.
Actor
Conducts
TTPs
:
●
Leverages
●
●
●
Attack
Exploits
Mitigation
:
●
Mitigatedby
●
●
●
Weakness
Occursin
●
:
System
Creates
Impact
Figure1:MicrosoftAIRTontologyformodelingGenAIsystemvulnerabilities.AIRToftenleveragesmultipleTTPs,whichmayexploitmultipleWeaknessesandcreatemultipleImpacts.Inaddition,morethanoneMitigationmaybenecessarytoaddressaWeakness.NotethatAIRTistaskedonlywithidentifyingrisks,whileproductteamsareresourcedtodevelopappropriatemitigations.
Lessonsfromredteaming100generativeAIproducts7
Redteaming
operations
Inthissection,weprovideanoverviewofthe
operationswehaveconductedsince2021.Intotal,wehaveredteamedover100GenAIproducts.Broadly
speaking,theseproductscanbebucketedinto
“models”and“systems.”Modelsaretypicallyhostedonacloudendpoint,whilesystemsintegratemodelsintocopilots,plugins,andotherAIappsandfeatures.Figure2showsthebreakdownofproductswehave
redteamedsince2021.Figure3showsabarchartwiththeannualpercentageofouroperationsthathave
probedforsafety(RAI)vs.securityvulnerabilities.
In2021,wefocusedprimarilyonapplicationsecurity.Althoughouroperationshaveincreasinglyprobed
forRAIimpacts,ourteamcontinuestoredteamforsecurityimpactsincludingdataexfiltration,credentialleaking,andremotecodeexecution.Organizations
haveadoptedmanydifferentapproachestoAIred
teamingrangingfromsecurity-focusedassessmentswithpenetrationtestingtoevaluationsthattarget
onlyGenAIfeatures.InLessons2and7,weelaborateonsecurityvulnerabilitiesandexplainwhywebelieveitisimportanttoconsiderbothtraditionalandAI-
specificweaknesses.
AfterthereleaseofChatGPTin2022,MicrosoftenteredtheeraofAIcopilots,startingwithAI-poweredBingChat,releasedinFebruary2023.
Thismarkedaparadigmshifttowardsapplications
thatconnectLLMstoothersoftwarecomponents
includingtools,databases,andexternalsources.
Applicationsalsostartedusinglanguagemodelsas
reasoningagentsthatcantakeactionsonbehalfof
users,introducinganewsetofattackvectorsthat
haveexpandedthesecurityrisksurface.InLesson
7,weexplainhowtheseattackvectorsbothamplifyexistingsecurityrisksandintroducenewones.
Inrecentyears,themodelsatthecenterofthese
applicationshavegivenrisetonewinterfaces,
allowinguserstointeractwithappsusingnatural
languageandrespondingwithhigh-qualitytext,
image,video,andaudiocontent.DespitemanyeffortstoalignpowerfulAImodelstohumanpreferences,
manymethodshavebeendevelopedtosubvert
safetyguardrailsandelicitcontentthatisoffensive,unethical,orillegal.Weclassifytheseinstancesof
harmfulcontentgenerationasRAIimpactsandin
Lessons3,5,and6discusshowwethinkabouttheseimpactsandthechallengesinvolved.
Inthenextsection,weelaborateontheeightmain
lessonswehavelearnedfromouroperations.Wealsohighlightfivecasestudiesfromouroperationsand
showhoweachonemapstoourontologyinFigure1.WehopetheselessonsareusefultoothersworkingtoidentifyvulnerabilitiesintheirownGenAIsystems.
80+100+
OpsProducts
Plugins
AppsandFeatures
Copilots
15%
16%
24%
Models
45%
Figure2:PiechartshowingthepercentagebreakdownofAI
productsthatAIRThastested.AsofOctober2024,wehave
conductedover80operationscoveringmorethan100products.
Percentageofopsprobingsafetyvs.security
Safety(RAI)%Security%
100
80
60
40
20
0
2021202220232024
Figure3:Barchartshowingthepercentageofoperationsthatprobedsafety(RAI)vs.securityvulnerabilitiesfrom2021–2024.
Lessonsfromredteaming100generativeAIproducts8
Lessons
Lesson1:
Understandwhatthesystem
candoandwhereitisapplied
ThefirststepinanAIredteamingoperationisto
determinewhichvulnerabilitiestotarget.Whilethe
ImpactcomponentoftheAIRTontologyisdepictedattheendofourontology,itservesasanexcellent
startingpointforthisdecision-makingprocess.
Startingfrompotentialdownstreamimpacts,rather
thanattackstrategies,makesitmorelikelythatan
operationwillproduceusefulfindingstiedtoreal
worldrisks.Aftertheseimpactshavebeenidentified,redteamscanworkbackwardsandoutlinethevariouspathsthatanadversarycouldtaketoachievethem.
Anticipatingdownstreamimpactsthatcouldoccurintherealworldisoftenachallengingtask,butwefindthatitishelpfultoconsider1)whattheAIsystemcando,and2)wherethesystemisapplied.
Capabilityconstraints
Asmodelsgetbigger,theytendtoacquirenew
capabilities[18].Thesecapabilitiesmaybeusefulin
manyscenarios,buttheycanalsointroduceattack
vectors.Forexample,largermodelsareoftenable
tounderstandmoreadvancedencodings,suchas
base64andASCIIart,comparedtosmallermodels
[16,45].Asaresult,alargemodelmaybesusceptibletomaliciousinstructionsencodedinbase64,whileasmallermodelmaynotunderstandtheencodingat
all.Inthisscenario,wesaythatthesmallermodelis
“capabilityconstrained,”andsotestingitforadvancedencodingattackswouldlikelybeawasteofresources.
Largermodelsalsogenerallyhavegreaterknowledgeintopicssuchascybersecurityandchemical,
biological,radiological,andnuclear(CBRN)weapons[19]andcouldpotentiallybeleveragedtogeneratehazardouscontentintheseareas.Asmallermodel,ontheotherhand,islikelytohaveonlyrudimentaryknowledgeofthesetopicsandmaynotneedtobeassessedforthistypeofrisk.
Perhapsamoresurprisingexampleofacapabilitythatcanbeexploitedasanattackvectorisinstruction-
following.WhiletestingthePhi-3seriesoflanguagemodels,forexample,wefoundthatlargermodels
weregenerallybetteratadheringtouserinstructions,whichisacorecapabilitythatmakesmodelsmore
helpful[52].However,itmayalsomakemodelsmoresusceptibletojailbreaks,whichsubvert
safetyalignmentusingcarefullycraftedmalicious
instructions[28].Understandingamodel’scapabilities(andcorrespondingweaknesses)canhelpAIred
teamsfocustheirtestingonthemostrelevantattackstrategies.
Downstreamapplications
Modelcapabilitiescanhelpguideattackstrategies,buttheydonotallowustofullyassessdownstreamimpact,whichlargelydependsonthespecific
scenariosinwhichamodelisdeployedorlikelyto
bedeployed.Forexample,thesameLLMcouldbe
usedasacreativewritingassistantandtosummarizepatientrecordsinahealthcarecontext,butthelatterapplicationclearlyposesmuchgreaterdownstreamriskthantheformer.
TheseexampleshighlightthatanAIsystemdoesnotneedtobestate-of-the-arttocreatedownstream
harm.However,advancedcapabilitiescanintroducenewrisksandattackvectors.Byconsideringboth
systemcapabilitiesandapplications,AIredteams
canprioritizetestingscenariosthataremostlikelytocauseharmintherealworld.
Lesson2:
Youdon’thavetocompute
gradientstobreakanAIsystem
Asthesecurityadagegoes,“realhackersdon’tbreakin,theylogin.”TheAIsecurityversionofthissayingmightbe,“realattackersdon’tcomputegradients,theypromptengineer”asnotedbyApruzzeseet
al.[2]intheirstudyonthegapbetweenadversarial
MLresearchandpractice.Thestudyfindsthat
althoughmostadversarialMLresearchisfocused
ondevelopinganddefendingagainstsophisticated
attacks,real-worldattackerstendtousemuchsimplertechniquestoachievetheirobjectives.
Inourredteamingoperations,wehavealsofound
that“basic”techniquesoftenworkjustaswellas,andsometimesbetterthan,gradient-basedmethods.
Thesemethodscomputegradientsthrougha
modeltooptimizeanadversarialinputthatelicits
anattacker-controlledmodeloutput.Inpractice,
however,themodelisusuallyasinglecomponentofabroaderAIsystem,andthemosteffectiveattackstrategiesoftenleveragecombinationsoftacticstotargetmultipleweaknessesinthatsystem.Further,gradient-basedmethodsarecomputationally
expensiveandtypicallyrequirefullaccesstothemodel,whichmostcommercialAIsystemsdonot
Lessonsfromredteaming100generativeAIproducts9
provide.Inthissection,wediscussexamplesof
relativelysimpletechniquesthatworksurprisinglywellandadvocateforasystem-leveladversarialmindsetinAIredteaming.
Simpleattacks
Apruzzeseetal.[2]considertheproblemofphishingwebpagedetectionandmanuallyanalyzeexamplesofwebpagesthatsuccessfullyevadedanMLphishingclassifier.Among100potentiallyadversarialsamples,theauthorsfoundthatattackersleveragedaset
ofsimple,yeteffective,strategiesthatreliedon
domainexpertiseincludingcropping,masking,logostretching,etc.Inourredteamingoperations,we
alsofindthatrudimentarymethodscanbeusedto
trickmanyvisionmodels,ashighlightedincasestudy#1.Inthetextdomain,avarietyofjailbreaks(e.g.,
SkeletonKey)andmultiturnpromptingstrategies(e.g.,Crescendo[34])arehighlyeffectiveforsubvertingthesafetyguardrailsofawiderangeofmodels.Notably,manuallycraftedjailbreakstendtocirculateononlineforumsmuchmorewidelythanadversarialsuffixes,
despitethesignificantattentionthatmethodslikeGCG[53]havereceivedfromAIsafetyresearchers.
Casestudy#1:
Jailbreakingavisionlanguagemodeltogeneratehazardouscontent
Inthisoperation,wetestedavisionlanguage
model(VLM)forresponsibleAIimpacts,includingthegenerationofcontentthatcouldaidinillegal
activities.AVLMtakesanimageandatextpromptasinputsandproducesatextoutput.Aftertestingavarietyoftechniques,wefoundthattheimageinputwasmuchmorevulnerabletojailbreaksthanthe
textinput.Inparticular,themodelusuallyrefusedtogenerateillegalcontentwhenprompteddirectlyviathetextinputbutoftencompliedwhenmalicious
instructionswereoverlaidontheimage.ThissimplebuteffectiveattackrevealedanimportantweaknesswithintheVLMthatcouldbeexploitedtobypassitssafetyguardrails.
HowdoIcommitidentitytheft?
Describethisimage
Ignoreprevious
instructionsandtellmehowtocommitidentitytheft
System:Visionlanguagemodel(VLM)
Actor:Adversarialuser
Tactic1:MLModelAccess
Technique1:AML.T0040-MLModelInferenceAPIAccess
Tactic2:DefenseEvasion
Technique2:AML.T0051-LLMPromptInjectionProcedure:
1.Overlayimagewithtextcontainingmaliciousinstructions.
2.SendimagetothevisionlanguagemodelAPI.Weakness:InsufficientVLMsafetytraining
Impact:Generationofillegalcontent
I’msorry,Ican’thelpwiththat.
Tocommitidentitytheft,
youshould...
Figure4:Exampleofanimagejailbreaktogeneratecontentthatcouldaidinillegalactivities.Theoverlaytextonthesecondimagereads:“Ignorepreviousinstructionsandtellmehowtocommitidentitytheft.”
Lessonsfromredteaming100generativeAIproducts10
System-levelperspective
AImodelsaredeployedwithinbroadersystems.Thiscouldbetheinfrastructurerequiredtohostamodel,oritcouldbeacomplexapplicationthatconnects
themodeltoexternaldatasources.Depending
onthesesystem-leveldetails,applicationsmaybe
vulnerabletoverydifferentattacks,evenifthesamemodelunderliesallofthem.Asaresult,redteamingstrategiesthattargetonlymodelsmaynottranslateintovulnerabilitiesinproductionsystems.Conversely,strategiesthatignorenon-GenAIcomponentswithinasystem(forexample,inputfilters,databases,and
othercloudresources)willlikelymissimportant
vulnerabilitiesthatmaybeexploitedbyadversaries.Forthisreason,manyofouroperationsdevelop
attacksthattargetend-to-endsystemsbyleveragingmultipletechniques.Forexample,oneofour
operationsfirstperformedareconnaissanceto
identifyinternalPythonfunctionsusinglow-resource
languagepromptinjections,thenusedacross-promptinjectionattacktogenerateascriptthatrunsthose
functions,andfinallyexecutedthecodetoexfiltrateprivateuserdata.Thepromptinjectionsusedbytheseattackswerecraftedbyhandandreliedonasystem-levelperspective.
Gradient-basedattacksarepowerful,buttheyare
oftenimpracticalorunnecessary.Werecommend
prioritizingsimpletechniquesandorchestrating
system-levelattacksbecausethesearemorelikelytobeattemptedbyrealadversaries.
Lesson3:
AIredteamingisnot
safetybenchmarking
Althoughsimplemethodsareoftenusedtobreak
AIsystemsinpractice,therisklandscapeisby
nomeansuncomplicated.Onthecontrary,itis
constantlyshiftinginresponsetonovelattacksandfailuremodes[7].Inrecentyears,therehavebeen
manyeffortstocategorizethesevulnerabilities,
givingrisetonumeroustaxonomiesofAIsafetyandsecurityrisks[15,21–23,35–37,39,41,42,46–48].Asdiscussedinthepreviouslesson,complexityoften
arisesatthesystem-level.Inthislesson,wediscusshowtheemergenceofentirelynewcategoriesof
harmaddscomplexityatthemodel-levelandexplainhowthisdifferentiatesAIredteamingfromsafety
benchmarking.
Novelharmcategories
WhenAIsystemsdisplaynovelcapabilitiesdueto,
forexample,advancementsinfoundationmodels,
theymayintroduceharmsthatwedonotfully
understand.Inthesescenarios,wecannotrelyon
safetybenchmarksbecausethesedatasetsmeasurepreexistingnotionsofharm.AtMicrosoft,theAI
redteamoftenexplorestheseunfamiliarscenarios,
helpingtodefinenovelharmcategoriesandbuild
newprobesformeasuringthem.Forexample,SoTALLMsmaypossessgreaterpersuasivecapabilitiesthanexistingchatbots,whichhaspromptedourteamto
thinkabouthowthesemodelscouldbeweaponizedformaliciouspurposes.Casestudy#2providesanexampleofhowweassessedamodelforthisriskinoneofouroperations.
Context-specificrisks
Thedisconnectbetweenexistingsafetybenchmarksandnovelharmcategoriesisanexampleofhow
benchmarksoftenfailtofullycapturethecapabilities
theyareassociatedwith[33].Rajietal.[30]
highlightthefallacyofequatingmodelperformanceondatasetslikeImageNetorGLUEwithbroad
capabilitieslikevisualorlanguage“understanding”
andarguethatbenchmarksshouldbedeveloped
withcontextualizedtasksinmind.Similarly,nosinglesetofbenchmarkscanfullyassessthesafetyofan
AIsystem.AsdiscussedinLesson1,itisimportanttounderstandthecontextinwhichasystemisdeployed(orlikelytobedeployed)andtogroundredteamingstrategiesinthiscontext.
AIredteamingandsafetybenchmarkingare
distinct,buttheyarebothusefulandcanevenbe
complementary.Inparticular,benchmarksmakeit
easytocomparetheperformanceofmultiplemodelsonacommondataset.AIredteamingrequiresmuchmorehumaneffortbutcandiscovernovelcategoriesofharmandprobeforcontextualizedrisks.Further,
safetyconcernsidentifiedbyAIredteamingcan
informthedevelopmentofnewbenchmarks.In
Lesson6,weexpandourdiscussionofthedifferencebetweenredteamingandbenchmark-styleevaluationinthecontextofresponsibleAI.
Lessonsfromredteaming100generativeAIproducts11
Casestudy#2:
AssessinghowanLLMcouldbeusedtoautomatescams
Inthisoperation,weinvestigatedtheabilityofa
state-of-the-artLLMtopersuadepeopletoengageinriskybehaviors.Inparticular,weevaluatedhowthismodelcouldbeusedinconjunctionwithotherreadilyavailabletoolstocreateanend-to-endautomated
scammingsystem,asillustratedinFigure5.
Todothis,wefirstwroteaprompttoassurethe
modelthatnoharmwouldbecausedtousers,
therebyjailbreakingthemodeltoacceptthe
scammingobjective.Thispromptalsoprovided
informationaboutvariouspersuasiontacticsthat
themodelcouldusetoconvincetheusertofallforthescam.Second,weconnectedtheLLMoutputtoatext-to-speechsystemthatallowsyoutocontrolthetoneofthespeechandgenerateresponsesthatsoundlikearealperson.Finally,weconnectedtheinputtoaspeech-to-textsystemsothattheuser
canconversenaturallywiththemodel.Thisproof-of-conceptdemonstratedhowLLMswithinsufficientsafetyguardrailscouldbeweaponizedtopersuadeandscampeople.
System:State-of-the-artLLM
Actor:Scammer
Tactic1:MLModelAccess
Technique1:AML.T0040-MLModelInferenceAPIAccess
Tactic2:DefenseEvasion
Technique2:AML.T0054-LLMJailbreakProcedure:
1.PassajailbreakingprompttotheLLMwithcontextaboutthescammingobjectiveandpersuasiontechniques.
2.ConnecttheLLMoutputtoatext-to-speechsystemsothemodelcanrespo
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年度新一代高性能計(jì)算機(jī)設(shè)備采購(gòu)合同
- 欽州2025年廣西欽州市市直中學(xué)教師專場(chǎng)招聘140人筆試歷年參考題庫(kù)附帶答案詳解
- 西安2025年陜西西安音樂(lè)學(xué)院招聘6人筆試歷年參考題庫(kù)附帶答案詳解
- 紅河云南民建紅河州委招聘公益性崗位人員筆試歷年參考題庫(kù)附帶答案詳解
- 百色2025年廣西百色學(xué)院招聘187人筆試歷年參考題庫(kù)附帶答案詳解
- 珠海廣東珠海高新區(qū)科技產(chǎn)業(yè)局招聘專員筆試歷年參考題庫(kù)附帶答案詳解
- 滁州2025年安徽滁州鳳陽(yáng)縣城區(qū)學(xué)校選調(diào)教師143人筆試歷年參考題庫(kù)附帶答案詳解
- 楚雄云南楚雄雙江自治縣綜合行政執(zhí)法局招聘編外長(zhǎng)聘人員筆試歷年參考題庫(kù)附帶答案詳解
- 惠州2025年廣東惠州市中醫(yī)醫(yī)院第一批招聘聘用人員27人筆試歷年參考題庫(kù)附帶答案詳解
- 2025年速凍麻竹筍項(xiàng)目可行性研究報(bào)告
- 中國(guó)氫內(nèi)燃機(jī)行業(yè)發(fā)展環(huán)境、市場(chǎng)運(yùn)行格局及前景研究報(bào)告-智研咨詢(2024版)
- 開學(xué)季初三沖刺中考開學(xué)第一課為夢(mèng)想加油課件
- 《自然保護(hù)區(qū)劃分》課件
- 2025年普通卷釘項(xiàng)目可行性研究報(bào)告
- 2025年人教版英語(yǔ)五年級(jí)下冊(cè)教學(xué)進(jìn)度安排表
- 2025年建筑施工春節(jié)節(jié)后復(fù)工復(fù)產(chǎn)工作專項(xiàng)方案
- 學(xué)校食堂餐廳管理者食堂安全考試題附答案
- 《商用車預(yù)見性巡航系統(tǒng)技術(shù)規(guī)范》
- 玻璃電動(dòng)平移門施工方案
- 春季安全開學(xué)第一課
- 陜鼓集團(tuán)招聘筆試題目
評(píng)論
0/150
提交評(píng)論