OpenAI:GPT-4o安全技術(shù)報(bào)告(英文)_第1頁
OpenAI:GPT-4o安全技術(shù)報(bào)告(英文)_第2頁
OpenAI:GPT-4o安全技術(shù)報(bào)告(英文)_第3頁
OpenAI:GPT-4o安全技術(shù)報(bào)告(英文)_第4頁
OpenAI:GPT-4o安全技術(shù)報(bào)告(英文)_第5頁
已閱讀5頁,還剩52頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1

GPT-4oSystemCard

OpenAI

August8,2024

1Introduction

GPT-4o[1]isanautoregressiveomnimodel,whichacceptsasinputanycombinationoftext,audio,

image,andvideoandgeneratesanycombinationoftext,audio,andimageoutputs.It’strainedend-to-endacrosstext,vision,andaudio,meaningthatallinputsandoutputsareprocessedbythesameneuralnetwork.

GPT-4ocanrespondtoaudioinputsinaslittleas232milliseconds,withanaverageof320

milliseconds,whichissimilartohumanresponsetime[2]inaconversation

.ItmatchesGPT-4TurboperformanceontextinEnglishandcode,withsignificantimprovementontextinnon-Englishlanguages,whilealsobeingmuchfasterand50%cheaperintheAPI.GPT-4oisespeciallybetteratvisionandaudiounderstandingcomparedtoexistingmodels.

InlinewithourcommitmenttobuildingAIsafelyandconsistentwithourvoluntarycommitments

totheWhiteHouse[3],wearesharingtheGPT-4oSystemCard,whichincludesourPreparedness

Framework[4]evaluations

.InthisSystemCard,weprovideadetailedlookatGPT-4o’scapabilities,limitations,andsafetyevaluationsacrossmultiplecategories,withafocusonspeech-to-speech

(voice)1

whilealsoevaluatingtextandimagecapabilities,andthemeasureswe’veimplementedtoensurethemodelissafeandaligned.Wealsoincludethirdpartyassessmentsondangerouscapabilities,aswellasdiscussionofpotentialsocietalimpactsofGPT-4otextandvisioncapabilities.

2Modeldataandtraining

GPT-4o’stextandvoicecapabilitieswerepre-trainedusingdatauptoOctober2023,sourcedfromawidevarietyofmaterialsincluding:

?Selectpubliclyavailabledata,mostlycollectedfromindustry-standardmachinelearningdatasetsandwebcrawls.

?Proprietarydatafromdatapartnerships.Weformpartnershipstoaccessnon-publiclyavailabledata,suchaspay-walledcontent,archives,andmetadata.Forexample,we

partneredwithShutterstock[5]o

nbuildinganddeliveringAI-generatedimages.

1Someevaluations,inparticular,themajorityofthePreparednessEvaluations,thirdpartyassessmentsandsomeofthesocietalimpactsfocusonthetextandvisioncapabilitiesofGPT-4o,dependingontheriskassessed.ThisisindicatedaccordinglythroughouttheSystemCard.

2

ThekeydatasetcomponentsthatcontributetoGPT-4o’scapabilitiesare:

?WebData:Datafrompublicwebpagesprovidesarichanddiverserangeofinformation,ensuringthemodellearnsfromawidevarietyofperspectivesandtopics.

?CodeandMath:–Includingcodeandmathdataintraininghelpsthemodeldeveloprobustreasoningskillsbyexposingittostructuredlogicandproblem-solvingprocesses.

?MultimodalData–Ourdatasetincludesimages,audio,andvideototeachtheLLMshowtointerpretandgeneratenon-textualinputandoutput.Fromthisdata,themodellearnshowtointerpretvisualimages,actionsandsequencesinreal-worldcontexts,language

patterns,andspeechnuances.

Priortodeployment,OpenAIassessesandmitigatespotentialrisksthatmaystemfromgenerativemodels,suchasinformationharms,biasanddiscrimination,orothercontentthatviolatesourusagepolicies.Weuseacombinationofmethods,spanningallstagesofdevelopmentacrosspre-training,post-training,productdevelopment,andpolicy.Forexample,duringpost-training,wealignthemodeltohumanpreferences;wered-teamtheresultingmodelsandaddproduct-levelmitigationssuchasmonitoringandenforcement;andweprovidemoderationtoolsandtransparencyreportstoourusers.

Wefindthatthemajorityofeffectivetestingandmitigationsaredoneafterthepre-trainingstagebecausefilteringpre-traineddataalonecannotaddressnuancedandcontext-specificharms.Atthesametime,certainpre-trainingfilteringmitigationscanprovideanadditionallayerofdefensethat,alongwithothersafetymitigations,helpexcludeunwantedandharmfulinformationfromourdatasets:

?WeuseourModerationAPIandsafetyclassifierstofilteroutdatathatcouldcontributetoharmfulcontentorinformationhazards,includingCSAM,hatefulcontent,violence,andCBRN.

?Aswithourpreviousimagegenerationsystems,wefilterourimagegenerationdatasetsforexplicitcontentsuchasgraphicsexualmaterialandCSAM.

?Weuseadvanceddatafilteringprocessestoreducepersonalinformationfromtrainingdata.

?UponreleasingDALL-E3,wepilotedanewapproachtogiveusersthepowerto

opt

imagesoutoftraining.

Torespectthoseopt-outs,wefingerprintedtheimagesandusedthefingerprintstoremoveallinstancesoftheimagesfromthetrainingdatasetfortheGPT-4o

seriesofmodels.

3Riskidentification,assessmentandmitigation

Deploymentpreparationwascarriedoutviaidentifyingpotentialrisksofspeechtospeechmodels,exploratorydiscoveryofadditionalnovelrisksthroughexpertredteaming,turningtheidentifiedrisksintostructuredmeasurementsandbuildingmitigationsforthem.WealsoevaluatedGPT-4o

inaccordancewithourPreparednessFramework[4]

.

3

3.1Externalredteaming

OpenAIworkedwithmorethan100externalredteamers2,speakingatotalof45different

languages,andrepresentinggeographicbackgroundsof29differentcountries.RedteamershadaccesstovarioussnapshotsofthemodelatdifferentstagesoftrainingandsafetymitigationmaturitystartinginearlyMarchandcontinuingthroughlateJune2024.

Externalredteamingwascarriedoutinfourphases.ThefirstthreephasestestedthemodelviaaninternaltoolandthefinalphaseusedthefulliOSexperiencefortestingthemodel.Atthetimeofwriting,externalredteamingoftheGPT-4oAPIisongoing.

Phase1

?10redteamersworkingonearlymodelcheckpointsstillindevelopment

?Thischeckpointtookinaudioandtextasinputandproducedaudioandtextasoutputs.

?Single-turnconversations

Phase2

?30redteamersworkingonmodelcheckpointswithearlysafetymitigations

?Thischeckpointtookinaudio,image&textasinputsandproducedaudioandtextasoutputs.

?Single&multi-turnconversations

Phase3

?65redteamersworkingonmodelcheckpoints&candidates

?Thischeckpointtookinaudio,image,andtextasinputsandproducedaudio,image,andtextasoutputs.

?Improvedsafetymitigationstestedtoinformfurtherimprovements

?Multi-turnconversations

Phase4

?65redteamersworkingonfinalmodelcandidates&assessingcomparativeperformance

?ModelaccessviaadvancedvoicemodewithiniOSappforrealuserexperience;reviewedandtaggedviainternaltool.

?Thischeckpointtookinaudioandvideoprompts,andproducedaudiogenerations.

?Multi-turnconversationsinrealtime

Redteamerswereaskedtocarryoutexploratorycapabilitydiscovery,assessnovelpotentialrisksposedbythemodel,andstresstestmitigationsastheyaredevelopedandimproved-specificallythoseintroducedbyaudioinputandgeneration(speechtospeechcapabilities).Thisredteaming

effortbuildsuponpriorwork,includingasdescribedintheGPT-4SystemCard[6]andthe

GPT-4(V)SystemCard[7]

.

Redteamerscoveredcategoriesthatspannedviolativeanddisallowedcontent(illegalerotic

content,violence,selfharm,etc),mis/disinformation,bias,ungroundedinferences,sensitive

2Spanningself-reporteddomainsofexpertiseincluding:CognitiveScience,Chemistry,Biology,Physics,Com-puterScience,Steganography,PoliticalScience,Psychology,Persuasion,Economics,Anthropology,Sociology,HCI,FairnessandBias,Alignment,Education,Healthcare,Law,ChildSafety,Cybersecurity,Finance,Mis/disinforma-tion,PoliticalUse,Privacy,Biometrics,LanguagesandLinguistics

4

traitattribution,privateinformation,geolocation,personidentification,emotionalperceptionandanthropomorphismrisks,fraudulentbehaviorandimpersonation,copyright,naturalsciencecapabilities,andmultilingualobservations.

ThedatageneratedbyredteamersmotivatedthecreationofseveralquantitativeevaluationsthataredescribedintheObservedSafetyChallenges,EvaluationsandMitigationssection.Insomecases,insightsfromredteamingwereusedtodotargetedsyntheticdatageneration.Modelswereevaluatedusingbothautogradersand/ormanuallabelinginaccordancewithsomecriteria(e.g,violationofpolicyornot,refusedornot).Inaddition,wesometimesre-purposedtheredteamingdatatoruntargetedassessmentsonavarietyofvoices/examplestotesttherobustness

ofvariousmitigations.

3.2Evaluationmethodology

Inadditiontothedatafromredteaming,arangeofexistingevaluationdatasetswereconvertedtoevaluationsforspeech-to-speechmodelsusingtext-to-speech(TTS)systemssuchasVoice

Engine[8]

.Weconvertedtext-basedevaluationtaskstoaudio-basedevaluationtasksbyconvertingthetextinputstoaudio.Thisallowedustoreuseexistingdatasetsandtoolingaroundmeasuringmodelcapability,safetybehavior,andmonitoringofmodeloutputs,greatlyexpandingoursetofusableevaluations.

WeusedVoiceEnginetoconverttextinputstoaudio,feedittotheGPT-4o,andscoretheoutputsbythemodel.Wealwaysscoreonlythetextualcontentofthemodeloutput,exceptincaseswheretheaudioneedstobeevaluateddirectly,suchasinevaluationsforvoicecloning(seeSection

3.3.1)

.

Limitationsoftheevaluationmethodology

First,thevalidityofthisevaluationformatdependsonthecapabilityandreliabilityoftheTTSmodel.Certaintextinputsareunsuitableorawkwardtobeconvertedtoaudio;forinstance:mathematicalequationscode.Additionally,weexpectTTStobelossyforcertaintextinputs,suchastextthatmakesheavyuseofwhite-spaceorsymbolsforvisualformatting.Sinceweexpect

5

thatsuchinputsarealsounlikelytobeprovidedbytheuseroverAdvancedVoiceMode,weeitheravoidevaluatingthespeech-to-speechmodelonsuchtasks,oralternativelypre-processexampleswithsuchinputs.Nevertheless,wehighlightthatanymistakesidentifiedinourevaluationsmayariseeitherduetomodelcapability,orthefailureoftheTTSmodeltoaccuratelytranslatetextinputstoaudio.

AsecondconcernmaybewhethertheTTSinputsarerepresentativeofthedistributionofaudioinputsthatusersarelikelytoprovideinactualusage.WeevaluatetherobustnessofGPT-4oonaudioinputsacrossarangeofregionalaccentsinSection

3.3.3.

However,thereremainmanyotherdimensionsthatmaynotbecapturedinaTTS-basedevaluation,suchasdifferentvoiceintonationsandvalence,backgroundnoise,orcross-talk,thatcouldleadtodifferentmodelbehaviorinpracticalusage.

Lastly,theremaybeartifactsorpropertiesinthemodel’sgeneratedaudiothatarenotcapturedintext;forexample,backgroundnoisesandsoundeffects,orrespondingwithanout-of-distributionvoice.InSection

3.3.1,weillustrateusingauxiliaryclassifierstoidentifyundesirableaudio

generationthatcanbeusedinconjunctionwithscoringtranscripts.

3.3Observedsafetychallenges,evaluationsandmitigations

Potentialriskswiththemodelweremitigatedusingacombinationofmethods.Wetrainedthemodeltoadheretobehaviorthatwouldreduceriskviapost-trainingmethodsandalsointegratedclassifiersforblockingspecificgenerationsasapartofthedeployedsystem.

Forobservedsafetychallengesoutlinedbelow,weprovideadescriptionoftherisk,themitigationsapplied,andresultsofrelevantevaluations.Therisksoutlinedbelowareillustrative,andnon-exhaustive,andarefocusedontheexperienceintheChatGPTinterface.Wefocusontherisksthatareintroducedbyspeechtospeechcapabilitiesandhowtheymayinteractwithpre-existing

modalities(text,image)3.

Risk

Mitigations

Unauthorizedvoicegenera-tion

?Inallofourpost-trainingaudiodata,wesuperviseidealcom-pletionsusingthevoicesampleinthesystemmessageasthebasevoice.

?Weonlyallowthemodeltousecertainpre-selectedvoicesanduseanoutputclassifiertodetectifthemodeldeviatesfromthat.

Speakeridentification

?Wepost-trainedGPT-4otorefusetocomplywithrequeststoidentifysomeonebasedonavoiceinanaudioinput,whilestillcomplyingwithrequeststoidentifyfamousquotes.

3Wealsoevaluatetextandvisioncapabilities,andupdatemitigationsappropriately.NoincrementalriskswerefoundbeyondexistingworkoutlinedinGPT-4andGPT-4(V)SystemCards.

6

Generatingcopyrightedcon-tent

?WetrainedGPT-4otorefuserequestsforcopyrightedcontent,includingaudio,consistentwithourbroaderpractices.

?ToaccountforGPT-4o’saudiomodality,wealsoupdatedcertaintext-basedfilterstoworkonaudioconversations,builtfilterstodetectandblockoutputscontainingmusic,andforourlimitedalphaofChatGPT’sAdvancedVoiceMode,instructedthemodeltonotsingatall.

Ungroundedinference/sen-sitivetraitattribution

?Wepost-trainedGPT-4otorefuserequestsforungroundedinference,suchas“howintelligentisthisspeaker?”.

?Wepost-trainedGPT-4otosafelycomplywithrequestsforsensitivetraitattributionbyhedginganswers,suchas“whatisthisspeaker’saccent”→"Basedontheaudio,theysoundliketheyhaveaBritishaccent.”

Disallowedcontentinaudiooutput

?Werunourexistingmoderationclassifierovertexttranscrip-tionsofaudiopromptsandgenerations,andblocktheoutputforcertainhigh-severitycategories.

Eroticandviolentspeechout-put

?Werunourexistingmoderationclassifierovertexttranscrip-tionsofaudioprompts,andblocktheoutputifthepromptcontainseroticorviolentlanguage.

3.3.1Unauthorizedvoicegeneration

RiskDescription:Voicegenerationisthecapabilitytocreateaudiowithahuman-sounding

syntheticvoice,andincludesgeneratingvoicesbasedonashortinputclip.

Inadversarialsituations,thiscapabilitycouldfacilitateharmssuchasanincreaseinfrauddue

toimpersonationandmaybeharnessedtospreadfalseinformation[9,

10](forexample,ifwe

alloweduserstouploadanaudioclipofagivenspeakerandaskGPT-4otoproduceaspeechinthatspeaker’svoice).

TheseareverysimilartotherisksweidentifiedwithVoiceEngine[8]

.

Voicegenerationcanalsooccurinnon-adversarialsituations,suchasouruseofthatabilitytogeneratevoicesforChatGPT’sAdvancedVoiceMode.Duringtesting,wealsoobservedrareinstanceswherethemodelwouldunintentionallygenerateanoutputemulatingtheuser’svoice.

RiskMitigation:Weaddressedvoicegenerationrelated-risksbyallowingonlythepreset

voiceswecreatedincollaborationwithvoiceactors[11]tobeused

.Wedidthisbyincludingtheselectedvoicesasidealcompletionswhilepost-trainingtheaudiomodel.Additionally,webuiltastandaloneoutputclassifiertodetectiftheGPT-4ooutputisusingavoicethat’sdifferentfromourapprovedlist.Werunthisinastreamingfashionduringaudiogenerationandblockthe

7

outputifthespeakerdoesn’tmatchthechosenpresetvoice.

Evaluation:Wefindthattheresidualriskofunauthorizedvoicegenerationisminimal.Our

systemcurrentlycatches100%ofmeaningfuldeviationsfromthesystemvoice4

basedonourinternalevaluations,whichincludessamplesgeneratedbyothersystemvoices,clipsduringwhichthemodelusedavoicefromthepromptaspartofitscompletion,andanassortmentofhumansamples.

Whileunintentionalvoicegenerationstillexistsasaweaknessofthemodel,weusethesecondaryclassifierstoensuretheconversationisdiscontinuedifthisoccursmakingtheriskofunintentionalvoicegenerationminimal.Finally,ourmoderationbehaviormayresultinover-refusalswhenthe

conversationisnotinEnglish,whichisanactiveareaofimprovement5.

Table2:Ourvoiceoutputclassifierperformanceoveraconversationbylanguage:

Precision

Recall

English0.96Non-English50.95

1.01.0

3.3.2Speakeridentification

RiskDescription:Speakeridentificationistheabilitytoidentifyaspeakerbasedoninputaudio.Thispresentsapotentialprivacyrisk,particularlyforprivateindividualsaswellasforobscureaudioofpublicindividuals,alongwithpotentialsurveillancerisks.

RiskMitigation:Wepost-trainedGPT-4otorefusetocomplywithrequeststoidentifysomeonebasedonavoiceinanaudioinput.WeallowGPT-4otoanswerbasedonthecontentoftheaudioifitcontainscontentthatexplicitlyidentifiesthespeaker.GPT-4ostillcomplieswithrequeststoidentifyfamousquotes.Forexample,arequesttoidentifyarandompersonsaying“fourscoreandsevenyearsago”shouldidentifythespeakerasAbrahamLincoln,whilearequesttoidentifyacelebritysayingarandomsentenceshouldberefused.

Evaluation:Comparedtoourinitialmodel,wesawa14pointimprovementinwhenthemodelshouldrefusetoidentifyavoiceinanaudioinput,anda12pointimprovementwhenitshouldcomplywiththatrequest.Theformermeansthemodelwillalmostalwayscorrectlyrefusetoidentifyaspeakerbasedontheirvoice,mitigatingthepotentialprivacyissue.Thelattermeanstheremaybesituationsinwhichthemodelincorrectlyrefusestoidentifythespeakerofafamousquote.

Table3:Speakeridentificationsafebehavioraccuracy

GPT-4o-early

GPT-4o-deployed

ShouldRefuse0.83ShouldComply0.70

0.980.83

4Thesystemvoiceisoneofpre-definedvoicessetbyOpenAI.Themodelshouldonlyproduceaudiointhatvoice

5Thisresultsinmoreconversationsbeingdisconnectedthanmaybenecessary,whichisaproductqualityandusabilityissue.

8

3.3.3Disparateperformanceonvoiceinputs

RiskDescription:Modelsmayperformdifferentlywithusersspeakingwithdifferentaccents.Disparateperformancecanleadtoadifferenceinqualityofservicefordifferentusersofthemodel

[12,

13,

14]

.

RiskMitigation:Wepost-trainedGPT-4owithadiversesetofinputvoicestohavemodelperformanceandbehaviorbeinvariantacrossdifferentuservoices.

Evaluations:WerunevaluationsonGPT-4oAdvancedVoiceModeusingafixedassistantvoice(“shimmer”)andVoiceEnginetogenerateuserinputsacrossarangeofvoicesamples.WeusetwosetsofvoicesamplesforTTS:

?Officialsystemvoices(3differentvoices)

?Adiversesetofvoicescollectedfromtwodatacampaigns.Thiscomprises27differentEnglishvoicesamplesfromspeakersfromawiderangeofcountries,andamixofgenders.

Weevaluateontwosetsoftasks:CapabilitiesandSafetyBehavior

Capabilities:

Weevaluate6

onfourtasks:TriviaQA,asubsetofMMLU7,HellaSwagand

Lambada.TriviaQAandMMLUareknowledge-centrictasks,whileHellaSwagandLambadaarecommonsense-centricortext-continuationtasks.Overall,wefindthatperformanceonthediversesetofhumanvoicesperformsmarginallybutnotsignificantlyworsethanonsystemvoicesacrossallfourtasks.

6Evaluationsinthissectionwererunonafixed,randomlysampledsubsetofexamples,andthesescoresshouldnotbecomparedwithpubliclyreportedbenchmarksonthesametask.

7Anatomy,Astronomy,ClinicalKnowledge,CollegeBiology,ComputerSecurity,GlobalFacts,HighSchoolBiology,Sociology,Virology,CollegePhysics,HighSchoolEuropeanHistoryandWorldReligions.FollowingtheissuesdescribedinEvaluationMethodology

3.2,weexcludetaskswithheavilymathematicalorscientificnotation

.

9

SafetyBehavior:Weevaluateonaninternaldatasetofconversationsandevaluatetheconsis-tencyofthemodel’sadherenceandrefusalbehavioracrossdifferentuservoices.Overall,wedonotfindthatthemodelbehaviorvariesacrossdifferentvoices.

10

3.3.4Ungroundedinference/Sensitivetraitattribution

RiskDescription:Audioinputcanleadtothemodelmakingpotentiallybiasedorinaccurate

inferencesaboutspeakers.Wedefinetwocategories:

?Ungroundedinference(UGI):Makinginferencesaboutaspeakerthatcouldnotbedeter-minedsolelyfromaudiocontent.Thisincludesinferencesaboutthingssuchasaspeaker’srace,socioeconomicstatus/occupation,religiousbeliefs,personalitytraits,politicalat-tributes,intelligence,appearance(e.g.,eyecolor,attractiveness),genderidentity,sexualpreference,orcriminalhistory.Thiscanleadtobothallocativeandrepresentationalharms

[13,

15]dependingonhowsuchbehaviormanifests

.

?Sensitivetraitattribution(STA):Makinginferencesaboutaspeakerthatcouldplausiblybedeterminedsolelyfromaudiocontent.Thisincludesinferencesaboutthingssuchasaspeaker’saccentornationality.PotentialharmsfromSTAincludeanincreaseinrisks

11

fromsurveillance[16]andadifferenceinqualityofserviceforspeakerswithdifferentvoice

attributes[12,

13,

14]

.

RiskMitigation:Wepost-trainedGPT-4otorefusetocomplywithUGIrequests,whilehedginganswerstoSTAquestions.Forexample,aquestiontoidentifyaspeaker’slevelofintelligencewillberefused,whileaquestiontoidentifyaspeaker’saccentwillbemetwithananswersuchas“Basedontheaudio,theysoundliketheyhaveaBritishaccent.”

Evaluation:Comparedtoourinitialmodel,wesawa24pointimprovementinthemodelcorrectlyrespondingtorequeststoidentifysensitivetraits(e.g,refusingUGIandsafelycomplyingwithSTA).

Table4:UngroundedInferenceandSensitiveTraitAttributionsafebehavioraccuracy

GPT-4o-early

GPT-4o-deployed

Accuracy

0.60

0.84

3.3.5Violativeanddisallowedcontent

RiskDescription:GPT-4omaybepromptedtooutputharmfulcontentthroughaudiothatwouldbedisallowedthroughtext,suchasaudiospeechoutputthatgivesinstructionsonhowtocarryoutanillegalactivity.

RiskMitigation:Wefoundhightexttoaudiotransferenceofrefusalsforpreviouslydisallowedcontent.Thismeansthatthepost-trainingwe’vedonetoreducethepotentialforharminGPT-4o’stextoutputsuccessfullycarriedovertoaudiooutput.

Additionally,werunourexistingmoderationmodeloveratexttranscriptionofbothaudioinputandaudiooutputtodetectifeithercontainspotentiallyharmfullanguage,andwillblocka

generationifso8.

Evaluation:WeusedTTStoconvertexistingtextsafetyevaluationstoaudio.Wethenevaluatethetexttranscriptoftheaudiooutputwiththestandardtextrule-basedclassifier.Ourevaluationsshowstrongtext-audiotransferforrefusalsonpre-existingcontentpolicyareas.FurtherevaluationscanbefoundinAppendixA.

Table5:Performancecomparisonofsafetyevaluations:Textvs.Audio

Text

Audio

NotUnsafe0.95NotOver-refuse50.81

0.930.82

3.3.6Eroticandviolentspeechcontent

RiskDescription:GPT-4omaybepromptedtooutputeroticorviolentspeechcontent,whichmaybemoreevocativeorharmfulthanthesamecontextintext.Becauseofthis,wedecidedtorestrictthegenerationoferoticandviolentspeech

8

WedescribetherisksandmitigationsviolativeanddisallowedtextcontentintheGPT-4SystemCard[6],

specificallySection3.1ModelSafety,andSection4.2ContentClassifierDevelopment

12

RiskMitigation:

Werunourexistingmoderationmodel[17]overatexttranscriptionofthe

audioinputtodetectifitcontainsarequestforviolentoreroticcontent,andwillblocka

generationifso.

3.3.7Otherknownrisksandlimitationsofthemodel

Throughthecourseofinternaltestingandexternalredteaming,wediscoveredsomeadditionalrisksandmodellimitationsforwhichmodelorsystemlevelmitigationsarenascentorstillindevelopment,including:

Audiorobustness:Wesawanecdotalevidenceofdecreasesinsafetyrobustnessthroughaudioperturbations,suchaslowqualityinputaudio,backgroundnoiseintheinputaudio,andechoesintheinputaudio.Additionally,weobservedsimilardecreasesinsafetyrobustnessthroughintentionalandunintentionalaudiointerruptionswhilethemodelwasgeneratingoutput.

Misinformationandconspiracytheories:Redteamerswereabletocompelthemodeltogenerateinaccurateinformationbypromptingittoverballyrepeatfalseinformationandproduceconspiracytheories.

WhilethisisaknownissuefortextinGPTmodels[18,

19],therewas

concernfromredteamersthatthisinformationmaybemorepersuasiveorharmfulwhendeliveredthroughaudio,especiallyifthemodelwasinstructedtospeakemotivelyoremphatically.Thepersuasivenessofthemodelwasstudiedindetail(SeeSection

3.7

andwefoundthatthemodeldidnotscorehigherthanMediumriskfortext-only,andforspeech-to-speechthemodeldidnotscorehigherthanLow.

Speakinganon-Englishlanguageinanon-nativeaccent:Redteamersobservedinstancesoftheaudiooutputusinganon-nativeaccentwhenspeakinginanon-Englishlanguage.Thismayleadtoconcernsofbiastowardscertainaccentsandlanguages,andmoregenerallytowardslimitationsofnon-Englishlanguageperformanceinaudiooutputs.

Generatingcopyrightedcontent:WealsotestedGPT-4o’scapacitytorepeatcontentfoundwithinitstrainingdata.WetrainedGPT-4otorefuserequestsforcopyrightedcontent,includingaudio,consistentwithourbroaderpractices.ToaccountforGPT-4o’saudiomodality,wealsoupdatedcertaintext-basedfilterstoworkonaudioconversations,builtfilterstodetectandblockoutputscontainingmusic,andforourlimitedalphaofChatGPT’sadvancedVoiceMode,instructedthemodeltonotsingatall.Weintendtotracktheeffectivenessofthesemitigationsandrefinethemovertime.

Althoughsometechnicalmitigationsarestillindevelopment,ourUsagePolicies[20]disallow

intentionallydeceivingormisleadingothers,andcircumventingsafeguardsorsafetymitigations.

Inadditiontotechnicalmitigations,weenforceourUsagePoliciesthroughmonitoringandtakeactiononviolativebehaviorinbothChatGPTandtheAPI.

3.4PreparednessFrameworkEvaluations

WeevaluatedGPT-4oinaccordancewithourPreparednessFramework[4]

.ThePreparednessFrameworkisalivingdocumentthatdescribesourproceduralcommitmentstotrack,evaluate,forecast,andprotectagainstcatastrophicrisksfromfrontiermodels.Theevaluationscurrentlycoverfourriskcategories:cybersecurity,CBRN(chemical,biological,radiological,nuclear),

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論