約翰霍普金斯大學(xué) Bloomberg:用于金融的大語言模型 -BloombergGPT - A Large Language Model for Finance_第1頁
約翰霍普金斯大學(xué) Bloomberg:用于金融的大語言模型 -BloombergGPT - A Large Language Model for Finance_第2頁
約翰霍普金斯大學(xué) Bloomberg:用于金融的大語言模型 -BloombergGPT - A Large Language Model for Finance_第3頁
約翰霍普金斯大學(xué) Bloomberg:用于金融的大語言模型 -BloombergGPT - A Large Language Model for Finance_第4頁
約翰霍普金斯大學(xué) Bloomberg:用于金融的大語言模型 -BloombergGPT - A Large Language Model for Finance_第5頁
已閱讀5頁,還剩123頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1BloombergGPT:ALargeLanguageModelforFinanceShijieWu1,*,Ozanrsoy1,*,StevenLu1,*,VadimDabravolski1,MarkDredze1,2,SebastianGehrmann1,PrabhanjanKambadur1,DavidRosenberg1,GideonMann11Bloomberg,NewYork,NYUSA2ComputerScience,JohnsHopkinsUniversity,Baltimore,MDUSAAbstractTheuseofNLPintherealmof?nancialtechnologyisbroadandcomplex,withapplicationsrangingfromsentimentanalysisandnamedentityrecognitiontoquestionanswering.LargeLanguageModels(LLMs)havebeenshowntobee?ectiveonavarietyoftasks;however,noLLMspecializedforthe?nancialdomainhasbeenreportedinliterature.Inthiswork,wepresentBloombergGPT,a50billionparameterlanguagemodelthatistrainedonawiderangeof?nancialdata.Weconstructa363billiontokendatasetbasedonBloomberg’sextensivedatasourcesperhapsthelargestdomainspecicdatasetyetaugmentedwith45billiontokensfromgeneralpurposedatasets.WevalidateBloombergGPTonstandardLLMbenchmarks,open?nancialbenchmarks,andasuiteofinternalbenchmarksthatmostaccuratelyre?ectourintendedusage.Ourmixeddatasettrainingleadstoamodelthatoutperformsexistingmodelson?nancialtasksbysigni?cantmarginswithoutsacri?cingperformanceongeneralLLMbenchmarks.Additionally,weexplainourmodelingchoices,trainingprocess,andevaluationmethodology.Asanextstep,weplantoreleasetraininglogs(Chronicles)detailingourexperienceintrainingBloombergGPT.Contents1Introduction31.1BloombergGPT 31.2BroaderContributions 42Dataset52.1FinancialDatasets(363Btokens–54.2%oftraining) 72.1.1Web(298Btokens–42.01%oftraining) 72.1.2News(38Btokens–5.31%oftraining) 72.1.3Filings(14Btokens–2.04%oftraining) 72.1.4Press(9Btokens–1.21%oftraining) 82.1.5Bloomberg(5Btokens–0.70%oftraining) 82.2PublicDatasets(345Btokens–48.73%oftraining) 92.2.1ThePile(184Btokens–25.9%oftraining) 92.2.2C4(138Btokens–19.48%oftraining) 92.2.3Wikipedia(24Btokens–3.35%oftraining) 92.3Tokenization 9*.Co-?rstauthors.23Model113.1Architecture 113.2ModelScaling 123.3TrainingCon?guration 133.4Large-scaleOptimization 144TrainingRun155Evaluation165.1Few-shotMethodology 185.2HeldoutLoss 185.3FinancialTasks 195.3.1ExternalFinancialTasks 205.3.2InternalTask:SentimentAnalysis 225.3.3ExploratoryTask:NER 235.4BIG-benchHard 26KnowledgeAssessments 265.6ReadingComprehension 285.7LinguisticTasks 295.8Summary 306QualitativeSamples317RelatedWork328Ethics,Limitations,andImplications378.1EthicalUse 37 9Conclusion38AArchitecture60A.0Notation 60A.1FullArchitecture 60A.2SelfAttentionwithALiBi(SA) 61A.3LayerNorm(LN) 62A.4FeedForwardNetwork(FFN) 62A.5ListofAllTrainableParameters 63BDetailsonexternal?nancialtasks6431.IntroductionThereleaseofGPT-3in2020(Brownetal.,2020)demonstratedthepowerfulbene?tsoftrainingverylargeauto-regressivelanguagemodels(LLMs).GPT-3had175billionparameters,ahundredfoldincreaseoverthepreviousGPT-2model,anddidremarkablywellacrossawiderangeofnowpopularLLMtasks,includingreadingcomprehension,ringandcodegenerationThisperformancehasbeenreplicatedacrossseveralothermodelsChowdheryetalScaoetalZhangetal022a).evidencesuggeststhatlargemodelsexhibitemergentbehaviorsgrowthallowsthemtoacquireabilitiesnotpresentinsmallermodels(Weietal.,2022a).Anotableexampleofemergentbehavioristheabilitytoperformtasksviafew-shotprompting,whereamodelcanlearnataskfromjustafewexamples.Thisabilityimproveswell-aboverandomasweincreasethesizeoflanguagemodels.Broadlyspeaking,few-shotpromptingdramaticallyexpandstherangeoftaskssupportedbymodelsandlowersthebarriertoentryforusersseekingautomationfornewlanguagetasks.AfterGPT-3,modelsgrewinsizeto280billion(Gopher,Raeetal.,2021),540bil-lion(PaLM,Chowdheryetal.,2022),and1trillionparameters(Megatron,Korthikantietal.,2022).Workalsoexploredotherimportantaspectsofachievingahigh-performingLLM,suchasdi?erenttrainingobjectives(Tayetal.,2022b),multilingualmodels(Scaoetal.,2022),moree?cientandsmallermodels(Blacketal.,2022),and?ndingdataandparameter-e?cienttrainingsizes(Ho?mannetal.,2022).Thesee?ortshavealmostexclusivelyfocusedongeneralLLMs,trainedondatasetsthatcoverabroadrangeoftopicsanddomains.Whilethesehaveincludedsomedatasetsforspecializeddomains(e.g.,code(Chenetal.,2021a)orbiomedicalarticlesGaoetal.(2021))thefocushasbeenonbuildingLLMswithbroadcapabilities.Recente?ortstrainingmodelsusingonlydomain-speci?cdatahaveyieldedmodelsthat,whilemuchsmaller,beatgeneralpurposeLLMsontaskswithinthosedomains,suchasscienceTayloretal.(2022)andmedicineBoltonetal.(2023);Luoetal.(2022);Lehmanetal.(2023).These?ndingsmotivatefurtherdevelopmentofmodelsfocusedonspeci?cdomains.FinancialTechnology(FinTech)isalargeandgrowingareawithNLPtechnologieshavinganincreasinglyimportantroleXingetal.(2018);Fisheretal.(2016);Dredzeetal.(2016).FinancialNLPtasksShahetal.(2022)includesentimentanalysisAraci(2019),namedentityrecognitionSalinasAlvaradoetal.(2015),newsclassi?cationSinhaandKhandait(2020),andquestionansweringChenetal.(2021b,2022).WhiletherangeoftasksissimilartothosefoundingeneralNLPbenchmarks,thecomplexityandterminologyofthe?nancialdomainwarrantadomain-speci?csystem.ForallofthereasonsgenerativeLLMsareattractiveingeneral–few-shotlearning,textgeneration,conversationalsystems,etc.–itwouldbevaluabletohaveaLLMfocusedonthe?nancialdomain.Whiletherearemaskedlanguagemodelstunedforthe?nancialdomainAraci(2019),noLLMhasbeentunedfororevaluatedontasksforthisdomain.1.1BloombergGPTWetrainBloombergGPT,a50billionparameterlanguagemodelthatsupportsawiderangeoftaskswithinthe?nancialindustry.Ratherthanbuildingageneral-purposeLLM,orasmallLLMexclusivelyondomain-speci?cdata,wetakeamixedapproach.General4andobviatetheneedforspecializationduringtrainingtime.However,resultsfromexistingdomain-speci?cmodelsshowthatgeneralmodelscannotreplacethem.AtBloomberg,wesupportaverylargeanddiversesetoftasks,wellservedbyageneralmodel,butthevastmajorityofourapplicationsarewithinthe?nancialdomain,betterservedbyaspeci?cmodel.Forthatreason,wesetouttobuildamodelthatachievesbest-in-classresultson?nancialbenchmarks,whilealsomaintainingcompetitiveperformanceongeneral-purposeLLMbenchmarks.Weachievethisgoalbyconstructingthelargestdomain-speci?cdatasetyet,drawingonexistingdatacreation,collection,andcurationresourcesatBloomberg.AsBloombergisprimarilya?nancialdatacompany,ourdataanalystshavecollectedandcurated?nanciallanguagedocumentsoverthespanoffortyyears.Wehaveextensivearchivesof?nancialdatathatcoverarangeoftopics,withcarefultrackingofdatasourcesandusagerights.Weaddthisdatatopublicdatasetstocreatealargetrainingcorpuswithover700billiontokens.Usingaportionofthistrainingcorpus,wetrainaBLOOM-style,50billionparametermodeldesignedbasedonguidelinesfromHo?mannetal.(2022)andLeScaoetal.(2022).WevalidatethemodelonstandardLLMbenchmarks,open?nancialbenchmarks,andasuiteofBloomberg-internalbenchmarksthatmostaccuratelyre?ectourintendedusecases.Ourresultsdemonstratethatourmixedtrainingapproachleadstoamodelthatvastlyoutperformsexistingmodelsonin-domain?nancialtaskswhilebeingonparorbetterongeneralNLPbenchmarks.1.2BroaderContributionsBeyondtheconstructionofaLLMfor?nancialdata,ourgoalistocontributetothebroaderresearchcommunity.Speci?cally,ourexperiencedocumentedinthispaperprovidesevidencethatfurtherdevelopsthecommunity’sunderstandingofseveralopenquestionsintheliterature.Domain-speci?cLLMs.Thefewexistingdomain-speci?cLLMsaretrainedexclusivelyondomain-speci?cdatasources(Luoetal.,2022;Boltonetal.,2023;Tayloretal.,2022),oradaptaverylargegeneralpurposemodeltodomain-speci?ctasks(Singhaletal.,2022;Lewkowyczetal.,2022).Ouralternativeapproach–traininganLLMonbothdomain-wellondomain-speci?ctasks,butalsomaintainsstrongperformanceongeneral-purposebenchmarks.Trainingdata.Nearlyalllanguagemodelsrelyinlargepartonweb-scrapeddata,suchasC4(Ra?eletal.,2020)andThePile(Gaoetal.,2021)(whichincludesOpenWebText2).ThisdatamaybecleanedorsubsettedinvariouswaysbeforeuseTouvronetal.(2023);Raeetal.(2020);Scaoetal.(2022);Jerniteetal.(2022),butissuesofdataduplicationCarlinietal.(2020)andtoxiclanguageremainWelbletal.(2021).OurtrainingdataisunusualforLLMtraininginthatitincludesasigni?cantamountofcuratedandprepareddatafromreliablesources.Evaluation.LLMevaluationremainsachallengingandevolvingproblemGehrmannetal.(2022);Goyaletal.(2022),withnewbenchmarkstryingtostandardizeevaluationacross5models(Liangetal.,2022;Srivastavaetal.,2022).However,fordomain-speci?ctasks,thereremainsamismatchbetweenevaluationandactualusecases.Evaluationsarebuiltonavailabledatasetsandnotnecessarilyonhowthemodelwillbeusedinpractice.Weprovideresultsonbothpublic?nancialNLPbenchmarks(Shahetal.,2022;Chenetal.,2021b)aswellasaselectionofinternalBloombergtasks,whicharebetteralignedwithourintendedusecasesanddirectlyevaluateourmodel’sabilitytoperformtasksofinterest.ModelSize.EarlyLLMsmadeasingletrainingpassoveracorpusof200-400billionto-kens(Brownetal.,2020)andHo?mannetal.(2022)positedthatmodelswereundertrained,insteadfocusingontrainingsmallermodelswithmoredata,astrategymostrecentlyem-ployedbyTouvronetal.(2023).WeselectamodelsizemotivatedbyHo?mannetal.(2022)andtraina50billionparametermodelon569billiontokensfromourcorpusofover700billiontokenstoproduceamodelthatiscompetitivewithlargermodels.Tokenizer.Afterassemblingtrainingdata,thecriticalstepoftokenizationtransformsthetextintoaformatsuitableforthelanguagemodel.TheimportanceofthisstepisoftenoverlookedMielkeetal.(2021),andmanyolderLLMsusethesametokenizerandvocabulary,meaningthatwehavelittleevidencetosupportothertokenizers.Wetakeadi?erentapproachanduseaUnigrammodelinsteadofgreedymerge-basedsub-wordtokenizerssinceitsavesprobabilitiesallowingforsmartertokenizationatinferencetime(Kudo,2018).ModelBuildingChallenges.GPT-3andsubsequentmodelsweretheworkoflargeteamsandrequiredanenormousamountofcomputation.Initialworktoreproducetheseresults,suchasOPTZhangetal.(2022a),didnotmatchtheperformanceoftheoriginalmodel.Withthereleaseofeachsubsequentmodel,thecommunity’sunderstanding,ex-perience,andsoftwaretoolsincrease.IndevelopingBloombergGPT,webene?tedfromexistingcodedevelopedaspartoftheBLOOMe?ortScaoetal.(2022),showingthatamoderatelysizedteamcanproduceacompetitivemodelondomain-speci?cdata.Wede-scribeourexperiencestrainingBloombergGPTindetailtosupportfuturetraininge?ortsandaddresseachoftheabovetopics.2.DatasetTotrainBloombergGPT,weconstruct“FinPile”,acomprehensivedatasetconsistingofnancialdocuments,andsocialmediadrawnfromtheBloombergarchives.Thesedocumentshavebeenacquiredthroughourbusinessprocessoverthepasttwodecades.WeaugmentFinPilewithpublicdatawidelyusedtotrainLLMs.Theresultisatrainingcorpusthatisroughlyhalfdomain-speci?ctextandhalfgeneral-purposetext.Forabreakdownofthefulltrainingset,seeTable1.Toimprovedataquality,wede-duplicateeachdataset(ThePile,C4,Wikipedia,FinPile)accordingtoLeeetal.(2022a);asaside-e?ect,thestatisticsreportedinTable1mightbedi?erentfromthosereportedinotherpapers.6DatasetDocsC/DCharsC/TToksT%FinPile1,017WebFilingsBloombergPUBLIC416,818Pile-CCGitHubPubMedCentralArXivOpenWebText2DMMathematicsWikipedia(en)USPTOBackgroundsPubMedAbstractsOpenSubtitlesGutenberg(PG-19)3UbuntuIRC1EuroParl7YouTubeSubtitlesBookCorpus228PhilPapers36NIHExPorter3EnronEmails51Wikipedia(7/1/22)TOTAL1,531Table1:BreakdownofthefulltrainingsetusedtotrainBloombergGPT.Thestatisticsprovidedaretheaveragenumberofcharactersperdocument(“C/D”),theaveragenumberofcharacterspertoken(“C/T”),andthepercentageoftheoveralltokens(“T%”).Unitsforeachcolumnaredenotedintheheader.72.1FinancialDatasets(363Btokens–54.2%oftraining)TheBloombergTerminalhasprovidedaccesstoacomprehensivesetofdiversestructuredandunstructured?nancialdataandanalyticsforthepastfourdecades.Inservingthismission,Bloomberganalystshavecuratedasetof?nancialdocumentsthatwereeithercreatedinternallyoracquiredfromexternalsources.WeutilizethisextensivecollectionofcuratedandmaintaineddocumentstocreateFinPile,whichconsistsofcompany?lings,?nancialnews,andotherdatarelevanttothe?nancialmarkets.SomedocumentsincludedintheFinPile,suchascompany?lings,areavailabletothegeneralpublic,althoughcollectingthesedocumentsandpre-processingthemforLLMtrainingisanon-trivialtask.Otherdocuments,suchas(asubsetof)Bloombergnews,mustbepurchased.Therestofthedocumentsareprivateandavailable,amongothersources,throughtheBloombergTerminal.Finally,wecleanthisdatatostripo?markup,specialformatting,andtemplates.NotethateachdocumentinFinPileistime-stamped,withdatesrangingfrom2007-03-01to2022-07-31;thequalityandquantityofdocumentsincreaseoverthistimerange.Whilewedonotutilizedateinformationinthiswork,weplantouseitinthefuture,suchasforevaluationofwhatthemodellearnsaboutdi?erenttimeperiods.WhilewecannotreleaseFinPile,ourexperiencetrainingonalarge,carefullycurated,andcleandomain-speci?cdatasetmayprovidehelpfulinsightstothecommunityontheadvantagesandchallengesofbuildinga?nancialLLMinparticular,andadomain-speci?cmodelingeneral.WeprovideabreakdownandanalysisofFinPileinTable2andabriefdescriptionofthetypesofdataincludedbelow.2.1.1Web(298Btokens–42.01%oftraining)Bloombergcollectswebcontentbyidentifyingsitesthatcontain?nanciallyrelevantinfor-mation.WhilethiscategorymakesupthemajorityofFinPile,itsclassi?cationsarerough,withcontentclassi?edmainlybythelocationofthewebdomain.Withintheselocation-speci?csources,e.g.“US”(15.95%oftotal),“Asia-Pac”(4.72%oftotal),and“UK”(1.98%oftotal),documenttypesarehighlyvariedaswouldbeexpectedfromawebcrawl.WhilewebsourcesarecommoninexistingpublicLLMtrainingdatasets,Bloomberg’swebcrawlisfocusedonhigh-qualitywebsitesthathave?nanciallyrelevantinformation,asopposedtoageneral-purposecrawloftheweb.2.1.2News(38Btokens–5.31%oftraining)TheNewscategoryincludesallnewssourcesexcludingnewsarticleswrittenbyBloombergjournalists.Overall,therearehundredsofEnglishnewssourcesinFinPileincluding“BloombergTranscripts”(0.41%oftotal),whicharetranscriptsofBloombergTVnews.Generallythecontentinthisdatasetcomesfromreputablesourcesofnewsthatarerelevanttothe?nancialcommunitysoastomaintainfactualityandreducebias.2.1.3Filings(14Btokens–2.04%oftraining)CompanyFilingsare?nancialstatementspreparedby(public)companiesandmadeavail-abletothegeneralpublic.Insomecountries,liketheUS,publiccompaniesaremandatedDateBloombergFilingsNewsPressWebTotal84,43111,69511,88316,907,57631,21436,21537,647Table2:Thenumberoftokens(inmillions)containedwithindocumentsinFinPile,or-ganizedbyyear(rows)andtype(column).Unitsaremillionsoftokens.toprepareandsubmittheir?nancialstatementsonaregularcadence;e.g.,10-Kannualreportsand10-Qquarterlyreports.Inourdataset,amajorityofthe?lingscomefromEDGAR,whichistheSEC’sonlinedatabase(1.90%oftotal).FilingsaretypicallylongPDFdocumentswithtablesandchartsthataredensein?nancialinformation,whichareprocessedandnormalizedinBloomberg.Filingsaresubstantiallydi?erentfromthetypesofdocumentstypicallyusedtotrainLLMs,butcontaincriticallyimportantinformationfor?nancialdecision-making.2.1.4Press(9Btokens–1.21%oftraining)ressreleasestypicallyissuedbycompaniesthatarenanciallyrelevant.Takentogetherwith?lings,pressreleasesrepresentmostofthepubliccommuni-cationsofacompany.However,unlike?lings,pressreleasesaresimilartonewsstoriesintermsofcontentandstyle.2.1.5Bloomberg(5Btokens–0.70%oftraining)ThiscategorycomprisesBloombergauthorednewsandotherdocumentssuchasopinionsandanalyses.Thelargestsourcesare“BloombergNews”(0.44%oftotal)and“BloombergFirstWord”(0.13%oftotal),theBloomberg-authoredwireofreal-timenews.WhileBloombergNewscoversawiderangeoftopics,ittypicallyfocusesoncontentrelevanttothe?nancialcommunity.Thisdatasetcontainsdocumentsofvaryinglengths.92.2PublicDatasets(345Btokens–48.73%oftraining)Weusethreewidelyknownandavailablepublicdatasetsinourtrainingcorpus.2.2.1ThePile(184Btokens–25.9%oftraining)ThePile(Gaoetal.,2021)isthedatasetusedinGPT-Neo(Blacketal.,2021),GPT-J(WangandKomatsuzaki,2021),andGPT-NeoX(20B)(Blacketal.,2022).WeincludeThePileinourtrainingdataforthefollowingreasons.First,ithasbeenusedtosuccessfullytrainanLLM.Second,ithasundergonesigni?cantdatacleaningandpre-processing.Third,itincludesmultipledomainsandwebelievesuchdiversedatawillaidgeneralizationtonewdomainsandmayevensupporttrainingon?nancialdata.Forexample,domainssuchasFreeLawandGitHubareusefultoteamsatBloombergthatworkonlegaldocumentsandsoftwaredevelopment,respectively.CreatorsofThePilehavedeliberatelychosentoincludeduplicatecontent,withtheduplicationfactorbeingproportionaltotheperceivedqualityofthecontent.However,aswededuplicateeachofourdatasets,thesizeofThePileissigni?cantlyreduced.Additionally,notethatourtokenizer(●2.3)istrainedonThePile.2.2.2C4(138Btokens–19.48%oftraining)TheColossalCleanCrawledCorpus(C4)isacommondatasetusedtotrainLLMs,andwasintroducedtosupporttrainingT5(Ra?eletal.,2020).AlthoughitoverlapswithPile-CC,C4iscleanedandprocesseddi?erently;hence,wefeelthatincludingC4inadditiontoThePilecanaddvaluemorethanduplicateddocumentswould.We?ndthatC4containshigh-qualitynaturallanguagedocumentsduetothelayersofcleaning,thoughothershavenotedthatthedistributionacrosswebdomainsisunusual,withahighfractionofdatastemmingfrompatentsDodgeetal.(2021).2.2.3Wikipedia(24Btokens–3.35%oftraining)BothThePileandC4includeout-of-datecopiesofWikipedia,soitcouldbebene?cialforthefactualityofthemodeltohaveup-to-dateWikipediapagesincluded.Therefore,weincludeadumpofEnglishWikipediafromJuly1,2022.Thisdatasetistokenizedquiteine?ciently(3.06characterspertoken),indicatinganabove-averageamountofmarkup,whichsuggeststhatfurthercleaningmightbene?tfuturemodeltraining.2.3TokenizationWechoosetheUnigramtokenizer(Kudo,

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論