版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
DistributedMachine
Learningwith
Python
Acceleratingmodeltrainingandservingwithdistributedsystems
GuanhuaWang
BIRMINGHAM—MUMBAI
DistributedMachineLearningwithPython
Copyright?2022PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishingoritsdealersanddistributors,willbeheldliableforanydamagescausedorallegedtohavebeencauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.
PublishingProductManager:AliAbidi
SeniorEditors:RoshanKumar,NathanyaDiaz
ContentDevelopmentEditors:TazeenShaikh,ShreyaMoharir
TechnicalEditor:DevanshiAyare
CopyEditor:SafisEditing
ProjectCoordinator:AparnaRavikumarNair
Proofreader:SafisEditing
Indexer:PratikShirodkar
ProductionDesigner:AlishonMendonca
MarketingCoordinators:AbeerRiyazDawe,ShifaAnsari
Firstpublished:May2022
Productionreference:1040422
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
Birmingham
B32PB,UK.
ISBN978-1-80181-569-7
Tomyparents,YingHanandXinWang
Tomygirlfriend,JingYuan
–GuanhuaWang
Contributors
Abouttheauthor
GuanhuaWangisafinal-yearcomputersciencePh.D.studentintheRISELabatUCBerkeley,advisedbyProfessorIonStoica.Hisresearchliesprimarilyinthemachinelearningsystemsarea,includingfastcollectivecommunication,efficientin-parallelmodeltraining,andreal-timemodelserving.Hisresearchhasgainedlotsofattentionfrombothacademiaandindustry.Hewasinvitedtogivetalkstotop-tieruniversities(MIT,Stanford,CMU,Princeton)andbigtechcompanies(Facebook/Meta,Microsoft).Hereceivedhismaster'sdegreefromHKUSTandabachelor'sdegreefromSoutheastUniversityinChina.Hehasalsodonesomecoolresearchonwirelessnetworks.Helikesplayingsoccerandhasrunmultiplehalf-marathonsintheBayAreaofCalifornia.
Aboutthereviewers
JamshaidSohailispassionateaboutdatascience,machinelearning,computervision,andnaturallanguageprocessingandhasmorethan2yearsofexperienceintheindustry.HepreviouslyworkedataSiliconValley-basedstart-up,FunnelBeam,thefoundersofwhicharefromStanfordUniversity,asadatascientist.Currently,heisworkingasadatascientistatSystemsLimited.Hehascompletedover66onlinecoursesfromdifferentplatforms.HeauthoredthebookDataWranglingwithPython3.XforPacktPublishingandhasreviewedmultiplebooksandcourses.HeisalsodevelopingacomprehensivecourseondatascienceatEducativeandisintheprocessofwritingbooksformultiplepublishers.
HiteshHindujaisanardentAIenthusiastworkingasaseniormanagerinAIatOlaElectric,whereheleadsateamof20+peopleintheareasofML,statistics,CV,NLP,andreinforcementlearning.Hehasfiled14+patentsinIndiaandtheUSandhasnumerousresearchpublicationstohisname.HiteshhasbeeninvolvedinresearchrolesatIndia'stopbusinessschools:theIndianSchoolofBusiness,Hyderabad,andtheIndianInstituteofManagement,Ahmedabad.Heisalsoactivelyinvolvedintrainingandmentoringandhasbeeninvitedtobeaguestspeakerbyvariouscorporationsandassociationsacrosstheglobe.
TableofContents
Preface
Section1–DataParallelism
1
SplittingInputData
Single-nodetrainingistooslow4
Themismatchbetweendataloading
bandwidthandmodeltrainingbandwidth5
Single-nodetrainingtimeonpopular
datasets6
Acceleratingthetrainingprocesswith
dataparallelism8
Dataparallelism–the
high-levelbits9
Stochasticgradientdescent13
Modelsynchronization14
Hyperparametertuning15
Globalbatchsize16
Learningrateadjustment16
Modelsynchronizationschemes17
Summary18
2
ParameterServerandAll-Reduce
Technicalrequirements20
Parameterserverarchitecture21
Communicationbottleneckinthe
parameterserverarchitecture22
Shardingthemodelamongparameter
servers24
Implementingtheparameter
server26
Definingmodellayers26
Definingtheparameterserver27
Definingtheworker28
Passingdatabetweentheparameter
serverandworker30
Issueswiththeparameter
server32
Theparameterserverarchitecture
introducesahighcodingcomplexity
forpractitioners33
viiiTableofContents
Broadcast40
Gather41
All-Gather42
Summary
43
All-Reducearchitecture34
Reduce34
All-Reduce36
RingAll-Reduce37
Collectivecommunication40
3
BuildingaDataParallelTrainingandServingPipeline
Technicalrequirements
46
Single-machinemulti-GPU52
Thedataparalleltraining
Multi-machinemulti-GPU56
pipelineinanutshell
Inputpre-processing
Inputdatapartition
Dataloading
Training
46
48
49
50
50
Checkpointingandfault
tolerance64
Modelcheckpointing64
Loadmodelcheckpoints65
Modelsynchronization
51
Modelevaluationand
Modelupdate
52
hyperparametertuning67
Single-machinemulti-GPUsand
multi-machinemulti-GPUs
4
BottlenecksandSolutions
52
Modelservingindataparallelism71
Summary73
Communicationbottlenecksin
dataparalleltraining76
Analyzingthecommunicationworkloads76
Parameterserverarchitecture77
TheAll-Reducearchitecture80
Theinefficiencyofstate-of-the-art
communicationschemes83
Leveragingidlelinksandhost
resources85
TreeAll-Reduce85
HybriddatatransferoverPCIeand
NVLink91
On-devicememorybottlenecks93
Recomputationandquantization94
Recomputation95
Quantization98
Summary99
TableofContentsix
Section2–ModelParallelism
5
SplittingtheModel
Technicalrequirements104
Single-nodetrainingerror–out
ofmemory105
Fine-tuningBERTonasingleGPU105
Tryingtopackagiantmodelinsideone
state-of-the-artGPU107
ELMo,BERT,andGPT110
Basicconcepts110
RNN114
ELMo117
6
PipelineInputandLayerSplit
BERT
GPT
Pre-trainingandfine-tuningState-of-the-arthardware
P100,V100,andDGX-1
NVLink
A100andDGX-2
NVSwitch
Summary
119
121
122
123
123
124
125
125
125
Vanillamodelparallelismis
inefficient128
Forwardpropagation130
Backwardpropagation131
GPUidletimebetweenforwardand
backwardpropagation132
Pipelineinput137
Prosandconsofpipeline
parallelism141
Advantagesofpipelineparallelism141
Disadvantagesofpipelineparallelism142
Layersplit142
Notesonintra-layermodel
parallelism145
Summary145
xTableofContents
7
ImplementingModelParallelTrainingandServingWorkflows
Technicalrequirements148
Wrappingupthewholemodel
parallelismpipeline149
Amodelparalleltrainingoverview149
Implementingamodelparalleltraining
pipeline150
Specifyingcommunicationprotocol
amongGPUs153
Modelparallelserving158
Fine-tuningtransformers162
Hyperparametertuningin
modelparallelism163
BalancingtheworkloadamongGPUs163
Enabling/disablingpipelineparallelism164
NLPmodelserving164
Summary165
8
AchievingHigherThroughputandLowerLatency
Technicalrequirements169
Freezinglayers169
Freezinglayersduringforward
propagation171
Reducingcomputationcostduring
forwardpropagation173
Freezinglayersduringbackward
propagation174
Exploringmemoryand
storageresources177
Understandingmodel
decompositionanddistillation180
Modeldecomposition180
Modeldistillation183
Reducingbitsinhardware184
Summary184
Section3–AdvancedParallelismParadigms
9
A
HybridofDataandModelParallelism
Technicalrequirements189
CasestudyofMegatron-LM189
Layersplitformodelparallelism189
Row-wisetrial-and-errorapproach192
Column-wisetrial-and-errorapproach196
Cross-machinefordataparallelism200
Implementationof
Megatron-LM201
Casestudyof
Mesh-TensorFlow203
TableofContentsxi
Implementationof
ProsandconsofMegatron-LM
Mesh-TensorFlow204
andMesh-TensorFlow204
Summary205
10
FederatedLearningandEdgeDevices
Technicalrequirements209
Sharingknowledgewithout
sharingdata209
Recappingthetraditionaldataparallel
modeltrainingparadigm210
Noinputsharingamongworkers211
Communicatinggradientsfor
collaborativelearning212
Casestudy:TensorFlow
Federated217
Runningedgedeviceswith
TinyML219
Casestudy:TensorFlowLite219
Summary220
11
ElasticModelTrainingandServing
Technicalrequirements223
Introducingadaptive
modeltraining223
Traditionaldataparalleltraining224
Adaptivemodeltrainingindata
parallelism226
Adaptivemodeltraining(AllReduce-
based)226
Adaptivemodeltraining(parameter
server-based)229
Traditionalmodel-parallelmodel
trainingparadigm231
Adaptivemodeltraininginmodel
parallelism232
Implementingadaptivemodel
traininginthecloud235
Elasticityinmodelinference236
Serverless238
Summary238
xiiTableofContents
12
AdvancedTechniquesforFurtherSpeed-Ups
Technicalrequirements
241
Jobmigrationandmultiplexing
249
Debuggingandperformance
Jobmigration
250
analytics
241
Jobmultiplexing
251
Generalconceptsinthe
profilingresultsCommunicationresultsanalysisComputationresultsanalysis
243
245
246
Modeltrainingina
heterogeneousenvironmentSummary
251
252
Index
OtherBooksYouMayEnjoy
Preface
Reducingtimecostsinmachinelearningleadstoashorterwaitingtimeformodeltrainingandafastermodelupdatingcycle.Distributedmachinelearningenablesmachinelearningpractitionerstoshortenmodeltrainingandinferencetimebyordersofmagnitude.Withthehelpofthispracticalguide,you'llbeabletoputyourPythondevelopmentknowledgetoworktogetupandrunningwiththeimplementationofdistributedmachinelearning,includingmulti-nodemachinelearningsystems,innotime.
You'llbeginbyexploringhowdistributedsystemsworkinthemachinelearningareaandhowdistributedmachinelearningisappliedtostate-of-the-artdeeplearningmodels.Asyouadvance,you'llseehowtousedistributedsystemstoenhancemachinelearningmodeltrainingandservingspeed.You'llalsogettogripswithapplyingdataparallelandmodelparallelapproachesbeforeoptimizingthein-parallelmodeltrainingandservingpipelineinlocalclustersorcloudenvironments.
Bytheendofthisbook,you'llhavegainedtheknowledgeandskillsneededtobuildanddeployanefficientdataprocessingpipelineformachinelearningmodeltrainingandinferenceinadistributedmanner.
Whothisbookisfor
Thisbookisfordatascientists,machinelearningengineers,andmachinelearningpractitionersinbothacademiaandindustry.AfundamentalunderstandingofmachinelearningconceptsandworkingknowledgeofPythonprogrammingisassumed.Priorexperienceimplementingmachinelearning/deeplearningmodelswithTensorFloworPyTorchwillbebeneficial.You'llfindthisbookusefulifyouareinterestedinusingdistributedsystemstoboostmachinelearningmodeltrainingandservingspeed.
xivPreface
Whatthisbookcovers
Chapter1,SplittingInputData,showshowtodistributemachinelearningtrainingorservingworkloadontheinputdatadimension,whichiscalleddataparallelism.Chapter2,ParameterServerandAll-Reduce,describestwowidely-adoptedmodelsynchronizationschemesinthedataparalleltrainingprocess.
Chapter3,BuildingaDataParallelTrainingandServingPipeline,illustrateshowtoimplementdataparalleltrainingandtheservingworkflow.
Chapter4,BottlenecksandSolutions,describeshowtoimprovedataparallelismperformancewithsomeadvancedtechniques,suchasmoreefficientcommunicationprotocols,reducingthememoryfootprint.
Chapter5,SplittingtheModel,introducesthevanillamodelparallelapproachingeneral.Chapter6,PipelineInputandLayerSplit,showshowtoimprovesystemefficiencywithpipelineparallelism.
Chapter7,ImplementingModelParallelTrainingandServingWorkflows,discusseshowtoimplementmodelparalleltrainingandservingindetail.
Chapter8,AchievingHigherThroughputandLowerLatency,coversadvancedschemestoreducecomputationandmemoryconsumptioninmodelparallelism.
Chapter9,AHybridofDataandModelParallelism,combinesdataandmodelparallelismtogetherasanadvancedin-parallelmodeltraining/servingscheme.
Chapter10,FederatedLearningandEdgeDevices,talksaboutfederatedlearningandhowedgedevicesareinvolvedinthisprocess.
Chapter11,ElasticModelTrainingandServing,describesamoreefficientschemethatcanchangethenumberofacceleratorsusedonthefly.
Chapter12,AdvancedTechniquesforFurtherSpeed-Ups,summarizesseveralusefultools,suchasaperformancedebuggingtool,jobmultiplexing,andheterogeneousmodeltraining.
Prefacexv
Togetthemostoutofthisbook
YouwillneedtoinstallPyTorch/TensorFlowsuccessfullyonyoursystem.Fordistributedworkloads,wesuggestyouatleasthavefourGPUsinhand.
WeassumeyouhaveLinux/Ubuntuasyouroperatingsystem.WeassumeyouuseNVIDIAGPUsandhaveinstalledtheproperNVIDIAdriveraswell.Wealsoassumeyouhavebasicknowledgeaboutmachinelearningingeneralandarefamiliarwithpopulardeeplearningmodels.
Ifyouareusingthedigitalversionofthisbook,weadviseyoutotypethecodeyourselforaccessthecodefromthebook'sGitHubrepository(alinkisavailableinthenextsection).Doingsowillhelpyouavoidanypotentialerrorsrelatedtothecopyingandpastingofcode.
Downloadtheexamplecodefiles
YoucandownloadtheexamplecodefilesforthisbookfromGitHubat
https://
/PacktPublishing/Distributed-Machine-Learning-with-
Python
.Ifthere'sanupdatetothecode,itwillbeupdatedintheGitHubrepository.
Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableat
/PacktPublishing/
.Checkthemout!
Downloadthecolorimages
WealsoprovideaPDFfilethathascolorimagesofthescreenshotsanddiagramsusedinthisbook.Youcandownloadithere:
/
downloads/9781801815697_ColorImages.pdf
xviPreface
Conventionsused
Thereareanumberoftextconventionsusedthroughoutthisbook.
Codeintext:Indicatescodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandles.Hereisanexample:"ReplaceYOUR_API_KEY_HEREwiththesubscriptionkeyofyourCognitiveServicesresource.Leavethequotationmarks!"
Ablockofcodeissetasfollows:
#ConnecttoAPIthroughsubscriptionkeyandendpoint
subscription_key="<your-subscription-key>"
endpoint="https://<your-cognitive-service>.cognitiveservices.
/"
#Authenticate
credential=AzureKeyCredential(subscription_key)
cog_client=TextAnalyticsClient(endpoint=endpoint,
credential=credential)
Bold:Indicatesanewterm,animportantword,orwordsthatyouseeonscreen.Forinstance,wordsinmenusordialogboxesappearinbold.Hereisanexample:"Select
Review+Create."
TipsorImportantNotes
Appearlikethis.
Getintouch
Feedbackfromourreadersisalwayswelcome.
Generalfeedback:Ifyouhavequestionsaboutanyaspectofthisbook,emailusatcustomercare@andmentionthebooktitleinthesubjectofyour
message.
Errata:Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyouhavefoundamistakeinthisbook,wewouldbegratefulifyouwouldreportthistous.Pleasevisit
/support/errata
andfillintheform.
Prefacexvii
Piracy:Ifyo
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024版燃料油供應(yīng)合同
- 2025福建漳州片仔癀藥業(yè)股份限公司校園招聘及人員高頻重點提升(共500題)附帶答案詳解
- 2025年安徽蕪湖市市場監(jiān)督管理局赴全國重點院校招聘緊缺專業(yè)應(yīng)屆畢業(yè)生擬聘用高頻重點提升(共500題)附帶答案詳解
- 2025年安徽省黃山黟縣面向村(社區(qū))“兩委”成員招聘事業(yè)單位人員3人歷年高頻重點提升(共500題)附帶答案詳解
- 二零二五年度建筑消防工程智能化升級分包合同3篇
- 二零二五年度酒店客房預(yù)訂與退房服務(wù)合同
- 2024精白米批量買賣協(xié)議范本版B版
- 2024高品質(zhì)擠塑板批量采購協(xié)議版
- 2024防腐木安裝工程施工合同
- 2024選礦場承包與礦山安全生產(chǎn)責任合同范本大全9篇
- 2024-2025學(xué)年人教版(2024)信息技術(shù)四年級上冊 第11課 嘀嘀嗒嗒的秘密 說課稿
- 造影劑過敏的護理
- 物流管理概論 課件全套 王皓 第1-10章 物流與物流管理的基礎(chǔ)知識 - 物流系統(tǒng)
- 蘇教版六年級上冊分數(shù)四則混合運算100題帶答案
- 潛水員潛水作業(yè)安全2024
- 以案促改心得體會
- 2024年公文寫作基礎(chǔ)知識競賽試題庫及答案(共130題)
- 2023-2024學(xué)年浙江省麗水市蓮都區(qū)教科版三年級上冊期末考試科學(xué)試卷
- 失禁性皮炎病例護理查房課件
- 期末復(fù)習試題 (試卷)-2024-2025學(xué)年四年級上冊數(shù)學(xué)人教版
- 2024年中國工業(yè)級硝酸銨市場調(diào)查研究報告
評論
0/150
提交評論