G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行業(yè)資料_第1頁
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行業(yè)資料_第2頁
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行業(yè)資料_第3頁
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行業(yè)資料_第4頁
G. Wang - Distributed Machine Learning with Python- Accelerating model training and serving with distributed systems (2022)(英文)行業(yè)資料_第5頁
已閱讀5頁,還剩368頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

DistributedMachine

Learningwith

Python

Acceleratingmodeltrainingandservingwithdistributedsystems

GuanhuaWang

BIRMINGHAM—MUMBAI

DistributedMachineLearningwithPython

Copyright?2022PacktPublishing

Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformorbyanymeans,withoutthepriorwrittenpermissionofthepublisher,exceptinthecaseofbriefquotationsembeddedincriticalarticlesorreviews.

Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracyoftheinformationpresented.However,theinformationcontainedinthisbookissoldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPacktPublishingoritsdealersanddistributors,willbeheldliableforanydamagescausedorallegedtohavebeencauseddirectlyorindirectlybythisbook.

PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthecompaniesandproductsmentionedinthisbookbytheappropriateuseofcapitals.However,PacktPublishingcannotguaranteetheaccuracyofthisinformation.

PublishingProductManager:AliAbidi

SeniorEditors:RoshanKumar,NathanyaDiaz

ContentDevelopmentEditors:TazeenShaikh,ShreyaMoharir

TechnicalEditor:DevanshiAyare

CopyEditor:SafisEditing

ProjectCoordinator:AparnaRavikumarNair

Proofreader:SafisEditing

Indexer:PratikShirodkar

ProductionDesigner:AlishonMendonca

MarketingCoordinators:AbeerRiyazDawe,ShifaAnsari

Firstpublished:May2022

Productionreference:1040422

PublishedbyPacktPublishingLtd.

LiveryPlace

35LiveryStreet

Birmingham

B32PB,UK.

ISBN978-1-80181-569-7

Tomyparents,YingHanandXinWang

Tomygirlfriend,JingYuan

–GuanhuaWang

Contributors

Abouttheauthor

GuanhuaWangisafinal-yearcomputersciencePh.D.studentintheRISELabatUCBerkeley,advisedbyProfessorIonStoica.Hisresearchliesprimarilyinthemachinelearningsystemsarea,includingfastcollectivecommunication,efficientin-parallelmodeltraining,andreal-timemodelserving.Hisresearchhasgainedlotsofattentionfrombothacademiaandindustry.Hewasinvitedtogivetalkstotop-tieruniversities(MIT,Stanford,CMU,Princeton)andbigtechcompanies(Facebook/Meta,Microsoft).Hereceivedhismaster'sdegreefromHKUSTandabachelor'sdegreefromSoutheastUniversityinChina.Hehasalsodonesomecoolresearchonwirelessnetworks.Helikesplayingsoccerandhasrunmultiplehalf-marathonsintheBayAreaofCalifornia.

Aboutthereviewers

JamshaidSohailispassionateaboutdatascience,machinelearning,computervision,andnaturallanguageprocessingandhasmorethan2yearsofexperienceintheindustry.HepreviouslyworkedataSiliconValley-basedstart-up,FunnelBeam,thefoundersofwhicharefromStanfordUniversity,asadatascientist.Currently,heisworkingasadatascientistatSystemsLimited.Hehascompletedover66onlinecoursesfromdifferentplatforms.HeauthoredthebookDataWranglingwithPython3.XforPacktPublishingandhasreviewedmultiplebooksandcourses.HeisalsodevelopingacomprehensivecourseondatascienceatEducativeandisintheprocessofwritingbooksformultiplepublishers.

HiteshHindujaisanardentAIenthusiastworkingasaseniormanagerinAIatOlaElectric,whereheleadsateamof20+peopleintheareasofML,statistics,CV,NLP,andreinforcementlearning.Hehasfiled14+patentsinIndiaandtheUSandhasnumerousresearchpublicationstohisname.HiteshhasbeeninvolvedinresearchrolesatIndia'stopbusinessschools:theIndianSchoolofBusiness,Hyderabad,andtheIndianInstituteofManagement,Ahmedabad.Heisalsoactivelyinvolvedintrainingandmentoringandhasbeeninvitedtobeaguestspeakerbyvariouscorporationsandassociationsacrosstheglobe.

TableofContents

Preface

Section1–DataParallelism

1

SplittingInputData

Single-nodetrainingistooslow4

Themismatchbetweendataloading

bandwidthandmodeltrainingbandwidth5

Single-nodetrainingtimeonpopular

datasets6

Acceleratingthetrainingprocesswith

dataparallelism8

Dataparallelism–the

high-levelbits9

Stochasticgradientdescent13

Modelsynchronization14

Hyperparametertuning15

Globalbatchsize16

Learningrateadjustment16

Modelsynchronizationschemes17

Summary18

2

ParameterServerandAll-Reduce

Technicalrequirements20

Parameterserverarchitecture21

Communicationbottleneckinthe

parameterserverarchitecture22

Shardingthemodelamongparameter

servers24

Implementingtheparameter

server26

Definingmodellayers26

Definingtheparameterserver27

Definingtheworker28

Passingdatabetweentheparameter

serverandworker30

Issueswiththeparameter

server32

Theparameterserverarchitecture

introducesahighcodingcomplexity

forpractitioners33

viiiTableofContents

Broadcast40

Gather41

All-Gather42

Summary

43

All-Reducearchitecture34

Reduce34

All-Reduce36

RingAll-Reduce37

Collectivecommunication40

3

BuildingaDataParallelTrainingandServingPipeline

Technicalrequirements

46

Single-machinemulti-GPU52

Thedataparalleltraining

Multi-machinemulti-GPU56

pipelineinanutshell

Inputpre-processing

Inputdatapartition

Dataloading

Training

46

48

49

50

50

Checkpointingandfault

tolerance64

Modelcheckpointing64

Loadmodelcheckpoints65

Modelsynchronization

51

Modelevaluationand

Modelupdate

52

hyperparametertuning67

Single-machinemulti-GPUsand

multi-machinemulti-GPUs

4

BottlenecksandSolutions

52

Modelservingindataparallelism71

Summary73

Communicationbottlenecksin

dataparalleltraining76

Analyzingthecommunicationworkloads76

Parameterserverarchitecture77

TheAll-Reducearchitecture80

Theinefficiencyofstate-of-the-art

communicationschemes83

Leveragingidlelinksandhost

resources85

TreeAll-Reduce85

HybriddatatransferoverPCIeand

NVLink91

On-devicememorybottlenecks93

Recomputationandquantization94

Recomputation95

Quantization98

Summary99

TableofContentsix

Section2–ModelParallelism

5

SplittingtheModel

Technicalrequirements104

Single-nodetrainingerror–out

ofmemory105

Fine-tuningBERTonasingleGPU105

Tryingtopackagiantmodelinsideone

state-of-the-artGPU107

ELMo,BERT,andGPT110

Basicconcepts110

RNN114

ELMo117

6

PipelineInputandLayerSplit

BERT

GPT

Pre-trainingandfine-tuningState-of-the-arthardware

P100,V100,andDGX-1

NVLink

A100andDGX-2

NVSwitch

Summary

119

121

122

123

123

124

125

125

125

Vanillamodelparallelismis

inefficient128

Forwardpropagation130

Backwardpropagation131

GPUidletimebetweenforwardand

backwardpropagation132

Pipelineinput137

Prosandconsofpipeline

parallelism141

Advantagesofpipelineparallelism141

Disadvantagesofpipelineparallelism142

Layersplit142

Notesonintra-layermodel

parallelism145

Summary145

xTableofContents

7

ImplementingModelParallelTrainingandServingWorkflows

Technicalrequirements148

Wrappingupthewholemodel

parallelismpipeline149

Amodelparalleltrainingoverview149

Implementingamodelparalleltraining

pipeline150

Specifyingcommunicationprotocol

amongGPUs153

Modelparallelserving158

Fine-tuningtransformers162

Hyperparametertuningin

modelparallelism163

BalancingtheworkloadamongGPUs163

Enabling/disablingpipelineparallelism164

NLPmodelserving164

Summary165

8

AchievingHigherThroughputandLowerLatency

Technicalrequirements169

Freezinglayers169

Freezinglayersduringforward

propagation171

Reducingcomputationcostduring

forwardpropagation173

Freezinglayersduringbackward

propagation174

Exploringmemoryand

storageresources177

Understandingmodel

decompositionanddistillation180

Modeldecomposition180

Modeldistillation183

Reducingbitsinhardware184

Summary184

Section3–AdvancedParallelismParadigms

9

A

HybridofDataandModelParallelism

Technicalrequirements189

CasestudyofMegatron-LM189

Layersplitformodelparallelism189

Row-wisetrial-and-errorapproach192

Column-wisetrial-and-errorapproach196

Cross-machinefordataparallelism200

Implementationof

Megatron-LM201

Casestudyof

Mesh-TensorFlow203

TableofContentsxi

Implementationof

ProsandconsofMegatron-LM

Mesh-TensorFlow204

andMesh-TensorFlow204

Summary205

10

FederatedLearningandEdgeDevices

Technicalrequirements209

Sharingknowledgewithout

sharingdata209

Recappingthetraditionaldataparallel

modeltrainingparadigm210

Noinputsharingamongworkers211

Communicatinggradientsfor

collaborativelearning212

Casestudy:TensorFlow

Federated217

Runningedgedeviceswith

TinyML219

Casestudy:TensorFlowLite219

Summary220

11

ElasticModelTrainingandServing

Technicalrequirements223

Introducingadaptive

modeltraining223

Traditionaldataparalleltraining224

Adaptivemodeltrainingindata

parallelism226

Adaptivemodeltraining(AllReduce-

based)226

Adaptivemodeltraining(parameter

server-based)229

Traditionalmodel-parallelmodel

trainingparadigm231

Adaptivemodeltraininginmodel

parallelism232

Implementingadaptivemodel

traininginthecloud235

Elasticityinmodelinference236

Serverless238

Summary238

xiiTableofContents

12

AdvancedTechniquesforFurtherSpeed-Ups

Technicalrequirements

241

Jobmigrationandmultiplexing

249

Debuggingandperformance

Jobmigration

250

analytics

241

Jobmultiplexing

251

Generalconceptsinthe

profilingresultsCommunicationresultsanalysisComputationresultsanalysis

243

245

246

Modeltrainingina

heterogeneousenvironmentSummary

251

252

Index

OtherBooksYouMayEnjoy

Preface

Reducingtimecostsinmachinelearningleadstoashorterwaitingtimeformodeltrainingandafastermodelupdatingcycle.Distributedmachinelearningenablesmachinelearningpractitionerstoshortenmodeltrainingandinferencetimebyordersofmagnitude.Withthehelpofthispracticalguide,you'llbeabletoputyourPythondevelopmentknowledgetoworktogetupandrunningwiththeimplementationofdistributedmachinelearning,includingmulti-nodemachinelearningsystems,innotime.

You'llbeginbyexploringhowdistributedsystemsworkinthemachinelearningareaandhowdistributedmachinelearningisappliedtostate-of-the-artdeeplearningmodels.Asyouadvance,you'llseehowtousedistributedsystemstoenhancemachinelearningmodeltrainingandservingspeed.You'llalsogettogripswithapplyingdataparallelandmodelparallelapproachesbeforeoptimizingthein-parallelmodeltrainingandservingpipelineinlocalclustersorcloudenvironments.

Bytheendofthisbook,you'llhavegainedtheknowledgeandskillsneededtobuildanddeployanefficientdataprocessingpipelineformachinelearningmodeltrainingandinferenceinadistributedmanner.

Whothisbookisfor

Thisbookisfordatascientists,machinelearningengineers,andmachinelearningpractitionersinbothacademiaandindustry.AfundamentalunderstandingofmachinelearningconceptsandworkingknowledgeofPythonprogrammingisassumed.Priorexperienceimplementingmachinelearning/deeplearningmodelswithTensorFloworPyTorchwillbebeneficial.You'llfindthisbookusefulifyouareinterestedinusingdistributedsystemstoboostmachinelearningmodeltrainingandservingspeed.

xivPreface

Whatthisbookcovers

Chapter1,SplittingInputData,showshowtodistributemachinelearningtrainingorservingworkloadontheinputdatadimension,whichiscalleddataparallelism.Chapter2,ParameterServerandAll-Reduce,describestwowidely-adoptedmodelsynchronizationschemesinthedataparalleltrainingprocess.

Chapter3,BuildingaDataParallelTrainingandServingPipeline,illustrateshowtoimplementdataparalleltrainingandtheservingworkflow.

Chapter4,BottlenecksandSolutions,describeshowtoimprovedataparallelismperformancewithsomeadvancedtechniques,suchasmoreefficientcommunicationprotocols,reducingthememoryfootprint.

Chapter5,SplittingtheModel,introducesthevanillamodelparallelapproachingeneral.Chapter6,PipelineInputandLayerSplit,showshowtoimprovesystemefficiencywithpipelineparallelism.

Chapter7,ImplementingModelParallelTrainingandServingWorkflows,discusseshowtoimplementmodelparalleltrainingandservingindetail.

Chapter8,AchievingHigherThroughputandLowerLatency,coversadvancedschemestoreducecomputationandmemoryconsumptioninmodelparallelism.

Chapter9,AHybridofDataandModelParallelism,combinesdataandmodelparallelismtogetherasanadvancedin-parallelmodeltraining/servingscheme.

Chapter10,FederatedLearningandEdgeDevices,talksaboutfederatedlearningandhowedgedevicesareinvolvedinthisprocess.

Chapter11,ElasticModelTrainingandServing,describesamoreefficientschemethatcanchangethenumberofacceleratorsusedonthefly.

Chapter12,AdvancedTechniquesforFurtherSpeed-Ups,summarizesseveralusefultools,suchasaperformancedebuggingtool,jobmultiplexing,andheterogeneousmodeltraining.

Prefacexv

Togetthemostoutofthisbook

YouwillneedtoinstallPyTorch/TensorFlowsuccessfullyonyoursystem.Fordistributedworkloads,wesuggestyouatleasthavefourGPUsinhand.

WeassumeyouhaveLinux/Ubuntuasyouroperatingsystem.WeassumeyouuseNVIDIAGPUsandhaveinstalledtheproperNVIDIAdriveraswell.Wealsoassumeyouhavebasicknowledgeaboutmachinelearningingeneralandarefamiliarwithpopulardeeplearningmodels.

Ifyouareusingthedigitalversionofthisbook,weadviseyoutotypethecodeyourselforaccessthecodefromthebook'sGitHubrepository(alinkisavailableinthenextsection).Doingsowillhelpyouavoidanypotentialerrorsrelatedtothecopyingandpastingofcode.

Downloadtheexamplecodefiles

YoucandownloadtheexamplecodefilesforthisbookfromGitHubat

https://

/PacktPublishing/Distributed-Machine-Learning-with-

Python

.Ifthere'sanupdatetothecode,itwillbeupdatedintheGitHubrepository.

Wealsohaveothercodebundlesfromourrichcatalogofbooksandvideosavailableat

/PacktPublishing/

.Checkthemout!

Downloadthecolorimages

WealsoprovideaPDFfilethathascolorimagesofthescreenshotsanddiagramsusedinthisbook.Youcandownloadithere:

/

downloads/9781801815697_ColorImages.pdf

xviPreface

Conventionsused

Thereareanumberoftextconventionsusedthroughoutthisbook.

Codeintext:Indicatescodewordsintext,databasetablenames,foldernames,filenames,fileextensions,pathnames,dummyURLs,userinput,andTwitterhandles.Hereisanexample:"ReplaceYOUR_API_KEY_HEREwiththesubscriptionkeyofyourCognitiveServicesresource.Leavethequotationmarks!"

Ablockofcodeissetasfollows:

#ConnecttoAPIthroughsubscriptionkeyandendpoint

subscription_key="<your-subscription-key>"

endpoint="https://<your-cognitive-service>.cognitiveservices.

/"

#Authenticate

credential=AzureKeyCredential(subscription_key)

cog_client=TextAnalyticsClient(endpoint=endpoint,

credential=credential)

Bold:Indicatesanewterm,animportantword,orwordsthatyouseeonscreen.Forinstance,wordsinmenusordialogboxesappearinbold.Hereisanexample:"Select

Review+Create."

TipsorImportantNotes

Appearlikethis.

Getintouch

Feedbackfromourreadersisalwayswelcome.

Generalfeedback:Ifyouhavequestionsaboutanyaspectofthisbook,emailusatcustomercare@andmentionthebooktitleinthesubjectofyour

message.

Errata:Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,mistakesdohappen.Ifyouhavefoundamistakeinthisbook,wewouldbegratefulifyouwouldreportthistous.Pleasevisit

/support/errata

andfillintheform.

Prefacexvii

Piracy:Ifyo

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論