版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA
DeepLearningwithNVIDIAArchitectureGuide
Authors:RenganXu,FrankHan,NishanthDandapanthula
Abstract
TherehasbeenanexplosionofinterestinDeepLearningandtheplethoraofchoicesmakesdesigningasolutioncomplexandtimeconsuming.Dell
sforAIDeepLearningwithNVIDIAisacompletesolution,designedtosupportallphasesofDeepLearning,incorporatesthelatestCPU,GPU,memory,network,storage,andsoftwaretechnologieswithimpressiveperformanceforbothtrainingandinferencephases.ThearchitectureofthisDeepLearningsolutionispresentedinthisdocument.
August2018
DellEMCReferenceArchitecture
Revisions
Date
Description
August2018
Initialrelease
publication,andspecificallydisclaimsimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.
Use,copying,anddistributionofanysoftwaredescribedinthispublicationrequiresanapplicablesoftwarelicense.
?August2018v1.0DellInc.oritssubsidiaries.AllRightsReserved.Dell,EMC,DellEMCandothertrademarksaretrademarksofDellInc.oritssubsidiaries.Othertrademarksmaybetrademarksoftheirrespectiveowners.
Dellbelievestheinformationinthisdocumentisaccurateasofitspublicationdate.Theinformationissubjecttochangewithoutnotice.
DellEMCReferenceArchitecture
TableofContents
Revisions 2
TableofContents 3
Executivesummary 4
SolutionOverview 5
SolutionArchitecture 7
HeadNodeConfiguration 7
SharedStorageviaNFSoverInfiniBand 8
ComputeNodeConfiguration 8
GPU 9
ProcessorrecommendationforHeadNodeandComputeNodes 10
MemoryrecommendationforHeadNodeandComputeNodes 10
IsilonStorage 11
Network 12
Software 13
DeepLearningTrainingandInferencePerformanceandAnalysis 14
DeepLearningTraining 14
FP16vsFP32 15
V100vsP100 16
V100-SXM2vsV100-PCIe 17
ScalingPerformancewithMulti-GPU 18
StoragePerformance 21
DeepLearningInference 28
NVIDIADIGITSToolandtheDeepLearningSolution 30
ContainersforDeepLearning 32
SingularityContainers 32
RunningNVIDIAGPUCloudwiththeReadySolutionsforAI-DeepLearning 34
TheDataScientistPortal 38
CreatingandRunningaNotebook 38
TensorboardIntegration 42
SlurmScheduler 43
ConclusionsandFutureWork 46
DellEMCReadySolutionsforAI-DeepLearningwithNVIDIA anArchitectureGuide|v1.0
Executivesummary
DeepLearningtechniqueshasenabledgreatsuccessinmanyfieldssuchascomputervision,naturallanguageprocessing(NLP),gamingandautonomousdrivingbyenablingamodeltolearnfromexistingdataandthentomakecorrespondingpredictions.Thesuccessisduetoacombinationofimprovedalgorithms,accesstolargedatasetsandincreasedcomputationalpower.Tobeeffectiveatenterprisescale,thecomputationalintensityofDeepLearningneuralnetworktrainingrequireshighlypowerfulandefficientparallelarchitectures.Thechoiceanddesignofthesystemcomponents,carefullyselectedandtunedforDeepLearninguse-cases,canmakethedifferenceinthebusinessoutcomesofapplyingDeepLearningtechniques.Inadditiontoseveraloptionsforprocessors,acceleratorsandstoragetechnologies,therearemultipleDeepLearningsoftwareframeworksandlibrariesthatmustbeconsidered.Thesesoftwarecomponentsareunderactivedevelopment,updatedfrequentlyandcumbersometomanage.ItiscomplicatedtosimplybuildandrunDeepLearningapplicationssuccessfully,leavinglittletimeforfocusontheactualbusinessproblem.
Toresolvethiscomplexitychallenge,DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.ThisdocumentpresentsthearchitectureofthisDeepLearningsolutionincludingdetailsonthedesignchoiceforeachcomponent.Theperformanceaspectsofthiscompletesolutionhavealsobeencharacterizedandarealsodescribedhere.
AUDIENCE
ThisdocumentisintendedfororganizationsinterestedinacceleratingDeepLearningwithadvancedcomputinganddatamanagementsolutions.Solutionarchitects,systemadministratorsandothersinterestedreaderswithinthoseorganizationsconstitutethetargetaudience.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
SolutionOverview
DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.Thiscompletesolutionisprovidedas sforAIDeepLearningwithNVIDIA.Thesolutionincludesfullyintegratedandoptimizedhardware,software,andservicesincludingdeployment,integrationandsupportmakingiteasierfororganizationstostartandgrowtheirDeepLearningpractice.
ThehighleveloverviewofDellEMCReadySolutionsforAI-DeepLearningisshowninFigure1.
Figure1:OverviewofDellEMCReadySolutionsforAI-DeepLearning
DataScientistPortal:Thisisanewportalfordatascientistscreatedforthissolution.Itenablesdatascientists,whoshouldnotneedtobeexpertsinclustertechnologies,touseasimplewebportaltotakeadvantageoftheunderlyingtechnology.Thescientistscanwrite,trainanddoinferencefordifferentDeepLearningmodelswithinJupyterNotebookwhichincludesPython2,Python3,Randotherkernels.
BrightClusterManagerandBrightMachineLearning:BrightClusterManagerisusedforthemonitoring,deployment,management,andmaintenanceofthecluster.TheBrightMachineLearning(ML)includesthedeeplearningframeworks,libraries,andcompilersandsoon.
DeepLearningFrameworksandLibraries:ThiscategoryincludesTensorFlow,MXNet,Caffe2,CUDA,cuDNN,andNCCL.Thelatestversionoftheseframeworksandlibrariesareintegratedintothesolution.
Infrastructure:Theinfrastructurecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Inthisinstanceofthesolution,themasternodeisaDellEMCPowerEdgeR740xd,eachcomputenodeisPowerEdgeC4140withNVIDIATeslaGPUs,thestorageincludesNetworkFileSystem(NFS)andIsilon,andthenetworksincludeEthernetandMellanoxInfiniBand.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Section2describeseachofthesesolutioncomponentsinmoredetail,coveringthecompute,network,storageandsoftwareconfigurations.ExtensiveperformanceanalysisonthissolutionwasconductedintheHPCandAIInnovationLabandthoseresultsarepresentedinSection3.Theseincludestestswithtrainingandinferenceworkloads,conductedondifferenttypesofGPUs,usingdifferentfloatingpointandintegerprecisionarithmetic,andwithdifferentstoragesub-systemsforDeepLearningworkloads.ThatisfollowedbySection4thatdescribescontainerizationtechniquesforDeepLearning.Section5hasdetailsontheDataScientistPortaldevelopedbyDellEMC.ConclusionandfuturedirectioncompletesthedocumentinSection6.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
SolutionArchitecture
Thehardwarecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Themasternodeorheadnoderolescanincludedeployingtheclusterofcomputenodes,managingthecomputenodes,userloginsandaccess,providingacompilationenvironment,andjobsubmissionstocomputenodes.Thecomputenodesaretheworkhorseandexecutethesubmittedjobs.SoftwarefromBrightComputingcalledBrightClusterManagerisusedtodeployandmanagethewholecluster.
Figure2showsthehigh-leveloverviewoftheclusterwhichincludesoneheadnode,ncomputenodes,thelocaldisksontheclusterheadnodeexportedoverNFS,Isilonstorage,andtwonetworks.AllcomputenodesareinterconnectedthroughanInfiniBandswitch.TheheadnodeisalsoconnectedtotheInfiniBandswitchasitusesIPoIBtoexporttheNFSsharetothecomputenodes.Allcomputenodesandtheheadnodearealsoconnectedtoa1GigabitEthernetmanagementswitchwhichisusedbyBrightClusterManagertoadministerthecluster.AnIsilonstoragesolutionisconnectedtotheFDR-40GigEGatewayswitchsothatitcanbeaccessedbytheheadnodeandallcomputenodes.
Figure2:Theoverviewofthecluster
HeadNodeConfiguration
TheDellEMCPowerEdgeR740xdisrecommendedfortheroleoftheheadnode.This
socket,2Urackserverthatcansupportthememorycapacities,I/Oneedsandnetworkoptionsrequiredoftheheadnode.Theheadnodewillperformtheclusteradministration,clustermanagement,NFSserver,userloginnodeandcompilationnoderoles.
ThesuggestedconfigurationofthePowerEdgeR740xdislistedinTable1.Itincludes12x12TBNLSASlocaldisksthatareformattedasanXFSfilesystemandexportedviaNFStothecomputenodesoverIPoIB.RAID50isusedinsteadofRAID6/RAID60totakeintoconsiderationfasterrebuildtimeandcapacityadvantagesprovidedbytheformer.Detailsofeachconfigurationchoicearedescribedinthefollowingsections.FormoreinformationonthisservermodelpleaserefertoPowerEdgeR740/740xdTechnicalGuide.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Table1:PowerEdgeR740xdconfigurations
Component
Details
ServerModel
PowerEdgeR740xd
Processor
2xIntelXeonGold6148CPU@2.40GHz
Memory
24x16GBDDR42666MT/sDIMMs-384GB
Disks
12x12TBNLSASRAID50(Recommended10+drives)
I/O&Ports
Networkdaughtercardwith
2x10GE+2x1GE
NetworkAdapter
1xInfiniBandEDRadapter
OutofBandManagement
iDRAC9EnterprisewithLifecycleController
PowerSupplies
Titanium1100W,Platinum
StorageControllers
PowerEdgeRAIDController(PERC)H730p
SharedStorageviaNFSoverInfiniBand
ThedefaultsharedstoragesystemfortheclusterisprovidedoverNFS.Itisbuiltusing12x12TBNLSASdisksthatarelocaltotheheadnodeconfiguredinRAID50withtwoparitycheckdisks.Thisprovidesusablecapacityof120TB(109TiB).RAID50waschosenbecauseithasbalancedperformanceandshorterrebuildtimecomparedtoRAID6orRAID60(sinceRAID50hasfewerparitydisksthanRAID6orRAID60).This120TBvolumeisformattedasanXFSfilesystemandexportedtothecomputenodesviaNFSoverIPoIB.
Inthedefaultconfiguration,bothhomedirectoriesandsharedapplicationandlibraryinstalllocationsarehostedonthisNFSshare.Inadditiontothis,forsolutionswhichrequirealargercapacitysharedstoragesolution,theIsilonF800isasanalternativeoptionandisdescribedinSection2.5.AcomparisonbetweenvariousstoragesubsystemsisprovidedinSection3.1.5,includingthisNLSASNFS,theIsilon,andsmallertestconfigurationsusingSSDsandNVMedevices.
ComputeNodeConfiguration
DeepLearningmethodswouldnothavegainedsuccesswithoutthecomputationalpowertodrivetheiterativetrainingprocess.Therefore,akeycomponentofDeepLearningsolutionsishighlycapablenodesthatcansupportcomputeintensiveworkloads.Thestate-of-artneuralnetworkmodelsinDeepLearninghavemorethan100layerswhichrequirethecomputationtobeabletoscaleacrossmanycomputenodesinorderforanytimelyresults.TheDellEMCPowerEdgeC4140,anaccelerator-optimized,highdensity1Urackserver,isusedasthecomputenodeunitinthissolution.ThePowerEdgeC4140cansupportfourNVIDIAVoltaSMX2GPUs,boththeV100-SXM2aswellastheV100-PCIemodels.Figure3showstheCPU-GPUandGPU-GPUconnectiontopologyofacomputenode.
ThedetailedconfigurationofeachPowerEdgeC4140computenodeislistedinTable2.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
Figure3:ThetopologyofacomputenodeTable2:PowerEdgeC4140Configurations
Component
Details
ServerModel
PowerEdgeC4140
Processor
2xIntelXeonGold6148CPU@2.40GHz
Memory
24x16GBDDR42666MT/sDIMMs-384GB
LocalDisks
120GBSSD,1.6TBNVMe
I/O&Ports
Networkdaughtercardwith
2x10GE+2x1GE
NetworkAdapter
1xInfiniBandEDRadapter
GPU
4xV100-SXM216GB
OutofBandManagement
iDRAC9EnterprisewithLifecycleController
PowerSupplies
2000Whot-plugRedundantPowerSupplyUnit(PSU)
GPU
TheNVIDIATeslaV100isthelatestdatacenterGPUavailabletoaccelerateDeepLearning.Poweredby
engineerstotacklechallengesthatwereoncedifficult.With640TensorCores,TeslaV100isthefirstGPUtobreakthe100teraflops(TFLOPS)barrierofDeepLearningperformance.
Table3:V100-SXM2vsV100-PCIe
Description
V100-PCIe
V100-SXM2
CUDACores
5120
5120
GPUMaxClockRate(MHz)
1380
1530
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
TensorCores
640
640
MemoryBandwidth(GB/s)
900
900
NVLinkBandwidth(GB/s)(uni-direction)
N/A
300
DeepLearning(TensorOPS)
112
120
TDP(Watts)
250
300
TeslaV100productlineincludestwovariations,V100-PCIeandV100-SXM2.ThecomparisonoftwovariantsisshowninTable3.IntheV100-PCIe,allGPUscommunicatewitheachotheroverPCIebuses.WiththeV100-SXM2model,allGPUsareconnectedbyNVIDIANVLink.Inuse-caseswheremultipleGPUsarerequired,theV100-SXM2modelsprovidetheadvantageoffasterGPU-to-GPUcommunicationovertheNVLINKinterconnectwhencomparedtoPCIe.V100-SXM2GPUsprovidesixNVLinksperGPUforbi-directionalcommunication.ThebandwidthofeachNVLinkis25GB/sinuni-directionandallfourGPUswithinanodecancommunicateatthesametime,thereforethetheoreticalpeakbandwidthis6*25*4=600GB/sinbi-direction.However,thetheoreticalpeakbandwidthusingPCIeisonly16*2=32GB/sastheGPUscanonlycommunicateinorder,whichmeansthecommunicationcannotbedoneinparallel.SointheorythedatacommunicationwithNVLinkcouldbeupto600/32=18xfasterthanPCIe.TheevaluationofthisperformanceadvantageinrealmodelswillbediscussedinSection3.1.3.Becauseofthisadvantage,thePowerEdgeC4140computenodeintheDeepLearningsolutionusesV100-SXM2insteadofV100-PCIeGPUs.
ProcessorrecommendationforHeadNodeandComputeNodes
TheprocessorchosenfortheheadnodeandcomputenodesisIntel?Xeon?Gold6148CPU.ThisisthelatestIntel?Xeon?Scalableprocessorwith20physicalcoreswhichsupport40threads.Previousstudies,asdescribedinSection3.1,haveconcludedthat16threadsaresufficienttofeedtheI/Opipelineforthestate-of-the-artconvolutionalneuralnetwork,sotheGold6148CPUisareasonablechoice.AdditionallythisCPUmodelisrecommendedforthecomputenodesaswell,makingthisaconsistentchoiceacrossthecluster.
MemoryrecommendationforHeadNodeandComputeNodes
Therecommendedmemoryfortheheadnodeis24x16GB2666MT/sDIMMs.Thereforethetotalsizeofmemoryis384GB.Thisischosenbasedonthefollowingfacts:
Capacity:AnidealconfigurationmustsupportsystemmemorycapacitythatislargerthanthetotalsizeofGPUmemory.Eachcomputenodehas4GPUsandeachGPUhas16GBmemory,sothesystemmemorymustbeatleast16GBx4=64GB.TheheadnodememoryalsoaffectsI/Operformance.ForNFSservice,largermemorywillreducediskreadoperationssinceNFSserviceneedstosendoutdatafrommemory.16GBDIMMsdemonstratethebestperformance/dollarvalue.
DIMMconfiguration:Choiceslike24x16GBor12x32GBwillprovidethesamecapacityof384GBsystemmemory,butaccordingtoourstudiesasshowninFigure4,thecombinationof24x16GBDIMMsprovides11%betterperformancethanusing12x32GB.TheresultsshownherewasontheIntelXeonPlatinum8180processor,butthesametrendswillapplyacrossothermodelsintheIntelScalableProcessorFamilyincludingtheGold6148,althoughtheactualpercentagedifferencesacrossconfigurationsmayvary.MoredetailscanbefoundinourSkylakememorystudy.
Serviceability:Theheadnodeandcomputenodesmemoryconfigurationsaredesignedtobesimilartoreducepartscomplexitywhilesatisfyingperformanceandcapacityneeds.Fewerpartsneedtobestockedforreplacement,andinurgentcasesifamemorymoduleintheheadnodeneedstobe
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
replacedimmediately,aDIMMmodulefromacomputenodecanbetemporarilyconsideredtorestoretheheadnodeuntilreplacementmodulesarrive.
Figure4:Relativememorybandwidthfordifferentsystemcapacities
IsilonStorage
DellEMCIsilonisaprovenscale-outnetworkattachedstorage(NAS)solutionthatcanhandletheunstructureddataprevalentinmanydifferentworkflows.TheIsilonstoragearchitectureautomaticallyalignsapplicationneedswithperformance,capacity,andeconomics.Asperformanceandcapacitydemandsincrease,bothcanbescaledsimplyandnon-disruptively,allowingapplicationsanduserstocontinueworking.
DellEMCIsilonOneFSoperatingsystempowersallDellEMCIsilonscale-outNASstoragesolutionsandhasthefollowingfeatures.
Ahighdegreeofscalability,withgrow-as-you-goflexibilityHighefficiencytoreducecosts
Multi-protocolsupportsuchasSMB,NFS,HTTPandHDFStomaximizeoperationalflexibilityEnterprisedataprotectionandresiliency
Robustsecurityoptions
TherecommendedIsilonstorageisIsilonF800all-flashscale-outNASstorage.DellEMCIsilonF800all-flashScale-outNASstorageisuniquelysuitedformodernDeepLearningapplicationsdeliveringtheflexibilitytodealwithanydatatype,scalabilityfordatasetsranginginthePBs,andconcurrencytosupportthemassive
-outarchitectureeliminatestheI/Obottleneckbetween
canscale-outupto68PBwithupto540GB/sofperformanceinasinglecluster.ThisallowsIsilontoaccelerateAIinnovationwithfastermodeltraining,providemoreaccurateinsightswithdeeperdatasets,anddelivera
TheIsilonstoragecanbeusedifthelocalNFSstoragecapacityisinsufficientfortheenvironment.IftheIsilonisusedinconjunctionwiththelocalNFSstorage,userhomedirectoriesandprojectresultscanbestoredon
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
theIsilonwithapplicationsinstalledonthelocalNFS.TheperformancecomparisonbetweenIsilonandotherstoragesolutionsareshowninSection3.1.6.ThespecificationsoftheIsilonF800arelistedinTable4.
Table4:SpecificationofIsilonF800
Storage
Externalstorage
Bandwidth
IOPS
ChassisCapacity(4RU)
ClusterCapacity
Network
Beforedoingdeeplearningmodeltraining,ifauserwantstomoveverylargedataoutsidetheclusterdescribedinSection2toIsilon,theusercanconnecttheserverwhichstoresthedatatotheFDR-40GigEgatewayinFigure2,sothatthedatacanbemovedontoIsilonwithouthavingtorouteitthroughtheheadnode.
TomonitorandanalyzetheperformanceandfilesystemofIsilonstorage,thetoolInsightIQcanbeused.InsightIQallowsausertomonitorandanalyzeIsilonstorageclusteractivityusingstandardreportsintheInsightIQweb-basedapplication.Theusercancustomizethesereportstoprovideinformationaboutstorageclusterhardware,software,andprotocoloperations.InsightIQtransformsdataintovisualinformationthathighlightsperformanceoutliers,andhelpsusersdiagnosebottlenecksandoptimizeworkflows.InSection3.1.5,InsightIQwasusedtocollecttheaveragediskoperationsize,diskreadIOPS,andfilesystemthroughputwhenrunningdeeplearningmodels.FormoredetailsaboutInsightIQ,refertoIsilonInsightIQUserGuide.
Network
Thesolutioncomprisesofthreenetworkfabrics.Theheadnodeandallcomputenodesareconnectedwitha1GigabitEthernetfabric.TheEthernetswitchrecommendedforthisistheDellNetworkingS3048-ONwhichhas48ports.ThisconnectionisprimarilyusedbyBrightClusterManagerfordeployment,maintenanceandmonitoringthesolution.
Thesecondfabricconnectstheheadnodeandallcomputenodesarethrough100Gb/sEDRInfiniBand.TheEDRInfiniBandswitchisMellanoxSB7800whichhas36ports.ThisfabricisusedforIPCbytheapplicationsaswellastoserveNFSfromtheheadnode(IPoIB)andIsilon.GPU-to-GPUcommunicationacrossserverscanuseatechniquecalledGPUDirectRemoteDirectMemoryAccess(RDMA)whichisenabledbyInfiniBand.ThisenablesGPUstocommunicatedirectlywithouttheinvolvementofCPUs.WithoutGPUDirect,whenGPUsacrossserversneedtocommunicate,theGPUinonenodehastocopydatafromitsGPUmemorytosystemmemory,thenthatdataissenttothesystemmemoryofanothernodeoverthenetwork,andfinallythedataiscopiedfromthesystemmemoryofthesecondnodetothereceivingGPUmemory.WithGPUDirecthowever,theGPUononenodecansendthedatadirectlyfromitsGPUmemorytotheGPUmemoryinanothernode,withoutgoingthroughthesystemmemoryinbothnodes.ThereforeGPUDirectdecreasestheGPU-GPUcommunicationlatencysignificantly.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
ThethirdswitchinthesolutioniscalledagatewayswitchinFigure2andconnectstheIsilonF800tothehead
nalinterfacesare40GigabitEthernet.Hence,aswitchwhichcanserveasthegatewaybetweenthe40GbEEthernetandInfiniBandnetworksisneededforconnectivitytotheheadandcomputenodes.TheMellanoxSX6036isusedforthispurpose.ThegatewayisconnectedtotheInfiniBandEDRswitchandtheIsilonasshowninFigure2.
Software
ThesoftwareportionofthesolutionisprovidedbyDellEMCandBrightComputing.Thesoftwareincludesseveralpieces.
ThefirstpieceisBrightClusterManagerwhichisusedtoeasilydeployandmanagetheclusteredinfrastructureandprovidesallclustersoftwareincludingtheoperatingsystem,GPUdriversandlibraries,InfiniBanddriversandlibraries,MPImiddleware,theSlurmschedule,etc.
ThesecondpieceistheBrightmachinelearning(ML)whichincludesanydeeplearninglibrarydependenciestothebaseoperatingsystem,deeplearningframeworksincludingCaffe/Caffe2,Pytorch,Torch7,Theano,Tensorflow,Horovod,Keras,DIGITS,CNTKandMXNet,anddeeplearninglibrariesincludingcuDNN,NCCL,andtheCUDAtoolkit.
ThethirdpieceistheDataScientistPortalwhichwasdevelopedbyDellEMC.Theportalwascreatedtoabstractthecomplexityofthedeeplearningecosystemsbyprovidingasinglepaneofglasswhichprovidesuserswithaninterfacetogetstartedwiththeirmodels.TheportalincludesspawnerforJupyterhubandintegrateswith
Resourcemanagersandschedulers(Slurm)LDAPforusermanagement
DeepLearningframeworkenvironments(TensorFlow,Keras,MXNet,Pytorchetc. moduleenvironment,Python2,Python3andRkernelsupport
Tensorboard
TerminalCLIenvironments.
ItalsoprovidestemplatestogetstartedwithforvariousDLenvironmentsandaddssupportforsingularitycontainers.FormoredetailsabouthowtousetheDataScientistPortal,refertoSection5.
DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0
DeepLearningTrainingandInferencePerformanceandAnalysis
Inthissection,theperformanceofDeepLearningtrainingaswellasinferenceismeasuredusingthreeopensourceDeepLearningframeworks:TensorFlow,MXNetandCaffe2.TheexperimentswereconductedonaninstanceofthesolutionarchitecturedescribedinSection2.TheexperimenttestclusterusedaPowerEdgeR740xdheadnode,andPowerEdgeC4140computenodes,differentstoragesub-systemsincludingIsilonandInfiniBandEDRnetwork.Adetailedtestbeddescriptionisprovidedinthefollowingsection.
DeepLearningTraining
Thewell-knownILSVRC2012datasetwasusedforbenchmarkingperformance.Thisdatasetcontains1,281,167trainingimagesand50,000validationimagesin140GB.Allimagesaregroupedinto1000categoriesorclasses.TheoverallsizeofILSVRC2012leadstonon-trivialtrainingtimesandthusmakesitmoreinterestingforanalysis.AdditionallythisdatasetiscommonlyusedbyDeepLearningresearchersforbenchmarkingandcomparisonstudies.Resnet50isacomputationallyintensivenetworkandwasselectedtostressthesolutiontoitsmaximumcapability.ForthebatchsizeparameterinDeepLearning,themaximumbatchsizethatdoesnotcausememoryerrorswasselected;thistranslatedtoabatchsizeof64perGPUforMXNetandCaffe2,and128perGPUforTensorFlow.Horovod,adistributedTensorFlowframework,wasusedtoscalethetrainingacrossmultiplecomputenodes.Throughputthisdocument,performancewasmeasuredusingametricofimages/secwhichisameasureofthroughputofhowfastthesystemcancompletetrainingthedataset.
Theimages/secresultwasaveragedacrossalliterationstotakeintoaccountthedeviations.Thetotalnumberofiterationsisequaltonum_epochs*num_images/(batch_size*num_gpus),wherenum_epochsmeansthenumberofpassestoallimagesofadataset,num_imagesmeansthetotalnumberofimagesinthedataset,batch_sizemeansthenumberofimagesthatareprocessedinparallelbyoneGPU,andnum_gpusmeansthetotalnumberofGPUsinvolvedinthetraining.
Beforerunninganybenchmark,thecacheontheheadnodeandcomputenode(s)wereclearedwiththe
Thetrainingtestswererunforasingleepoch,oronepassthroughtheentiredataset,sincethethroughputisconsistentthroughepochsforMXNetandTensorFlowtests.Consistentthroughputmeansthattheperformancevariationwasnotsignificantacrossiterations,thetestsmeasuredlessthan2%variationinperformance.
However,twoepochswereusedforCaffe2asitneedstwoepochstostabilizetheperformance.Thisisbecause
(throughputorimages/sec)isnotstable(theperformancevariationbetweeniterationsislarge)whenthedatasetisnotfullyloadedinmemory.
ForMXNetframework,16CPUthreadswereusedfordatasetdecodingandthereasonwasexplainedintheDeepLearningonV100.Caffe2doesnotprovideaparameterforuserstosetthenumberofCPUthreads.
ForTensorFlow,thenumberofCPUthreadsusedfordatasetdecodingiscalculatedbysubtractingfourthreadsperGPUfromthetotalphysicalcorecountofthesystem.ThefourthreadsperGPUareusedforGPUcompute,memorycopies,eventmonitoring,ands
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 委托參于證券市場操作協(xié)議
- 設(shè)立股份有限公司協(xié)議書
- 2025年安全礦長年度工作總結(jié)模版(2篇)
- 保姆雇傭合同模板
- 2025年財務(wù)年終總結(jié)(5篇)
- 2025年衛(wèi)生院傳染病防治工作計(jì)劃(3篇)
- 企業(yè)消防器材管理制度模版(3篇)
- 2025年 技術(shù)合作協(xié)議
- 二零二五年度養(yǎng)老護(hù)理機(jī)構(gòu)與醫(yī)院醫(yī)療設(shè)備共享合作協(xié)議3篇
- 2025店面裝修的合同協(xié)議書
- HIV陽性孕產(chǎn)婦全程管理專家共識2024年版解讀
- 小學(xué)體育跨學(xué)科主題學(xué)習(xí)教學(xué)設(shè)計(jì):小小志愿軍
- 附件2:慢病管理中心評審實(shí)施細(xì)則2024年修訂版
- 《ISO56001-2024創(chuàng)新管理體系 - 要求》之4:“4組織環(huán)境-確定創(chuàng)新管理體系的范圍”解讀和應(yīng)用指導(dǎo)材料(雷澤佳編制-2024)
- 2024-2030年中國散熱產(chǎn)業(yè)運(yùn)營效益及投資前景預(yù)測報告
- 和父親斷絕聯(lián)系協(xié)議書范本
- 2024時事政治考試題庫(100題)
- 2024地理知識競賽試題
- 《城市軌道交通工程盾構(gòu)吊裝技術(shù)規(guī)程》(征求意見稿)
- 【新教材】統(tǒng)編版(2024)七年級上冊語文期末復(fù)習(xí)課件129張
- 欽州市浦北縣2022-2023學(xué)年七年級上學(xué)期期末語文試題
評論
0/150
提交評論