Dell EMC準(zhǔn)備好的AI深度學(xué)習(xí)解決方案搭配NVIDIA_第1頁
Dell EMC準(zhǔn)備好的AI深度學(xué)習(xí)解決方案搭配NVIDIA_第2頁
Dell EMC準(zhǔn)備好的AI深度學(xué)習(xí)解決方案搭配NVIDIA_第3頁
Dell EMC準(zhǔn)備好的AI深度學(xué)習(xí)解決方案搭配NVIDIA_第4頁
Dell EMC準(zhǔn)備好的AI深度學(xué)習(xí)解決方案搭配NVIDIA_第5頁
已閱讀5頁,還剩41頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA

DeepLearningwithNVIDIAArchitectureGuide

Authors:RenganXu,FrankHan,NishanthDandapanthula

Abstract

TherehasbeenanexplosionofinterestinDeepLearningandtheplethoraofchoicesmakesdesigningasolutioncomplexandtimeconsuming.Dell

sforAIDeepLearningwithNVIDIAisacompletesolution,designedtosupportallphasesofDeepLearning,incorporatesthelatestCPU,GPU,memory,network,storage,andsoftwaretechnologieswithimpressiveperformanceforbothtrainingandinferencephases.ThearchitectureofthisDeepLearningsolutionispresentedinthisdocument.

August2018

DellEMCReferenceArchitecture

Revisions

Date

Description

August2018

Initialrelease

publication,andspecificallydisclaimsimpliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.

Use,copying,anddistributionofanysoftwaredescribedinthispublicationrequiresanapplicablesoftwarelicense.

?August2018v1.0DellInc.oritssubsidiaries.AllRightsReserved.Dell,EMC,DellEMCandothertrademarksaretrademarksofDellInc.oritssubsidiaries.Othertrademarksmaybetrademarksoftheirrespectiveowners.

Dellbelievestheinformationinthisdocumentisaccurateasofitspublicationdate.Theinformationissubjecttochangewithoutnotice.

DellEMCReferenceArchitecture

TableofContents

Revisions 2

TableofContents 3

Executivesummary 4

SolutionOverview 5

SolutionArchitecture 7

HeadNodeConfiguration 7

SharedStorageviaNFSoverInfiniBand 8

ComputeNodeConfiguration 8

GPU 9

ProcessorrecommendationforHeadNodeandComputeNodes 10

MemoryrecommendationforHeadNodeandComputeNodes 10

IsilonStorage 11

Network 12

Software 13

DeepLearningTrainingandInferencePerformanceandAnalysis 14

DeepLearningTraining 14

FP16vsFP32 15

V100vsP100 16

V100-SXM2vsV100-PCIe 17

ScalingPerformancewithMulti-GPU 18

StoragePerformance 21

DeepLearningInference 28

NVIDIADIGITSToolandtheDeepLearningSolution 30

ContainersforDeepLearning 32

SingularityContainers 32

RunningNVIDIAGPUCloudwiththeReadySolutionsforAI-DeepLearning 34

TheDataScientistPortal 38

CreatingandRunningaNotebook 38

TensorboardIntegration 42

SlurmScheduler 43

ConclusionsandFutureWork 46

DellEMCReadySolutionsforAI-DeepLearningwithNVIDIA anArchitectureGuide|v1.0

Executivesummary

DeepLearningtechniqueshasenabledgreatsuccessinmanyfieldssuchascomputervision,naturallanguageprocessing(NLP),gamingandautonomousdrivingbyenablingamodeltolearnfromexistingdataandthentomakecorrespondingpredictions.Thesuccessisduetoacombinationofimprovedalgorithms,accesstolargedatasetsandincreasedcomputationalpower.Tobeeffectiveatenterprisescale,thecomputationalintensityofDeepLearningneuralnetworktrainingrequireshighlypowerfulandefficientparallelarchitectures.Thechoiceanddesignofthesystemcomponents,carefullyselectedandtunedforDeepLearninguse-cases,canmakethedifferenceinthebusinessoutcomesofapplyingDeepLearningtechniques.Inadditiontoseveraloptionsforprocessors,acceleratorsandstoragetechnologies,therearemultipleDeepLearningsoftwareframeworksandlibrariesthatmustbeconsidered.Thesesoftwarecomponentsareunderactivedevelopment,updatedfrequentlyandcumbersometomanage.ItiscomplicatedtosimplybuildandrunDeepLearningapplicationssuccessfully,leavinglittletimeforfocusontheactualbusinessproblem.

Toresolvethiscomplexitychallenge,DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.ThisdocumentpresentsthearchitectureofthisDeepLearningsolutionincludingdetailsonthedesignchoiceforeachcomponent.Theperformanceaspectsofthiscompletesolutionhavealsobeencharacterizedandarealsodescribedhere.

AUDIENCE

ThisdocumentisintendedfororganizationsinterestedinacceleratingDeepLearningwithadvancedcomputinganddatamanagementsolutions.Solutionarchitects,systemadministratorsandothersinterestedreaderswithinthoseorganizationsconstitutethetargetaudience.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

SolutionOverview

DellEMChasdevelopedanarchitectureforDeepLearningthatprovidesacomplete,supportedsolution.ThissolutionincludescarefullyselectedtechnologiesacrossallaspectsofDeepLearning,processingcapabilities,memory,storageandnetworktechnologiesaswellasthesoftwareecosystem.Thiscompletesolutionisprovidedas sforAIDeepLearningwithNVIDIA.Thesolutionincludesfullyintegratedandoptimizedhardware,software,andservicesincludingdeployment,integrationandsupportmakingiteasierfororganizationstostartandgrowtheirDeepLearningpractice.

ThehighleveloverviewofDellEMCReadySolutionsforAI-DeepLearningisshowninFigure1.

Figure1:OverviewofDellEMCReadySolutionsforAI-DeepLearning

DataScientistPortal:Thisisanewportalfordatascientistscreatedforthissolution.Itenablesdatascientists,whoshouldnotneedtobeexpertsinclustertechnologies,touseasimplewebportaltotakeadvantageoftheunderlyingtechnology.Thescientistscanwrite,trainanddoinferencefordifferentDeepLearningmodelswithinJupyterNotebookwhichincludesPython2,Python3,Randotherkernels.

BrightClusterManagerandBrightMachineLearning:BrightClusterManagerisusedforthemonitoring,deployment,management,andmaintenanceofthecluster.TheBrightMachineLearning(ML)includesthedeeplearningframeworks,libraries,andcompilersandsoon.

DeepLearningFrameworksandLibraries:ThiscategoryincludesTensorFlow,MXNet,Caffe2,CUDA,cuDNN,andNCCL.Thelatestversionoftheseframeworksandlibrariesareintegratedintothesolution.

Infrastructure:Theinfrastructurecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Inthisinstanceofthesolution,themasternodeisaDellEMCPowerEdgeR740xd,eachcomputenodeisPowerEdgeC4140withNVIDIATeslaGPUs,thestorageincludesNetworkFileSystem(NFS)andIsilon,andthenetworksincludeEthernetandMellanoxInfiniBand.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

Section2describeseachofthesesolutioncomponentsinmoredetail,coveringthecompute,network,storageandsoftwareconfigurations.ExtensiveperformanceanalysisonthissolutionwasconductedintheHPCandAIInnovationLabandthoseresultsarepresentedinSection3.Theseincludestestswithtrainingandinferenceworkloads,conductedondifferenttypesofGPUs,usingdifferentfloatingpointandintegerprecisionarithmetic,andwithdifferentstoragesub-systemsforDeepLearningworkloads.ThatisfollowedbySection4thatdescribescontainerizationtechniquesforDeepLearning.Section5hasdetailsontheDataScientistPortaldevelopedbyDellEMC.ConclusionandfuturedirectioncompletesthedocumentinSection6.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

SolutionArchitecture

Thehardwarecomprisesofaclusterwithamasternode,computenodes,sharedstorageandnetworks.Themasternodeorheadnoderolescanincludedeployingtheclusterofcomputenodes,managingthecomputenodes,userloginsandaccess,providingacompilationenvironment,andjobsubmissionstocomputenodes.Thecomputenodesaretheworkhorseandexecutethesubmittedjobs.SoftwarefromBrightComputingcalledBrightClusterManagerisusedtodeployandmanagethewholecluster.

Figure2showsthehigh-leveloverviewoftheclusterwhichincludesoneheadnode,ncomputenodes,thelocaldisksontheclusterheadnodeexportedoverNFS,Isilonstorage,andtwonetworks.AllcomputenodesareinterconnectedthroughanInfiniBandswitch.TheheadnodeisalsoconnectedtotheInfiniBandswitchasitusesIPoIBtoexporttheNFSsharetothecomputenodes.Allcomputenodesandtheheadnodearealsoconnectedtoa1GigabitEthernetmanagementswitchwhichisusedbyBrightClusterManagertoadministerthecluster.AnIsilonstoragesolutionisconnectedtotheFDR-40GigEGatewayswitchsothatitcanbeaccessedbytheheadnodeandallcomputenodes.

Figure2:Theoverviewofthecluster

HeadNodeConfiguration

TheDellEMCPowerEdgeR740xdisrecommendedfortheroleoftheheadnode.This

socket,2Urackserverthatcansupportthememorycapacities,I/Oneedsandnetworkoptionsrequiredoftheheadnode.Theheadnodewillperformtheclusteradministration,clustermanagement,NFSserver,userloginnodeandcompilationnoderoles.

ThesuggestedconfigurationofthePowerEdgeR740xdislistedinTable1.Itincludes12x12TBNLSASlocaldisksthatareformattedasanXFSfilesystemandexportedviaNFStothecomputenodesoverIPoIB.RAID50isusedinsteadofRAID6/RAID60totakeintoconsiderationfasterrebuildtimeandcapacityadvantagesprovidedbytheformer.Detailsofeachconfigurationchoicearedescribedinthefollowingsections.FormoreinformationonthisservermodelpleaserefertoPowerEdgeR740/740xdTechnicalGuide.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

Table1:PowerEdgeR740xdconfigurations

Component

Details

ServerModel

PowerEdgeR740xd

Processor

2xIntelXeonGold6148CPU@2.40GHz

Memory

24x16GBDDR42666MT/sDIMMs-384GB

Disks

12x12TBNLSASRAID50(Recommended10+drives)

I/O&Ports

Networkdaughtercardwith

2x10GE+2x1GE

NetworkAdapter

1xInfiniBandEDRadapter

OutofBandManagement

iDRAC9EnterprisewithLifecycleController

PowerSupplies

Titanium1100W,Platinum

StorageControllers

PowerEdgeRAIDController(PERC)H730p

SharedStorageviaNFSoverInfiniBand

ThedefaultsharedstoragesystemfortheclusterisprovidedoverNFS.Itisbuiltusing12x12TBNLSASdisksthatarelocaltotheheadnodeconfiguredinRAID50withtwoparitycheckdisks.Thisprovidesusablecapacityof120TB(109TiB).RAID50waschosenbecauseithasbalancedperformanceandshorterrebuildtimecomparedtoRAID6orRAID60(sinceRAID50hasfewerparitydisksthanRAID6orRAID60).This120TBvolumeisformattedasanXFSfilesystemandexportedtothecomputenodesviaNFSoverIPoIB.

Inthedefaultconfiguration,bothhomedirectoriesandsharedapplicationandlibraryinstalllocationsarehostedonthisNFSshare.Inadditiontothis,forsolutionswhichrequirealargercapacitysharedstoragesolution,theIsilonF800isasanalternativeoptionandisdescribedinSection2.5.AcomparisonbetweenvariousstoragesubsystemsisprovidedinSection3.1.5,includingthisNLSASNFS,theIsilon,andsmallertestconfigurationsusingSSDsandNVMedevices.

ComputeNodeConfiguration

DeepLearningmethodswouldnothavegainedsuccesswithoutthecomputationalpowertodrivetheiterativetrainingprocess.Therefore,akeycomponentofDeepLearningsolutionsishighlycapablenodesthatcansupportcomputeintensiveworkloads.Thestate-of-artneuralnetworkmodelsinDeepLearninghavemorethan100layerswhichrequirethecomputationtobeabletoscaleacrossmanycomputenodesinorderforanytimelyresults.TheDellEMCPowerEdgeC4140,anaccelerator-optimized,highdensity1Urackserver,isusedasthecomputenodeunitinthissolution.ThePowerEdgeC4140cansupportfourNVIDIAVoltaSMX2GPUs,boththeV100-SXM2aswellastheV100-PCIemodels.Figure3showstheCPU-GPUandGPU-GPUconnectiontopologyofacomputenode.

ThedetailedconfigurationofeachPowerEdgeC4140computenodeislistedinTable2.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

Figure3:ThetopologyofacomputenodeTable2:PowerEdgeC4140Configurations

Component

Details

ServerModel

PowerEdgeC4140

Processor

2xIntelXeonGold6148CPU@2.40GHz

Memory

24x16GBDDR42666MT/sDIMMs-384GB

LocalDisks

120GBSSD,1.6TBNVMe

I/O&Ports

Networkdaughtercardwith

2x10GE+2x1GE

NetworkAdapter

1xInfiniBandEDRadapter

GPU

4xV100-SXM216GB

OutofBandManagement

iDRAC9EnterprisewithLifecycleController

PowerSupplies

2000Whot-plugRedundantPowerSupplyUnit(PSU)

GPU

TheNVIDIATeslaV100isthelatestdatacenterGPUavailabletoaccelerateDeepLearning.Poweredby

engineerstotacklechallengesthatwereoncedifficult.With640TensorCores,TeslaV100isthefirstGPUtobreakthe100teraflops(TFLOPS)barrierofDeepLearningperformance.

Table3:V100-SXM2vsV100-PCIe

Description

V100-PCIe

V100-SXM2

CUDACores

5120

5120

GPUMaxClockRate(MHz)

1380

1530

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

TensorCores

640

640

MemoryBandwidth(GB/s)

900

900

NVLinkBandwidth(GB/s)(uni-direction)

N/A

300

DeepLearning(TensorOPS)

112

120

TDP(Watts)

250

300

TeslaV100productlineincludestwovariations,V100-PCIeandV100-SXM2.ThecomparisonoftwovariantsisshowninTable3.IntheV100-PCIe,allGPUscommunicatewitheachotheroverPCIebuses.WiththeV100-SXM2model,allGPUsareconnectedbyNVIDIANVLink.Inuse-caseswheremultipleGPUsarerequired,theV100-SXM2modelsprovidetheadvantageoffasterGPU-to-GPUcommunicationovertheNVLINKinterconnectwhencomparedtoPCIe.V100-SXM2GPUsprovidesixNVLinksperGPUforbi-directionalcommunication.ThebandwidthofeachNVLinkis25GB/sinuni-directionandallfourGPUswithinanodecancommunicateatthesametime,thereforethetheoreticalpeakbandwidthis6*25*4=600GB/sinbi-direction.However,thetheoreticalpeakbandwidthusingPCIeisonly16*2=32GB/sastheGPUscanonlycommunicateinorder,whichmeansthecommunicationcannotbedoneinparallel.SointheorythedatacommunicationwithNVLinkcouldbeupto600/32=18xfasterthanPCIe.TheevaluationofthisperformanceadvantageinrealmodelswillbediscussedinSection3.1.3.Becauseofthisadvantage,thePowerEdgeC4140computenodeintheDeepLearningsolutionusesV100-SXM2insteadofV100-PCIeGPUs.

ProcessorrecommendationforHeadNodeandComputeNodes

TheprocessorchosenfortheheadnodeandcomputenodesisIntel?Xeon?Gold6148CPU.ThisisthelatestIntel?Xeon?Scalableprocessorwith20physicalcoreswhichsupport40threads.Previousstudies,asdescribedinSection3.1,haveconcludedthat16threadsaresufficienttofeedtheI/Opipelineforthestate-of-the-artconvolutionalneuralnetwork,sotheGold6148CPUisareasonablechoice.AdditionallythisCPUmodelisrecommendedforthecomputenodesaswell,makingthisaconsistentchoiceacrossthecluster.

MemoryrecommendationforHeadNodeandComputeNodes

Therecommendedmemoryfortheheadnodeis24x16GB2666MT/sDIMMs.Thereforethetotalsizeofmemoryis384GB.Thisischosenbasedonthefollowingfacts:

Capacity:AnidealconfigurationmustsupportsystemmemorycapacitythatislargerthanthetotalsizeofGPUmemory.Eachcomputenodehas4GPUsandeachGPUhas16GBmemory,sothesystemmemorymustbeatleast16GBx4=64GB.TheheadnodememoryalsoaffectsI/Operformance.ForNFSservice,largermemorywillreducediskreadoperationssinceNFSserviceneedstosendoutdatafrommemory.16GBDIMMsdemonstratethebestperformance/dollarvalue.

DIMMconfiguration:Choiceslike24x16GBor12x32GBwillprovidethesamecapacityof384GBsystemmemory,butaccordingtoourstudiesasshowninFigure4,thecombinationof24x16GBDIMMsprovides11%betterperformancethanusing12x32GB.TheresultsshownherewasontheIntelXeonPlatinum8180processor,butthesametrendswillapplyacrossothermodelsintheIntelScalableProcessorFamilyincludingtheGold6148,althoughtheactualpercentagedifferencesacrossconfigurationsmayvary.MoredetailscanbefoundinourSkylakememorystudy.

Serviceability:Theheadnodeandcomputenodesmemoryconfigurationsaredesignedtobesimilartoreducepartscomplexitywhilesatisfyingperformanceandcapacityneeds.Fewerpartsneedtobestockedforreplacement,andinurgentcasesifamemorymoduleintheheadnodeneedstobe

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

replacedimmediately,aDIMMmodulefromacomputenodecanbetemporarilyconsideredtorestoretheheadnodeuntilreplacementmodulesarrive.

Figure4:Relativememorybandwidthfordifferentsystemcapacities

IsilonStorage

DellEMCIsilonisaprovenscale-outnetworkattachedstorage(NAS)solutionthatcanhandletheunstructureddataprevalentinmanydifferentworkflows.TheIsilonstoragearchitectureautomaticallyalignsapplicationneedswithperformance,capacity,andeconomics.Asperformanceandcapacitydemandsincrease,bothcanbescaledsimplyandnon-disruptively,allowingapplicationsanduserstocontinueworking.

DellEMCIsilonOneFSoperatingsystempowersallDellEMCIsilonscale-outNASstoragesolutionsandhasthefollowingfeatures.

Ahighdegreeofscalability,withgrow-as-you-goflexibilityHighefficiencytoreducecosts

Multi-protocolsupportsuchasSMB,NFS,HTTPandHDFStomaximizeoperationalflexibilityEnterprisedataprotectionandresiliency

Robustsecurityoptions

TherecommendedIsilonstorageisIsilonF800all-flashscale-outNASstorage.DellEMCIsilonF800all-flashScale-outNASstorageisuniquelysuitedformodernDeepLearningapplicationsdeliveringtheflexibilitytodealwithanydatatype,scalabilityfordatasetsranginginthePBs,andconcurrencytosupportthemassive

-outarchitectureeliminatestheI/Obottleneckbetween

canscale-outupto68PBwithupto540GB/sofperformanceinasinglecluster.ThisallowsIsilontoaccelerateAIinnovationwithfastermodeltraining,providemoreaccurateinsightswithdeeperdatasets,anddelivera

TheIsilonstoragecanbeusedifthelocalNFSstoragecapacityisinsufficientfortheenvironment.IftheIsilonisusedinconjunctionwiththelocalNFSstorage,userhomedirectoriesandprojectresultscanbestoredon

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

theIsilonwithapplicationsinstalledonthelocalNFS.TheperformancecomparisonbetweenIsilonandotherstoragesolutionsareshowninSection3.1.6.ThespecificationsoftheIsilonF800arelistedinTable4.

Table4:SpecificationofIsilonF800

Storage

Externalstorage

Bandwidth

IOPS

ChassisCapacity(4RU)

ClusterCapacity

Network

Beforedoingdeeplearningmodeltraining,ifauserwantstomoveverylargedataoutsidetheclusterdescribedinSection2toIsilon,theusercanconnecttheserverwhichstoresthedatatotheFDR-40GigEgatewayinFigure2,sothatthedatacanbemovedontoIsilonwithouthavingtorouteitthroughtheheadnode.

TomonitorandanalyzetheperformanceandfilesystemofIsilonstorage,thetoolInsightIQcanbeused.InsightIQallowsausertomonitorandanalyzeIsilonstorageclusteractivityusingstandardreportsintheInsightIQweb-basedapplication.Theusercancustomizethesereportstoprovideinformationaboutstorageclusterhardware,software,andprotocoloperations.InsightIQtransformsdataintovisualinformationthathighlightsperformanceoutliers,andhelpsusersdiagnosebottlenecksandoptimizeworkflows.InSection3.1.5,InsightIQwasusedtocollecttheaveragediskoperationsize,diskreadIOPS,andfilesystemthroughputwhenrunningdeeplearningmodels.FormoredetailsaboutInsightIQ,refertoIsilonInsightIQUserGuide.

Network

Thesolutioncomprisesofthreenetworkfabrics.Theheadnodeandallcomputenodesareconnectedwitha1GigabitEthernetfabric.TheEthernetswitchrecommendedforthisistheDellNetworkingS3048-ONwhichhas48ports.ThisconnectionisprimarilyusedbyBrightClusterManagerfordeployment,maintenanceandmonitoringthesolution.

Thesecondfabricconnectstheheadnodeandallcomputenodesarethrough100Gb/sEDRInfiniBand.TheEDRInfiniBandswitchisMellanoxSB7800whichhas36ports.ThisfabricisusedforIPCbytheapplicationsaswellastoserveNFSfromtheheadnode(IPoIB)andIsilon.GPU-to-GPUcommunicationacrossserverscanuseatechniquecalledGPUDirectRemoteDirectMemoryAccess(RDMA)whichisenabledbyInfiniBand.ThisenablesGPUstocommunicatedirectlywithouttheinvolvementofCPUs.WithoutGPUDirect,whenGPUsacrossserversneedtocommunicate,theGPUinonenodehastocopydatafromitsGPUmemorytosystemmemory,thenthatdataissenttothesystemmemoryofanothernodeoverthenetwork,andfinallythedataiscopiedfromthesystemmemoryofthesecondnodetothereceivingGPUmemory.WithGPUDirecthowever,theGPUononenodecansendthedatadirectlyfromitsGPUmemorytotheGPUmemoryinanothernode,withoutgoingthroughthesystemmemoryinbothnodes.ThereforeGPUDirectdecreasestheGPU-GPUcommunicationlatencysignificantly.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

ThethirdswitchinthesolutioniscalledagatewayswitchinFigure2andconnectstheIsilonF800tothehead

nalinterfacesare40GigabitEthernet.Hence,aswitchwhichcanserveasthegatewaybetweenthe40GbEEthernetandInfiniBandnetworksisneededforconnectivitytotheheadandcomputenodes.TheMellanoxSX6036isusedforthispurpose.ThegatewayisconnectedtotheInfiniBandEDRswitchandtheIsilonasshowninFigure2.

Software

ThesoftwareportionofthesolutionisprovidedbyDellEMCandBrightComputing.Thesoftwareincludesseveralpieces.

ThefirstpieceisBrightClusterManagerwhichisusedtoeasilydeployandmanagetheclusteredinfrastructureandprovidesallclustersoftwareincludingtheoperatingsystem,GPUdriversandlibraries,InfiniBanddriversandlibraries,MPImiddleware,theSlurmschedule,etc.

ThesecondpieceistheBrightmachinelearning(ML)whichincludesanydeeplearninglibrarydependenciestothebaseoperatingsystem,deeplearningframeworksincludingCaffe/Caffe2,Pytorch,Torch7,Theano,Tensorflow,Horovod,Keras,DIGITS,CNTKandMXNet,anddeeplearninglibrariesincludingcuDNN,NCCL,andtheCUDAtoolkit.

ThethirdpieceistheDataScientistPortalwhichwasdevelopedbyDellEMC.Theportalwascreatedtoabstractthecomplexityofthedeeplearningecosystemsbyprovidingasinglepaneofglasswhichprovidesuserswithaninterfacetogetstartedwiththeirmodels.TheportalincludesspawnerforJupyterhubandintegrateswith

Resourcemanagersandschedulers(Slurm)LDAPforusermanagement

DeepLearningframeworkenvironments(TensorFlow,Keras,MXNet,Pytorchetc. moduleenvironment,Python2,Python3andRkernelsupport

Tensorboard

TerminalCLIenvironments.

ItalsoprovidestemplatestogetstartedwithforvariousDLenvironmentsandaddssupportforsingularitycontainers.FormoredetailsabouthowtousetheDataScientistPortal,refertoSection5.

DellEMCReadySolutionsforAI DeepLearningwithNVIDIA|v1.0

DeepLearningTrainingandInferencePerformanceandAnalysis

Inthissection,theperformanceofDeepLearningtrainingaswellasinferenceismeasuredusingthreeopensourceDeepLearningframeworks:TensorFlow,MXNetandCaffe2.TheexperimentswereconductedonaninstanceofthesolutionarchitecturedescribedinSection2.TheexperimenttestclusterusedaPowerEdgeR740xdheadnode,andPowerEdgeC4140computenodes,differentstoragesub-systemsincludingIsilonandInfiniBandEDRnetwork.Adetailedtestbeddescriptionisprovidedinthefollowingsection.

DeepLearningTraining

Thewell-knownILSVRC2012datasetwasusedforbenchmarkingperformance.Thisdatasetcontains1,281,167trainingimagesand50,000validationimagesin140GB.Allimagesaregroupedinto1000categoriesorclasses.TheoverallsizeofILSVRC2012leadstonon-trivialtrainingtimesandthusmakesitmoreinterestingforanalysis.AdditionallythisdatasetiscommonlyusedbyDeepLearningresearchersforbenchmarkingandcomparisonstudies.Resnet50isacomputationallyintensivenetworkandwasselectedtostressthesolutiontoitsmaximumcapability.ForthebatchsizeparameterinDeepLearning,themaximumbatchsizethatdoesnotcausememoryerrorswasselected;thistranslatedtoabatchsizeof64perGPUforMXNetandCaffe2,and128perGPUforTensorFlow.Horovod,adistributedTensorFlowframework,wasusedtoscalethetrainingacrossmultiplecomputenodes.Throughputthisdocument,performancewasmeasuredusingametricofimages/secwhichisameasureofthroughputofhowfastthesystemcancompletetrainingthedataset.

Theimages/secresultwasaveragedacrossalliterationstotakeintoaccountthedeviations.Thetotalnumberofiterationsisequaltonum_epochs*num_images/(batch_size*num_gpus),wherenum_epochsmeansthenumberofpassestoallimagesofadataset,num_imagesmeansthetotalnumberofimagesinthedataset,batch_sizemeansthenumberofimagesthatareprocessedinparallelbyoneGPU,andnum_gpusmeansthetotalnumberofGPUsinvolvedinthetraining.

Beforerunninganybenchmark,thecacheontheheadnodeandcomputenode(s)wereclearedwiththe

Thetrainingtestswererunforasingleepoch,oronepassthroughtheentiredataset,sincethethroughputisconsistentthroughepochsforMXNetandTensorFlowtests.Consistentthroughputmeansthattheperformancevariationwasnotsignificantacrossiterations,thetestsmeasuredlessthan2%variationinperformance.

However,twoepochswereusedforCaffe2asitneedstwoepochstostabilizetheperformance.Thisisbecause

(throughputorimages/sec)isnotstable(theperformancevariationbetweeniterationsislarge)whenthedatasetisnotfullyloadedinmemory.

ForMXNetframework,16CPUthreadswereusedfordatasetdecodingandthereasonwasexplainedintheDeepLearningonV100.Caffe2doesnotprovideaparameterforuserstosetthenumberofCPUthreads.

ForTensorFlow,thenumberofCPUthreadsusedfordatasetdecodingiscalculatedbysubtractingfourthreadsperGPUfromthetotalphysicalcorecountofthesystem.ThefourthreadsperGPUareusedforGPUcompute,memorycopies,eventmonitoring,ands

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論