2024微软人工智能系统课程_第1页
2024微软人工智能系统课程_第2页
2024微软人工智能系统课程_第3页
2024微软人工智能系统课程_第4页
2024微软人工智能系统课程_第5页
已阅读5页,还剩622页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

课程讲义名称备注1课程介绍Overviewandsystem/AIbasics2人工智能系统概述SystemperspectiveofSystemforAISystemforAI:ahistoricview;Fundamentalsofneuralnetworks;FundamentalsofSystemforAI3深度神经网络计算框架基础ComputationframeworksforDNNBackpropandAD,Tensor,DAG,ExecutiongraphPapersandsystems:PyTorch,TensorFlow4矩阵运算与计算机体系结构ComputerarchitectureforMatrixcomputationMatrixcomputation,CPU/SIMD,GPGPU,ASIC/TPUPapersandsystems:Blas,TPU5分布式训练算法DistributedtrainingalgorithmsDataparallelism,modelparallelism,distributedSGDPapersandsystems:6分布式训练系统DistributedtrainingsystemsMPI,parameterservers,all-reduce,RDMAPapersandsystems:Horovod7异构计算集群调度与资源管理系统SchedulingandresourcemanagementsystemRunningDNNjoboncluster:container,resourceallocation,schedulingPapersandsystems:KubeFlow,OpenPAI,Gandiva,HiveD8深度学习推导系统InferencesystemsEfficiency,latency,throughput,anddeployment 课程讲义名称备注9计算图编译优化ComputationgraphcompilationandoptimizationIR,sub-graphpatternmatch,MatrixmultiplicationandmemoryoptimizationPapersandsystems:XLA,MLIR,TVM,NNFusion10模型压缩和稀疏化处理EfficiencyviacompressionandsparsityModelcompression,SparsityPruning11自动机器学习系统AutoMLsystemsHyperparametertuning,NASPapersandsystems:Hyperband,SMAC,ENAS,AutoKeras,NNI12强化学习系统ReinforcementlearningsystemsTheoryofRL,systemsforRLPapersandsystems:AC3,RLlib,AlphaZero13模型安全与隐私保护SecurityandPrivacyFederatedlearning,security,privacyPapersandsystems:DeepFake14用AI技术优化计算机系统AIforsystemsAIfortraditionalsystemsproblems,forsystemalgorithmsPapersandsystems:LearnedIndexes,Learnedquerypath课程讲义名称备注Lab1(forweek1,2)框架及工具入门示例Asimplethroughoutend-to-endAIexample,fromasystemperspectiveUnderstandthesystemsfromdebuggerinfoandsystemlogsLab2(forweek3)定制一个新的张量运算CustomizeoperatorsDesignandimplementacustomizedoperator(bothforwardandbackward):inpythonLab3(forweek4)CUDA实现和优化CUDAimplementationAddaCUDAimplementationforthecustomizedoperatorLab4(forweek5,6)AllReduce实现和优化AllReduceImproveoneofAllReduceoperators’implementationonHorovodLab5(forweek7,8)配置Container来进行云上训练或推理准备ConfigurecontainersforcustomizedtrainingandinferenceConfigurecontainersLab6学习使用调度管理系统SchedulingandresourcemanagementsystemGetfamiliarwithOpenPAIorKubeFlowLab7分布式训练任务练习DistributedtrainingTrydifferentkindsofallreduceimplementationsLab8自动机器学习系统练习AutoMLSearchforanewneuralnetworkNNstructureeforImage/NLPtasksLab9强化学习系统练习RLSystemsConfigureandgetfamiliarwithoneofthefollowingRLSystems:RLlib,… Self-driving Surveillancedetection Medicaldiagnostics GamePersonalassistant

DeepLearning深度学习正在改变世界

Art Imagerecognition Speechrecognition Naturallanguage Generativemodel Reinforcementlearning catdoghoneybadgercatdoghoneybadger

CatDogRaccoonlossloss𝑑𝑤1

𝑑𝑤2

𝑑𝑤3

𝑑𝑤4

𝑑𝑤5

ErrorsDogRDMA14Mimages海量的(标识)数据RDMA14Mimages

深度学习算法的进步 语言、框架

计算能力 深度学习+系统的进步:编程语言、优化、计算机体系结构、并行计算以及分布式系统E.g.,imageclassificationproblemMNISTImageNetWebImages60Ksamples16MsamplesBillionsofImages10categories1000categoriesOpenedcategoriesTESTERRORRATE(%)TESTERRORRATE(%)123AlexNet,16.4%ReLU,Dropout,2012Inception,6.7%Batchnormalization,2015ResNet,3.57%Residualway,2015AlexNet,16.4%ReLU,Dropout,2012Inception,6.7%Batchnormalization,2015ResNet,3.57%Residualway,2015EfficientNet,EfficientNet,3.1%NASLeNet,convolution,max-pooling,softmax,1998LeNet,convolution,max-pooling,softmax,1998 ImagerecognitionSpeechrecognition

NaturallanguageReinforcementlearning TPUv3360TPUv3360TopsV100TPUv1125Tops90TopsPerformance(Op/Sec)?TPUDedicatedPerformance(Op/Sec)?TPUDedicatedHardwareGPUCPUMoore’slaw5KopsENIAC~500GopsXeonE5108x105x

1970 1980 1990 2000

2019CompilerBackendTVMTensorFlowXLACompilerBackendTVMTensorFlowXLA LanguageFrontendSwiftforTensorFlowMxNetCNTKLanguageFrontendSwiftforTensorFlowMxNetCNTKPyTorchCustompurposemachinelearningalgorithmsTheanoDisBeliefCaffeAlgebra&linearAlgebra&linearlibsCPUGPUDensematmulengineGPUFPGASpecialAIacceleratorsTPUGraphCoreOtherASICs CustompurposemachineCustompurposemachinelearningalgorithmsTheanoDisBeliefCaffeDeeplearningframeworksprovideeasierwaystoleveragevariouslibrariesMachineLearningLanguageandCompilerPowerfulCompilerInfrastructure:Codeoptimization,sparsityoptimization,hardwaretargetingAFull-FeaturedProgrammingLanguageforML:ExpressiveandflexibleControlflow,recursion,sparsityAlgebra&Algebra&linearlibsCPUGPUAIframeworkDensematmulengineSIMD→MIMDSparsitySupportControlFlowandDynamicityAssociatedMemory End-to-EndAIUserExperiencesModel,Algorithm,Pipeline,Experiment,End-to-EndAIUserExperiencesModel,Algorithm,Pipeline,Experiment,LifeCycleManagementProgrammingInterfacesComputationgraph,(auto)GradientcalculationIR,CompilerinfrastructureProgrammingInterfacesComputationgraph,(auto)GradientcalculationIR,CompilerinfrastructureHardwareAPIs(GPU,CPU,FPGA,ASIC)ResourceManagement/SchedulerHardwareAPIs(GPU,CPU,FPGA,ASIC)ResourceManagement/Scheduler ScalableNetworkStack(RDMA,IB,NVLink)DeepLearningRuntime:Optimizer,Planner,ExecutorArchitecture(singlenodeandCloud)

class3class4class5class6class7class8更广泛的AI系统生态class

机器学习新模式(RL)

深度学习算法和框架classclassclass

自动机器学习(AutoML)安全与隐私模型推导、压缩与优化

通用AI算法支持与进化深度神经网络编译架构及优化

深度学习任务运行和优 通用资源管理和调度化环境 统

新型硬件及相关高性能网络和计算栈 (2)开始训练

定义网络结构 Fullyconnected最后几层

Convolutionalneuralnetwork等Locality强的数据

Recurrentneuralnetwork化的数据,比如文本信息、知识图

Transformerneuralnetwork比如文本信息 #ArecursiveTreeBankmodelinadozenlinesofJPLcode#Walkthetree,accumulatingembeddingvecs#Wordembeddingmodelisusedattheleafnodetomapword#indexintohigh-dimensionalsemanticwordrepresentation.#Getsemanticrepresentationsforleftandrightchildren.#Acompositionfunctionisusedtolearnsemantic#representationforphraseattheinternalnode.#Maptreeembeddingtosentiment

更多样化的结构更复杂的依赖关系更细粒度的计算模式ExecutionRuntimeCPU,GPU,RDMAdevicesGraphdefinition(IR)xw*b+yFront-endLanguageBinding:Python,Lua,R,C++OptimizationBatching,Cache,Overlap ExecutionRuntimeCPU,GPU,RDMAdevicesGraphdefinition(IR)xw*b+yFront-endLanguageBinding:Python,Lua,R,C++OptimizationBatching,Cache,OverlapData-FlowGraph(DFG)asIntermediateRepresentation

x y z*a+bΣc

TensorFlow AddgradientbackpropagationAddgradientbackpropagationData-FlowGraph(DFG)xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠x y z

𝛻x 𝛻yCPUcodeGPUcode

* a+ +𝐠b 𝛻bΣ Σ𝐠c

𝛻a

𝛻zxyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠 11Operators IDEProgrammingwith:VSCode,JupiterNotebookIDEProgrammingwith:VSCode,JupiterNotebookLanguageIntegratedwithmainstreamPL:PyTorchandTensorFlowinsidePythonCompilerIntermediaterepresentationCompilationOptimizationBasicdatastructure:TensorLexicalanalysis:TokenUsercontrolled:mini-batchBasiccomputation:DAGParsing:ASTDataparallelismandmodelparallelismAdvancefeatures:controlflowSemanticanalysis:SymbolicADLoopnetsanalysis:pipelineparallelism,controlflowGeneralIRs:MLIRCodeoptimizationDataflowanalysis:Arithmetic,FusionCodegenerationHardwaredependentoptimizations:matrixcomputation,layoutResourceallocationandscheduler:memory,recomputation,RuntimesSinglenode:CuDNNMultimode:Parameterservers,AllreducerComputationclusterresourcemanagementandjobschedulerHardwareHardwareaccelerators:CPU/GPU/ASIC/FPGANetworkaccelerators:RDMA/IB/NVLinkFrameworksArchitecture CompilerBackendTVMTensorFlowXLACompilerBackendTVMTensorFlowXLALanguageFrontendSwiftforTensorFlowMxNetTensorFlowCNTKPyTorchDeeplearningframeworksLanguageFrontendSwiftforTensorFlowMxNetTensorFlowCNTKPyTorchSpecialAIacceleratorsTPUGraphCoreOtherASICsAIFrameworkDensematmulengineGPUFPGAimport"tensorflow/core/framework/to"SpecialAIacceleratorsTPUGraphCoreOtherASICsAIFrameworkDensematmulengineGPUFPGAMachineLearningLanguageandCompilerPowerfulCompilerInfrastructureMachineLearningLanguageandCompilerPowerfulCompilerInfrastructure:Codeoptimization,sparsityoptimization,hardwaretargetingAFull-FeaturedProgrammingLanguageforML:ExpressiveandflexibleControlflow,recursion,sparsitySIMD→SIMD→MIMDSparsitySupportControlFlowandDynamicityAssociatedMemory//SyntacticallysimilartoLLVM:func@testFunction(%arg0:i32){%x=call@thingToCall(%arg0):(i32)->i32br^bb1^bb1:%y=addi%x,%x:i32return%y:i32}深度学习高度依赖数据规模和模型规模

8layers1.416%Error2012AlexNet

Image152layersGFLOP%Error2015ResNetSpeech提高训练速度可以加快深度学习模型的开发速度大规模部署深度学习模型需要更快和更高效的推演速度Inferenceperformance→Servinglatency

80GFLOP7,000hrsofData8%Error2014DeepSpeech1

465GFLOP12,000hrsofData5%Error2015DeepSpeech2 Differentarchitectures:CNN,RNN,Transformer,…

Highcomputationresourcerequirements:modelsize,…Differentgoals:throughput,accuracy,…BeBetransparenttovarioususerrequirementsapplyoverheterogeneoushardwareenvironmentScale-out LocalEfficiency MemoryEffectivenessHardware SSD CPU/GPU/FGPA InfiniBand/NVLinkHardware SSD CPU/GPU/FGPA InfiniBand/NVLinkHyper-params OptimizerMini-batchLearningrateOptimizations CachingI/Oove rlappcompressMixedprecisionScheduling I/O Computation Communication系统、算法和硬件必须相互结合:batchOverlapAmodeltotrainParameter-server All-reduceAmodeltotrainParameter-server All-reducearallelismPipelineparallelisModelp aralleliDatap Parallelism ManagedbyKubernetesManagedbyHadoop+AIVS(Code),JupyterNotebook,DLWorkspace,etc.ManagedbyKubernetesManagedbyHadoop+AI

WebWebPortalRESTRESTServerLauncherLauncher/Microsoft/pai

PAIMonitorHDFSPAIMonitorHDFSYARN+AIPAIRuntimeAutoMLJobsPAIRuntimeBigDataJobsPAIRuntimeDLJobs提供运行环境供Frameworks访问计算资源:GPU/FPGA/ASICIB/RDMAKubernetesClusterManagement存储资源:HDFS/NFSKubernetesClusterManagementDocker/Ubuntu高效的调度算法Docker/Ubuntu分配异构计算资源错误恢复和容错管理日志、性能监控系统用户、安全管理energy-efficiencyTPUenergy-efficiencyTPUenergy-efficiencywallGPUenergy-efficiencywallCPUenergy-efficiencywallGiga-operationsperJouleDedicate100Giga-operationsperJouleDedicate10Moore’slaw1Moore’slaw0.11995

2000 2005 2010 2015 2020Year1 lossdogloss

CatDogRaccoon𝑑𝑤1

𝑑error𝑑𝑤2

𝑑𝑤3

𝑑𝑤4

𝑑error𝑑𝑤5

Error importnumpyasnpN,D=3,importnumpyasnpN,D=3,4x=np.random.randn(N,D)y=np.random.randn(N,D)z=np.random.randn(N,D)a=x*yb=a+zc=np.sum(b)

𝑥 𝑦 𝑧 𝑐𝑐grad_c=grad_c=1.0grad_b=grad_c*np.ones((N,D))grad_a=grad_b.copy()grad_z=grad_b.copy()grad_x=grad_a*ygrad_y=grad_a*x

𝑔𝑟𝑎𝑑_𝑥

𝑔𝑟𝑎𝑑_𝑦

𝑔𝑟𝑎𝑑_𝑧

Python-likeFlexibilityPython-like importxxlibimportxxlibx,y=load_data()y=xxlib.resnet152(x)libraryPython-likelibraryPython-like

Flexibility灵活 高效 Efficiency

library

Layer-based

Python-like

Flexibility ClassAttenionLayer<CPU>{voidforward(inputs..){}voidbackward(inputs,grad){}ClassAttenionLayer<CPU>{voidforward(inputs..){}voidbackward(inputs,grad){}ClassAttenionLayer<GPU>{…};REGISTER_LAYER(“Attention”,AttenionLayer); SGD:𝑤←𝑤−𝜂∇𝑤SGDwithmomentum:𝑤←𝑤−(𝛾∇𝑡−1+𝜂∇𝑡)𝑤 𝑤\hhttps://ruder.io/optimizing-gradient-descent/ 前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图xw*b+y图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernelgeneration 计算硬件计算硬件CPU,GPU,RDMAdevices AddLogWhileSubMatMulMergeMulConvBroadCastDivBatchNormReduceAddLogWhileSubMatMulMergeMulConvBroadCastDivBatchNormReduceReluLossMapTanhTransposeReshapeExpConcatenateSelectFloorSigmoid….. PAGE15PAGE15Numpyimportnumpyasnpnp.random.seed(importnumpyasnpnp.random.seed(0)N,D=3,4grad_c=1.0grad_b=grad_c*np.ones((N,D))grad_a=grad_b.copy()grad_z=grad_b.copy()grad_x=grad_a*ygrad_y=grad_a*x

𝑥 𝑦 𝑧 𝑐x=np.random.randn(N,D)y=𝑐x=np.random.randn(N,D)y=np.random.randn(N,D)z=np.random.randn(N,D)abc===x*ya+znp.sum(b)

𝑔𝑟𝑎𝑑_𝑦

𝑔𝑟𝑎𝑑_𝑧xyz𝛻x𝛻y*xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠1717 L(𝑤)=𝐿𝑜𝑠𝑠 𝑓(𝑤,

),→𝜕𝐿(𝑤)𝜕𝑤L𝑥 =expexp 𝑥 +exp𝑥 2 +sin(exp𝑥 +exp𝑥 2)𝜕𝐿(𝑤)𝜕𝑤 𝐿 𝑥 =exp exp𝑥 +exp𝑥 2 +sin(exp𝑥 +exp𝑥 2) xyz𝛻x𝛻y*a*xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernely+*bxw统一模型表示:计算流图图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernely+*bxw统一模型表示:计算流图前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)计算硬件计算硬件CPU,GPU,RDMAdevicesxwxw*b+yxwxw*b+yPAGE29PAGE29Batchsame-typeoperatorsleverageGPUmassiveparallelism++×𝝈+M𝝈×𝒕𝒂𝒏𝒉+ +MMMMMRf Rzht-1xtData-flowgraphofaGRUcellWzWoWfhtBatchsame-typeoperatorsleverageGPUmassiveparallelism+×𝝈+×𝝈+M𝝈×𝒕𝒂𝒏𝒉+ +MMMMMRf Rzht-1xtWzWoWfht+×𝝈×𝝈+M𝒕𝒂𝒏𝒉Mht-1RWxthtData-flowgraphofaGRUcellPAGEPAGE31xyz𝛻x𝛻y*a*𝐠𝛻z+xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠1xyzxyz𝛻x*𝐠𝛻y*a𝛻z+bΣc+𝐠𝛻bΣ𝐠𝛻aGPU0显式图划分GPU0𝒀MatMul𝑯Sigmoid 𝑾𝟐𝒀MatMul𝑯Sigmoid 𝑾𝟐MatMulGPU133DispatchpartitionsPartitiongraph𝑯𝝈𝒀*DispatchpartitionsPartitiongraph𝑯𝝈𝒀*𝑾𝟐*𝑾𝟏 𝑿tensortransmissionmechanism𝑯𝝈Send*𝒀*Recv𝑾𝑾𝟏𝑿𝟐Server0ServerServer0Server136x y z

𝛻x 𝛻yCPUcodeGPUcode

* a+ +𝐠b 𝛻bΣ Σ𝐠c

𝛻a

𝛻z计算硬件CPU,GPU,RDMAdevices前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图* + yb图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernelgeneration计算硬件CPU,GPU,RDMAdevices前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图* + yb图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernelgenerationLayer-basedStaticgraphLayer-basedStaticgraphPython-likePython,ScipyCannotleverageGPUNoprogrammingrestrictCNTK,Caffe2DeclarativeprogrammingGraphoptimizationCaffeProgramingwithconfigLargekernelgranularity

MoreFlexibilityimporttorchfromtorch.autogradimportVariableN,D=3,4x=Variable(torch.randn(N,D).cuda())y=Variable(torch.randn(N,D).cuda())z=Variable(torch.randn(N,D).cuda())foriinrange(10):importtorchfromtorch.autogradimportVariableN,D=3,4x=Variable(torch.randn(N,D).cuda())y=Variable(torch.randn(N,D).cuda())z=Variable(torch.randn(N,D).cuda())foriinrange(10):a=x*yb=a+zc=c+torch.sum(b)c.backward() 43Layer-basedStaticgraphLayer-basedStaticgraphDynamicgraphPython-likeDyNetImperativeprogramming(Define-by-run)NographoptimizationPython,ScipyCannotleverageGPUNoprogrammingrestrictCNTK,Caffe2DeclarativeprogrammingGraphoptimizationCaffeProgramingwithconfigLargekernelgranularity

MoreFlexibilityCompilerisusedtooptimizegeneralframeworktobemoreefficient,whilekeepingtheexistingflexibility!Compilerisusedtooptimizegeneralframeworktobemoreefficient,whilekeepingtheexistingflexibility!CustompurposemachinelearningalgorithmsTheanoCustompurposemachinelearningalgorithmsTheanoDisBeliefCaffeDeeplearningframeworksprovideeasierwaystoleveragevariouslibrariesMachineLearningLanguageandCompilerPowerfulCompilerInfrastructure:Codeoptimization,sparsityoptimization,hardwaretargetingAFull-FeaturedProgrammingLanguageforML:ExpressiveandflexibleControlflow,recursion,sparsityAlgebra&Algebra&linearlibsCPUGPUAIframeworkDensematmulengineSIMD→MIMDSparsitySupportControlFlowandDynamicityAssociatedMemoryPAGEPAGE45\h[Link]11𝑌=𝑊𝑇𝑋卷积层计算 矩阵乘法计算 \hSource:/class/ee282h \hSource:/ \h/playgrounds/283/sse-avx-vectorization/what-is-sse-and-avxforforfor \hSource:/ \h/wp/wp-content/uploads/2013/05/poster_andresch_acaces2014.pdf/cutlass-\hlinear-algebra-cuda/ Anin-depthlookatGoogle’sfirstTensorProcessingUnit(TPU)TPUInstructionFunctionRead_Host_MemoryReaddatafrommemoryRead_WeightsReadweightsfrommemoryMatrixMultiply/ConvolveMultiplyorconvolvewiththedataandweights,accumulatetheresultsActivateApplyactivationfunctionsWrite_Host_MemoryWriteresulttomemory TPU:MatrixMultiplierUnitTPU:MatrixMultiplierUnitPAGEPAGE53 单一设备Top-5accuracyvs.computationalcomplexity(singleforwardpass)BenchmarkAnalysisofRepresentativeDeepNeuralNetworkArchitectures/pdf/1810.00736.pdf模型相关,相对固定 可变因素可变因素工作重点Moore定律,可变因素工作重点相对有限本章内容本章内容下章内容 TensorFlow

Data-FlowGraph(DFG)x=tf.placeholder(tf.float32)W=x=tf.placeholder(tf.float32)W=b=tf.Variable(tf.float32)m=W*xs=m+by=tf.reduce_sum(s)grad_W,grad_b=tf.gradients(y,[W,b])update=optimizer.apply_gradients({[W,grad_W],[b,grad_b]})xWb𝛻W𝛻b*m*𝐠++𝐠𝛻ms𝛻sΣΣ𝐠y x=tf.placeholder(tf.float32)W=b=tf.Variable(tf.float32)

多个样本

Data-FlowGraph(DFG)m=W*xs=m+by=tf.reduce_sum(s)grad_W,grad_b=tf.gradients(y,[W,b])update=optimizer.apply_gradients({[W,grad_W],

x W *m

𝛻W 𝛻b*𝐠𝛻m+ +𝐠s

多个操作Σ y

张量计算算子间并行数据并行:多个样本并行执行模型并行:多个算子并行执行组合并行:多种并行方案组合叠加https://blog.skymind.ai/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks/ im2colconvolution im2colconvolution JosephE.GonzalezAI-SystemsDistributedTraining算子内并行算子并行:并行单个张量计算子内的计算(GPU多处理单元并行)https://blog.skymind.ai/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks/AllReducex W b *

AllReduce𝛻W 𝛻b*𝐠

AllReducex W b*

AllReduce𝛻W 𝛻b*𝐠m +

𝛻m

m+

𝛻ms 𝛻s

s 𝛻sΣ y

GPU0

Σ y

GPU1xWb𝛻W𝛻b*m*𝐠+xWb𝛻W𝛻b*m*𝐠++𝐠𝛻mGPU0s𝛻sΣΣ𝐠yGPU1非并行数据并行模型并行样本数据量11/N1传输数据量0模型大小激活大小存储占用1N1负载平衡度-强弱并行限制-单步样本量算子数量具有同步障不含同步障具有限定的宽松同步障ComputeCommunicateComputeIterationComputeCommunicateComputeIterationw1wIterationw2wIterationw3BarrierwBarrierWasteIterationWasteIterationWasteIterationwwMachine2wwMachine3wwJosephE.GonzalezAI-SystemsDistributedTrainingComputeCommunicateComputeIterationComputeCommunicateComputeIterationIterationIterationWasteIterationWasteIterationWasteIterationMachine2Machine3Barrier BarrierJosephE.GonzalezAI-SystemsDistributedTrainingMachineMachine

Compute

ComputeIterationIterationIterationIterationIterationIterationIterationIterationJosephE.GonzalezAI-SystemsDistributedTraining具有同步障AAA不含同步障A具有限定的宽松同步障AADownpourSGD\hGPipe:EasyScalingwithMicro-BatchPipelineParallelismPipeDreamGPipe\hScalingDistributedMachineLearningwiththeServer\hPipeDream:GeneralizedPipelineParallelismforDNNAdaptiveCommunicationAchievetheBestError-Runtime\hinLocal-UpdateSGDHorovodDistributed-OptimizerN/AHorovodDistributed-OptimizerN/AGloo,MPI,支持多种框架PyTorchp2p+collectivecomm.PyTorchGloo,MPI,只支持单框架TensorFlow分布式用户接口strategylist:PS,AllReduce执行单节点训练TensorFlow通信协调gRPC,libRDMA,NCCL TrainingAPIMirroredStrategyTPUStrategyMultiWorker-MirroredStrategyCentralStorage-StrategyParameterServer-StrategyOneDeviceStrategyKerasAPISupportedExperimentalsupportExperimentalsupportExperimentalsupportSupportedplannedpost2.0SupportedCustomtrainingloopExperimentalsupportExperimentalsupportSupportplannedpost2.0Supportplannedpost2.0NosupportyetSupportedEstimatorAPILimitedSupportNotsupportedLimitedSupportLimitedSupportLimitedSupportLimitedSupporthttps:\h///guide/distributed_traininghttps:\h///guide/distributed_traininghttps:\h///guide/distributed_training https:\h///guide/distributed_training xyzxyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠Partitiongraph𝑯𝝈Partitiongraph𝑯𝝈𝒀*Dispatchpartitions𝑾𝟐*𝑾𝟏 𝑿Server1Server0Send𝑯𝝈*𝒀*Recv𝑾𝟐𝑾𝟏 𝑿Server0Server1ParameterServerParameterServer每个mini-batch通信一次ApplyGradApplyGrad𝑾𝒆𝒊𝒈𝒉𝒕SendGenGrad𝝈*𝑿𝟏Worker0GenGrad𝑯𝝈𝒀𝝈*𝒀*Recv𝑾𝟐*𝑾𝟏 𝑿𝑿𝟐Server0Server1数据并行 模型并行/tutorials/intermediate/dist_tuto.html/tutorials/intermediate/dist_tuto.htmlAll-Reduce/tutorials/intermediate/dist_tuto.htmlAll-Reduce/tutorials/intermediate/dist_tuto.html \h/guide/distributed_training\h/tutorials/intermediate/dist_tuto.html\h/horovod/\h/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy xyzxyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠1课程概览在多租(Multi-Tenant)GPU集群运行作业(Job)10GPU

提交TFJob

20GPU提交PyTorchJob

多用户共享多GPU服务器多作业(Job),多用户作业环境需求多样作业资源需求多样服务器软件环境单一拥有100GPU配额

GPU集群提交MXNetJob调度与资源管理系统重要性调度与资源管理系统重要性提供人工智能基础架构支持深度学习作业(Job)调度(Scheduling)与管理异构硬件管理提升生产力用户专注于模型创新,无需关注系统部署,管理模型,代码和数据共享,加速研究与创新深度学习作业(Job)的生命周期深度学习作业(Job)的生命周期作业提交与排队作业资源分配与调度GPU集群作业,镜像与容器作业,镜像与容器深度学习作业镜像容器独占服务器执行深度学习作业独占服务器执行深度学习作业GPU服务器P100独占环境,无需考虑环境,资源隔离问题P100本地/anaconda3GPU本地/usr/local/cuda本地/data直接执行启动脚本:pythontrain.py--batch_size=256 --model_name=resnet50作业提交到平台作业提交到平台P100提交作业P100提交作业GPU集群"image":"jobName":"restnet","image":"example.tensorflow:stable","dataDir":"/tmp/data",...

"/tmp/output",环境依赖

"taskRoles":[{"taskNumber":1,"taskNumber":1,"cpuNumber":8,"memoryMB":32768,"gpuNumber":1,"command":

数据与代码任务数量}]

"pythontrain.py--batch_size=256\--model_name=resnet50"资源占用作业启动命令}作业环境依赖问题作业环境依赖问题问题:服务器上没有预装好所需要的个性化环境?不同作业需要的框架,依赖和版本不同,安装繁琐且重复?部署服务器上可能会有大量重复安装的库,占用空间?深度学习特有问题:深度学习作业需要安装依赖和深度学习框架等目标:复用整体安装环境并创建新环境层级构建依赖,复用每一层级的依赖作业运行时资源隔离问题作业运行时资源隔离问题问题:集群资源被共享,如何保证作业互相之间不干扰和多占用资源?如何能够让不同作业可以运行不同的操作系统和命名空间?如何保证隔离的同时,作业启动的越快越好?深度学习特有问题:GPU和GPUmemory如何隔离目标:资源隔离轻量级启动镜像与容器-Docker镜像与容器-DockerImage打包作业环境依赖

push

pullRegistry共享镜像Container

build Dockerfile Image

DockerContainer镜像(Image)镜像(Image)含有可读和读写层unionmount支持多种文件系统…

容器镜像层级关系Dockerfile实例FROMnvidia/cuda:10.0-cudnn7-devel-ubuntu16.04ARGPYTHON_VERSION=3.6…RUNapt-getupdate&&apt-getinstall-y--no-install-recommends\…RUNcurl-o~/miniconda.sh…/opt/conda/bin/condainstall-y-cpytorchmagma-cuda100&&\…WORKDIR/opt/pytorchCOPY..…WORKDIR/workspaceRUNchmod-Ra+w.…\h/pytorch/pytorch/blob/build/1.3.0/docker/pytorch/DockerfileDockerImage镜像文件系统实例镜像文件系统实例file1file4file1file4file2file3 file4file1 file2 file3 file4Imagelayer2Imagelayer1ImagebaselayerAUFS存储驱动实例容器(Container)容器(Container)DLJob1DLJob2DLJob3DLDLJob1DLJob2DLJob3DLJob1DLJob2DLJob3Linuxcontainersareimplementationsofoperatingsystem-levelvirtualizationfortheLinuxsystem.支撑技术:cgroup,namespace命名空间(Namespace)命名空间(Namespace)命名空间隔离类型pidnetmntipcuser...

RootPIDNamespacePIDNamespacexPIDNamespacexpid3(pid1)pid2(pid2)pid5(pid3)pid4(pid2)pid1(pid1)black:RealPIDblue:getid()togetthisPID控制组控制组Linux通过cgroup对进程组的资源:CPUMemoryNetworkStorageIO控制(control)CPUMemoryNetworkStorageIO计数(accounting)隔离(isolation)CGOUP1CGOUP4类型:CGOUP1CGOUP4CGOUP2cpuCGOUP2CGOUP3memoryCGOUP3blocki/onetwork…容器资源分配与隔离子系统实例容器资源分配与隔离子系统实例SubsystemTypeDescriptioncpusetisolationConfineprocessestoprocessorandmemorynodesubsetsnsisolationForshowingprivateview(namespace)ofsystemtoprocessesincgroupcpucontrolShareCPUbandwidthbetweengroupscpuacctaccountingTheCPUAccounting(cpuacct)subsystemgeneratesautomaticreportsonCPUresourcesmemorycontrolThememorycontrollersupportsreportingandlimitingofprocessmemory,kernelmemory,andswapusedbycgroups.devicesisolationThissupportscontrollingwhichprocessesmaycreate(mknod)devicesaswellasopenthemforreadingorwriting.rdmacontrolTheRDMAcontrollerpermitslimitingtheuseofRDMA/IB-spe‐cificresourcespercgroup.blk_iocontrolTheblkiocgroupcontrolsandlimitsaccesstospecifiedblockdevicesbyapplyingIOcontrolGPU资源分配与隔离-NvidiaDockerGPU资源分配与隔离-NvidiaDocker功能:GPU粒度的隔离问题:无法像传统OS对CPU时分复用隔离GPU内存无法隔离潜在解决方法:NvidiaMPS等技术NvidiaDocker实例NvidiaDocker实例#Testnvidia-smiwiththelatestofficialCUDAimage$dockerrun--gpusallnvidia/cuda:9.0-basenvidia-smi#StartaGPUenabledcontainerontwoGPUs$dockerrun--gpus2nvidia/cuda:9.0-basenvidia-smi#StartingaGPUenabledcontaineronspecificGPUs$dockerrun--gpus'"device=1,2"'nvidia/cuda:9.0-basenvidia-smi小结与讨论小结与讨论容器与镜像解决了环境依赖,资源隔离进而奠定未来调度系统多租的基石相比传统OS,在GPU技术栈还不完善的功能是?调度调度群调度(GangScheduling)DRF调度(DominantScheduling)容量调度(CapacityScheduling)(Preemption)前沿调度算法调度目标调度目标针对一批作业调度,常常考虑以下指标:吞吐(Throughput)完工时间(Makespan/AverageTime)公平(Fairness)利用率(Utilization)/(Efficiency)服务水平协议(SLA)调度并行或分布式作业会有哪些问题?调度并行或分布式作业会有哪些问题?深度学习训练特点:GPU目标:HighThroughput,HighUtilizationandShortTimes

JobAJobBNode11234a1a2a3a4a1a31234a1a2a3a4a1a3a4b1b2资源浪费a2Time

GPUNode25567 8b1 b2 bb1b2b3b4b3 b3b4…b…b5a5无法启动,JobA无法训练群调度(GangScheduling)群调度(GangScheduling)Wiki定义:

JobAJobB

JobCTaskAschedulingalgorithmforparallelsystemsthatschedulesthreadsorprocessesto

Node1 Node25567 81a12a23a34a4Timet1 a5 1a12a23a34a4Timerunsimultaneouslyon

t2 a1

a3

a5 a6differentprocessors.

t3 C1 C2 C3

b1 b2 b3 b4策略:同时启动深度学习任务进程

t4 b1 b2 b3 b4GPUClusterJobQueueJob11GPU4GB1GPUClusterJobQueueJob11GPU4GB11GPU4GB2Job22GPU2GBRAM33GPU2GB4CPU,Hostmemoryetc.),并且需要调度GPU及GPUmemory设计目标:吞吐(ThroughputClusterResources:[10GPU,20GBRAM…]DominantDominant(DRF)优化目标:DRF尝试最大化系统中的最小主导份额(smallestdominantshare)策略:定主导资源(dominantresource)基于最大最小公平(max-minfairness)的针对多资源类型(e.g.GPUCPU)度算法

Job1:Queue1Job1Queue1Job11GPU4GB11GPU4GBRAM2Job22GPU2GBRAM33GPU2GB4TotalMemory4+4=8GBMemoryShare=8/20=0.4SHARE=0.4[DominantresourceisMemory]Job2:TotalGPU2+3=5GPUGPUShare=5/10=0.5TotalMemory2+2=4GBMemoryShare=4/20=0.2SHARE=0.5[DominantresourceisGPU]ClusterResources:[10GPU,20GBRAM]Job1hashigherprioritythanJob2asJob1share0.4islessthanJob2share0.5如何让多个小组共享集群?如何让多个小组共享集群?WastedfreeresourcesUser145%UsedCapacity45%FreeCapacity45%User145%UsedCapacity45%FreeCapacity45%TeamATeamA组织提供最小容量保证?5%5%5%TeamBUsedCapacity10%业也需要考虑调度GPU及GPUmemoryTeamBUsedCapacity10%U1U2U3U4设计目标:Utilization,andU1U2U3U4TeamCTeamCNofreeresources,jobcannotbescheduled容量调度(CapacityScheduling)容量调度(CapacityScheduling)User130%MinCapacity10%User130%MinCapacity10%MaxCapacity30%QueueAUserLimitFactor=3QueueAUserLimitFactor=3策略:5%5%5%MaxCapacity30%MinCapacity10%虚拟集群(VirtualCluster)QueueBUserLimitFactor=1BonusQueueBUserLimitFactor=1U12%U22%U3U12%U22%U33%U44%MaxCapacity30%MinCapacity10%UserLimitFactor最大资源QueueCUserLimitFactor=0.25QueueCUserLimitFactor=0.25抢占(Preemption)虚拟集群(VirtualCluster)虚拟集群(VirtualCluster)目的:提升资源利用率策略:8-GPUNode每个8-GPUNode

Rack8-GPUNode28-GPUNode3 4348-GPUNode8-GPUNode1QPI6CPU8-GPUNode8-GPUNode1QPI6CPUCPU875432118-GPUNode22QPI6CPUCPU8754321TenantATenantBTenantCTenantATenantBTenantC如何让调度兼顾SLA?如何让调度兼顾SLA?不使用抢占调度问题:但是无法保证SLA?挑战:无法像传统OS上下文切换设计目标:兼顾有限资源利用和服务等级协议(SLA)

使用抢占调度App2更快结束App1动机1:GPU集群对深度学习作业的影响动机1:GPU集群对深度学习作业的影响多GPU集群运行深度学习问题与挑战:受到GPU拓扑结构影响受到服务器上同时运行作业的干扰启发优化策略:考虑拓扑结构的亲和性(Affinity)调度动机2:深度学习作业的特点动机2:深度学习作业的特点深度学习作业特点:切分为小时间窗口的任务不同的时间点做检查点有不同的数据量资源消耗可预测性,可以通过运行时监控获取启发优化:时分复用(Timeslicing)与超额订阅(Oversubscription)装箱(Packing)迁移(Migration)…Gandiva调度策略Gandiva调度策略设计目标(Goals)(earlyfeedback)(clusterefficiency)(fairness):fairness)modes)ReactiveModeReactiveMode亲和性(affinity)作业到达(arrivals(departures),失效(failures)考虑亲和性的调度策略Nodeswithsame“affinity”Nodeswith“noaffinity”Nodeswithaffinity”Oversubscription:suspend-resumeonsame“affinity”nodesJobqueuedIntrospectiveModeIntrospectiveMode监控并定期优化当前作业的放置(placement)早反馈(Earlyfeedback):packingmigrationgrow-shrinktimeslicing深度学习还有其他问题影响调度?深度学习还有其他问题影响调度?深度学习作业:作业资源占用多样,造成资源分配后容易形成低利用率和资源碎片深度学习模型越来越大,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论