版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
NVIDIALLM全棧式方案使用和優(yōu)化最佳實(shí)踐Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM2Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLM3NVIDIAFull-StackSolutionforLLMNVIDIAMegatron-Core(M-core)forLLMTrainingNVIDIATensorRT-LLMforLLMInference4OverviewofNVIDIA’sLargeLanguageModelOfferingsforTrainingSolutionsatEachLeveloftheNemoFramework:EasytouseOOTBFWwithalargemodelMegatron-LM:AlightweightframeworkreferenceforusingMegatron-Core:LibraryforGPUoptimizedtechniquesforLLMTransformerEngine:HopperacceleratedTransformermodels.5WhyWeNeedNVIDIAMegatron-Core?6NVIDIATensorRT-LLM?FasterTransformertoleverageitsoptimizedkernelsforperfo?OthercomponentsforthecustomizationsofLLMinference,suchasCUTLASS7KeyFeaturesinNVIDIATensorRT-LLM8WhatisNVIDIATritonInferenceServer?FeaturesofTritonInferenceServer:9Agenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMBestPracticeforNVIDIAMegatron-Core?Enabledistributedoptimizertoshardoptimizerstates.?Utilizedistributedoptimizertoshardoptimizerstatessimultaneously.BestPracticeforNVIDIAMegatron-Core?EnableTransformerEngine(--transformer-impltransformer_engine)?EnableFlashAttention(--use-flash-attn)?Enablecommunicationoverlapping?EnablekernelfusionsBestPracticeforNVIDIAMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreTrainingLoopM-LMMegatron-LMMegatron-CoreEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizatiEmbeddingsPipelineScheduleandCommunicationDistributedCheckpointingAttentionNormalizationActivationRecomputeModelsConfig/Spec(Customization)Config/Spec(Customization)TransformerBlockTransformerBlockTransformerLayerTransformerLayerMLPMLPSequenceParallelismSequenceParallelismDDistributedOptimizerAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtoUseNVIDIATensorRT-LLM?EasetheuseeffortsHowtoUseNVIDIATensorRT-LLM#Converthuggingfacellama-7bmodeltotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/llama/convert_checkpoint.py\--model_dirllama-7b-hf\--dtypefloat16\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp16-tp2#Quantizehuggingfacellama-7bandexporttotrt-llmcheckpoint#Optionallywithtensorand/orpipelineparallelism,e.g.,tp=2pythonexamples/quantization/quantize.py\--model_dirllama-7b-hf\--dtypefloat16\--qformatfp8\--tp_size2\--output_dirtllm_ckpt/llama-7b-fp8-tp2HowtoUseNVIDIATensorRT-LLM#Buildtrt-llmenginesfromtrt-llmcheckpoint#Optionallyenable/disablebuildingoptionstrtllm-build--checkpoint_dirtllm_ckpt/llama-7b-fp8-tp2\--gemm_pluginfloat16\--output_dirtllm_engines/llama-7b-fp8-tp2\--workers2#Runinferencewiththetrt-llmenginesmpirun-n2--allow-run-as-rootpythonexamples/run.py\--engine_dirtllm_engines/llama-7b-fp8-tp2\--tokenizer_dirllama-7b-hf\--max_output_len30\--input_text"Borninnorth-eastFrance,Soyertrainedasa"#ExamplegeneratedoutputOutput[Text0Beam0]:"chefinParisandLondonbeforemovingtoNewYorkin1850.Hewasthefirstcheftobehiredbythenewly"HowtoUseNVIDIATensorRT-LLM?Oneormoresafetensorsfilesstoringrankweights?Eachfilesavesadictmappingh{'transformer.vocab_embedding.weight':torch.Tensor(...),'transformer.layers.0.attention.qkv.weight':torch.Tensor(...),'transformer.layers.0.attention.dense.weight':torch.Tensor(...),'transformer.layers.0.mlp.fc.weight':torch.Tensor(...),'j.weight':torch.Tensor(...),'lm_head.weight':torch.Tensor(...)}HowtoUseNVIDIATensorRT-LLMBuildOptions?In-flightbatchingisenabledbydefaultwithtrtllm-build,whichrequiresth?CustomAllReducePlugin:recommendtoenableforNVLink-basednodes?Embeddingparallelismandsharingfeatures:recommendtoenabletoimprovethroughputandreducememoryusageRuntimeOptions?gpt_model_type:recommendtouseinflight_fused_batchingtoincreasethroughputandreducelatency?batch_scheduler_policy:recommendtouseguaranteed_no_evictfirstlyandchangetomax_utilizationforpossiblyhigher?kv_cache_free_gpu_mem_fraction(default=0.9)ispreferredovermax_tokens_in_paged_kv_cacheduetoease-of-use.They?enable_trt_overlap:recommendtosetfalsefirstlyHowtoUseNVIDIATensorRT-LLMPerformanceBestPractices:QuantizationWeight-onlyQuantizationlatency;Getthesclatency;Getthescalesfromexternallibraries.WeightandActivationQuantizationHowtoUseNVIDIATensorRT-LLMAgenda?NVIDIAFull-StackSolutionforLLM?BestPracticesofNVIDIAMegatron-CoreforLLMTraining?BestPracticesofNVIDIATensorRT-LLMforLLMInference?BestPracticesofNVIDIATritonInferenceSeverforLLMHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer?Option2:Buildviadockerfile–canmodifydockerfileeasily.#Updatethesubmodulescdtensorrtllm_backendgitlfsinstallgitsubmoduleupdate--init–recursive#UsetheDockerfiletobuildthebackendinacontainer#Forx86_64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm-fdockerfile/Dockerfile.trt_llm_backend.#Foraarch64DOCKER_BUILDKIT=1dockerbuild-ttriton_trt_llm--build-argTORCH_INSTALL_TYPE="src_non_cxx11_abi"-fdockerfile/Dockerfile.trt_llm_backend.HowtouseNVIDIATritonInferenceServer#PreparetheTRT-LLMbaseimageusingthedockerfilefromtensorrtllm_backend.cdtensorrtllm_backend#Specifythebuildargsforthedockerfile.BASE_IMAGE=nvcr.io/nvidia/tritonserver:24.01-py3-minTRT_VERSION=9.2.0.5TRT_URL_x86=/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.linux.x86_64-gnu.cuda-12.2.tar.gz/downloads/compute/machine-learning/tensorrt/9.2.0/tensorrt-.Ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gzdockerbuild-ttrtllm_base\--build-argBASE_IMAGE="${BASE_IMAGE}"--build-argTRT_VER="${TRT_VERSION}"--build-argRELEASE_URL_TRT_x86="${TRT_URL_x86}"\--build-argRELEASE_URL_TRT_ARM="${TRT_URL_ARM}"-fdockerfile/Dockerfile.triton.trt_llm_backend.#RunthebuildscriptfromTritonServerrepo.Theflagsforsomefeaturesorendpointscanberemovedifnotneeded.TRTLLM_BASE_IMAGE=trtllm_basecdserver./build.py-v--no-container-interactive--enable-logging--enable-stats--enable-tracing\--enable-metrics--enable-gpu-metrics--enable-cpu-metrics\--filesystem=gcs--filesystem=s3--filesystem=azure_storage\--endpoint=http--endpoint=grpc--endpoint=sagemaker--endpoint=vertex-ai\--backend=ensemble--enable-gpu--endpoint=http--endpoint=grpc\--image=base,${TRTLLM_BASE_IMAGE}\--backend=tensorrtllm:${TENSORRTLLM_BACKEND_REPO_TAG}\--backend=python:${PYTHON_BACKEND_REPO_TAG}HowtouseNVIDIATritonInferenceServer#Gotothetensorrt_llm/examples/llamadirectorycdtensorrt_llm/examples/llama#ConverttheLLaMAmodelintotensorrt-llmcheckpointformat.pythonconvert_checkpoint.py--model_dir/path/to/llama-7b-hf\--output_dir./tllm_checkpoint_1gpu_fp16\--dtypefloat16#BuildtheLLaMA7BmodelusingasingleGPUandFP16.trtllm-build--checkpoint_dir./tllm_checkpoint_1gpu_fp16\--output_dir./llama_model/fp16/1-gpu\--gemm_pluginfloat16\--context_fmhaenable\--max_beam_width1\--max_batch_size8\--max_input_len--gpt_attention_pluginfloat16\d_kv_cacheenable\--remove_input_paddingenableHowtouseNVIDIATritonInferenceServerconnectionofinputandoutputtensorsbnumberofrequeststhatmustbesenttoTriton.prompts(string)toinput_ids(listofints).forinference.fromoutput_ids(listofints)tooutputs(string).postprocessingmodetensorrt_llmandpostprocessingmodelstogether.AlsosupportsmorefHowtouseNVIDIATritonInferenceServerHowtouseNVIDIATritonInferenceServer#EnterTritonNGCcontainerdockerrun--rm-it--nethost--shm-size=2g--ulimitmemlock=-1--ulimitstack=67108864\--gpusall-v/path/to/tensorrtllm_backend:/tensorrtllm_backendnvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3bash#LaunchTritonservercd/tensorrtllm_backend#--world_sizeisthenumberofGPUsyouwanttouseforservingpython3scripts/launch_triton_server.py--world_size=4--model_repo=/tensorrtllm_backend/all_models/inflight_batcher_llm++++|Model|Version|Status|++++|<model_name>|<v>|READY||..|.|..|++++I091914:52:10.475738293grpc_server.cc:2451]StartedGRPCInferenceServiceat:8001I091914:52:10.475968293http_server.cc:3558]StartedHTTPServiceat:8000I091914:52:10.517138293http_server.cc:187]StartedMetricsServiceat:8002HowtouseNVIDIATritonInferenceServercd/tensorrtllm_backend#Useinflight_batcher_llm_client.pypython3inflight_batcher_llm/client/inflight_batcher_llm_client.py--request-output-len200\--tokenizer-dir/path/to/llama/tokenizer\--text"Bor
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 備考會(huì)計(jì)基礎(chǔ)秀課件推
- 養(yǎng)老院老人康復(fù)理療師職業(yè)發(fā)展規(guī)劃制度
- 增收節(jié)支課件
- 2024年挖掘機(jī)租賃合同范本(含應(yīng)急維修服務(wù))3篇
- 2024年度生態(tài)園林樹木補(bǔ)種與養(yǎng)護(hù)管理合同3篇
- 大年夜學(xué)期末財(cái)務(wù)學(xué)課件期末溫習(xí)資料試卷
- 《肝癌與其他》課件
- 2024年版:工程機(jī)械短期租賃協(xié)議
- 《在大多數(shù)廣告中》課件
- 2025年四川貨運(yùn)從業(yè)考試試題及答案詳解
- 人力資源專員招聘筆試題
- LY/T 1646-2005森林采伐作業(yè)規(guī)程
- GB/T 7714-2015信息與文獻(xiàn)參考文獻(xiàn)著錄規(guī)則
- GB/T 7531-2008有機(jī)化工產(chǎn)品灼燒殘?jiān)臏y(cè)定
- GB/T 19963.1-2021風(fēng)電場(chǎng)接入電力系統(tǒng)技術(shù)規(guī)定第1部分:陸上風(fēng)電
- GB/T 13586-2006鋁及鋁合金廢料
- 二年級(jí)上冊(cè)數(shù)學(xué)試題-應(yīng)用題復(fù)習(xí)6-人教新課標(biāo)(2014秋)(無(wú)答案)
- 2023教師編制考試教育理論綜合基礎(chǔ)知識(shí)復(fù)習(xí)題庫(kù)及參考答案(通用版)
- 麗聲北極星分級(jí)繪本第一級(jí)上Tiger-Is-Coming課件
- 2023年哈工大模電大作業(yè)
- 高考作文 論證方法匯總
評(píng)論
0/150
提交評(píng)論