AI 系统 快速入门_第1页
AI 系统 快速入门_第2页
AI 系统 快速入门_第3页
AI 系统 快速入门_第4页
AI 系统 快速入门_第5页
已阅读5页,还剩29页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

AISystem

NSCCTraining17December2020

NSCCAISystem

PAGE

10

Expectations

TheDGX-1nodesaremostsuitedtolarge,batchworkloads

e.g.trainingcomplexmodelswithlargedatasets

Weencourageuserstododevelopmentandpreliminarytestingonlocalresources

UsersareencouragedtousetheoptimizedNVIDIAGPUCloudDockerimages

Utilisation

AccessisthroughPBSjobscheduler

Weencourageworkloadswhichcanscaleuptoutiliseall8GPUsonanodeorrunacrossmultiplenodes

Userscanrequestfewerthan8GPUs

MultiplejobswillthenrunonanodewithGPUresourceisolation(usingcgroups)

YouwillonlyseethenumberofGPUsyourequest

SystemOverview

LoginNodes

DGX-1Nodes

InfiniBandNetwork

PBSJob

Scheduler

Storage

NSCCVPN:aspire.nscc.sg

Externaloutgoingaccess

astar.nscc.sg

Externaloutgoingaccess

ntu.nscc.sgNointernetaccess

NSCCNetworks

nscc0[3-4]

OnNUSandNTUloginnodes:

nscc0[1-2]

Forexternaloutgoingaccess:

sshnscc04-ib0

ntu0[1-4]

nus0[1-4]

nus.nscc.sg

Nointernetaccess

dgx410[1-6]

Nodirectincomingaccess

Externaloutgoingaccess

ProjectID

ProjectIDsprovideaccesstocomputationalresourcesandprojectstorage.

AIprojectsareinGPUhours

OnlyAIprojectcodescanwillrunonthedgxqueues

Inthefollowingmaterialwhereyousee

$PROJECTreplacewiththecodeforyourproject,forexamplethestakeholderpilotprojectcodewas41000001

Filesystems

TherearemultiplefilesystemsavailableontheNSCCsystems

/home GPFSfilesystemexportedtotheDGXnodesasanNFSfilesystem

/scratch high-performanceLustrefilesystem

/raid LocalSSDfilesystemoneachontheDGXnodes

I/OintensiveworkloadsshoulduseeithertheLustre/scratchfilesytemorthelocalSSD/raid

filesystem

VisibleonLoginnodes

VisibleonDGXhostO/S

VisibleinDGXincontainers

Description

/home/users/ORG/USER

YES

YES

YES

Homedirectory:$HOME50GBlimit

/home/projects/$PROJECT

YES

YES

YES

ProjectdirectoryLargerstoragelimits

/scratch/users/ORG/USER

YES

YES

YES

HighperformanceLustrefilesystem.Softlinkedto$HOME/scratch

Noquota,willbepurgedwhenfilesystemisfull.

/raid/users/ORG/USER

NO

YES

YES

LocalSSDfilesytemoneachDGXnode.

7TBfilesystemonvisibleonthatspecificnode.Noquota,willbepurgedwhenfilesystemisfull.

Filesystems

The/homefilesystem(homeandprojectdirectories)ismountedandvisibleonallloginandDGXnodesandinsideDockercontainers.Thisfilesystemshouldbeusedforstoringjobscripts,logsandarchivalofinactivedatasets.ActivedatasetswhicharebeingusedincalculationsshouldbeplacedoneithertheLustre/scratchfilesystemorthelocalSSD/raidfilesystems.

IntensiveI/OworkloadsonlargedatasetsshouldusetheLustrefilesytem.TheLustre/scratchdirectoryisnowmounteddirectlyontheDGXnodesandautomaticallymountedinsideDockercontainers(previouslyitwasonlyvisibleonloginnodesandmountedinDockercontainers)

ThelocalSSD/raidfilesystemisfastbutonlyvisibleonaspecificDGXnode.Thiscanbeused

fortemporaryfilesduringarunorforstaticlong-termdatasets.

Datasetswithverylargenumbersofsmallfiles(e.g.100,000fileswhichareapprox.1kBin

size)MUSTusethelocalSSD(/raid)filesystemorLustre(/scratch)filesystem.

Networkfilesystems(/home&/scratch)arenotsuitedtodatasetswhichhaveverylargenumberofsmallfilesbecausemetadataoperationsonnetworkfilesystemsareslow.

PBSQueueConfiguration

User

queues

Execution

dgx-dev

dgx-03g-04h

dgx-03g-24h

dgx-48g-04h

dgx-48g-24h

dgx

queues

Per-userrunlimits,per-queuerunlimitsandnodeassignmentusedtocontrolutilisation

Shorterqueues

havehigherpriority

Halfofanodeforsharedinteractivetesting&development

TypicalPBSNodeConfiguration

dgx-48g-*

dgx-03g-*

dgx-dev

dgx4101

dgx4102

dgx4103

dgx4104

dgx4105

()

()

dgx4106(4GPUS)

dgx4106(4GPUS)

Differentqueuescanaccessdifferentsetsofnodes

Shorterqueueshavebeengivenhigherpriority

Queuelimitsonthe48hourqueueareverystrictsowaittimesinthatqueueareextremelylong(throughputismuchbetterinthe4hourand24hourqueues)

Configurationmaybechangedtomatchrequirementsbasedontheloadinthe

queues

InteractiveUse–Access

SharedaccesstohalfofaDGXnode(4GPUs)isavailablefortestingofworkflowsbeforesubmissiontothebatchqueues

Toopenaninteractivesessionusethefollowingqsubcommandfromaloginnode:

user@nscc:~$qsub-I-qdgx-dev-lwalltime=8:00:00–P$PROJECT

#$PROJECT=41000001or22270170

Resourcesaresharedbetweenallusers,checkactivitybeforeuse

Usageofthedgx-devqueueisnotchargedagainstyourprojectquota

InteractiveUse–Docker

TorunaninteractivesessioninaDockercontainerthenaddthe“-t”flagtothe“nscc-dockerrun”command:

user@dgx:~$nscc-dockerrun-tnvcr.io/nvidia/tensorflow:latest

$ls

README.mddocker-examplesnvidia-examples

$tty

/dev/pts/0

The–tflagwillcausejobtofailifusedinabatchscript,onlyuseforinteractiveuse:

user@dgx:~$echotty|nscc-dockerrun-tnvcr.io/nvidia/tensorflow:latest

theinputdeviceisnotaTTY

Batchscheduler

Accessingthebatchschedulergenerallyinvolves3commands:

Submittingajob: qsub

Queryingthestatusofajob: qstat

Killingajob: qdel

qsubjob.pbs #submitaPBSjobscripttoscheduler

qstat #querythestatusofyourjobsqdel11111.wlm01#terminatejobwithid11111.wlm01

See

https://help.nscc.sg/user-guide/

formoreinformationonhowtousethePBSscheduler

Introductoryworkshopsareheldregularly,moreinformationat

https://www.nscc.sg/hpc-calendar/

ExamplePBSJobScript(Headers)

#!/bin/sh

##Lineswhichstartwith#PBSaredirectivesforthescheduler

##Directivesinjobscriptsaresupercededbycommandlineoptionspassedtoqsub

##Thefollowinglinerequeststheresourcesfor1DGXnode#PBS-lselect=1:ncpus=40:ngpus=8

##Runfor1hour,modifyasrequired

#PBS-lwalltime=1:00:00

##SubmittocorrectqueueforDGXaccess

#PBS–qdgx

##SpecifyprojectID

#Replace$PROJECTwithProjectIDsuchas41000001or22270170

#PBS-P$PROJECT

##Jobname#PBS-Nmxnet

##MergestandardoutputanderrorfromPBSscript#PBS-joe

ExamplePBSScript(Commmands)

#Changetodirectorywherejobwassubmitted

cd"$PBS_O_WORKDIR"||exit$?

#SpecifywhichDockerimagetouseforcontainer

image="nvcr.io/nvidia/tensorflow:latest"

#Passthecommandsthatyouwishtoruninsidethecontainertothestandardinputof“nscc-dockerrun”

nscc-dockerrun$image<stdin>stdout.$PBS_JOBID2>stderr.$PBS_JOBID

Hands-on

/home/projects/ai/examples

ExamplePBSjobscriptstodemonstratehowto:

submitajobtorunonaDGX-1node

startacontainer

runastandardMXNettrainingjob

installapythonpackageinsideinacontainer

See

https://help.nscc.sg/user-guide/

formoreinformationonhowto

usetheNSCCsystems

Hands-on

Step1:LogontoNSCCmachine

Step2:Runthefollowingcommandsandconfirmthattheywork:

cp-a/home/projects/ai/examples.#submitfirstbasicexample

cdexamples/1-basic-job&&\qsubsubmit.pbs

#runatrainingjob

cd../../examples/2-mxnet-training&&\qsubtrain.pbs

#installapythonpackageinsidecontainer

cd../../examples/3-pip-install&&\

qsubpip.pbs

Useqstattocheckjobstatusandwhenjobshavefinishedexamineoutputfilestoconfirm

everythingisworkingcorrectly

PartialNodeJobSubmission

Specifyrequiredngpusresourceinjobscript:

#PBS-lselect=1:ngpus=N:ncpus=5N

whereNisthenumberofGPUsrequired

e.g.“-lselect=1:ngpus=4:ncpus=20”

$echonvidia-smi|qsub-lselect=1:ncpus=5:ngpus=1-lwalltime=0:05:00-qfj5-P410000017590401.wlm01

$grepTeslaSTDIN.o7590401

| 0TeslaV100-SXM2...On |00000000:07:00.0Off| 0|

$echonvidia-smi|qsub-lselect=1:ncpus=10:ngpus=2-lwalltime=0:05:00-qfj5-P41000001

7590404.wlm01

$

grep

Tesla

STDIN.o7590404

|

0

Tesla

V100-SXM2...On

|00000000:07:00.0Off|

0|

|

1

Tesla

V100-SXM2...On

|00000000:0A:00.0Off|

0|

$echonvidia-smi|qsub-lselect=1:ncpus=20:ngpus=4-lwalltime=0:05:00-qfj5-P410000017590408.wlm01

$

grep

Tesla

STDIN.o7590408

|

0

Tesla

V100-SXM2...

On

|

00000000:07:00.0

Off

|

0

|

|

1

Tesla

V100-SXM2...

On

|

00000000:0A:00.0

Off

|

0

|

|

2

Tesla

V100-SXM2...

On

|

00000000:0B:00.0

Off

|

0

|

|

3

Tesla

V100-SXM2...

On

|

00000000:85:00.0

Off

|

0

|

NOTETHATTHEINTERACTIVEQUEUE(dgx-dev)WILLSTILLGIVESHAREDACCESSTOASETOFGPUSONTHETEST&DEVNODE

Checkingwhereajobisrunning

4availableoptionstoseewhichhostajobisrunningon:

$qstat-fJOBID

JobId:7008432.wlm01

<snip>

comment=JobrunatWedMay30at13:25on(dgx4106:ncpus=40:ngpus=8)

<snip>

$qstat-wanJOBID

wlm01:

Req'dReq'd Elap

JobID Username Queue Jobname SessID NDSTSK MemoryTimeSTime

-7008432.wlm01 fsg3 fj5 STDIN 67452 1 40 --01:00R00:05:09

dgx4106/0*40

$pbsnodes-Sjdgx410{1..6}

vnode

state

njobs

run

susp

mem

f/t

ncpus

f/t

nmics

f/t

ngpus

f/t

jobs

dgx4101

free

0

0

0

504gb/504gb

40/40

0/0

8/8

--

dgx4102

free

0

0

0

504gb/504gb

40/40

0/0

8/8

--

dgx4103

free

0

0

0

504gb/504gb

40/40

0/0

8/8

--

dgx4104

free

0

0

0

504gb/504gb

40/40

0/0

8/8

--

dgx4105

free

0

0

0

504gb/504gb

40/40

0/0

8/8

--

dgx4106

job-busy

1

1

0

504gb/504gb

0/40

0/0

0/8

7008432

$gstat

-dgx

#similarinformationtoabovecommandsbutshowsinformationonjobsfromallusersandiscachedsohas

aquickerresponse(butdatamaybeupto5minutesold)

AttachingsshSessiontoPBSJob

Ifyousshtoanodewhereyouarerunningajobthenthesshsessionwillbeattachedtothecgroupforyourjob.

Ifyouhavemultiplejobsrunningonanodeyoucanselectwhichjobtobeattachedtowiththecommand“pbs-attach”

$pbs-attach-l #listavailablejobs

7590741.wlm017590751.wlm01

$pbs-attach7590751.wlm01

executing:cgclassify-gdevices:/7590751.wlm0143840

Availableworkflows

Dockercontainers(recommended)

OptimizedDLframeworksfromNVIDIAGPUCloud(fullysupported)

Singularitycontainers(besteffortsupport)

https://sylabs.io/docs/

Applicationsinstalledbyuserinhomedirectory(e.g.Anaconda)(besteffortsupport)

DockerImages

The“nscc-dockerimages"commandshowsallimagescurrentlyinrepository

Currentlyinstalledincludes:

nvcr.io/nvidia/{pytorch,tensorflow,mxnet}:*

nvcr.io/nvidia/cuda:*

Olderimageswillberemovediftheyhavenotbeenusedrecently,ifyouneedaspecificversionthenitcanbepulledonrequest

Contact

help@nscc.sg

or

https://servicedesk.nscc.sg

NVIDIAGPUCloud

ToseewhichoptimisedDLframeworksareavailablefromNVIDIAcreateaccounton

/

UsingDockerontheDGX-1

Directaccesstothedockercommandordockergroupisnotpossiblefortechnicalreasons

Utilitiesprovidepre-definedtemplatedDockercommands:

nscc-dockerrunimage

nvidia-docker-u$UID:$GID\

-v/home:/home-v/scratch:/scratch-v/raid:/raid\

--rm-i--shm-size=1g--ulimitmemlock=-1\

--ulimitstack=67108864runimage/bin/sh

nscc-dockerimages

dockerimages

nscc-dockerps

dockerps

Dockerwrapper

$nscc-dockerrun-h

Usage:nscc-dockerrun[--net=host][--ipc=host][--pid=host][-t][-h]IMAGE

--net=host addsdockeroption--net=host

--ipc=host addsdockeroption--ipc=host

--pid=host addsdockeroption--pid=host

-t addsdockeroption-t

-h displaythishelpandexit

--help displaythishelpandexit

--usage displaythishelpandexit

Thefollowingoptionsareaddedtothedockercommandbydefault:

-uUID:GID--group-addGROUP\

–v/home:/home–v/raid:/raid-v/scratch:/scratch\

--rm–i--ulimitmemlock=-1--ulimitstack=67108864

If--ipc=hostisnotspecifiedthenthefollowingoptionisalsoadded:

--shm-size=1g

Singularity

Singularityisanalternativecontainertechnology

Canbeusedasanormaluser

CommonlyusedatHPCsites

Imagesareflatfiles(ordirectories)ratherthanlayers

LatestNGCDockerimagesconvertedtoSingularityimagesand

availablein:

/home/projects/ai/singularity

Examplejobscriptin:

/home/projects/ai/examples/singularity

https://www.sylabs.io/docs/

/docker-compatibility-singularity-hpc/

MultinodeTrainingwithHorovod

HorovodisadistributedtrainingframeworkforTensorFlow,Keras,andPyTorch.

Canbeusedfor:

multi-GPUparallelizationinasinglenode

multi-nodeparallelizationacrossmultiplenodesUsesNCCLandMPI

/uber/horovod

Examplejobscriptformulti-nodeHorovodusing

Singularitytorunacrossmultiplenodes:

/home/projects/ai/examples/horovod

CustomImages(Method1)

NSCC

Admin

UsersendsDockerfile

toNSCCAdmin

NSCCadminperforms"dockerbuild"andsynchronizesimageonallDGXnodes

NSCC

DGX-1

Local

resource

UserlogsintoNSCC

Usercreatesandtests

Dockerfile

Userperforms"nscc-dockerrun"

UsercreatesDockerimagelocallyandsendsDockerfiletoNSCCadmin

CustomImages(Method2)

DockerHub

Userperforms

"dockerpush"

NSCCadminperforms

"dockerpull"onallDGX

Local

resource

UserrequestsNSCCtopullimage

UsercreatesDockerfile

Userperforms"dockerbuild"

NSCC

Userperforms"nscc-dockerrun"

UsercreatesDockerimagelocallyandpushesimagetoDockerHub

Custompythonpackages

#“pipinstall”failsduetopermissionserror#“pipinstall--user”installsinto~/.local

# Thisisnotbestpracticeasitisexternaltocontainer# Itcanalsocauseunexpectedconflicts

#UsePYTHONUSERBASEtoinstallpackagesinsidecontainer

nscc-dockerrunnvcr.io/nvidia/tensorflow:latest<<EOFmkdir/workspace/.local

exportPYTHONUSERBASE=/workspace/.local

pipinstall--userscikit-learn

EOF

#Packagesinstalledwillbewipedoutwhencontainerstops#Forpermanentsolutionbuildacustomimage

Custompythonpackages(virtualenv)

#Installintoavirtualenv(notinstalledindefaultimage)nscc-dockerrunnscc/local/tensorflow:latest<<EOFvirtualenv$HOME/mypthon

.$HOME/mypython/bin/activate

pipinstallscikit-learn

EOF

#virtualenvisinhomedirectorysopersistsaftercontainerstops

#Thereforevirtualenvcanbereused

#Notbestpracticeasitaffectsportabilityandreplicability

nscc-dockerrunnscc/local/tensorflow:latest<<EOF

.$HOME/mypython/bin/activate

pythonscript.pyEOF

sshmiscellany

#ProxyCommandcanmakea2hopsshconnectionappeardirect#Onlocalmachinedo:

cat<<EOF>>.ssh/config

hostdgx410?

ProxyCommandsshaspire.nscc.sgnc%h%p

usermyusernamehostaspire.nscc.sguser

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论