Kubeflow Training Operators

MPIJob Operator, TFJob Operator, PyTorchJob Operator

Kubeflow Training Operators์—๋Š” MPIJob Operator, TFJob Operator, PytorchJob Operator๊ฐ€ ์žˆ๋Š”๋ฐ, Kubernetes ์ƒ์—์„œ ML๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ ๋ถ„์‚ฐํ•™์Šต ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ด ํ•™์Šต์— ๋“œ๋Š” ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

Training Operators๋ฅผ ์ดํ•ดํ•˜๋ ค๋ฉด, Kubernetes Operator ๊ฐœ๋…์„ ์•Œ์•„์•ผํ•˜๋ฉฐ [Kubernetes Operator] ๊ธ€์„ ์ฐธ๊ณ ํ•˜์ž.

Kubeflow์—์„œ ์ œ๊ณตํ•˜๋Š” Training Operator์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

MPIJob Operator

MPIJob Operator ์ด์šฉํ•ด Kubernetes ์ƒ์—์„œ Horovod ๊ธฐ๋ฐ˜์œผ๋กœ All Reduce ๋ถ„์‚ฐํ•™์Šต์„ ์‰ฝ๊ฒŒ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

MPIJob Operator๊ฐ€ TFJob Operator, PytorchJob Operator ์™€ ๋‹ค๋ฅธ ์ ์€ Tensorflow, Keras, Pytorch, MXnet ๋“ฑ ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ๊ฐœ๋ฐœํ•œ ์ฝ”๋“œ๋ฅผ ์กฐ๊ธˆ๋งŒ ์ˆ˜์ •ํ•ด์„œ ๋ถ„์‚ฐํ•™์Šต์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค.

MPIJob Operator Component๋Š” ์—ญํ• ์— ๋”ฐ๋ผ Launcher, Worker๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, MPIJob CRD์˜ spec ํ•˜์œ„ ํ•„๋“œ์— ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

MPIJob Operator๋ฅผ ์ดํ•ดํ•˜๋ ค๋ฉด, ๋จผ์ € Horovod ๊ธฐ๋ฐ˜ All Reduce ๋ถ„์‚ฐํ•™์Šต ๋ฐฉ์‹์„ ์ดํ•ดํ•ด์•ผ ํ•˜๋ฉฐ, [Distributed Training] ๊ธ€์„ ์ฐธ๊ณ ํ•˜์ž.

MPIJob ์‹คํ–‰ ์ ˆ์ฐจ

MPIJob Operator๋Š” MPIJob CR์˜ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์ด ๋ฐœ์ƒํ•˜๋ฉด, MPIJob Controller๊ฐ€ ๋‹ค์Œ ์ ˆ์ฐจ์— ๋”ฐ๋ผ MPIJob์„ ์‹คํ–‰ํ•œ๋‹ค.

1) ConfigMap์„ ์ƒ์„ฑํ•œ๋‹ค. ConfigMap์—๋Š” hostfile๊ณผ mpirun์—์„œ ์‚ฌ์šฉํ•  kubexec.sh ์‰˜ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. ConfigMap์—์„œ hostfile ํ•ญ๋ชฉ์—๋Š” Worker StatefulSet ํŒŒ๋“œ ๋ชฉ๋ก์„ ์ •์˜ํ•˜๊ณ  kubexec.sh ์Šคํฌ๋ฆฝํŠธ ํ•ญ๋ชฉ์—๋Š” /etc/hosts ํŒŒ์ผ์— ํ˜ธ์ŠคํŠธ ๋„ค์ž„์„ ๋“ฑ๋กํ•  ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ •์˜ํ•œ๋‹ค. 2) Role, RoleBinding, Service Account ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 3) Worker StatefulSet ํŒŒ๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 4) Worker StatefulSet ํŒŒ๋“œ๊ฐ€ ์ค€๋น„๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฐ๋‹ค. 5) Launcher Job์„ ์ƒ์„ฑํ•˜๊ณ , mpirun ๋ช…๋ น์–ด๋ฅผ ์›๊ฒฉ์œผ๋กœ ์‹คํ–‰ํ•˜๋Š”๋ฐ์— ํ•„์š”ํ•œ ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•˜๊ณ  mpirun ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•œ๋‹ค. Launcher Job ํŒŒ๋“œ๋ฅผ ์ดˆ๊ธฐํ™”ํ•  ๋•Œ kubectl์„ ๋ณต์‚ฌํ•œ ํ›„, kubexec.sh์„ ๊ฐ Worker StatefulSet ํŒŒ๋“œ์— ์‹คํ–‰ํ•œ๋‹ค. 6). Launcher Job์ด ์™„๋ฃŒ๋˜๋ฉด, Worker StatefulSet์— replica๋ฅผ 0์œผ๋กœ ์„ค์ •ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, workerReplicas 2๋กœ ์„ค์ •๋˜์—ˆ๋‹ค๋ฉด 0์œผ๋กœ ๋ณ€๊ฒฝํ•œ๋‹ค.

์ถœ์ฒ˜: https://github.com/kubeflow/community/blob/master/proposals/mpi-operator-proposal.md

TFJob Operator

TFJob Operator๋Š” Kubernetes ์ƒ์—์„œTensorflow ๋ถ„์‚ฐํ•™์Šต Job์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” Kubeflow Operator๋กœ ๋ถ„์‚ฐํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ํด๋Ÿฌ์Šคํ„ฐ ํ™˜๊ฒฝ์„ ์‰ฝ๊ฒŒ ๊ตฌ์ถ•ํ•˜๊ณ  ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด์—์„œ ๋ถ„์‚ฐํ•™์Šต์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‹ค์Œ์€ TFJob CR์„ ์ด์šฉํ•œ TensorFlow ๋ถ„์‚ฐํ•™์Šต ์‹คํ–‰ ๊ณผ์ •์ด๋‹ค.

1) ์‚ฌ์šฉ์ž๋Š” tfjobs.kubeflow.org CRD๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ TFJob CR์„ ์ž‘์„ฑํ•œ๋‹ค. 2) kubectl ๋ช…๋ น์–ด๋ฅผ ์‚ฌ์šฉํ•ด ์ž‘์„ฑํ•œ TFJob CR์„ kube-apiserver์— ๋“ฑ๋กํ•˜๋ฉด, TFJob Operator๊ฐ€ TFJob CR ๋ช…์„ธ๋ฅผ ์ฝ์–ด ๋ถ„์‚ฐํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ตฌ์ถ•ํ•œ ํ›„, ์ค€๋น„๊ฐ€ ๋˜๋ฉด ๋ถ„์‚ฐํ•™์Šต์„ ์‹คํ–‰ํ•œ๋‹ค.

PyTorchJob Operator

PytorchJob Operator๋Š” Kubernetes ์ƒ์—์„œPyTorch ๋ถ„์‚ฐํ•™์Šต Job์„ ์‹คํ–‰ํ•  ์ˆ˜๋Š” ์žˆ๋Š” Kubeflow Operator ์ด๊ณ , ํ•™์Šต ๊ณผ์ •์€ TFJob ๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

์ฐธ๊ณ ์ž๋ฃŒ

https://github.com/kubeflow/ https://www.kubeflow.org/docs/components/training/

Last updated

Was this helpful?