Pytorch Multi-GPU 정리 중

용어

[노드]

(분산 처리에서는 GPU가 달려 있는 machine을 node라는 용어로 지칭한다고 함)

ex, 컴퓨터가 한 대 이면 node 1

ex, 컴퓨터가 두 대 이면 node 2

[World SIze]

Number of processes participating in the job

작업에 사용되는 프로세스들의 개수

즉, 분산 처리에서 사용할 총 gpu 개수

[RANK]

Rank는 Data Distributed Parallel에서 가동되는 process ID

Global Rank: 전체 node에 가동되는 process id

Local Rank: 각 node별 process id

[Local Rank]

노드 내 프로세스의 로컬 순위

Local Rank를 0으로 한다면 0번째 GPU를 우선순위로 작업이 진행되는 듯.

함수

분산 처리를 이용하려면 먼저 아래의 method를 사용해야 한다.

(Initialize default distributed process group)

torch.distributed.init_process_group()

여기서 보통 backend와, init_method, world_size, rank를 선택하는데

world_size와 rank는

위에서 언급한 것과 맞춰 넣어주면되고

[backend]

backend의 경우는,

Gloo

MPI

NCCL

의 옵션들이 있다.

보통은 NCCL이 GPU 기반 학습에서 사용하는 것이 좋다고 한다.

아래 링크를 보면

NCCL 은 분산 GPU training 시

Gloo 는 분산 CPU training 시

MPI는 고성능 컴퓨팅 시

필요하다고 한다.

https://pytorch.org/docs/stable/distributed.html

Distributed communication package - torch.distributed — PyTorch 1.10.0 documentation

Shortcuts

pytorch.org

[init_method]

URL specifying how to initialize the process group

init method는 각 프로세스가 서로 탐색하는 방법

통신 backend를 사용하여 프로세스 그룹을 초기화하고 확인하는 방법을 알려준다.

init_method가 지정되어 있지 않으면 Pytorch에서는 환경 변수 초기화 메서드(env://)를 사용

[torch.distributed.barrier]

전체 프로세스를 동기화

이 collective는

async_op가 False인 경우 또는

wait()에서 비동기 작업 핸들이 호출되는 경우

전체 그룹이 이 함수에 들어갈 때까지 프로세스를 차단.

참고

https://docs.microsoft.com/ko-kr/azure/machine-learning/how-to-train-distributed-gpu

분산 GPU 학습 가이드 - Azure Machine Learning

MPI, Horovod, DeepSpeed, PyTorch, PyTorch Lightning, Hugging Face Transformers, TensorFlow, InfiniBand 같은 Azure Machine Learning 지원 프레임워크를 사용하여 분산 교육을 수행하기 위한 모범 사례를 알아봅니다.

docs.microsoft.com

https://hongl.tistory.com/292

Pytorch - DistributedDataParallel (1) - 개요

이전 포스트 [Machine Learning/기타] - Pytorch - DataParallel 지난 포스트의 DataParallel 모듈은 하나의 machine에 붙어있는 multi-gpu를 사용하여 훈련하는 방법이지만 최근에는 하드웨어 리소스를 충분히 활..

hongl.tistory.com

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group

Distributed communication package - torch.distributed — PyTorch 1.10.0 documentation

Shortcuts

pytorch.org

https://pytorch.org/docs/stable/distributed.html

Distributed communication package - torch.distributed — PyTorch 1.10.0 documentation

Shortcuts

pytorch.org

저작자표시 (새창열림)

'Language&Framework&Etc > Pytorch' 카테고리의 다른 글

Pytorch dataset(sampler) (0)	2022.06.21
Pytorch 데이터 로딩 방법 (0)	2021.06.07
AUTOGRAD : 자동 미분 (Pytorch tutorial) (0)	2021.04.30
Pytorch란 무엇인가요? - 연산 (Pytorch 학습 2) (0)	2021.04.16
Pytorch란 무엇인가요? - 시작하기 (Pytorch 학습 1) (0)	2021.04.16

Pytorch Multi-GPU 정리 중

용어

함수

'Language&Framework&Etc > Pytorch' 카테고리의 다른 글

관련글

티스토리툴바