large-scale-lm-tutorials의 03_distributed_programming.ipynb 에서 오류 현상

sansanai · 7월 10, 2023, 6:06오전

안녕하세요...
tunib-ai/large-scale-lm-tutorials: Large-scale language modeling tutorials with PyTorch (github.com)
페이지에 있는 튜토리얼 노트북을 따라서 실행시키고 있습니다.
이중 3.Distributed Programming 노트북에서, P2P Communication 까지는 소스 실행이 잘되고 있었습니다.

그런데 Collective Communication 의 첫번째 broadcast 연산의 샘플 프로그램을 따라서 실행하다가 오류가 발생합니다. dist.broadcast(tensor, src=0)을 실행하기까지는 출력이 되었는데, 이후 오류가 발생합니다.

혹시 이 오류가 왜 발생하는지 알 수 있을까요?

제가 테스트 한 환경은 다음과 같습니다.
윈도우 10 Enterprise 22H2 버전이며,
wsl2 기반 윈도우 docker desktop 에서 ubuntu20.04 버전을 실행하고
있습니다.
파이썬은 3.8.10 이며,
pytorch 버전은 1.13.1+cu116 입니다.
GPU는 1080ti 2장이 설치되어 있는 상황입니다.

broadcast 연산 관련 실행 결과는 다음과 같습니다.

root@9023c839d35c:/data/hf_test# python -m torch.distributed.launch --nproc_per_node=2 broadcast.py
/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects --local_rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
Distributed communication package - torch.distributed — PyTorch 2.6 documentation for
further instructions

warnings.warn(
WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

torch.cuda.current_device() : 0
torch.cuda.current_device() : 1
before rank 0: tensor([[ 0.5138, -1.4212],
[-0.8317, 1.0614]], device='cuda:0')

before rank 1: tensor([[0., 0.],
[0., 0.]], device='cuda:1')

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f93e939f457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f93e93693ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f9414a47c64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f9414a1f3e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f9414a22054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f943f11fe23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f93e937f9e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f93e937faf9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f943f37dc68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f943f37df85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f945f86d083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

Traceback (most recent call last):
File "broadcast.py", line 21, in
print(f"after rank {rank}: {tensor}\n")
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 859, in format
return object.format(self, format_spec)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 427, in repr
return torch._tensor_str._str(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 637, in _str
return _str_intern(self, tensor_contents=tensor_contents)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 568, in _str_intern
tensor_str = _tensor_str(self, indent)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 328, in _tensor_str
formatter = _Formatter(get_summarized_data(self) if summarize else self)
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor_str.py", line 115, in init
nonzero_finite_vals = torch.masked_select(
RuntimeError: numel: integer multiplication overflow
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f32834b2457 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f328347c3ec in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(std::string const&, std::string const&, int, bool) + 0xb4 (0x7f32aeb5ac64 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #3: + 0x1e3e5 (0x7f32aeb323e5 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #4: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x244 (0x7f32aeb35054 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10_cuda.so)
frame #5: + 0x4d6e23 (0x7f32d9232e23 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1a0 (0x7f32834929e0 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f3283492af9 in /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so)
frame #8: + 0x734c68 (0x7f32d9490c68 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2d5 (0x7f32d9490f85 in /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so)
frame #10: /usr/bin/python() [0x5cf323]
frame #11: /usr/bin/python() [0x5d221c]
frame #12: /usr/bin/python() [0x6a786c]
frame #13: /usr/bin/python() [0x5d1d17]
frame #14: PyImport_Cleanup + 0x193 (0x685f73 in /usr/bin/python)
frame #15: Py_FinalizeEx + 0x7f (0x68080f in /usr/bin/python)
frame #16: Py_RunMain + 0x32d (0x6b823d in /usr/bin/python)
frame #17: Py_BytesMain + 0x2d (0x6b84ad in /usr/bin/python)
frame #18: __libc_start_main + 0xf3 (0x7f32f9980083 in /lib/x86_64-linux-gnu/libc.so.6)
frame #19: _start + 0x2e (0x5fb39e in /usr/bin/python)

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 260) of binary: /usr/bin/python

Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 195, in
main()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

broadcast.py FAILED

Failures:

[1]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 261)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 261

Root Cause (first observed failure):

[0]:
time : 2023-07-10_03:07:46
host : 9023c839d35c
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 260)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 260

====================================================

빠른 답변 부탁드립니다.

그럼 이만...

덕분입니다. 감사합니다.

김명찬 배상

bot · 7월 10, 2023, 6:37오전

아래 답변은 OpenAI의 GPT-4 모델로 자동 생성한 것으로, 정확하지 않을 수 있습니다.
잘못된 답변을 보셨다면 댓글로 첨삭 지도 부탁드립니다.

안녕하세요, 김명찬님. PyTorch를 사용하시는 데 어려움을 겪고 계신 것 같아서 안타깝습니다. 질문을 잘 이해했다면, 김명찬님은 tunib-ai의 large-scale-lm-tutorials 중 3.Distributed Programming 노트북을 따라하시다가 Collective Communication의 첫 번째 broadcast 연산에서 문제가 발생했다는 것 같습니다. 오류 메시지를 보면, "CUDA error: an illegal memory access was encountered"라는 메시지가 나타나고 있습니다.

이 문제는 GPU 메모리에 불법적인 접근이 있었음을 나타냅니다. 이는 일반적으로 GPU 메모리에 저장된 데이터를 잘못된 방식으로 접근하려고 할 때 발생합니다. 이 문제를 해결하기 위해서는 코드를 살펴보고 어떤 부분에서 불법적인 메모리 접근이 발생하는지 확인해야 합니다.

그러나 제공해주신 정보만으로는 정확한 원인을 파악하기 어렵습니다. 가능하다면, 문제가 발생하는 broadcast.py의 코드를 제공해주시면 더욱 정확한 도움을 드릴 수 있을 것 같습니다.

또한, PyTorch의 버전이 1.13.1+cu116인데, 이는 상당히 오래된 버전입니다. PyTorch의 최신 버전을 사용하면 이 문제가 해결될 수도 있습니다. PyTorch의 최신 버전을 설치하려면 다음 명령어를 사용하실 수 있습니다.

pip install torch torchvision torchaudio -U

이 명령어는 PyTorch, torchvision, torchaudio를 최신 버전으로 업데이트합니다.

마지막으로, CUDA 오류를 디버깅하는 데 도움이 될 수 있는 방법 중 하나는 환경 변수 CUDA_LAUNCH_BLOCKING=1을 설정하는 것입니다. 이 환경 변수를 설정하면 CUDA 연산이 동기적으로 실행되어 오류가 발생한 위치를 더 정확하게 파악할 수 있습니다. 이 환경 변수를 설정하려면 다음 명령어를 사용하실 수 있습니다.

export CUDA_LAUNCH_BLOCKING=1

이후에 다시 코드를 실행해 보시고, 여전히 문제가 발생한다면 오류 메시지를 공유해주시면 감사하겠습니다.

PyTorch의 분산 처리에 대한 자세한 내용은 PyTorch 공식 문서의 분산 처리 가이드를 참조하시면 도움이 될 것입니다.

제 답변이 김명찬님의 문제를 해결하는 데 도움이 되길 바라며, 추가적인 질문이나 필요한 정보가 있다면 언제든지 말씀해주세요. 제 답변이 틀릴 수도 있으니, 다른 자료나 의견도 참고하시는 것이 좋습니다. 감사합니다.

답변이 어떠셨나요? 괜찮았다고 생각하신다면 아래 를 한 번 눌러주시길 부탁드려요!

sansanai · 7월 10, 2023, 7:52오전

제가 따라한 소스는 다음과 같습니다..

"""
src/broadcast.py
"""

import torch
import torch.distributed as dist

dist.init_process_group("nccl")
rank = dist.get_rank()
torch.cuda.set_device(rank)
\#    device를 setting하면 이후에 rank에 맞는 디바이스에 접근 가능합니다.

if rank == 0:
        tensor = torch.randn(2, 2).to(torch.cuda.current_device())
else:
        tensor = torch.zeros(2, 2).to(torch.cuda.current_device())

print(f"before rank {rank}: {tensor}\n")
dist.broadcast(tensor, src=0)
print(f"after rank {rank}: {tensor}\n")

위소스 코드를 실행하면 에상되는 출력은 오류가 아니라 다음과 같은 결과를 기대햇습니다.

--예상 기대 결과 ---
!python -m torch.distributed.launch --nproc_per_node=4 ../src/broadcast.py

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

before rank 3: tensor([[0., 0.],
[0., 0.]], device='cuda:3')
before rank 1: tensor([[0., 0.],
[0., 0.]], device='cuda:1')

before rank 2: tensor([[0., 0.],
[0., 0.]], device='cuda:2')

before rank 0: tensor([[-0.7522, -0.2532],
[ 0.9788, 1.0834]], device='cuda:0')

after rank 0: tensor([[-0.7522, -0.2532],
[ 0.9788, 1.0834]], device='cuda:0')

after rank 1: tensor([[-0.7522, -0.2532],
[ 0.9788, 1.0834]], device='cuda:1')

after rank 3: tensor([[-0.7522, -0.2532],
[ 0.9788, 1.0834]], device='cuda:3')

after rank 2: tensor([[-0.7522, -0.2532],
[ 0.9788, 1.0834]], device='cuda:2')

입니다.

bot · 7월 14, 2023, 1:33오전

아래 답변은 OpenAI의 GPT-3.5(ChatGPT) 모델로 자동 생성한 것으로, 정확하지 않을 수 있습니다.
잘못된 답변을 보셨다면 댓글로 첨삭 지도 부탁드립니다.

안녕하세요 김명찬님,

large-scale-lm-tutorials의 03_distributed_programming.ipynb 노트북을 따라 실행하다가 오류가 발생하는 문제에 대해 도움을 주려고 합니다.

먼저, 오류가 발생하는 부분은 Collective Communication의 첫 번째 broadcast 연산입니다. dist.broadcast(tensor, src=0)을 실행하면 출력은 되지만 그 이후에 오류가 발생합니다. 오류가 왜 발생하는지 알고 싶으신 것 같습니다.

제시해주신 환경은 윈도우 10 Enterprise 22H2 버전에서 WSL2 기반의 Ubuntu 20.04 버전을 사용하고 있으며, Python 3.8.10과 PyTorch 1.13.1+cu116을 사용하고 있습니다. GPU는 1080ti 2장이 설치되어 있는 상황이라고 합니다.

오류 메시지를 살펴보면 CUDA error: an illegal memory access was encountered라는 내용이 있습니다. 이 오류는 GPU 메모리에 잘못된 접근이 발생했을 때 발생하는 오류입니다. 이러한 오류는 다양한 원인이 있을 수 있습니다.

가능한 원인 중 하나는 GPU 메모리 부족일 수 있습니다. broadcast 연산은 모든 프로세스 간에 텐서를 복사하는 작업이므로 GPU 메모리 사용량이 많아질 수 있습니다. GPU 메모리 사용량을 확인하고 부족한 경우에는 더 작은 텐서를 사용하거나 GPU 메모리를 추가로 확보해야 할 수도 있습니다.

또 다른 가능한 원인은 PyTorch 버전과 CUDA 버전의 호환성 문제일 수 있습니다. PyTorch 1.13.1은 CUDA 11.6과 호환되지 않을 수 있습니다. PyTorch와 CUDA의 버전을 확인하고 호환되는 버전을 사용해보시기 바랍니다.

또한, PyTorch의 공식 문서와 튜토리얼을 참고하시면 도움이 될 수 있습니다. 특히, Distributed Communication에 대한 내용은 아래 링크를 참고하시기 바랍니다.

Distributed communication package - torch.distributed — PyTorch 1.10.0 documentation

위 링크에서는 분산 학습과 관련된 다양한 기능과 사용 방법에 대해 자세히 설명하고 있습니다. 문제의 원인과 해결책을 찾는 데 도움이 될 수 있습니다.

더 자세한 도움이 필요하시면 언제든지 말씀해주세요. 감사합니다.

답변이 어떠셨나요? 괜찮았다고 생각하신다면 아래 를 한 번 눌러주시길 부탁드려요!