cuda out of memory 에러 관련 질문 입니다

edward0210 · 4월 15, 2022, 3:00오전

파이토치 아웃오브 메모리가 뜨고있습니다…
그래서 검색을 많이 해보았는데 일반적으로 진짜 메모리가 없어서 에러나는 경우만 있더라구여…
저는 근데 3090을 사용 하고 있어서 24기가가 있고
회사 피씨라 에러 코드 복사가 어려운데 대충 작성 해보면
runtime error : cuda out of memory. tried to allocate 20.00mib (gpu 0; 24gib total capacity; 6.57gib already allocate; 11.62gib free; 6.69gib reserved in total by pytorch)
이렇게 뜨고 있습니다
프리가 11기가 정도 있는데 겨우 20 메가할당 하려는데 에러가 왜나는 걸까요…?

9bow · 4월 16, 2022, 11:36오후

안녕하세요, @edward0210 님.

제가 아직 유사한 문제를 겪어본 적이 없어서 검색을 좀 해봤는데요,
(1) 배치 사이즈를 줄이거나,
(2) 캐시를 지우는 것으로 해결되었다는 사례가 있더라구요.

한 번 배치 사이즈를 줄이시거나 아래와 같이 캐시를 삭제해보시는건 어떠실까요?

import torch, gc
gc.collect()
torch.cuda.empty_cache()

StackOverflow에서도 비슷한 질문이 있고, 유사하게 캐시를 지우거나 배치 사이즈를 줄여보라는 답변이 있네요.

한 번 시도해보시고, 결과 알려주시면 다른 분들께도 도움이 될 것 같습니다. ^^

edward0210 · 4월 17, 2022, 12:56오후

우선 댓글달아 주셔서 감사합니다
말씀해주신방법은 진행해봤었어요 ㅠㅠ
보통그런 경우는 정말 글픽카드 메모리가 프리 보다 할당하려는 경우가 많을때 발생하더라구요…

Sung_Sue_Hwang · 4월 18, 2022, 1:24오전

혹시나 GPU를 사용하는 다른 프로세스가 있는지 확인이 필요할 것 같습니다.
사용율이나 남은 메모리를 알려주는 에러 메시지를 믿을 수는 없을 것 같기도 하고요.
사용하시는 코드를 한줄씩 차근 차근 실행하면서 메모리 상태를 확인해 보시는게 좋을 것 같습니다.

edward0210 · 4월 18, 2022, 2:51오후

nvodia-smi로 확인했을때는 없었는데… 다른빙법도 있을까요?

9bow · 4월 18, 2022, 11:41오후

pytorch 공식 저장소의 #16417 이슈가 이 문제와 관련한 성지;;인 것 같습니다.

github.com/pytorch/pytorch

RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)

opened 10:01AM - 27 Jan 19 UTC

closed 06:56AM - 28 Jan 19 UTC

EMarquer

needs reproduction

## CUDA Out of Memory error but CUDA memory is almost empty I am currently tr…aining a lightweight model on very large amount of textual data (about 70GiB of text). For that I am using a machine on a cluster (['grele' of the grid5000 cluster network](https://www.grid5000.fr/mediawiki/index.php/Nancy:Hardware#grele_.28production_queue.29)). I am getting after 3h of training this very strange CUDA Out of Memory error message: `RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)`. According to the message, I have the required space but it does not allocate the memory. Any idea what might cause this ? For information, my preprocessing relies on `torch.multiprocessing.Queue` and an iterator over the lines of my source data to preprocess the data on the fly. ### Full stacktrace ``` Traceback (most recent call last): File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/memory_profiler.py", line 1228, in <module> exec_with_profiler(script_filename, prof, args.backend, script_args) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/memory_profiler.py", line 1129, in exec_with_profiler exec(compile(f.read(), filename, 'exec'), ns, ns) File "run.py", line 293, in <module> main(args, save_folder, load_file) File "run.py", line 272, in main trainer.all_epochs() File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 140, in all_epochs self.single_epoch() File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 147, in single_epoch tracker.add(*self.single_batch(data, target)) File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 190, in single_batch result = self.model(data) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/emarquer/papud-bull-nn/model/model.py", line 54, in forward emb = self.emb(input) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached) ```

수많은 사람들이 이슈 보고와 해결 방법들을 공유하고 있는데요, 대략 정리해보면 다음과 같은 경우들이 있는 것 같습니다.

덧글이 많아서 2021년 초까지 밖에 못 보긴 했는데요, 위 내용들을 확인하셔서 체크해보시면 좋을 것 같습니다.

아래 몇몇 참고 하시면 좋을 것 같은 덧글들과 내용들을 발췌해봤습니다.

RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached) · Issue #16417 · pytorch/pytorch · GitHub 에서,

Hi guys, I got into this issue many times. Now I'd try to summarize the possible solutions:

If 'CUDA out of memory' error msg pops up even no iteration goes, then check the batch size you set and decrease it.

If 'CUDA out of memory' error msg pops up after some iterations, then check all the variables in the computing graph, and detach the unnecessary variables. I usually run into this situation, e.g., I did not detach the inputs of my measure/loss functions
which are not involved in backward propagation.

A practical way to tell what's happening is using nvidia-smi command every few seconds to check whether the CUDA memory usage is increasing. If it is, I think you may run into the second situation.

RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached) · Issue #16417 · pytorch/pytorch · GitHub 에서,

# Evaluate mode
model_x.eval()
model_y.eval()
with torch.no_grad():
    # <validation>
    # <...>

# Train mode
model_x.train()
model_y.train()
gc.collect()
torch.cuda.empty_cache()

위 9번과 관련해서도 문제를 많이 겪는 것 같았습니다. 아래 덧글들을 함께 참고하시면 좋을 것 같구요.

github.com/pytorch/pytorch

RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)

opened 10:01AM - 27 Jan 19 UTC

closed 06:56AM - 28 Jan 19 UTC

EMarquer

needs reproduction

## CUDA Out of Memory error but CUDA memory is almost empty I am currently tr…aining a lightweight model on very large amount of textual data (about 70GiB of text). For that I am using a machine on a cluster (['grele' of the grid5000 cluster network](https://www.grid5000.fr/mediawiki/index.php/Nancy:Hardware#grele_.28production_queue.29)). I am getting after 3h of training this very strange CUDA Out of Memory error message: `RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)`. According to the message, I have the required space but it does not allocate the memory. Any idea what might cause this ? For information, my preprocessing relies on `torch.multiprocessing.Queue` and an iterator over the lines of my source data to preprocess the data on the fly. ### Full stacktrace ``` Traceback (most recent call last): File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/memory_profiler.py", line 1228, in <module> exec_with_profiler(script_filename, prof, args.backend, script_args) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/memory_profiler.py", line 1129, in exec_with_profiler exec(compile(f.read(), filename, 'exec'), ns, ns) File "run.py", line 293, in <module> main(args, save_folder, load_file) File "run.py", line 272, in main trainer.all_epochs() File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 140, in all_epochs self.single_epoch() File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 147, in single_epoch tracker.add(*self.single_batch(data, target)) File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 190, in single_batch result = self.model(data) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/emarquer/papud-bull-nn/model/model.py", line 54, in forward emb = self.emb(input) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached) ```

github.com/pytorch/pytorch

RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)

opened 10:01AM - 27 Jan 19 UTC

closed 06:56AM - 28 Jan 19 UTC

EMarquer

needs reproduction

## CUDA Out of Memory error but CUDA memory is almost empty I am currently tr…aining a lightweight model on very large amount of textual data (about 70GiB of text). For that I am using a machine on a cluster (['grele' of the grid5000 cluster network](https://www.grid5000.fr/mediawiki/index.php/Nancy:Hardware#grele_.28production_queue.29)). I am getting after 3h of training this very strange CUDA Out of Memory error message: `RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached)`. According to the message, I have the required space but it does not allocate the memory. Any idea what might cause this ? For information, my preprocessing relies on `torch.multiprocessing.Queue` and an iterator over the lines of my source data to preprocess the data on the fly. ### Full stacktrace ``` Traceback (most recent call last): File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/memory_profiler.py", line 1228, in <module> exec_with_profiler(script_filename, prof, args.backend, script_args) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/memory_profiler.py", line 1129, in exec_with_profiler exec(compile(f.read(), filename, 'exec'), ns, ns) File "run.py", line 293, in <module> main(args, save_folder, load_file) File "run.py", line 272, in main trainer.all_epochs() File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 140, in all_epochs self.single_epoch() File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 147, in single_epoch tracker.add(*self.single_batch(data, target)) File "/home/emarquer/papud-bull-nn/trainer/trainer.py", line 190, in single_batch result = self.model(data) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/emarquer/papud-bull-nn/model/model.py", line 54, in forward emb = self.emb(input) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in __call__ result = self.forward(*input, **kwargs) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 118, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/emarquer/miniconda3/envs/pytorch/lib/python3.6/site-packages/torch/nn/functional.py", line 1454, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached) ```

또는 메인 메모리(RAM)가 부족하지 않은지도 함께 체크해보시면 좋을 것 같습니다.

github.com/pytorch/pytorch

cuda out of memory , but there is enough memory

opened 04:01AM - 14 Jun 20 UTC

closed 04:07PM - 10 Nov 20 UTC

ahmadalzoubi13579

module: cuda module: memory usage triaged

i am training binary classification model on gpu using pytorch, and i get cuda m…emory error , but i have enough free memory as the message say: error : `RuntimeError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 6.00 GiB total capacity; 693.78 MiB already allocated; 3.91 GiB free; 18.22 MiB cached)` it is trying to allocate 50 MB but i have 3.91 GB free, so what is the problem ??? note: gpu: gtx 1060 6 GB torch: 1.2.0 cc @ngimel

쉽지 않으시겠지만 위 이슈의 덧글들을 읽어보시면서 해당되는 내용이 있다면 하나씩 시도해보시면 좋을 것 같습니다.
혹시 해결하시게 되면 다른 분들을 위해 공유 부탁드립니다.

anstmdwn34 · 3월 26, 2024, 2:06오전

해당 문제로 머리가 아팠는데, 덕분에 잘 해결이 되었습니다. 감사합니다.