2021-11-06 09:52:22,967 - mmdet - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.7.10 (default, Feb 26 2021, 18:47:35) [GCC 7.3.0]
....
2021-11-06 09:52:51,385 - mmdet - INFO - workflow: [('train', 1)], max: 12 epochs
/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448265233/work/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/workspace/mmdet/core/anchor/anchor_generator.py:324: UserWarning: ``grid_anchors`` would be deprecated soon. Please use ``grid_priors``
warnings.warn('``grid_anchors`` would be deprecated soon. '
/workspace/mmdet/core/anchor/anchor_generator.py:361: UserWarning: ``single_level_grid_anchors`` would be deprecated soon. Please use ``single_level_grid_priors``
'``single_level_grid_anchors`` would be deprecated soon. '
2021-11-06 09:53:33,805 - mmdet - INFO - Epoch [1][50/29317] lr: 1.978e-03, eta: 3 days, 10:34:55, time: 0.845, data_time: 0.061, memory: 6497, loss_rpn_cls: 0.5175, loss_rpn_bbox: 0.0986, loss_cls: 0.6558, acc: 89.7168, loss_bbox: 0.1074, loss: 1.3793
....
2021-11-06 15:30:17,031 - mmdet - INFO - Epoch [1][25750/29317] lr: 2.000e-02, eta: 2 days, 23:12:32, time: 1.068, data_time: 0.049, memory: 6498, loss_rpn_cls: 0.0657, loss_rpn_bbox: 0.0642, loss_cls: 0.3354, acc: 90.6543, loss_bbox: 0.2861, loss: 0.7514
Killed
MMdetection COCO dataset에 대해서 학습을 진행하던 중 위와 같이, 아무런 에러 없이 Killed 라는 라인이 뜨며 지속적으로 학습이 멈추는 것을 확인하여 구글 검색을 해봤다.
보던 중 나의 케이스와 유사하게 아래에 이슈가 있는 걸 확인했다.
https://github.com/tensorflow/models/issues/3497
위 링크 사람들의 말로는 GPU 메모리 용량이 아닌 RAM의 용량이 부족해서 생긴 문제라고 한다.
당장 RAM을 살 수는 없으니 도커 container에 공유 메모리를 더 넘겨서 테스트해봐야겠다.