-
Notifications
You must be signed in to change notification settings - Fork 496
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Search before asking
- I have searched the RF-DETR issues and found no similar bug report.
Bug
I am trying to fine-tune the Medium model to my dataset containing single class. This is my training script
from rfdetr import RFDETRMedium
model = RFDETRMedium()
model.train(
dataset_dir="nov11",
epochs=100,
batch_size=4,
grad_accum_steps=4,
lr=1e-5,
num_workers=1,
output_dir="rfdet_nov11",
resolution=1232,
device='cuda',
wandb=True,
project="ball_det",
early_stopping=True,
early_stopping_patience=10
)It runs into this error before finishing the first epoch
...
.0049) loss_giou_2_unscaled: 0.1385 (0.1908) cardinality_error_2_unscaled: 0.7500 (1.7778) loss_ce_enc_unscaled: 0.5104 (0.6461) loss_bbox_enc_unscaled: 0.0036 (0.0056) loss_giou_enc_unscaled: 0.1650 (0.2092) cardinality_error_enc_unscaled: 0.5000 (0.6321) time: 0.6672 data: 0.0072 max mem: 10285
Traceback (most recent call last):
File "/home/quidich/Documents/train_rf.py", line 5, in <module>
model.train(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 83, in train
self.train_from_config(config, **kwargs)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 191, in train_from_config
self.model.train(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/main.py", line 341, in train
train_stats = train_one_epoch(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/engine.py", line 88, in train_one_epoch
for data_iter_step, (samples, targets) in enumerate(
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/util/misc.py", line 239, in log_every
for obj in iterable:
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
data = self._next_data()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
idx, data = self._get_data()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
success, data = self._try_get_data()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/queues.py", line 122, in get
return _ForkingPickler.loads(res)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
fd = df.detach()
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
with _resource_sharer.get_connection(self._id) as conn:
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
c = Client(address, authkey=process.current_process().authkey)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 514, in Client
deliver_challenge(c, authkey)
File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 750, in deliver_challenge
raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrongInitially, the num_workers was not set in my script. I set it to 1 after getting this error. But the error still persists.
Environment
RF-DETR 1.3.0
OS Ubuntu 24.04.3
Python 3.10.0
PyTorch 2.9.0
CUDA/cuDNN V12.0.140
GPU 4090Ti
Minimal Reproducible Example
Just run training script. The data exists in COCO format.
Additional
No response
Are you willing to submit a PR?
- Yes, I'd like to help by submitting a PR!
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working