Skip to content

multiprocessing.context.AuthenticationError: digest received was wrong #450

@pra-dan

Description

@pra-dan

Search before asking

  • I have searched the RF-DETR issues and found no similar bug report.

Bug

I am trying to fine-tune the Medium model to my dataset containing single class. This is my training script

from rfdetr import RFDETRMedium

model = RFDETRMedium()

model.train(
    dataset_dir="nov11",
    epochs=100,
    batch_size=4,
    grad_accum_steps=4,
    lr=1e-5,
    num_workers=1, 
    output_dir="rfdet_nov11",
    resolution=1232,
    device='cuda',
    wandb=True,
    project="ball_det",
    early_stopping=True,
    early_stopping_patience=10
)

It runs into this error before finishing the first epoch

...
.0049)  loss_giou_2_unscaled: 0.1385 (0.1908)  cardinality_error_2_unscaled: 0.7500 (1.7778)  loss_ce_enc_unscaled: 0.5104 (0.6461)  loss_bbox_enc_unscaled: 0.0036 (0.0056)  loss_giou_enc_unscaled: 0.1650 (0.2092)  cardinality_error_enc_unscaled: 0.5000 (0.6321)  time: 0.6672  data: 0.0072  max mem: 10285
Traceback (most recent call last):
  File "/home/quidich/Documents/train_rf.py", line 5, in <module>
    model.train(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 83, in train
    self.train_from_config(config, **kwargs)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/detr.py", line 191, in train_from_config
    self.model.train(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/main.py", line 341, in train
    train_stats = train_one_epoch(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/engine.py", line 88, in train_one_epoch
    for data_iter_step, (samples, targets) in enumerate(
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/rfdetr/util/misc.py", line 239, in log_every
    for obj in iterable:
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
    data = self._next_data()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1482, in _next_data
    idx, data = self._get_data()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1444, in _get_data
    success, data = self._try_get_data()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1275, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 541, in rebuild_storage_fd
    fd = df.detach()
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 514, in Client
    deliver_challenge(c, authkey)
  File "/home/quidich/miniconda3/envs/rfdetr/lib/python3.10/multiprocessing/connection.py", line 750, in deliver_challenge
    raise AuthenticationError('digest received was wrong')
multiprocessing.context.AuthenticationError: digest received was wrong

Initially, the num_workers was not set in my script. I set it to 1 after getting this error. But the error still persists.

Environment

RF-DETR 1.3.0
OS Ubuntu 24.04.3
Python 3.10.0
PyTorch 2.9.0
CUDA/cuDNN V12.0.140
GPU 4090Ti

Minimal Reproducible Example

Just run training script. The data exists in COCO format.

Additional

No response

Are you willing to submit a PR?

  • Yes, I'd like to help by submitting a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions