-
Notifications
You must be signed in to change notification settings - Fork 495
Description
I just started working with RF-DETR, and while training is going smoothly, validation seems to crash on almost every epoch (but not every epoch), not necessarily with a consistent error or at a consistent point, typically with a torch.distributed.elastic.multiprocessing.errors.ChildFailedError (I am training on two GPUs).
Rather than trying to debug this, I'd like to just disable validation entirely, and run val on every checkpoint at the end. Is that possible? I don't see any equivalent of the "run_test" argument (e.g. "run_val", or "disable_val"). If there's not functionality for this, is there a recommended workaround, e.g. is an an empty "valid" folder allowable? Or maybe having a single image in the "valid" folder is the closest I can do?
Thanks!