Skip to content

[bug] Detectron2 errors when installing on PyTorch DLC #1782

@austinmw

Description

@austinmw

Checklist

Concise Description:

Detectron2 errors when being installed on top of pytorch-training container. It appears to be related to smdebug.

How to reproduce:

> nvidia-docker run -it 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker
> pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu113/torch1.10/index.html
> python -c "from detectron2 import model_zoo"

DLC image/dockerfile:

763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.10.2-gpu-py38-cu113-ubuntu20.04-sagemaker

Current behavior:

Traceback:

root@fe0954d71a8e:/# python -c "from detectron2 import model_zoo"
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/init.py", line 8, in
from .model_zoo import get, get_config_file, get_checkpoint_url, get_config
File "/opt/conda/lib/python3.8/site-packages/detectron2/model_zoo/model_zoo.py", line 9, in
from detectron2.modeling import build_model
File "/opt/conda/lib/python3.8/site-packages/detectron2/modeling/init.py", line 2, in
from detectron2.layers import ShapeSpec
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/init.py", line 2, in
from .batch_norm import FrozenBatchNorm2d, get_norm, NaiveSyncBatchNorm, CycleBatchNormList
File "/opt/conda/lib/python3.8/site-packages/detectron2/layers/batch_norm.py", line 4, in
from fvcore.nn.distributed import differentiable_all_reduce
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/init.py", line 4, in
from .focal_loss import (
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 52, in
sigmoid_focal_loss_jit: "torch.jit.ScriptModule" = torch.jit.script(sigmoid_focal_loss)
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script
fn = torch._C._jit_script_compile(
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_recursive.py", line 838, in try_compile_fn
return torch.jit.script(fn, _rcb=rcb)
File "/opt/conda/lib/python3.8/site-packages/torch/jit/_script.py", line 1310, in script
fn = torch._C._jit_script_compile(
RuntimeError:
undefined value has_torch_function_variadic:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/smdebug.py", line 2962
>>> loss.backward()
"""
if has_torch_function_variadic(input, target, weight, pos_weight):
~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
return handle_torch_function(
binary_cross_entropy_with_logits,
'binary_cross_entropy_with_logits' is being compiled since it was called from 'sigmoid_focal_loss'
File "/opt/conda/lib/python3.8/site-packages/fvcore/nn/focal_loss.py", line 36
targets = targets.float()
p = torch.sigmoid(inputs)
ce_loss = F.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
p_t = p * targets + (1 - p) * (1 - targets)
loss = ce_loss * ((1 - p_t) ** gamma)

Expected behavior:

No error on import

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions