Skip to content

Commit a4cdf17

Browse files
authored
[doc] update mvit metafile and readme (#2210)
1 parent 8af2fa9 commit a4cdf17

File tree

2 files changed

+81
-7
lines changed

2 files changed

+81
-7
lines changed

configs/recognition/mvit/README.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,12 @@ well as 86.1% on Kinetics-400 video classification.
2626
## Results and Models
2727

2828
1. Models with * in `Inference results` are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data, and models in `Training results` are trained in MMAction2 on our data.
29-
2. The values in columns named after `reference` are copied from paper, and `reference*` are results use [SlowFast](https://github.com/facebookresearch/SlowFast/) repo and trained on our data.
29+
2. The values in columns named after `reference` are copied from paper, and `reference*` are results using [SlowFast](https://github.com/facebookresearch/SlowFast/) repo and trained on our data.
3030
3. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
31-
4. MaskFeat fine-tuning experiment is based on pretrain model from [MMSelfSup](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/projects/maskfeat_video), and corresponding reference result is based on pretrain model from [SlowFast](https://github.com/facebookresearch/SlowFast/).
31+
4. MaskFeat fine-tuning experiment is based on pretrain model from [MMSelfSup](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/projects/maskfeat_video), and the corresponding reference result is based on pretrain model from [SlowFast](https://github.com/facebookresearch/SlowFast/).
3232
5. Due to the different versions of Kinetics-400, our training results are different from paper.
3333
6. Due to the training efficiency, we currently only provide MViT-small training results, we don't ensure other config files' training accuracy and welcome you to contribute your reproduction results.
34-
7. We use `repeat augment` in MViT training configs following [SlowFast](https://github.com/facebookresearch/SlowFast/). [Repeat augment](https://arxiv.org/pdf/1901.09335.pdf) takes multiple times of random augment for one video, this way can improve the generalization of model and relieve the IO stress of loading videos. And please note that, the actual batch size is `num_repeats` times of `batch_size` in `train_dataloader`.
34+
7. We use `repeat augment` in MViT training configs following [SlowFast](https://github.com/facebookresearch/SlowFast/). [Repeat augment](https://arxiv.org/pdf/1901.09335.pdf) takes multiple times of data augment for one video, this way can improve the generalization of the model and relieve the IO stress of loading videos. And please note that the actual batch size is `num_repeats` times of `batch_size` in `train_dataloader`.
3535

3636
### Inference results
3737

@@ -60,6 +60,12 @@ well as 86.1% on Kinetics-400 video classification.
6060
| 16x4x1 | 224x224 | MViTv2-S | From scratch | 80.6 | 94.7 | [80.8](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 64G | 34.5M | [config](configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_20230201-23284ff3.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.log) |
6161
| 16x4x1 | 224x224 | MViTv2-S | K400 MaskFeat | 81.8 | 95.2 | [81.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/maskfeat/README.md) | [94.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/maskfeat/README.md) | 10 clips x 1 crop | 71G | 36.4M | [config](/configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb_20230201-5bced1d0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.log) |
6262

63+
the corresponding result without repeat augment is as follows:
64+
65+
| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference\* top1 acc | reference\* top5 acc | testing protocol | FLOPs | params |
66+
| :---------------------: | :--------: | :------: | :----------: | :------: | :------: | :--------------------------------------------------: | :--------------------------------------------------: | :--------------: | :---: | :----: |
67+
| 16x4x1 | 224x224 | MViTv2-S | From scratch | 79.4 | 93.9 | [80.8](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 64G | 34.5M |
68+
6369
#### Something-Something V2
6470

6571
| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top5 acc | testing protocol | FLOPs | params | config | ckpt | log |

configs/recognition/mvit/metafile.yml

Lines changed: 72 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ Collections:
66
Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"
77

88
Models:
9-
- Name: mvit-small-p244_16x4x1_kinetics400-rgb
10-
Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
9+
- Name: mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_infer
10+
Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
1111
In Collection: MViT
1212
Metadata:
1313
Architecture: MViT-small
@@ -24,6 +24,28 @@ Models:
2424
Top 5 Accuracy: 94.7
2525
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth
2626

27+
- Name: mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb
28+
Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
29+
In Collection: MViT
30+
Metadata:
31+
Architecture: MViT-small
32+
Batch Size: 16
33+
Epochs: 100
34+
FLOPs: 64G
35+
Parameters: 34.5M
36+
Resolution: 224x224
37+
Training Data: Kinetics-400
38+
Training Resources: 32 GPUs
39+
Modality: RGB
40+
Results:
41+
- Dataset: Kinetics-400
42+
Task: Action Recognition
43+
Metrics:
44+
Top 1 Accuracy: 80.6
45+
Top 5 Accuracy: 94.7
46+
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.log
47+
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_20230201-23284ff3.pth
48+
2749
- Name: mvit-base-p244_32x3x1_kinetics400-rgb
2850
Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
2951
In Collection: MViT
@@ -60,8 +82,8 @@ Models:
6082
Top 5 Accuracy: 94.7
6183
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth
6284

63-
- Name: mvit-small-p244_u16_sthv2-rgb
64-
Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
85+
- Name: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_infer
86+
Config: configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py
6587
In Collection: MViT
6688
Metadata:
6789
Architecture: MViT-small
@@ -78,6 +100,29 @@ Models:
78100
Top 5 Accuracy: 91.0
79101
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth
80102

103+
- Name: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb
104+
Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
105+
In Collection: MViT
106+
Metadata:
107+
Architecture: MViT-small
108+
Batch Size: 16
109+
Epochs: 100
110+
FLOPs: 64G
111+
Parameters: 34.4M
112+
Pretrained: Kinetics-400
113+
Resolution: 224x224
114+
Training Data: SthV2
115+
Training Resources: 16 GPUs
116+
Modality: RGB
117+
Results:
118+
- Dataset: SthV2
119+
Task: Action Recognition
120+
Metrics:
121+
Top 1 Accuracy: 68.2
122+
Top 5 Accuracy: 91.3
123+
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.log
124+
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_20230201-4065c1b9.pth
125+
81126
- Name: mvit-base-p244_u32_sthv2-rgb
82127
Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
83128
In Collection: MViT
@@ -113,3 +158,26 @@ Models:
113158
Top 1 Accuracy: 73.2
114159
Top 5 Accuracy: 94.0
115160
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth
161+
162+
- Name: mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb
163+
Config: configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py
164+
In Collection: MViT
165+
Metadata:
166+
Architecture: MViT-small
167+
Batch Size: 32
168+
Epochs: 100
169+
FLOPs: 71G
170+
Parameters: 36.4M
171+
Pretrained: Kinetics-400 MaskFeat
172+
Resolution: 224x224
173+
Training Data: Kinetics-400
174+
Training Resources: 8 GPUs
175+
Modality: RGB
176+
Results:
177+
- Dataset: Kinetics-400
178+
Task: Action Recognition
179+
Metrics:
180+
Top 1 Accuracy: 81.8
181+
Top 5 Accuracy: 95.2
182+
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.log
183+
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb_20230201-5bced1d0.pth

0 commit comments

Comments
 (0)