[doc] update mvit metafile and readme (#2210)

cir7 · web-flow · commit a4cdf1778cd3 · 2023-02-10T11:45:40.000+08:00
diff --git a/configs/recognition/mvit/README.md b/configs/recognition/mvit/README.md
@@ -26,12 +26,12 @@ well as 86.1% on Kinetics-400 video classification.
 ## Results and Models
 
 1. Models with * in `Inference results` are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data, and models in `Training results` are trained in MMAction2 on our data.
-2. The values in columns named after `reference` are copied from paper, and `reference*` are results use [SlowFast](https://github.com/facebookresearch/SlowFast/) repo and trained on our data.
+2. The values in columns named after `reference` are copied from paper, and `reference*` are results using [SlowFast](https://github.com/facebookresearch/SlowFast/) repo and trained on our data.
 3. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
-4. MaskFeat fine-tuning experiment is based on pretrain model from [MMSelfSup](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/projects/maskfeat_video), and corresponding reference result is based on pretrain model from [SlowFast](https://github.com/facebookresearch/SlowFast/).
+4. MaskFeat fine-tuning experiment is based on pretrain model from [MMSelfSup](https://github.com/open-mmlab/mmselfsup/tree/dev-1.x/projects/maskfeat_video), and the corresponding reference result is based on pretrain model from [SlowFast](https://github.com/facebookresearch/SlowFast/).
 5. Due to the different versions of Kinetics-400, our training results are different from paper.
 6. Due to the training efficiency, we currently only provide MViT-small training results, we don't ensure other config files' training accuracy and welcome you to contribute your reproduction results.
-7. We use `repeat augment` in MViT training configs following [SlowFast](https://github.com/facebookresearch/SlowFast/). [Repeat augment](https://arxiv.org/pdf/1901.09335.pdf) takes multiple times of random augment for one video, this way can improve the generalization of model and relieve the IO stress of loading videos. And please note that, the actual batch size is `num_repeats` times of `batch_size` in `train_dataloader`.
+7. We use `repeat augment` in MViT training configs following [SlowFast](https://github.com/facebookresearch/SlowFast/). [Repeat augment](https://arxiv.org/pdf/1901.09335.pdf) takes multiple times of data augment for one video, this way can improve the generalization of the model and relieve the IO stress of loading videos. And please note that the actual batch size is `num_repeats` times of `batch_size` in `train_dataloader`.
 
 ### Inference results
 
@@ -60,6 +60,12 @@ well as 86.1% on Kinetics-400 video classification.
 |         16x4x1          |  224x224   | MViTv2-S | From scratch  |   80.6   |   94.7   | [80.8](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop  |  64G  | 34.5M  | [config](configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_20230201-23284ff3.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.log) |
 |         16x4x1          |  224x224   | MViTv2-S | K400 MaskFeat |   81.8   |   95.2   | [81.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/maskfeat/README.md) | [94.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/maskfeat/README.md) | 10 clips x 1 crop |  71G  | 36.4M  | [config](/configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb_20230201-5bced1d0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.log) |
 
+the corresponding result without repeat augment is as follows:
+
+| frame sampling strategy | resolution | backbone |   pretrain   | top1 acc | top5 acc |                 reference\* top1 acc                 |                 reference\* top5 acc                 | testing protocol | FLOPs | params |
+| :---------------------: | :--------: | :------: | :----------: | :------: | :------: | :--------------------------------------------------: | :--------------------------------------------------: | :--------------: | :---: | :----: |
+|         16x4x1          |  224x224   | MViTv2-S | From scratch |   79.4   |   93.9   | [80.8](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop |  64G  | 34.5M  |
+
 #### Something-Something V2
 
 | frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc |      reference top1 acc       |       reference top5 acc       | testing protocol | FLOPs | params |       config       |       ckpt       |       log       |
diff --git a/configs/recognition/mvit/metafile.yml b/configs/recognition/mvit/metafile.yml
@@ -6,8 +6,8 @@ Collections:
     Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"
 
 Models:
-  - Name: mvit-small-p244_16x4x1_kinetics400-rgb
-    Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
+  - Name: mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_infer
+    Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
     In Collection: MViT
     Metadata:
       Architecture: MViT-small
@@ -24,6 +24,28 @@ Models:
         Top 5 Accuracy: 94.7
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth
 
+  - Name: mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Batch Size: 16
+      Epochs: 100
+      FLOPs: 64G
+      Parameters: 34.5M
+      Resolution: 224x224
+      Training Data: Kinetics-400
+      Training Resources: 32 GPUs
+    Modality: RGB
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 80.6
+        Top 5 Accuracy: 94.7
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb_20230201-23284ff3.pth
+
   - Name: mvit-base-p244_32x3x1_kinetics400-rgb
     Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
     In Collection: MViT
@@ -60,8 +82,8 @@ Models:
         Top 5 Accuracy: 94.7
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth
 
-  - Name: mvit-small-p244_u16_sthv2-rgb
-    Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
+  - Name: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_infer
+    Config: configs/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.py
     In Collection: MViT
     Metadata:
       Architecture: MViT-small
@@ -78,6 +100,29 @@ Models:
         Top 5 Accuracy: 91.0
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth
 
+  - Name: mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_32xb16-16x4x1-200e_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Batch Size: 16
+      Epochs: 100
+      FLOPs: 64G
+      Parameters: 34.4M
+      Pretrained: Kinetics-400
+      Resolution: 224x224
+      Training Data: SthV2
+      Training Resources: 16 GPUs
+    Modality: RGB
+    Results:
+    - Dataset: SthV2
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 68.2
+        Top 5 Accuracy: 91.3
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb/mvit-small-p244_k400-pre_16xb16-u16-100e_sthv2-rgb_20230201-4065c1b9.pth
+
   - Name: mvit-base-p244_u32_sthv2-rgb
     Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
     In Collection: MViT
@@ -113,3 +158,26 @@ Models:
         Top 1 Accuracy: 73.2
         Top 5 Accuracy: 94.0
     Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth
+
+  - Name: mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Batch Size: 32
+      Epochs: 100
+      FLOPs: 71G
+      Parameters: 36.4M
+      Pretrained: Kinetics-400 MaskFeat
+      Resolution: 224x224
+      Training Data: Kinetics-400
+      Training Resources: 8 GPUs
+    Modality: RGB
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.8
+        Top 5 Accuracy: 95.2
+    Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb.log
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb/mvit-small-p244_k400-maskfeat-pre_8xb32-16x4x1-100e_kinetics400-rgb_20230201-5bced1d0.pth