Paper link: https://arxiv.org/abs/2104.14294
Code link: github repository
1. Official Readme
pre-trained model
You can choose to download only the weights of the pretrained backbone for downstream tasks, or a full checkpoint that includes the backbone and projection head weights for the student and teacher networks. We also provide the trunk in onnx format, with detailed parameters and training/evaluation logs. Note that the "DeiT-S" and "ViT-S" designations refer to the same architecture.
Pretrained models on PyTorch Hub
vits16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vits16 ' ) vits8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vits8 ' ) vitb16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vitb16 ' ) vitb8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_vitb8 ' ) xcit_small_12_p16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_small_12_p16 ' ) xcit_small_12_p8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_small_12_p8 ' ) xcit_medium_24_p16 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_medium_24_p16 ' ) xcit_medium_24_p8 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_xcit_medium_24_p8 ' ) resnet50 = torch.hub.load( ' facebookresearch/dino:main ' , ' dino_resnet50 ' )
train
document
please install PyTorch and download ImageNet data set. This codebase was developed using python version 3.6, PyTorch version 1.7.1, CUDA 11.0, and torchvision 0.8.2. allowable Pre-trained model part The exact parameters that reproduce the model proposed in our paper are found in the "args" column of . To view the full documentation for DINO training, run:
python main_dino.py-help
DINO training:
Run DINO for 100 epochs with the ViT-small network on a single node with 8 GPU s using the following command. With a training time of 1.75 days, the resulting checkpoints should achieve 69.3% on k-NN evaluation and 74.0% on linear evaluation. we provide train and linear evaluation logs (batch size 256 when evaluated) to help improve reproducibility.
python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
multi-node training
We use Slurm and submitit ( pip install submitit ). Train on 2 nodes with 8 GPUs each (16 GPUs total):
python run_with_submitit.py --nodes 2 --ngpus 8 --arch vit_small --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
python run_with_submitit.py --nodes 2 --ngpus 8 --use_volta32 --arch vit_base --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
Improve DINO performance: t-rex:
You can improve the performance of your run by:
- train more epochs: --epochs 300,
- Increase teacher temp: --teacher_temp 0.07 --warmup_teacher_temp_epochs 30.
- remove last layer normalization (only safe with --arch vit_small): --norm_last_layer false,
python run_with_submitit.py --arch vit_small --epochs 300 --teacher_temp 0.07 --warmup_teacher_temp_epochs 30 --norm_last_layer false --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
The resulting pretrained model should achieve 73.3% on k-NN evaluation and 76.0% on linear evaluation. The training time is 2.6 days using 16 GPU s. we provide train and linear evaluation logs (batch size 256 when evaluated) to help improve reproducibility.
ResNet-50 and other convolutional neural network training
This code is also suitable for training DINO on convolutional networks such as ResNet-50. In this case, we strongly recommend some optimization parameters. For example, the following is the command to train DINO for 100 epochs on a ResNet-50 on a single node with 8 GPU s. We run for _ _ this.
python -m torch.distributed.launch --nproc_per_node=8 main_dino.py --arch resnet50 --optimizer sgd --lr 0.03 --weight_decay 1e-4 --weight_decay_end 1e-4 --global_crops_scale 0.14 1 --local_crops_scale 0.05 0.14 --data_path /path/to/imagenet/train --output_dir /path/to/saving_dir
self-attention visualization
You can see the self-attention of the [CLS] tokens on different heads at the last layer by running:
python visualize_attention.py
Evaluation: k-NN Classification on ImageNet
To evaluate a simple k-NN classifier on a pretrained model using a single GPU, run:
python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --data_path /path/to/imagenet
If you choose not to specify --pretrained_weights, DINO reference weights are used by default. If you want to evaluate checkpoints from your own runs, you can run for example:
python -m torch.distributed.launch --nproc_per_node=1 eval_knn.py --pretrained_weights /path/to/checkpoint.pth --checkpoint_key teacher --data_path /path/to/imagenet
Evaluation: Linear Classification on ImageNet
To train a supervised linear classifier on frozen weights on a single node with 8 GPU s, run:
python -m torch.distributed.launch --nproc_per_node=8 eval_linear.py --data_path /path/to/imagenet
We publish logs and weights evaluating different models:
You can check the performance of pretrained weights on the ImageNet validation set by running the following command line:
python eval_linear.py --evaluate --arch vit_small --patch_size 16 --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch vit_small --patch_size 8 --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch vit_base --patch_size 16 --n_last_blocks 1 --avgpool_patchtokens true --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch vit_base --patch_size 8 --n_last_blocks 1 --avgpool_patchtokens true --data_path /path/to/imagenet/train
python eval_linear.py --evaluate --arch resnet50 --data_path /path/to/imagenet/train
2. Procedure
2.1 Program structure
- main_dino.py network structure
- vision_transformer.py Transformer structure
- eval_knn.py KNN evaluation
- eval_linear.py Linear Evaluation
- eval_video_segmentation.py segmentation evaluation
- visualize_attention.py semantic segmentation visualization
2.2 main_dino.py
2.2.1 Import package
import argparse import os import sys import datetime import time import math import json from pathlib import Path import numpy as np from PIL import Image import torch import torch.nn as nn import torch.distributed as dist import torch.backends.cudnn as cudnn import torch.nn.functional as F from torchvision import datasets, transforms from torchvision import models as torchvision_models import utils import vision_transformer as vits from vision_transformer import DINOHead
2.2.2 Import package
# Specified model, Backbone parser.add_argument('--arch', default='vit_small', type=str, choices=['vit_tiny', 'vit_small', 'vit_base', 'xcit', 'deit_tiny', 'deit_small'] \ + torchvision_archs + torch.hub.list("facebookresearch/xcit:main"), help="""Name of architecture to train. For quick experiments with ViTs, we recommend using vit_tiny or vit_small.""") #Using a smaller value for the VIT patch size results in better performance but requires more memory. Applicable only for ViT (vit_tiny, vit_small and vit_base) parser.add_argument('--patch_size', default=16, type=int, help="""Size in pixels of input square patches - default 16 (for 16x16 patches). Using smaller values leads to better performance but requires more memory. Applies only for ViTs (vit_tiny, vit_small and vit_base). If <16, we recommend disabling mixed precision training (--use_fp16 false) to avoid unstabilities.""") # DINO head output parser.add_argument('--out_dim', default=65536, type=int, help="""Dimensionality of the DINO head output. For complex and large datasets large values (like 65k) work well.""") #Whether to normalize the weights of the last layer of the DINO head parser.add_argument('--norm_last_layer', default=True, type=utils.bool_flag, help="""Whether or not to weight normalize the last layer of the DINO head. Not normalizing leads to better performance but can make the training unstable. In our experiments, we typically set this paramater to False with vit_small and True with vit_base.""") #Teacher Update Parameters parser.add_argument('--momentum_teacher', default=0.996, type=float, help="""Base EMA parameter for teacher update. The value is increased to 1 during training with cosine schedule. We recommend setting a higher value with small batches: for example use 0.9995 with batch size of 256.""") #Whether to use batch normalization in the projection head parser.add_argument('--use_bn_in_head', default=False, type=utils.bool_flag, help="Whether to use batch normalizations in projection head (Default: False)") # Temperature teacher parameters parser.add_argument('--warmup_teacher_temp', default=0.04, type=float, help="""Initial value for the teacher temperature: 0.04 works well in most cases. Try decreasing it if the training loss does not decrease.""") parser.add_argument('--teacher_temp', default=0.04, type=float, help="""Final value (after linear warmup) of the teacher temperature. For most experiments, anything above 0.07 is unstable. We recommend starting with the default value of 0.04 and increase this slightly if needed.""") parser.add_argument('--warmup_teacher_temp_epochs', default=0, type=int, help='Number of warmup epochs for the teacher temperature (Default: 30).') # training optimization parameters #Use half precision for training. Improves training time and memory requirements, but introduces instability and slight performance degradation parser.add_argument('--use_fp16', type=utils.bool_flag, default=True, help="""Whether or not to use half precision for training. Improves training time and memory requirements, but can provoke instability and slight decay of performance. We recommend disabling mixed precision if the loss is unstable, if reducing the patch size or if training with bigger ViTs.""") parser.add_argument('--weight_decay', type=float, default=0.04, help="""Initial value of the weight decay. With ViT, a smaller value at the beginning of training works well.""") parser.add_argument('--weight_decay_end', type=float, default=0.4, help="""Final value of the weight decay. We use a cosine schedule for WD and using a larger decay by the end of training improves performance for ViTs.""") parser.add_argument('--clip_grad', type=float, default=3.0, help="""Maximal parameter gradient norm if using gradient clipping. Clipping with norm .3 ~ 1.0 can help optimization for larger ViT architectures. 0 for disabling.""") parser.add_argument('--batch_size_per_gpu', default=64, type=int, help='Per-GPU batch-size : number of distinct images loaded on one GPU.') parser.add_argument('--epochs', default=100, type=int, help='Number of epochs of training.') parser.add_argument('--freeze_last_layer', default=1, type=int, help="""Number of epochs during which we keep the output layer fixed. Typically doing so during the first epoch helps training. Try increasing this value if the loss does not decrease.""") parser.add_argument("--lr", default=0.0005, type=float, help="""Learning rate at the end of linear warmup (highest LR used during training). The learning rate is linearly scaled with the batch size, and specified here for a reference batch size of 256.""") parser.add_argument("--warmup_epochs", default=10, type=int, help="Number of epochs for the linear learning-rate warm up.") parser.add_argument('--min_lr', type=float, default=1e-6, help="""Target LR at the end of optimization. We use a cosine LR schedule with linear warmup.""") parser.add_argument('--optimizer', default='adamw', type=str, choices=['adamw', 'sgd', 'lars'], help="""Type of optimizer. We recommend using adamw with ViTs.""") parser.add_argument('--drop_path_rate', type=float, default=0.1, help="stochastic depth rate") # Multi-crop parameters parser.add_argument('--global_crops_scale', type=float, nargs='+', default=(0.4, 1.), help="""Scale range of the cropped image before resizing, relatively to the origin image. Used for large global view cropping. When disabling multi-crop (--local_crops_number 0), we recommand using a wider range of scale ("--global_crops_scale 0.14 1." for example)""") parser.add_argument('--local_crops_number', type=int, default=8, help="""Number of small local views to generate. Set this parameter to 0 to disable multi-crop training. When disabling multi-crop we recommend to use "--global_crops_scale 0.14 1." """) parser.add_argument('--local_crops_scale', type=float, nargs='+', default=(0.05, 0.4), help="""Scale range of the cropped image before resizing, relatively to the origin image. Used for small local view cropping of multi-crop.""") # Misc miscellaneous items, including data path output path, etc. ..................(omitted) return parser
2.2.3 main function
if __name__ == '__main__': #Get parameter settings parser = argparse.ArgumentParser('DINO', parents=[get_args_parser()]) args = parser.parse_args() #create output folder Path(args.output_dir).mkdir(parents=True, exist_ok=True) #start training train_dino(args)
2.2.4 LOSS
class DINOLoss(nn.Module): def __init__(self, out_dim, ncrops, warmup_teacher_temp, teacher_temp, warmup_teacher_temp_epochs, nepochs, student_temp=0.1, center_momentum=0.9): super().__init__() self.student_temp = student_temp self.center_momentum = center_momentum self.ncrops = ncrops self.register_buffer("center", torch.zeros(1, out_dim)) # we apply a warm up for the teacher temperature because # a too high temperature makes the training instable at the beginning self.teacher_temp_schedule = np.concatenate(( np.linspace(warmup_teacher_temp, teacher_temp, warmup_teacher_temp_epochs), np.ones(nepochs - warmup_teacher_temp_epochs) * teacher_temp )) def forward(self, student_output, teacher_output, epoch): """ Cross-entropy between softmax outputs of the teacher and student networks. """ student_out = student_output / self.student_temp student_out = student_out.chunk(self.ncrops) # After the teacher network passes through the center module, the distribution becomes smooth, and then becomes sharper through softmax temp = self.teacher_temp_schedule[epoch] teacher_out = F.softmax((teacher_output - self.center) / temp, dim=-1) teacher_out = teacher_out.detach().chunk(2) total_loss = 0 n_loss_terms = 0 for iq, q in enumerate(teacher_out): for v in range(len(student_out)): if v == iq: # we skip cases where student and teacher operate on the same view continue loss = torch.sum(-q * F.log_softmax(student_out[v], dim=-1), dim=-1) total_loss += loss.mean() n_loss_terms += 1 total_loss /= n_loss_terms self.update_center(teacher_output) return total_loss #center module parameter update @torch.no_grad() def update_center(self, teacher_output): """ Update center used for teacher output. """ batch_center = torch.sum(teacher_output, dim=0, keepdim=True) dist.all_reduce(batch_center) batch_center = batch_center / (len(teacher_output) * dist.get_world_size()) # ema update self.center = self.center * self.center_momentum + batch_center * (1 - self.center_momentum)
2.2.5 Training function
def train_one_epoch(student, teacher, teacher_without_ddp, dino_loss, data_loader, optimizer, lr_schedule, wd_schedule, momentum_schedule,epoch, fp16_scaler, args): metric_logger = utils.MetricLogger(delimiter=" ") header = 'Epoch: [{}/{}]'.format(epoch, args.epochs) for it, (images, _) in enumerate(metric_logger.log_every(data_loader, 10, header)): # update weight decay and learning rate according to their schedule it = len(data_loader) * epoch + it # global training iteration for i, param_group in enumerate(optimizer.param_groups): param_group["lr"] = lr_schedule[it] if i == 0: # only the first group is regularized param_group["weight_decay"] = wd_schedule[it] # move images to gpu images = [im.cuda(non_blocking=True) for im in images] # teacher and student forward passes + compute dino loss with torch.cuda.amp.autocast(fp16_scaler is not None): teacher_output = teacher(images[:2]) # only the 2 global views pass through the teacher student_output = student(images) loss = dino_loss(student_output, teacher_output, epoch) if not math.isfinite(loss.item()): print("Loss is {}, stopping training".format(loss.item()), force=True) sys.exit(1) # student update optimizer.zero_grad() param_norms = None if fp16_scaler is None: loss.backward() if args.clip_grad: param_norms = utils.clip_gradients(student, args.clip_grad) utils.cancel_gradients_last_layer(epoch, student, args.freeze_last_layer) optimizer.step() else: fp16_scaler.scale(loss).backward() if args.clip_grad: fp16_scaler.unscale_(optimizer) # unscale the gradients of optimizer's assigned params in-place param_norms = utils.clip_gradients(student, args.clip_grad) utils.cancel_gradients_last_layer(epoch, student, args.freeze_last_layer) fp16_scaler.step(optimizer) fp16_scaler.update() # EMA update for the teacher with torch.no_grad(): m = momentum_schedule[it] # momentum parameter for param_q, param_k in zip(student.module.parameters(), teacher_without_ddp.parameters()): param_k.data.mul_(m).add_((1 - m) * param_q.detach().data) # logging torch.cuda.synchronize() metric_logger.update(loss=loss.item()) metric_logger.update(lr=optimizer.param_groups[0]["lr"]) metric_logger.update(wd=optimizer.param_groups[0]["weight_decay"]) # gather the stats from all processes metric_logger.synchronize_between_processes() print("Averaged stats:", metric_logger) return {k: meter.global_avg for k, meter in metric_logger.meters.items()}