Skip to content
Snippets Groups Projects
readme.md 1.87 KiB
Newer Older
Fast imagenet training on the TU Delft HPC with PyTorch using TFRecords and DALI.
Attila Lengyel's avatar
Attila Lengyel committed

Tested with PyTorch 1.12.1, CUDA 11.6 and NVIDIA DALI 1.22.0.
Attila Lengyel's avatar
Attila Lengyel committed
Install instructions for DALI: https://www.github.com/NVIDIA/DALI.
Attila Lengyel's avatar
Attila Lengyel committed

##### Description of files
Attila Lengyel's avatar
Attila Lengyel committed

* `imagenet_tfrecord.py` Python script containing ImageNet dataloader. Use this for your own project.
* `main.py` Ready to run ImageNet training script for ResNet18. Will finish training in ~24 hours.
* `imagenet.sbatch` Sbatch script with recommended settings.
Attila Lengyel's avatar
Attila Lengyel committed

##### Usage

* Set `OUT_DIR` and `WANDB_DIR` environment variables (for example in `imagenet.sbatch`).
* Run `sbatch imagenet.sbatch` to start training.
* For ResNet18 requires 4x 1080Ti GPUs (batch size 64 each), or 1x A40 GPU (batch size 256).

Attila Lengyel's avatar
Attila Lengyel committed
##### Performance

Performance of ResNet18 is on par with the pre-trained torchvision model.

|                                                              | Top-1 error % | Top-5 error % |
| ------------------------------------------------------------ | ------------- | ------------- |
| ResNet18 - DALI [ours]                                       | 29.99         | 10.79         |
| ResNet18 - Torchvision [[link](https://pytorch.org/docs/stable/torchvision/models.html)] | 30.24         | 10.92         |




##### Limitations
Attila Lengyel's avatar
Attila Lengyel committed

* As all JPEG decoding and data augmentation is processed on the GPU, less GPU memory is available for your network. In case of OOM errors you can try to (1) use more GPUs, or (2) enable `dali_cpu` (possibly slower).
* I'm not sure how exactly batches are shuffled, it might be less "random" compared to loading individual JPEG files.
Attila Lengyel's avatar
Attila Lengyel committed

##### Other
Attila Lengyel's avatar
Attila Lengyel committed

* Short video explanation of DistributedDataParallel: https://youtu.be/a6_pY9WwqdQ?t=95
Attila Lengyel's avatar
Attila Lengyel committed
* https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/data_loading/dataloading_tfrecord.html
* https://docs.nvidia.com/deeplearning/dali/user-guide/docs/supported_ops.html