深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

网友投稿 834 2022-10-06

深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

深度学习论文: TResNet: High Performance GPU-Dedicated Architecture及其PyTorch实现

TResNet: High Performance GPU-Dedicated Architecture PDF:​​​​​概述

TResNet 模型具有表现出更高的准确度和效率。使用 TResNet 模型以及与 ResNet50 相似的 GPU 吞吐量,研究者在 ImageNet 上实现了 80.7% 的 top-1 准确度。

2 TResNet Design

2-1 Stem Design

class SpaceToDepth(nn.Module): def __init__(self, block_size=4): super().__init__() assert block_size == 4 self.bs = block_size def forward(self, x): N, C, H, W = x.size() x = x.view(N, C, H // self.bs, self.bs, W // self.bs, self.bs) # (N, C, H//bs, bs, W//bs, bs) x = x.permute(0, 3, 5, 1, 2, 4).contiguous() # (N, bs, bs, C, H//bs, W//bs) x = x.view(N, C * (self.bs ** 2), H // self.bs, W // self.bs) # (N, C*bs^2, H//bs, W//bs) return

Mark Sandler, Jonathan Baccash, Andrey Zhmoginov, and Andrew Howard. Non-discriminative data or weak model? on the relative importance of data and model resolution. arXiv preprint arXiv:1909.03205, 2019.

2-2 Anti-Alias Downsampling (AA)

class AADownsample(nn.Module): def __init__(self, filt_size=3, stride=2, channels=None): super(AADownsample, self).__init__() self.filt_size = filt_size self.stride = stride self.channels = channels assert self.filt_size == 3 a = torch.tensor([1., 2., 1.]) filt = (a[:, None] * a[None, :]) filt = filt / torch.sum(filt) # self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)) self.register_buffer('filt', filt[None, None, :, :].repeat((self.channels, 1, 1, 1))) def forward(self, input): input_pad = F.pad(input, (1, 1, 1, 1), 'reflect') return F.conv2d(input_pad, self.filt, stride=self.stride, padding=0, groups=input.shape[1])

Richard Zhang. Making convolutional networks shiftinvariant again. In ICML, 2019.

2-3 In-Place Activated BatchNorm (Inplace-ABN)

使用Inplace-ABN 代替所有的BatchNorm+ReLU , 可以大幅减少GPU内存消耗 同时使用Leaky-ReLU代替ReLU,提升性能的同时,带来很少代价

​​Rota Bulo, Lorenzo Porzi, and Peter Kontschieder. `In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018

2-4 Blocks Selection

下图左边为ResNet34 使用的BasicBlock,右边为ResNet50使用的Bottleneck, Bottleneck使用GPU更高,但是可以得到更高精度, BasicBlock有更大的感受野.

因此, TResNet在前两阶段使用BasicBlock,后两阶段使用Bottleneck

2-5 SE Layers

在前三阶段增加SE layers, 同时SE layers位置如下

提出的结构如下

3 Code Optimizations

3-1 JIT Compilation

JIT accelerated SpaceToDepth module

@torch.jit.scriptclass SpaceToDepthJit(object): def __call__(self, x: torch.Tensor): # assuming hard-coded that block_size==4 for acceleration N, C, H, W = x.size() x = x.view(N, C, H // 4, 4, W // 4, 4) # (N, C, H//bs, bs, W//bs, bs) x = x.permute(0, 3, 5, 1, 2, 4).contiguous() # (N, bs, bs, C, H//bs, W//bs) x = x.view(N, C * 16, H // 4, W // 4) # (N, C*bs^2, H//bs, W//bs) return

JIT accelerated AA downsampling module

@torch.jit.scriptclass AADownsampleJIT(object): def __init__(self, filt_size: int = 3, stride: int = 2, channels: int = 0): self.stride = stride self.filt_size = filt_size self.channels = channels assert self.filt_size == 3 assert stride == 2 a = torch.tensor([1., 2., 1.]) filt = (a[:, None] * a[None, :]).clone().detach() filt = filt / torch.sum(filt) self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half() def __call__(self, input: torch.Tensor): if input.dtype != self.filt.dtype: self.filt = self.filt.float() input_pad = F.pad(input, (1, 1, 1, 1), 'reflect') return F.conv2d(input_pad, self.filt, stride=2, padding=0, groups=input.shape[1])

3-2 Fixed Global Average Pooling

AvgPool2d比AdaptiveAvgPool2d更快,但是使用View 和 Mean会比AvgPool2d快5倍.

class FastGlobalAvgPool2d(nn.Module): def __init__(self, flatten=False): super(FastGlobalAvgPool2d, self).__init__() self.flatten = flatten def forward(self, x): if self.flatten: in_size = x.size() return x.view((in_size[0], in_size[1], -1)).mean(dim=2) else: return x.view(x.size(0), x.size(1), -1).mean(-1).view(x.size(0), x.size(1), 1, 1)

3-3 Inplace Operations

在所有可能的地方,尽可能的使用 Inplace操作 如 residual connection, SE layers, blocks’ final activation等

4 实验结果

4-1 Basic

4-2 Ablation

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:注意力机制论文:Asymmetric Non-local Neural Networks for Semantic Segmentation及其PyTorch实现
下一篇:微信小程序动态显示项目倒计时的效果
相关文章

 发表评论

暂时没有评论,来抢沙发吧~