介绍

  • 这是一个简单的演示案例:如何从零开始,使用预训练的 3D ResNet 模型实现视频动作识别
  • 本示例代码参考 kenshohara/3D-ResNets-PyTorch 项目,直接使用 Paddle, OpenCV, Numpy, PIL 等常用的 Python 模块实现,无需使用其他额外的代码库

效果演示

  • 这里使用“作证”、“打篮球”和“洗盘子”这三个视频动作片段作为演示

resnet34在cifar100上的官方训练模型 resnet预训练_ide


resnet34在cifar100上的官方训练模型 resnet预训练_ci_02

resnet34在cifar100上的官方训练模型 resnet预训练_计算机视觉_03


resnet34在cifar100上的官方训练模型 resnet预训练_pytorch_04

参考资料

@misc{hara2018spatiotemporal,
      title={Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?}, 
      author={Kensho Hara and Hirokatsu Kataoka and Yutaka Satoh},
      year={2018},
      eprint={1711.09577},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{carreira2019short,
      title={A Short Note on the Kinetics-700 Human Action Dataset}, 
      author={Joao Carreira and Eric Noland and Chloe Hillier and Andrew Zisserman},
      year={2019},
      eprint={1907.06987},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

引入

  • 一般来讲使用一个预训练的深度学习模型进行推理操作,包含如下几个步骤:
  • 数据处理:对数据进行一定的处理,将其转换为模型能够接受的输入格式
  • 模型搭建:根据模型的网络结构搭建模型
  • 加载模型:实例化模型加载预训练参数
  • 模型计算:使用加载的模型对处理完成的数据进行前向计算得出结果
  • 结果处理:对输出的结果转换成根据方便查看或理解的形式输出

数据集

  • 本次使用的视频数据来自于 Kinetics 700 数据集,一个包含 700 个动作类别的大型动作识别数据集
  • 从 Kinetics 700 的验证集中随便选取一个视频,其标注如下:
"---QUuC4vJs": {
    "annotations": {
        "label": "testifying",
        "segment": [
            84.0,
            94.0
        ]
    },
    "duration": 10.0,
    "subset": "validate",
    "url": "https://www.youtube.com/watch?v=---QUuC4vJs"
}
  • 其中包含了视频的链接,视频截取片段的区间,截取的时长,属于哪个数据集合,以及该视频片段的动作标签
  • 这个视频已经事先下载完毕了,截取的片段预览如下:

数据处理

  • 这里简单的演示一下如何对一段视频进行处理,变成一个模型可接受的数据
  • 数据的处理主要分为三个部分:视频帧的图像处理,视频的时序处理,视频数据的读取
  • 论文中的训练时间的数据处理流程如下:
  • 模型推理时间的数据处理过程如下:

视频帧的图像处理

  • 根据短边缩放 -> 正方形中心裁切 -> 缩放数据 -> 转换为 Tensor -> 根据数据集的均值和标准差对数据归一化 -> 添加维度
  • 这些操作均可以使用 Paddle.vision 的内置函数进行实现
from paddle.vision.transforms import Compose, Normalize, Resize, CenterCrop, ToTensor

sample_size = 112
mean = [0.4345, 0.4051, 0.3775]
std = [0.2768, 0.2713, 0.2737]

test_spatial_transforms = Compose([
    Resize(sample_size),
    CenterCrop(sample_size),
    ToTensor(),
    Normalize(mean, std),
    lambda x: x[None, ...]
])

视频的时序处理

  • 由于视频序列较长,需要对序列进行采样才能作为模型的输入
  • 在模型预测推理阶段论文中采用的处理方式为:滑动窗口采样,即将一个长序列转换为多个较短且等长的子序列
  • 使用多个子序列的结果融合而非单一采样的方式,可以有效的提升推理时精度表现,属于一种 TTA(Test Time Augmentation)的技巧
class LoopPadding(object):
    '''
    序列填充:当序列长度小于设定的最小长度时,循环填充最后一个 index 至长度等于设定的最小长度
    '''
    def __init__(self, size):
        self.size = size

    def __call__(self, frame_indices):
        out = frame_indices

        for index in out:
            if len(out) >= self.size:
                break
            out.append(index)

        return out
    
class SlidingWindow(object):
    '''
    滑动窗口:根据子序列长度和滑动步长,对一个长序列进行采样,组成多个长度相等的子序列
    '''
    def __init__(self, size, stride=0):
        self.size = size
        if stride == 0:
            self.stride = self.size
        else:
            self.stride = stride
        self.loop = LoopPadding(size)

    def __call__(self, frame_indices):
        out = []
        for begin_index in frame_indices[::self.stride]:
            end_index = min(frame_indices[-1] + 1, begin_index + self.size)
            sample = list(range(begin_index, end_index))

            if len(sample) < self.size:
                out.append(self.loop(sample))
                break
            else:
                out.append(sample)

        return out

    
sample_duration = 16
inference_stride = 16

test_temporal_transforms = SlidingWindow(sample_duration, inference_stride)

视频数据读取

  • 根据数据集的标注提取所需要的片段 -> 视频切帧 -> 帧图像预处理 -> 滑动窗口截取视频采样片段
import cv2
import paddle
import PIL.Image as Image

start_time = 84
end_time = 94
video_file = 'data/data121477/testifying.mp4'


cap = cv2.VideoCapture(video_file)
cap.set(cv2.CAP_PROP_POS_MSEC, start_time*1000)
frames = []
while True:
    if cap.get(cv2.CAP_PROP_POS_MSEC) >= end_time*1000:
        break
    success, frame = cap.read()
    if success:
        frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        frames.append(test_spatial_transforms(Image.fromarray(frame)))
    else:
        break
cv2.destroyAllWindows()
frames = paddle.concat(frames, 0)

samples = []
for indexs in test_temporal_transforms(range(len(frames))):
    sample = frames[indexs].transpose((1, 0, 2, 3))[None, ...]
    samples.append(sample)
samples = paddle.concat(samples, 0)
print(f'输入采样的 Shape: {samples.shape}')
输入采样的 Shape: [19, 3, 16, 112, 112]

类别标签

  • 因为算法在计算过程中使用的均为数字表达,所以需要一个列表记录输出的 ID 和实际类别之间的对应关系
  • 在推理操作之前需要加载与训练时相同的标签列表,以确保输出的标签结果是一致的
  • 所以本次使用的标签列表来自作者提供的训练时所使用的 json 文件
label_file = 'data/data121477/kinetics700.txt'

label_list = []
with open(label_file, 'r', encoding='UTF-8') as f:
    for line in f:
        label = ' '.join(line[:-1].split(' ')[1:])
        label_list.append(label)

print(f'总共包含{len(label_list)}个类别标签,分别是:{label_list}')
总共包含700个类别标签,分别是:['slicing onion', 'jetskiing', 'building cabinet', 'eating hotdog', 'building lego', 'jumping jacks', 'kissing', 'bee keeping', 'blowing glass', 'curling eyelashes', 'shining shoes', 'flipping pancake', 'putting on eyeliner', 'salsa dancing', 'tasting wine', 'pinching', 'massaging back', 'getting a piercing', 'bouncing ball (not juggling)', 'playing gong', 'karaoke', 'battle rope training', 'pumping fist', 'pretending to be a statue', 'ski ballet', 'stretching leg', 'decoupage', 'tossing coin', 'ironing', 'washing hands', 'doing laundry', 'dancing charleston', 'opening coconuts', 'opening wine bottle', 'looking in mirror', 'cracking neck', 'springboard diving', 'golf chipping', 'repairing puncture', 'crocheting', 'talking on cell phone', 'opening present', 'leatherworking', 'pulling rope (game)', 'giving or receiving award', 'tiptoeing', 'eating nachos', 'catching or throwing baseball', 'bowling', 'playing recorder', 'whistling', 'parkour', 'trimming shrubs', 'exercising arm', 'clay pottery making', 'playing controller', 'playing ukulele', 'training dog', 'using segway', 'roasting pig', 'shoot dance', 'throwing snowballs', 'using remote controller (not gaming)', 'riding or walking with horse', 'hugging (not baby)', 'playing bass guitar', 'squeezing orange', 'waxing armpits', 'recording music', 'belly dancing', 'brushing teeth', 'applying cream', 'chiseling wood', 'head stand', 'directing traffic', 'photocopying', 'yawning', 'checking watch', 'writing', 'sausage making', 'crossing eyes', 'falling off bike', 'blending fruit', 'skiing slalom', 'pouring wine', 'throwing tantrum', 'celebrating', 'fidgeting', 'shearing sheep', 'tagging graffiti', 'diving cliff', 'throwing knife', 'scrapbooking', 'cartwheeling', 'hurdling', 'getting a haircut', 'slacklining', 'arm wrestling', 'playing with trains', 'mopping floor', 'scrambling eggs', 'playing tennis', 'blowing nose', 'playing flute', 'pushing car', 'vacuuming floor', 'getting a tattoo', 'raising eyebrows', 'abseiling', 'playing road hockey', 'headbanging', 'spraying', 'chasing', 'playing billiards', 'motorcycling', 'air drumming', 'changing gear in car', 'riding unicycle', 'somersaulting', 'making snowman', 'using circular saw', 'attending conference', 'massaging feet', 'playing field hockey', 'playing scrabble', 'polishing metal', 'assembling computer', 'riding snow blower', 'news anchoring', 'flying kite', 'breakdancing', 'eating cake', 'counting money', 'breathing fire', 'carving ice', 'shining flashlight', 'archaeological excavation', 'calculating', 'snatch weight lifting', 'playing accordion', 'stacking cups', 'playing cards', 'playing badminton', 'stomping grapes', 'playing didgeridoo', 'chiseling stone', 'folding napkins', 'snowmobiling', 'backflip (human)', 'golf putting', 'lock picking', 'lawn mower racing', 'making the bed', 'brushing floor', 'peeling banana', 'front raises', 'wood burning (art)', 'hurling (sport)', 'long jump', 'squat', 'playing checkers', 'bouncing on trampoline', 'playing oboe', 'playing blackjack', 'kitesurfing', 'hockey stop', 'weaving basket', 'playing shuffleboard', 'ice fishing', 'bookbinding', 'playing laser tag', 'dining', 'casting fishing line', 'delivering mail', 'tobogganing', 'washing hair', 'square dancing', 'playing american football', 'exercising with an exercise ball', 'krumping', 'skydiving', 'bathing dog', 'hand washing clothes', 'bouncing on bouncy castle', 'making tea', 'watching tv', 'cleaning toilet', 'playing darts', 'jaywalking', 'kicking field goal', 'punching person (boxing)', 'scrubbing face', 'skiing mono', 'huddling', 'ice swimming', 'playing xylophone', 'unloading truck', 'gargling', 'twiddling fingers', 'using a microscope', 'sweeping floor', 'passing soccer ball', 'wading through mud', 'cumbia', 'cracking knuckles', 'waiting in line', 'cleaning gutters', 'mixing colours', 'playing paintball', 'stretching arm', 'using megaphone', 'treating wood', 'petting horse', 'dyeing eyebrows', 'drooling', 'swimming butterfly stroke', 'roller skating', 'punching bag', 'laying stone', 'sharpening pencil', 'pushing wheelbarrow', 'catching or throwing softball', 'laying tiles', 'poking bellybutton', 'riding scooter', 'skiing crosscountry', 'bending metal', 'shooting basketball', 'card stacking', 'parasailing', 'ski jumping', 'waxing legs', 'making cheese', 'sanding floor', 'playing cricket', 'separating eggs', 'shaving head', 'carrying weight', 'presenting weather forecast', 'waking up', 'picking apples', 'mountain climber (exercise)', 'popping balloons', 'docking boat', 'peeling apples', 'tapping pen', 'using inhaler', 'petting animal (not cat)', 'running on treadmill', 'hugging baby', 'juggling soccer ball', 'snowkiting', 'licking', 'contorting', 'cooking scallops', 'inflating balloons', 'gymnastics tumbling', 'grinding meat', 'playing drums', 'playing maracas', 'washing feet', 'eating burger', 'rock climbing', 'carving pumpkin', 'shucking oysters', 'lunge', 'shaping bread dough', 'calligraphy', 'extinguishing fire', 'tickling', 'swinging on something', 'carving wood with a knife', 'pushing cart', 'sewing', 'swimming with sharks', 'baking cookies', 'playing volleyball', 'feeding birds', 'baby waking up', 'country line dancing', 'tasting food', 'clean and jerk', 'playing netball', 'building shed', 'auctioning', 'walking with crutches', 'arranging flowers', 'person collecting garbage', 'wrapping present', 'drinking shots', 'spray painting', 'passing American football (in game)', 'sign language interpreting', 'bandaging', 'making a sandwich', 'pull ups', 'juggling balls', 'playing poker', 'climbing ladder', 'cutting cake', 'side kick', 'laying decking', 'fly tying', 'cheerleading', 'being excited', 'dealing cards', 'vacuuming car', 'card throwing', 'breaking glass', 'shooting off fireworks', 'egg hunting', 'mushroom foraging', 'cosplaying', 'herding cattle', 'weaving fabric', 'pushing wheelchair', 'swimming with dolphins', 'drawing', 'fencing (sport)', 'swimming front crawl', 'closing door', 'sword swallowing', 'eating doughnuts', 'sipping cup', 'throwing ball (not baseball or American football)', 'waxing eyebrows', 'jumping into pool', 'cooking egg', 'packing', 'preparing salad', 'winking', 'gold panning', 'spelunking', 'swimming breast stroke', 'jogging', 'fixing bicycle', 'filling cake', 'embroidering', 'texting', 'pouring beer', 'playing hand clapping games', 'feeding fish', 'bartending', 'putting wallpaper on wall', 'tackling', 'sailing', 'playing kickball', 'paragliding', 'laughing', 'disc golfing', 'bending back', 'clam digging', 'swinging baseball bat', 'home roasting coffee', 'moon walking', 'dodgeball', 'brush painting', 'busking', 'push up', 'deadlifting', 'shooting goal (soccer)', 'hula hooping', 'waxing back', 'watering plants', 'arresting', 'burping', 'picking blueberries', 'making horseshoes', 'entering church', 'rolling eyes', 'jumpstyle dancing', 'making balloon shapes', 'crossing river', 'shaving legs', 'high jump', 'luge', 'waxing chest', 'carving marble', 'using bagging machine', 'applauding', 'feeding goats', 'riding elephant', 'playing piano', 'beatboxing', 'water skiing', 'zumba', 'ironing hair', 'cooking sausages (not on barbeque)', 'using atm', 'dancing macarena', 'snorkeling', 'playing cymbals', 'sword fighting', 'digging', 'playing clarinet', 'doing jigsaw puzzle', 'shoveling snow', 'cracking back', 'washing dishes', 'marriage proposal', 'playing pan pipes', 'changing oil', 'playing organ', 'pouring milk', 'robot dancing', 'letting go of balloon', 'cleaning pool', 'sieving', 'arguing', 'blowing leaves', 'headbutting', 'surfing crowd', 'shot put', 'passing American football (not in game)', 'cleaning shoes', 'lighting candle', 'bench pressing', 'unboxing', 'reading newspaper', 'breading or breadcrumbing', 'yarn spinning', 'geocaching', 'pumping gas', 'playing guitar', 'alligator wrestling', 'shuffling cards', 'carrying baby', 'historical reenactment', 'blasting sand', 'walking the dog', 'photobombing', 'drumming fingers', 'riding a bike', 'riding mechanical bull', 'chopping wood', 'playing ocarina', 'playing beer pong', 'pulling espresso shot', 'capoeira', 'milking goat', 'massaging legs', 'catching or throwing frisbee', 'using a sledge hammer', 'doing aerobics', 'trapezing', 'surfing water', 'dunking basketball', 'playing keyboard', 'cooking chicken', 'playing polo', 'biking through snow', 'playing rubiks cube', 'juggling fire', 'spinning poi', 'playing pinball', 'drop kicking', 'canoeing or kayaking', 'braiding hair', 'shaking hands', 'coughing', 'playing nose flute', 'playing trombone', 'driving tractor', 'adjusting glasses', 'shredding paper', 'making bubbles', 'smoking hookah', 'playing cello', 'folding clothes', 'making paper aeroplanes', 'filling eyebrows', 'making sushi', 'bobsledding', 'making pizza', 'riding mule', 'surveying', 'dyeing hair', 'mowing lawn', 'combing hair', 'cutting nails', 'shopping', 'playing rounders', 'making jewelry', 'walking through snow', 'pole vault', 'driving car', 'decorating the christmas tree', 'listening with headphones', 'lifting hat', 'climbing a rope', 'wading through water', 'slapping', 'eating carrots', 'cutting pineapple', 'eating spaghetti', 'tightrope walking', 'standing on hands', 'tossing salad', 'ripping paper', 'polishing furniture', 'peeling potatoes', 'checking tires', 'putting on mascara', 'curling hair', 'bodysurfing', 'playing violin', 'smoking', 'blowdrying hair', 'frying vegetables', 'using puppets', 'making slime', 'barbequing', 'dumpster diving', 'bottling', 'setting table', 'triple jump', 'chopping meat', 'doing nails', 'playing lute', 'tapping guitar', 'smelling feet', 'playing harmonica', 'making a cake', 'planting trees', 'tying shoe laces', 'javelin throw', 'scuba diving', 'laying bricks', 'hitting baseball', 'mosh pit dancing', 'tasting beer', 'windsurfing', 'tap dancing', 'sleeping', 'testifying', 'tying necktie', 'putting on foundation', 'brushing hair', 'putting on shoes', 'saluting', 'singing', 'crying', 'marching', 'visiting the zoo', 'land sailing', 'swing dancing', 'yoga', 'shaking head', 'tying bow tie', 'cutting apple', 'hoverboarding', 'rock scissors paper', 'eating chips', 'lighting fire', 'petting cat', 'flipping bottle', 'planing wood', 'pillow fight', 'using a paint roller', 'playing squash or racquetball', 'massaging neck', 'spinning plates', 'coloring in', 'falling off chair', 'using a power drill', 'roasting marshmallows', 'catching fish', 'snowboarding', 'seasoning food', 'sneezing', 'poaching eggs', 'playing marbles', 'skipping stone', 'doing sudoku', 'assembling bicycle', 'steer roping', 'longboarding', "massaging person's head", 'smoking pipe', 'blowing bubble gum', 'opening refrigerator', 'trimming or shaving beard', 'sanding wood', 'reading book', 'being in zero gravity', 'sawing wood', 'bulldozing', 'playing basketball', 'moving furniture', 'playing dominoes', 'welding', 'cleaning windows', 'putting on sari', 'steering car', 'needle felting', 'clapping', 'dancing ballet', 'walking on stilts', 'situp', 'rope pushdown', 'grooming cat', 'base jumping', 'skateboarding', 'grooming dog', 'playing trumpet', 'playing saxophone', 'contact juggling', 'playing monopoly', 'playing chess', 'helmet diving', 'bungee jumping', 'eating watermelon', 'opening door', 'moving baby', 'moving child', 'cooking on campfire', 'blowing out candles', 'jumping bicycle', 'high kick', 'swimming backstroke', 'staring', 'pirouetting', 'looking at phone', 'grooming horse', 'hopscotch', 'crawling baby', 'making latte art', 'hammer throw', 'archery', 'milking cow', 'golf driving', 'tai chi', 'playing bagpipes', 'skipping rope', 'playing ice hockey', 'taking photo', 'metal detecting', 'finger snapping', 'shuffling feet', 'climbing tree', 'curling (sport)', 'capsizing', 'holding snake', 'putting on lipstick', 'chewing gum', 'dancing gangnam style', 'plastering', 'stacking dice', 'fixing hair', 'tying knot (not on a tie)', 'smashing', 'splashing water', 'changing wheel (not on bike)', 'putting in contact lenses', 'kicking soccer ball', 'high fiving', 'silent disco', 'building sandcastle', 'uncorking champagne', 'tie dying', 'sucking lolly', 'playing slot machine', 'trimming trees', 'dribbling basketball', 'using a wrench', 'throwing water balloon', 'gospel singing in church', 'opening bottle (not wine)', 'acting in play', 'rolling pastry', 'sharpening knives', 'answering questions', 'folding paper', 'knitting', 'threading needle', 'tango dancing', 'waving hand', 'playing piccolo', 'sticking tongue out', 'playing ping pong', 'laying concrete', 'ice climbing', 'throwing discus', 'installing carpet', 'flint knapping', 'cutting orange', 'faceplanting', 'ice skating', 'eating ice cream', 'cutting watermelon', 'playing mahjong', 'playing harp', 'shouting', 'jumping sofa', 'sled dog racing', 'wrestling', 'water sliding', 'throwing axe', 'breaking boards', 'riding camel']

3D ResNet

  • 基础模块结构图如下:
  • 网络总体架构如下:

模型精度

  • Kinetics 700 验证集

搭建模型网络

  • 3D ResNet 的网络结构与原版的 2D ResNet 非常接近,只不过是将其中的 2D 操作换成了 3D 操作,如将 2D 卷积换成了 3D 卷积等
  • 原版的 ResNet 实现代码如下,代码来自于 Paddle.vision:
import paddle
import paddle.nn as 


class BasicBlock(nn.Layer):
    expansion = 1

    def __init__(self,
                 inplanes,
                 planes,
                 stride=1,
                 downsample=None,
                 groups=1,
                 base_width=64,
                 dilation=1,
                 norm_layer=None):
        super(BasicBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2D

        if dilation > 1:
            raise NotImplementedError(
                "Dilation > 1 not supported in BasicBlock")

        self.conv1 = nn.Conv2D(
            inplanes, planes, 3, padding=1, stride=stride, bias_attr=False)
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2D(planes, planes, 3, padding=1, bias_attr=False)
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class BottleneckBlock(nn.Layer):

    expansion = 4

    def __init__(self,
                 inplanes,
                 planes,
                 stride=1,
                 downsample=None,
                 groups=1,
                 base_width=64,
                 dilation=1,
                 norm_layer=None):
        super(BottleneckBlock, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2D
        width = int(planes * (base_width / 64.)) * groups

        self.conv1 = nn.Conv2D(inplanes, width, 1, bias_attr=False)
        self.bn1 = norm_layer(width)

        self.conv2 = nn.Conv2D(
            width,
            width,
            3,
            padding=dilation,
            stride=stride,
            groups=groups,
            dilation=dilation,
            bias_attr=False)
        self.bn2 = norm_layer(width)

        self.conv3 = nn.Conv2D(
            width, planes * self.expansion, 1, bias_attr=False)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU()
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out


class ResNet(nn.Layer):
    def __init__(self, block, depth, num_classes=1000, with_pool=True):
        super(ResNet, self).__init__()
        layer_cfg = {
            18: [2, 2, 2, 2],
            34: [3, 4, 6, 3],
            50: [3, 4, 6, 3],
            101: [3, 4, 23, 3],
            152: [3, 8, 36, 3]
        }
        layers = layer_cfg[depth]
        self.num_classes = num_classes
        self.with_pool = with_pool
        self._norm_layer = nn.BatchNorm2D

        self.inplanes = 64
        self.dilation = 1

        self.conv1 = nn.Conv2D(
            3,
            self.inplanes,
            kernel_size=7,
            stride=2,
            padding=3,
            bias_attr=False)
        self.bn1 = self._norm_layer(self.inplanes)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2D(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
        if with_pool:
            self.avgpool = nn.AdaptiveAvgPool2D((1, 1))

        if num_classes > 0:
            self.fc = nn.Linear(512 * block.expansion, num_classes)

    def _make_layer(self, block, planes, blocks, stride=1, dilate=False):
        norm_layer = self._norm_layer
        downsample = None
        previous_dilation = self.dilation
        if dilate:
            self.dilation *= stride
            stride = 1
        if stride != 1 or self.inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                nn.Conv2D(
                    self.inplanes,
                    planes * block.expansion,
                    1,
                    stride=stride,
                    bias_attr=False),
                norm_layer(planes * block.expansion), )

        layers = []
        layers.append(
            block(self.inplanes, planes, stride, downsample, 1, 64,
                  previous_dilation, norm_layer))
        self.inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        if self.with_pool:
            x = self.avgpool(x)

        if self.num_classes > 0:
            x = paddle.flatten(x, 1)
            x = self.fc(x)

        return x

搭建 3D ResNet 模型

  • 只需要根据原版 ResNet 稍作修改,将其中的 2D 操作改为 3D 操作即可
from functools import partial

import paddle
import paddle.nn as nn
import paddle.nn.functional as F

def get_inplanes():
    return [64, 128, 256, 512]


def conv3x3x3(in_planes, out_planes, stride=1):
    return nn.Conv3D(in_planes,
                     out_planes,
                     kernel_size=3,
                     stride=stride,
                     padding=1,
                     bias_attr=False)


def conv1x1x1(in_planes, out_planes, stride=1):
    return nn.Conv3D(in_planes,
                     out_planes,
                     kernel_size=1,
                     stride=stride,
                     bias_attr=False)


class BasicBlock(nn.Layer):
    expansion = 1

    def __init__(self, in_planes, planes, stride=1, downsample=None):
        super().__init__()

        self.conv1 = conv3x3x3(in_planes, planes, stride)
        self.bn1 = nn.BatchNorm3D(planes)
        self.relu = nn.ReLU()
        self.conv2 = conv3x3x3(planes, planes)
        self.bn2 = nn.BatchNorm3D(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class Bottleneck(nn.Layer):
    expansion = 4

    def __init__(self, in_planes, planes, stride=1, downsample=None):
        super().__init__()

        self.conv1 = conv1x1x1(in_planes, planes)
        self.bn1 = nn.BatchNorm3D(planes)
        self.conv2 = conv3x3x3(planes, planes, stride)
        self.bn2 = nn.BatchNorm3D(planes)
        self.conv3 = conv1x1x1(planes, planes * self.expansion)
        self.bn3 = nn.BatchNorm3D(planes * self.expansion)
        self.relu = nn.ReLU()
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out


class ResNet(nn.Layer):

    def __init__(self,
                 block,
                 layers,
                 block_inplanes,
                 n_input_channels=3,
                 conv1_t_size=7,
                 conv1_t_stride=1,
                 no_max_pool=False,
                 shortcut_type='B',
                 widen_factor=1.0,
                 n_classes=400):
        super().__init__()

        block_inplanes = [int(x * widen_factor) for x in block_inplanes]

        self.in_planes = block_inplanes[0]
        self.no_max_pool = no_max_pool

        self.conv1 = nn.Conv3D(n_input_channels,
                               self.in_planes,
                               kernel_size=(conv1_t_size, 7, 7),
                               stride=(conv1_t_stride, 2, 2),
                               padding=(conv1_t_size // 2, 3, 3),
                               bias_attr=False)
        self.bn1 = nn.BatchNorm3D(self.in_planes)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool3D(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, block_inplanes[0], layers[0],
                                       shortcut_type)
        self.layer2 = self._make_layer(block,
                                       block_inplanes[1],
                                       layers[1],
                                       shortcut_type,
                                       stride=2)
        self.layer3 = self._make_layer(block,
                                       block_inplanes[2],
                                       layers[2],
                                       shortcut_type,
                                       stride=2)
        self.layer4 = self._make_layer(block,
                                       block_inplanes[3],
                                       layers[3],
                                       shortcut_type,
                                       stride=2)

        self.avgpool = nn.AdaptiveAvgPool3D((1, 1, 1))
        self.fc = nn.Linear(block_inplanes[3] * block.expansion, n_classes)

        for m in self.sublayers():
            if isinstance(m, nn.Conv3D):
                nn.initializer.KaimingNormal()(m.weight)
            elif isinstance(m, nn.BatchNorm3D):
                nn.initializer.Constant(1.0)(m.weight)
                nn.initializer.Constant(0.0)(m.bias)

    def _downsample_basic_block(self, x, planes, stride):
        out = F.avg_pool3d(x, kernel_size=1, stride=stride)
        zero_pads = paddle.zeros(out.shape[0], planes - out.shape[1], out.shape[2],
                                out.shape[3], out.shhape[4])

        out = paddle.concat([out.data, zero_pads], axis=1)

        return out

    def _make_layer(self, block, planes, blocks, shortcut_type, stride=1):
        downsample = None
        if stride != 1 or self.in_planes != planes * block.expansion:
            if shortcut_type == 'A':
                downsample = partial(self._downsample_basic_block,
                                     planes=planes * block.expansion,
                                     stride=stride)
            else:
                downsample = nn.Sequential(
                    conv1x1x1(self.in_planes, planes * block.expansion, stride),
                    nn.BatchNorm3D(planes * block.expansion))

        layers = []
        layers.append(
            block(in_planes=self.in_planes,
                  planes=planes,
                  stride=stride,
                  downsample=downsample))
        self.in_planes = planes * block.expansion
        for i in range(1, blocks):
            layers.append(block(self.in_planes, planes))

        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        if not self.no_max_pool:
            x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)

        x = x.flatten(1)
        x = self.fc(x)

        return x

加载模型

  • 实例化一个模型
  • 加载预训练模型的参数
  • 将模型转换为评估模式
ckpt = 'data/data121477/r3d50_K_200ep.pdparams'
model_configs = {
    'n_classes': 700,
    'n_input_channels': 3,
    'shortcut_type': 'B',
    'conv1_t_size': 7,
    'conv1_t_stride': 1,
    'no_max_pool': False,
    'widen_factor': 1.0,
}

model = ResNet(Bottleneck, [3, 4, 6, 3], get_inplanes(), **model_configs)

params = paddle.load(ckpt)
model.set_dict(params)
model.eval()

模型推理

  • 只需要将前面处理好的数据放进刚刚加载完成的模型中即可完成推理操作
  • 因为推理预测无需梯度,可以关闭梯度,减少内存占用
  • 多个子序列结果通过求和的方式计算总得分
  • 取出得分最高的五个类别对应的 ID 和置信度
with paddle.no_grad():
    outputs = model(samples)
    sum_outputs = outputs.mean(axis=0)
    prob = F.softmax(sum_outputs, axis=-1)
    top5_prob, top5_indexs = [t.numpy() for t in paddle.topk(prob, k=5, axis=-1)]

结果输出

  • 根据标签和 ID 的对应列表将输出的 ID 转换为文本形式的类别名称
# 结果输出
print('动作识别结果: ')
for index, prob in zip(top5_indexs, top5_prob):
    class_name = label_list[index]
    prob *= 100
    print(f'ID: {index}, 类别: {class_name}, 置信度: {prob:.2f} %')
动作识别结果: 
ID: 534, 类别: testifying, 置信度: 99.84 %
ID: 671, 类别: answering questions, 置信度: 0.07 %
ID: 44, 类别: giving or receiving award, 置信度: 0.02 %
ID: 121, 类别: attending conference, 置信度: 0.01 %
ID: 128, 类别: news anchoring, 置信度: 0.01 %

总结

  • 可以看到模型的预测结果中最大可能的动作 testifying 和标签的标注是一致的,可以看出模型确实可以比较准确的识别视频中的动作信息