基于分层强化学习的视频描述
论文基本信息
-
论文名:Video Captioning via Hierarchical Reinforcement Learning
- 论文源码:
- None
- 关于笔记作者:
- 朱正源,北京邮电大学研究生,研究方向为多模态与认知计算。
论文推荐理由
视频描述中细粒度的动作描述仍然是该领域中一个巨大的挑战。该论文创新点分为两部分:1. 通过层级化的强化学习框架,使用高层manager识别粗粒度的视频信息并控制描述生成的目标,使用低层的worker识别细粒度的动作并完成目标。2. 提出Charades数据集。
Video Captioning via Hierarchical Reinforcement Learning
Framework of model
- Work processing
- Pretrained CNN encoding stage we obtain: video frame features: $v={v_i}$, where $i$ is index of frames.
- Language Model encoding stage we obtain: Worker : from low-level Bi-LSTM encoder Manager: $h^{E_m}={h_i^{E_m}}$ from high LSTM encoder
- HRL agent decoding stage we obtain: Language description:$a_{1}a_{2}…a_{T}$, where $T$ is the length of generated caption.
- Details in HRL agent:
- High-level manager:
- Operate at lower temporal resolution.
- Emits a goal for worker to accomplish.
- Low-level worker
- Generate a word for each time step by following the goal.
- Internal critic
- Determin if the worker has accomplished the goal
- High-level manager:
- Details in Policy Network:
- Attention Module:
- At each time step t: $c_t^W=\sum\alpha_{t,i}^{W}h^{E_w}_i$
- Note that attention score $\alpha_{t,i}^{W}=\frac{exp(e_{t, i})}{\sum_{k=1}^{n}exp(e_t, k)}$, where $e_{t,i}=w^{T} tanh(W_{a} h_{i}^{E_w} + U_{a} h^{W}_{t-1})$
- Manager and Worker:
- Manage: take $[c_t^M, h_t^M]$ as input to produce goal. Goal is obtained through a MLP.
- Worker: receive the goal $g_t$ and take the concatenation of $c_t^W, g_t, a_{t-1}$ as input, and outputs the probabilities of $\pi_t$ over all action $a_t$.
- Internal Critic:
- evaluate worker’s progress. Using an RNN struture takes a word sequence as input to discriminate whether end.
- Internal Critic RNN take $h^I_{t-1}, a_t$ as input, and generate probability $p(z_t)$.
- Attention Module:
- Details in Learning:
- Definition of Reward: $R(a_t)$ = $\sum_{k=0} \gamma^{k} f(a_{t+k})$ , where $f(x)=CIDEr(sent+x)-CIDEr(sent)$ and $sent$ is previous generated caption.
- Pseudo Code of HRL training algorithm:
import training_pairs
import pretrained_CNN, internal_critic
for i in range(M):
Initial_random(minibatch)
if Train_Worker:
goal_exploration(enable=False)
sampled_capt = LSTM() # a_1, a_2, ..., a_T
Reward = [r_i for r_i in calculate_R(sampled_caption)]
Manager(enable=False)
worker_policy = Policy_gradient(Reward)
elif Train_Manager:
Initial_ramdom_process(N)
greedy_decoded_cap = LSTM()
Reward = [r_i for r_i in calculate_R(sampled_caption)]
Worker(enable=False)
manager_policy = Policy_gradient(Reward)
All in one
数据集
- MSR-VTT
该数据集包含50个小时的视频和26万个相关视频描述。
- Charades
Charades Captions:室内互动的9848个视频,包含157个动作的66500个注解,46个类别的物体的41104个标签,和共27847个文本描述。
实验结果
-
实验可视化
-
模型对比
Comments