Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

1 ETH Zurich, Switzerland
2 Massachusetts Institute of Technology, USA
3 NVIDIA
* Equal Contribution
CVPR 2024
Overview

Abstract

We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined "action context". We propose TransFusion, a multimodal transformer-based architecture for short-term object interaction anticipation. Our method exploits the representational power of language by summarizing the action context textually, after leveraging pre-trained image captioning and vision-language models to extract the action context from past video frames. The summarized action context and the last observed video frame are processed by the multimodal fusion module to forecast the next object interaction. Experiments on the Ego4D dataset show the effectiveness of our multimodal fusion model and highlight the benefits of using the power of vision-language models and language-based context summaries in a task where vision seems to suffice. Our novel approach outperforms all state-of-the-art methods on the Ego4D next active object interaction dataset.

Video

We present a method for next-active-object interaction anticipation using vision-language-model–generated summaries of the past together with single images.

BibTeX

@inproceedings{
  pasca2024transfusion,
  title={Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation},
  author={Pasca, Razvan and Gavryushin, Alexey and Hamza, Muhammad and Kuo, Yen-Ling and Mo, Kaichun and Van Gool, Luc and Hilliges, Otmar and Wang, Xi},
  booktitle={Conference on Computer Vision and Pattern Recognition 2024},
  year={2024},
  url={https://eth-ait.github.io/transfusion-proj/}
}