Poster: Towards Efficient Spatio-Temporal Video Grounding in Pervasive Mobile Devices

PROCEEDINGS OF THE 2024 THE 22ND ANNUAL INTERNATIONAL CONFERENCE ON MOBILE SYSTEMS, APPLICATIONS AND SERVICES, MOBISYS 2024(2024)

引用 0|浏览7
暂无评分
摘要
As the use of pervasive devices expands into complex collaborative tasks such as cognitive assistants and interactive AR/VR companions, they are equipped with a myriad of sensors facilitating natural interactions, such as voice commands. Spatio-Temporal Video Grounding (STVG), the task of identifying the target object in the field-of-view referred to in a language instruction, is a key capability needed for such systems. However, current STVG models tend to be resource-intensive, relying on multiple cross-attentional transformers applied to each video frame. This results in runtime complexity that increases linearly with video length. Furthermore, deploying these models on mobile devices while maintaining a low-latency poses additional challenges. Hence, this paper explores the latency and energy requirements for implementing STVG models on a pervasive device.
更多
查看译文
关键词
Human-AI Collaboration,Spatio-Temporal Video Grounding
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要