TY - GEN
T1 - Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms
AU - Guo, Diandian
AU - Lin, Manxi
AU - Pei, Jialun
AU - Tang, He
AU - Jin, Yueming
AU - Heng, Pheng-Ann
PY - 2024
Y1 - 2024
N2 - A comprehensive understanding of surgical scenes allows for monitoring of the surgical process, reducing the occurrence of accidents and enhancing efficiency for medical professionals. Semantic modeling within operating rooms, as a scene graph generation (SGG) task, is challenging since it involves consecutive recognition of subtle surgical actions over prolonged periods. To address this challenge, we propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR. Diverging from previous approaches that integrated temporal information via memory graphs, our method embraces two advantages: 1) we directly exploit bi-modal temporal information from the video streaming for hierarchical feature interaction, and 2) the prior knowledge from Large Language Models (LLMs) is embedded to alleviate the class-imbalance problem in the operating theatre. Specifically, our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp). Furthermore, we transfer knowledge from the biomedical LLM, LLaVA-Med, to deepen the comprehension of intraoperative relations. The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations to generate scene graphs. Experimental results on the 4D-OR benchmark demonstrate the superior performance of our model for long-term OR streaming. Codes are available at https://github.com/RascalGdd/TriTemp-OR.
AB - A comprehensive understanding of surgical scenes allows for monitoring of the surgical process, reducing the occurrence of accidents and enhancing efficiency for medical professionals. Semantic modeling within operating rooms, as a scene graph generation (SGG) task, is challenging since it involves consecutive recognition of subtle surgical actions over prolonged periods. To address this challenge, we propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR. Diverging from previous approaches that integrated temporal information via memory graphs, our method embraces two advantages: 1) we directly exploit bi-modal temporal information from the video streaming for hierarchical feature interaction, and 2) the prior knowledge from Large Language Models (LLMs) is embedded to alleviate the class-imbalance problem in the operating theatre. Specifically, our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp). Furthermore, we transfer knowledge from the biomedical LLM, LLaVA-Med, to deepen the comprehension of intraoperative relations. The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations to generate scene graphs. Experimental results on the 4D-OR benchmark demonstrate the superior performance of our model for long-term OR streaming. Codes are available at https://github.com/RascalGdd/TriTemp-OR.
KW - Surgical scene understanding
KW - Scene graph generation
KW - Temporal OR interaction
KW - Multi-modality learning
U2 - 10.1007/978-3-031-72089-5_67
DO - 10.1007/978-3-031-72089-5_67
M3 - Article in proceedings
SN - 978-3-031-72088-8
T3 - Lecture Notes in Computer Science
SP - 714
EP - 724
BT - Proceedings of the 27th International Conference on Medical Image Computing and Computer Assisted Intervention – MICCAI 2024
PB - Springer
T2 - 27th International Conference on Medical Image Computing and Computer-Assisted Intervention, MICCAI 2024
Y2 - 6 October 2024 through 10 October 2024
ER -