X-Review – Page 13 – Robotics and Computer Vision Lab

[CVPR 2025] What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

안녕하세요. 새해 첫 엑스리뷰로는 기존에 읽어왔던 AVQA 관련 논문보단 VLM 에 관련된 논문을 들고왔습니다. 뭔가 한 태스크에 시야가 갇히는 느낌이 없지않아 있어서, 좀 다른 시야를…

[arXiv 2025]OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

안녕하세요. 이번에 리뷰할 논문은 OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation입니다. 2025년 9–10월쯤 아카이브에 올라온 논문인데, 읽어보니 현재 연구실에서 돌리고 있는 모바일 플랫폼에도 적용…

Paper X-Review

[ICRA 2023] Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

안녕하세요, 허재연 입니다. 오늘 리뷰할 논문은 ICRA 2023에 게재된 논문으로, 인접 프레임 간의 관계 변화를 포착하는 데 어려움을 겪는 기존 모델들의 한계를 극복하기 위해 Cross-Modality…

X-Review

[arXiv 2025]Is Image-based Object Pose Estimation Ready to Support Grasping?

안녕하세요. 손우진입니다. 제가 오늘 가져온 논문은 IROS 2025에 accept된 논문입니다. 근데 제가 본 논문은 arxiv ver2로 올린 논문이더라구요… 내용이 좀 많이 빠져있어서 어떻게 accept 된거지…

X-Review

[arXiv 2025] Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

안녕하세요, 이번주는 3d object flow라는 개념을 활용한 open-world manipulation 연구를 리뷰해보려고 합니다. 최근 비디오 모델들의 물리적인, 시각적인 표현력이 급증하면서 manipulation 영상을 자연스럽게 생성할 수 있게…

Conference X-Review

[NeurIPS2025] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Intro 본 논문의 task는 단안 영상을 입력으로 받아 깊이를 추정하는 task로 DepthAnything 시리즈나 marigold와 같은 foundation model에 관한 논문입니다. 저자들은 이상적인 Depth foundation 모델이 갖춰야…

Conference X-Review

[EMNLP 2025] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

해당 논문은 텍스트-비디오 검색 연구에서도 LLM 및 CoT가 도입된 것 같아 읽어보게되었습니다. 1. Introduction 이 논문은 기존 텍스트-비디오 검색 시스템이 “왜 이 비디오가 검색되었는지” 를…

Paper X-Review

[RA-L 2024] LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

오늘은 현재 제가 진행중인 실험과 관련이 있는, LiDAR-Camera Place Recognition과 관련된 논문 리뷰를 작성하고자 합니다. 컨셉적으로 많이 참신한 논문은 아닙니다만, 관련성 측면에서 정리해볼 겸 가져왔습니다….

Paper X-Review

[WACV 2024] CAD – Contextual Multi-modal Alignment for Dynamic AVQA

제가 이번에 리뷰할 논문도 Audio Visual Question answering 태스크를 다루는 논문입니다. 저희가 실험중인 성능과 비슷한 성능을 달성하기도 했고, 실험 성능중 Audio 와 관련된 성능은 저희…

Paper X-Review

[arXiv 2025]Deep Video Discovery : Agentic Search with Tool Usefor Long-form Video Understanding

그래서 AI가 그렇게 좋다는데, 지금 기술로 Video Understanding은 어디까지 가능하지?와 같은 질문에 해답이 될 수 있는 논문을 소개합니다. 본 논문은 Agentic Search를 통한 Longvideo benchmark에서의…

Category: X-Review

[CVPR 2025] What’s in the Image? A Deep-Dive into the Vision of Vision Language Models

[arXiv 2025]OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

[ICRA 2023] Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs

[arXiv 2025]Is Image-based Object Pose Estimation Ready to Support Grasping?

[arXiv 2025] Dream2Flow: Bridging Video Generation and Open-World Manipulation with 3D Object Flow

[NeurIPS2025] MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

[EMNLP 2025] X-CoT: Explainable Text-to-Video Retrieval via LLM-based Chain-of-Thought Reasoning

[RA-L 2024] LCPR: A Multi-Scale Attention-Based LiDAR-Camera Fusion Network for Place Recognition

[WACV 2024] CAD – Contextual Multi-modal Alignment for Dynamic AVQA

[arXiv 2025]Deep Video Discovery : Agentic Search with Tool Usefor Long-form Video Understanding

Conference Deadline

NEW POST

New Comment