X-Review – Page 7 – Robotics and Computer Vision Lab

[ICLR 2025] GENERATIVE REPRESENTATIONAL INSTRUCTION TUNING

안녕하세요. 오늘은 LLM의 생성 능력과 임베딩 능력을 하나의 모델로 통합하려는 GRIT(Generative Representational Instruction Tuning) 논문을 리뷰하고자 합니다. 최근 MLLM 기반 생성 모델을 검색에도 함께 활용하려는…

X-Review

[arXiv 2026] EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

안녕하세요, 이번주는 NVIDIA에서 최근에 발표한 연구에 대해 리뷰해보려고 합니다. 최근 로봇 데이터가 아닌 다른 도메인의 데이터가 어떻게 학습에 사용될까?에 대한 궁금증이 늘 있는데, 해당 연구에서…

Conference X-Review

[CVPR 2025]RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

지난번에 리뷰한 PhysToolBench 논문에 RoboBrain 논문이 있어서 궁금해서 읽어보게 되었습니다. 2025년 2월에 공개된 논문으로, 이후에 9월에 RoboBrain 2.0 리포트가 나온 것 같습니다. Abstract 최근 MLLMs의…

Paper X-Review

[AAAI 2026] VideoChat-A1: Thinking with Long Videos byChain-of-Shot Reasoning

안녕하세요! 이번에 소개할 논문은 Long Video Understanding에서 긴 비디오를 효과적으로 이해하기 어려운 문제를 해결하기 위해 shot단위의 점진적인 추론 방식인 Chain-of-Shot 프레임워크(VideoChat-A1)를 제안한 연구입니다이 논문은 기존…

Paper X-Review

[arXiv2025]LongVideoAgent: Multi-Agent Reasoning with Long Videos

왜 제안되었나? Crucially, most prior systems are non-agentic models: they process a static, pre-encoded or down-sampled video. 기존의 연구들은 미리 설계된(pre-encoded) 아키텍쳐로 분석을 수행하였다. 이러한…

Paper X-Review

[arXiv 2025] LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

안녕하세요 이번에 리뷰할 논문은 아카이브에 올라온지 2달된 LoGoPlanner Localization Grounded Navigation Policy with Metric-aware Visual Geometry라는 논문 입니다.지금까지는 image goal, language prompt 기반의 navigation 논문들을…

Paper X-Review

[CVPR 2025] Apollo: An Exploration of Video Understanding in Large Multimodal Models

안녕하세요, 3번째 x-review는 Apollo라는 논문입니다. (논문 기준) 현재까지 video-LLM 연구의 문제점을 짚고, 저자 자신들의 모델을 제안하는 구성이기 때문에 LVU task에 익숙하지 않으신 분들도 꽤(?) 재밌게…

X-Review

[arXiv 2025]Phystoolbench: Benchmarking physical tool understanding for mllms

해당 논문은 작년 10월에 아카이브에 공개된 논문으로, MLLMs에 대한 도구 이해 능력을 평가하였다는 점에서 궁금하여 읽게 되었습니다. 어디에 제출하였는지는 잘 모르겠지만, 난이도에 대하여 단계적으로 구분한…

X-Review

[ICCV 2025] STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

안녕하세요 이번에 소개할 논문은 NVDIA에서 발표한 논문으로 롱비디오 이해에 있어 Mamba 기반 모델로 토큰 압축을 적용하여 시간 모델링을 보완하고 성능과 효율을 동시에 향상시킨 논문입니다. 1….

X-Review

[CVPR 2025] Co-op:Correspondence-based Novel Object Pose Estimation

안녕하세요 손우진입니다. 오늘은 단일 RGB 기반의 6D Pose Esitmation 논문을 리뷰해볼까 합니다. 아무래도 6D 측정을 위해서라면 Depth가 필요한데요 하지만 Depth 없이 이미지기반의 6D 포즈 예측은…

Category: X-Review

[ICLR 2025] GENERATIVE REPRESENTATIONAL INSTRUCTION TUNING

[arXiv 2026] EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

[CVPR 2025]RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

[AAAI 2026] VideoChat-A1: Thinking with Long Videos byChain-of-Shot Reasoning

[arXiv2025]LongVideoAgent: Multi-Agent Reasoning with Long Videos

[arXiv 2025] LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry

[CVPR 2025] Apollo: An Exploration of Video Understanding in Large Multimodal Models

[arXiv 2025]Phystoolbench: Benchmarking physical tool understanding for mllms

[ICCV 2025] STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

[CVPR 2025] Co-op:Correspondence-based Novel Object Pose Estimation

Conference Deadline

NEW POST

New Comment