Paper – Page 7 – Robotics and Computer Vision Lab

[CVPR 2025]CityWalker Learning Embodied Urban Navigation from Web-Scale Videos

안녕하세요 이번에 리뷰할 논문은 CVPR 2025년에 올라온 CityWalker Learning Embodied Urban Navigation from Web-Scale Videos 라는 논문입니다. 바로 리뷰 시작하도록 하겠습니다. introduction 동적 도시 환경에서의…

Paper X-Review

[arXiv2026]Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

오늘 소개드린 논문은 LLM의 evalutation에 대한 분석과 분석 방법을 다룬 논문입니다. 일반적인 벤치마크는 정확도를 기준으로 평가합니다. 하지만 이는 LLM이 실제로 그 정보에 대한 지식이 없는지(empty…

Paper X-Review

[Arxiv 2026] Agentic Very Long Video Understanding

안녕하세요.이번에 리뷰해볼 논문은 long video understanding에서 1시간 가량의 롱이 아닌 최대 50시간 정도의 베리롱!! VU를 다룬 논문입니다. 그럼 리뷰 시작하겠습니다. Intro 이 논문에서는 “very long…

Paper X-Review

[EMLLP 2023] Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

안녕하세요 이번에 들고온 논문은 VLM 들도 사람과 비슷하게 착시를 겪는지? 를 분석한 논문입니다. 그럼 리뷰 시작하겠습니다. Abstract Vision-Language Models 즉 VLMs 들은 인간이 생성한 방대한…

Paper X-Review

[arXiv 2025] DREAMGEN: Unlocking Generalization in Robot Learning through Video World Model

안녕하세요 오늘은 로봇 데이터에 관한 논문을 가지고 왔습니다. NVIDIA에서 제시한 DreamGen이라는 방법론입니다. VLA를 보면 볼 수록 아무래도 데이터의 갯수가 많지 않다보니까 특정 데이터에 편향되는 모습을…

Paper X-Review

[CVPR2025] Self-Supervised Spatial Correspondence Across Modalities

안녕하세요, 2025 CVPR에 붙은 현재 인용 수 1인 따끈따끈한 논문을 소개해볼까합니다. 해당 논문이 풀고하는 문제는 GT가 없는 상황에서의 matching입니다.위 그림을 보시면 알겠지만, multi-spectral뿐만 아니라, photo-Sketch처럼…

Paper X-Review

[TMLR 2026] A Survey of Token Compression for Efficient Multimodal Large Language Models (1)

안녕하세요. 오늘의 X-Review는 MLLM에서의 이미지, 비디오, 오디오 관련 token compression 서베이 논문을 소개해드리고자합니다. 저번주 Audio-Visual Question Answering task에 대한 논문을 제출한 뒤, 졸업 전까지 VLM을…

Paper X-Review

[RA-L 2022]Socially CompliAnt Navigation Dataset (SCAND) A Large-Scale Dataset of Demonstrations for Social Navigation

안녕하세요. 이번에 리뷰할 논문은 RAL 2022년에 올라온 Socially CompliAnt Navigation Dataset (SCAND) A Large-Scale Dataset of Demonstrations for Social Navigation 이라는 데이터셋 논문입니다. 바로 리뷰…

Paper X-Review

[AAAI 2026] VideoChat-A1: Thinking with Long Videos byChain-of-Shot Reasoning

안녕하세요! 이번에 소개할 논문은 Long Video Understanding에서 긴 비디오를 효과적으로 이해하기 어려운 문제를 해결하기 위해 shot단위의 점진적인 추론 방식인 Chain-of-Shot 프레임워크(VideoChat-A1)를 제안한 연구입니다이 논문은 기존…

Paper X-Review

[arXiv2025]LongVideoAgent: Multi-Agent Reasoning with Long Videos

왜 제안되었나? Crucially, most prior systems are non-agentic models: they process a static, pre-encoded or down-sampled video. 기존의 연구들은 미리 설계된(pre-encoded) 아키텍쳐로 분석을 수행하였다. 이러한…

Category: Paper

[CVPR 2025]CityWalker Learning Embodied Urban Navigation from Web-Scale Videos

[arXiv2026]Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

[Arxiv 2026] Agentic Very Long Video Understanding

[EMLLP 2023] Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

[arXiv 2025] DREAMGEN: Unlocking Generalization in Robot Learning through Video World Model

[CVPR2025] Self-Supervised Spatial Correspondence Across Modalities

[TMLR 2026] A Survey of Token Compression for Efficient Multimodal Large Language Models (1)

[RA-L 2022]Socially CompliAnt Navigation Dataset (SCAND) A Large-Scale Dataset of Demonstrations for Social Navigation

[AAAI 2026] VideoChat-A1: Thinking with Long Videos byChain-of-Shot Reasoning

[arXiv2025]LongVideoAgent: Multi-Agent Reasoning with Long Videos

Conference Deadline

NEW POST

New Comment