Paper – Page 2 – Robotics and Computer Vision Lab

[ICCV 2025]Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

Video Large Language Models(Video-LMMs)는 시공간 토큰(spatiotemporal tokens)을 활용해서 강력한 비디오 이해 능력을 가지게 되었지만 토큰 개수가 많아질수록 연산량이 2차적으로 증가한다는 문제점을 가지고 있었습니다. 이에 저자들은…

Paper X-Review

[AAAI 2024] SA2VP: Spatially Aligned-and-Adapted Visual Prompt

안녕하세요 4번째 X-review입니다. 이번에는 새로운 결의 논문을 가져왔는데요. Visual Prompt Tuning입니다. Visual Prompt Tuning(이하 VPT)에 대해 조금 더 자세히 설명을 해보자면 LLM에서 prompt를 이용해 전체적인…

Paper X-Review

[WACV 2023] MixVPR: Feature Mixing for Visual Place Recognition

안녕하세요 3번째 X-review네요. 아마 당분간은 쭉 VPR(visual place recognition)쪽 논문 리뷰를 들고오지 않을까 싶습니다. 오늘 소개드릴 논문은 MixVPR입니다. 간략하게 소개를 드리자면 무거운 transformer 연산 없이도,…

Paper X-Review

[arXiv 2025] WorldVLA: Towards Autoregressive Action WorldModel

안녕하세요 오늘은 WorldVLA에 대해서 설명드리도록 하겠습니다. 최근 들어서 계속 VLA 관련 논문들을 읽고 있는데 세계에 대한 일반화? 능력에 대한 부분이 상당히 필요한 것 같다고 느꼈습니다….

Paper X-Review

[CVPR 2025]CityWalker Learning Embodied Urban Navigation from Web-Scale Videos

안녕하세요 이번에 리뷰할 논문은 CVPR 2025년에 올라온 CityWalker Learning Embodied Urban Navigation from Web-Scale Videos 라는 논문입니다. 바로 리뷰 시작하도록 하겠습니다. introduction 동적 도시 환경에서의…

Paper X-Review

[arXiv2026]Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

오늘 소개드린 논문은 LLM의 evalutation에 대한 분석과 분석 방법을 다룬 논문입니다. 일반적인 벤치마크는 정확도를 기준으로 평가합니다. 하지만 이는 LLM이 실제로 그 정보에 대한 지식이 없는지(empty…

Paper X-Review

[Arxiv 2026] Agentic Very Long Video Understanding

안녕하세요.이번에 리뷰해볼 논문은 long video understanding에서 1시간 가량의 롱이 아닌 최대 50시간 정도의 베리롱!! VU를 다룬 논문입니다. 그럼 리뷰 시작하겠습니다. Intro 이 논문에서는 “very long…

Paper X-Review

[EMLLP 2023] Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

안녕하세요 이번에 들고온 논문은 VLM 들도 사람과 비슷하게 착시를 겪는지? 를 분석한 논문입니다. 그럼 리뷰 시작하겠습니다. Abstract Vision-Language Models 즉 VLMs 들은 인간이 생성한 방대한…

Paper X-Review

[arXiv 2025] DREAMGEN: Unlocking Generalization in Robot Learning through Video World Model

안녕하세요 오늘은 로봇 데이터에 관한 논문을 가지고 왔습니다. NVIDIA에서 제시한 DreamGen이라는 방법론입니다. VLA를 보면 볼 수록 아무래도 데이터의 갯수가 많지 않다보니까 특정 데이터에 편향되는 모습을…

Paper X-Review

[CVPR2025] Self-Supervised Spatial Correspondence Across Modalities

안녕하세요, 2025 CVPR에 붙은 현재 인용 수 1인 따끈따끈한 논문을 소개해볼까합니다. 해당 논문이 풀고하는 문제는 GT가 없는 상황에서의 matching입니다.위 그림을 보시면 알겠지만, multi-spectral뿐만 아니라, photo-Sketch처럼…

Category: Paper

[ICCV 2025]Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs

[AAAI 2024] SA2VP: Spatially Aligned-and-Adapted Visual Prompt

[WACV 2023] MixVPR: Feature Mixing for Visual Place Recognition

[arXiv 2025] WorldVLA: Towards Autoregressive Action WorldModel

[CVPR 2025]CityWalker Learning Embodied Urban Navigation from Web-Scale Videos

[arXiv2026]Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

[Arxiv 2026] Agentic Very Long Video Understanding

[EMLLP 2023] Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

[arXiv 2025] DREAMGEN: Unlocking Generalization in Robot Learning through Video World Model

[CVPR2025] Self-Supervised Spatial Correspondence Across Modalities

Conference Deadline

NEW POST

New Comment