조 원 – Page 5 – Robotics and Computer Vision Lab

허 재연 on [ICRA 2023] Cross-Modality Time-Variant Relation Learning for Generating Dynamic Scene Graphs01/15/2026
재밌는 의견 주셔서 감사합니다. 요약하면 t-1->t 프레임 간 변화 정보(차이)를 모델링하는데 있어 전체 프레임을 보는 것보다 부분 정보를 활용하면 좋을…
박 성준 on [NIPS2025] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding01/14/2026
안녕하세요, 재윤님 좋은 댓글 감사합니다. 재윤님이 말해주신 극단적인 케이스에서는 시간 순대로 나열하는 방식과 차이가 적긴하지만, 시간 정보와 클립 사이의 연결성도…
박 성준 on [NIPS2025] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding01/14/2026
안녕하세요, 예은님 좋은 댓글 감사합니다. LVU task 중에서도 DB를 생성하고 평가하는 RAG방식의 방법론은 일반적으로 오프라인으로 DB를 생성하는 과정이 오래걸리는 것을…
박 성준 on [NIPS2025] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding01/14/2026
안녕하세요, 현우님 좋은 댓글 감사합니다. 실제로 저자가 Appendix에서 Limitation 중 하나로 필터링에서 오류가 존재할 수 있다는 점을 언급하고 있습니다. 학습…
박 성준 on [NIPS2025] Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding01/14/2026
안녕하세요, 기현님 좋은 댓글 감사합니다. Vgent는 오프라인 DB를 생성할때에는 연산량이 늘어나고 시간이 오래걸리지만, DB를 생성한 이후에 평가를 진행할 때에는 효율적인…

Author: 조 원

[arXiv2021] Are Convolutional Neural Networks or Transformers more like human vision? – [1]

Protected: [Review] Multimodal Video-to-Video Retrieval

Protected: 김형준 [ICCV2021 PeerReview] 2364

Protected: [ICCV2021 PeerReview] Cross-Modal Feature Fusion for Object Detection without Depth Supervision

[arXiv2021] MLP-Mixer: An all-MLP Architecture for Vision

[Challenge] ActivityNet Challenge 2020

[CVPR2015] ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

[arXiv2021] ViViT: A Video Vision Transformer

[arXiv2021] Is Space-Time Attention All You Need for Video Understanding?

[ECCV2016] Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Conference Deadline

NEW POST

New Comment