We investigate representations from pre-trained text-to-image diffusion models for control tasks and showcase competitive performance across a wide range of tasks.
We conduct a study on using pre-trained visual representations (PVRs) to train robots for real-world tasks.
We present the largest and most comprehensive empirical study of visual foundation models for Embodied AI (EAI).
In this work we propose OVRL, a two-stage representation learning strategy for visual navigation tasks in Embodied AI.