We investigate representations from pre-trained text-to-image diffusion models for control tasks and showcase competitive performance across a wide range of tasks.
We conduct a study on using pre-trained visual representations (PVRs) to train robots for real-world tasks.
We present the largest and most comprehensive empirical study of visual foundation models for Embodied AI (EAI).
We propose a combined simulation and real-world benchmark on the problem of Open-Vocabulary Mobile Manipulation (OVMM).