UMI on Legs is a framework for combining real-world human demonstrations with simulation trained whole-body controllers, providing a scalable approach for manipulation skills on robot dogs with arms.


The best part? You can plug-and-play your existing visuomotor policies onto a quadruped, making your manipulation policies mobile!

UMI on Legs is a framework for combining real-world human demonstrations with simulation trained whole-body controllers, providing a scalable approach for manipulation skills on robot dogs with arms.


The best part? You can plug-and-play your existing visuomotor policies onto a quadruped, making your manipulation policies mobile!

Technical Summary Video

Robot Data without Robots

Robots are the reason why it's been hard to collect a lot of robot data. They are expensive 💸, tricky to control for dexterous tasks 🎾, and can punch a hole in the wall (or itself) if you're not careful ☠️. What if we could collect robot data without robots?

UMI is a handheld gripper with a GoPro attached. With UMI, we can collect real-world robot demonstrations anywhere for any manipulation skill, without any robots. So just walk out into the wild with UMI, and start collecting data!

Training on UMI demonstrations gives a policy that outputs gripper movements from image inputs. But how should the dog's legs move to track those gripper movements? 🤔

Training on UMI demonstrations gives a policy that outputs gripper movements from image inputs. But how should the dog's legs move to track those gripper movements? 🤔

Task Tracking without Simulating Tasks

The promise of simulation engines is that of infinite data. However, hidden in the terms and conditions that no one reads is the painful process of acquiring assets, defining dense rewards, rendering diverse scenes, not to mention the sim-to-real gap 🙃. All these problems are side-stepped or well-studied if we only use simulation to learn whole-body controllers.

In a massively parallelized simulation, our robot learns through trial-and-error how to track UMI gripper.

What makes a Whole-body Controller "Manipulation-Centric"?

Manipulation policies typically predict gripper movements in a fixed world frame, and assumes that the robot track those movements with stability ⚖️ and precision 🎯. This is exactly what we set out to learn with our manipulation-centric whole-body controller.

Stability ⚖️ Our controller tracks gripper movements in the world frame, instead of the body frame like most prior works. This means if you push the robot's body, its arm will move in the opposite direction to compensate, as seen above. Now that's what I call a 6 DoF chicken head 🐔!

Precision 🎯 We give our controller a trajectory of gripper targets into the future, which allows it to anticipate future gripper movements and reach all targets precisely. For instance, in tossing a tennisball, the robot can brace its body for a high-velocity toss, planting its front legs into the ground to supply enough tossing force. Meanwhile, in pushing a heavy kettlebell, the robot can mobilize all its legs, knowing that it will have to continue pushing forwards in the seconds to come.

Meet Espresso and Oat Milk!

Following our lab's caffinated drinks naming tradition, I've decided to name our quadruped Espresso and our new arm Oat Milk. My hope is that, when combined together, Espresso and Oat Milk will be as capable as Latte (our UR5 which has unfolded and folded cloths and washed dishes). Deploying policies from Latte to Espresso and Oat Milk, as we've done, is the first step 🚀.

espresso-and-latte

Espresso - Oat Milk comes with a GoPro on its head 🎩, a 3D printed gripper at the end of its arm 🦾, and an iPhone on its butt 🍑. The GoPro streams visual observations through a capture card, serving as the policy observation. Meanwhile, the iPhone runs a custom iOS app we developed, which streams the robot's body pose, allowing world frame stabilization.

Try out UMI on Legs!

Quadruped manipulation has lots of moving parts, and any small bug can lead to a big mess 🤕 We've spent a lot of time figuring out the networking, operating systems, and hardware. We've broken 3 quadruped legs, fried 1 Jetson, and ripped one pair of pants, so you don't have to 👖

Our Team

1Stanford University, 2Columbia University, 3Google DeepMind, *Equal contribution

@misc{ha2024umilegs,
      title={{UMI} on Legs: Making Manipulation Policies Mobile with Manipulation-Centric Whole-body Controllers}, 
      author={Huy Ha and Yihuai Gao and Zipeng Fu and Jie Tan and Shuran Song},
      year={2024},
      eprint={2407.10353},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2407.10353},
}

If you have any questions, please contact Huy Ha and Yihuai Gao 🐕

Questions & Answers

What can't this framework do?

There is only a one-way communication from the manipulation visuo-motor policy to the whole-body controller, via an end-effector trajectory interface. This has two main drawbacks:

(1) Guarantee Reachability. Sometimes the manipulation policy would ask the controller to move to target poses it can't track, like ones that are too high or rotate the gripper by too much. Ideally, the controller could communicate to the manipulation policy what the hardware is capable of as well.

(2) Track Multiple End-effectors. Ideally, our robots would involve their entire body in manipulation tasks, not just its gripper. Towards involving the robot's feet, body, and arms in manipulation as well, the real-world data collection platform, manipulation-controller interface, and the whole-body controller formulation has some work to do.

I want to build a foundation policy for robotics. What can UMI on Legs teach me about how to go about it?

Tossing was flashy, but the in-the-wild cup rearrangement task (shown below) changed my perspective on what a foundation policy for robotics should look like. We were able to just plug-and-play UMI's publicly released policy onto our robot dog (seen above in 1x speed)! With an expressive embodiment-agnostic interface in place, I'm optimistic about a world where people can separately develop ever more general visuo-motor policies and ever more robust WBCs, knowing that they can be plugged-and-played together in the end.

Can learning to track end-effector trajectories really result in kettlebell pushing?

Yes, and no 🙃 Let me explain.

Yes, because the kettlebell does get pushed to its target zone in our experiments, which technically counts as a success. No, because the robot's behavior was not very elegant, despite the hardware being physically capable of much smoother, stronger pushes. We've included mass, center of mass, and other joint domain randomization during training, but additionally including force-torque perturbations at the end-effector during WBC training could address take us a step closer to more robust pushing behaviors.

What simulator do you use? Why does it look so realistic?

In this project, I use the IsaacGym simulator for physics, but like many of my prior works, I use Blender for rendering (e.g., Scaling Up, FlingBot).


Also, as in my prior works, I've published instructions for importing simulations into Blender. If you're using IsaacGym, you can refer to this project's codebase. If you're using PyBullet, you can use this plugin I developed for my multi-arm motion planning project. If you're using MuJoCo, you can refer to the visualization instructions on my Scaling Up's codebase.

Isn't this just DeepWBC or Visual Whole-Body Control?

These whole-body controllers are not manipulation-centric - they do not track end-effector trajectories in task-frame. Instead, they track instantaneous end-effector targets in body-frame, and depend on the quadruped's base command. This means their controllers can't be used to deploy existing table-top visuo-motor manipulation policies.