This is a technical diary documenting the common pitfalls of training robotic policies to solve challenging tasks in the real world. I record the bugs I encountered during research, along with retrospections and takeaways. I hope that by writing this blog, fewer people will make the same mistakes I did. I will also include some random philosophical thoughts on robot intelligence in general. Let's buckle up and set out on a rough road to general robot intelligence.
Over the past two days, I have been struggling to deploy a Pi0.5 policy on a 4-YAM teleoperation hardware suite. As a newbie in deploying manipulation policies in the real world, I did not know what the best hyperparameters should be. So I consulted Grok, Claude Opus 4.5, and Gemini 3 Pro for advice, and eventually came up with a hyperparameter suite that actually didn't work well in the real world.
Anyways, I got stuck for a whole day, and the major headache was that the first action of my Pi0.5 policy was super far away from the grippers, and the subsequent action frames caused the grippers to move to somewhere out-of-reach. This is ridiculous. I thought that there could be several issues:
So I spent extra money to let OpenAI Codex help me debug. Unfortunately, that wasted my whole day and gave me code that I couldn't parse. So the next day I got quite furious and decided to throw away everything the AI said, and get back to good-old-fashioned debugging: print logs and compare with oracles.
What I did the next day included:
I loaded the policy on my training machine and added code to load my last checkpoint, print the velocity loss, and action reconstruction loss on the dataset (in L1 and L2 norm). I discovered that the L1/L2 losses were huge. After consulting GPT, I discovered that I did not normalize over the batch of data, so after that the loss gets down to like 1e-3 level. So does my velocity loss. This means the policy did converge on the training dataset.
Then, is the training data correct? I was forced to write an evaluation script on the deployment machine to replay my dataset actions in sim then in real. It worked quite well. So, since my training converged on real data, and my real data is correct, what happened?
I then thoroughly checked all the inference code and discovered that Pi did not provide a perfect codebase that streamlines their normalization pipeline. I decided to be less ambitious. I started from something that worked—my training script. I simplified it to a minimal codebase that loads and builds policy successfully, and migrated it to my eval machine. However, it caused OOM when I did the eval on the eval machine to test if the model fits the training dataset, and then after deleting some local variables each step, it worked on the eval machine.
Then I thoroughly inspected the normalization pipeline. Just as I thought everything should work, the real eval failed again. I was devastated.
Then I decided to print out my actions. It suddenly dawned on me that the actions were way different from my training performance, which matches the dataset. Another thing is that when I compute the reconstruction loss on the train set, I did not use output normalization, and it fits on the train set. This is weird to me. So after canceling output normalization too for the eval script, magic happened. The first action chunk becomes near to the gripper, and the robot starts picking up the t-shirt. This is crazy. I then inspected the robot dataset and found out that the normalized state/actions do not have a zero mean, so my dataset was wrong in the first place! It was never actually normalized.