A Rough Road Towards Robot Intelligence

This is a technical diary documenting the common pitfalls of training robotic policies to solve challenging tasks in the real world. I record the bugs I encountered during research, along with retrospections and takeaways. I hope that by writing this blog, fewer people will make the same mistakes I did. I will also include some random philosophical thoughts on robot intelligence in general. Let's buckle up and set out on a rough road to general robot intelligence.

Deployment Struggles: The Pi0.5 Policy

Over the past two days, I have been struggling to deploy a Pi0.5 policy on a 4-YAM teleoperation hardware suite. As a newbie in deploying manipulation policies in the real world, I did not know what the best hyperparameters should be. So I consulted Grok, Claude Opus 4.5, and Gemini 3 Pro for advice, and eventually came up with a hyperparameter suite that actually didn't work well in the real world.

Anyways, I got stuck for a whole day, and the major headache was that the first action of my Pi0.5 policy was super far away from the grippers, and the subsequent action frames caused the grippers to move to somewhere out-of-reach. This is ridiculous. I thought that there could be several issues:

The fine-tuned checkpoint did not converge, or overfitted. This was actually false. The loss goes down to 5e-3 after 8k steps. In WandB log scale, the curve looks good.
The evaluation codebase is wrong (highly possible)
The fine-tuning hyperparameter setting is not good enough.

So I spent extra money to let OpenAI Codex help me debug. Unfortunately, that wasted my whole day and gave me code that I couldn't parse. So the next day I got quite furious and decided to throw away everything the AI said, and get back to good-old-fashioned debugging: print logs and compare with oracles.

The Debugging Process

What I did the next day included:

1. Check Training Bugs

I loaded the policy on my training machine and added code to load my last checkpoint, print the velocity loss, and action reconstruction loss on the dataset (in L1 and L2 norm). I discovered that the L1/L2 losses were huge. After consulting GPT, I discovered that I did not normalize over the batch of data, so after that the loss gets down to like 1e-3 level. So does my velocity loss. This means the policy did converge on the training dataset.

2. Verify Training Data

Then, is the training data correct? I was forced to write an evaluation script on the deployment machine to replay my dataset actions in sim then in real. It worked quite well. So, since my training converged on real data, and my real data is correct, what happened?

3. Inspect Inference Code

I then thoroughly checked all the inference code and discovered that Pi did not provide a perfect codebase that streamlines their normalization pipeline. I decided to be less ambitious. I started from something that worked—my training script. I simplified it to a minimal codebase that loads and builds policy successfully, and migrated it to my eval machine. However, it caused OOM when I did the eval on the eval machine to test if the model fits the training dataset, and then after deleting some local variables each step, it worked on the eval machine.

4. Normalization Pipeline Inspection

Then I thoroughly inspected the normalization pipeline. Just as I thought everything should work, the real eval failed again. I was devastated.

5. The Breakthrough

Then I decided to print out my actions. It suddenly dawned on me that the actions were way different from my training performance, which matches the dataset. Another thing is that when I compute the reconstruction loss on the train set, I did not use output normalization, and it fits on the train set. This is weird to me. So after canceling output normalization too for the eval script, magic happened. The first action chunk becomes near to the gripper, and the robot starts picking up the t-shirt. This is crazy. I then inspected the robot dataset and found out that the normalized state/actions do not have a zero mean, so my dataset was wrong in the first place! It was never actually normalized.

Key Takeaways

Lessons Learned

Log more than just loss during training. Periodically log more information—don't just log your loss. Also add action reconstruction (in L1/L2), etc. The more offline metrics you record, the faster you figure out what's wrong. It reminds me when I do RL, I always log excessively and it really helps me quickly tune hyperparameters. Supervised learning, though conceptually simpler, should also be treated with equal respect.
Pay the highest attention to your data. Inspect data files yourself, try to memorize the normalization statistics on each robot joint. Then when you debug, you can even read the output actions to find where it's wrong.
Add sanity checks after letting AI write code. Until now, they never fully solve problems and you must be the final verifier. After normalization, you should quickly fetch a batch of data and compute its per-sample mean to see if it has zero mean (if you use mean/std normalization). Printing out the data min/max distributions are also a great idea if you use other normalizations. Also dig into your data in your dataset and during model passes.
Invest more time in building a debugger. Build debugging tools on your deploy machine and during training. Use more visualization for your images, and print out the statistics of your state/actions. These tools will save you in the end.
Check port forwarding issues. If you're using server-client communication but the visualization tools have invalid addresses, check the "Ports" window of your IDE to see if the port got transferred to other IDs.
Real-world deployment reveals issues simulation doesn't. I never recognized that action jerkiness is such a great problem until I deployed my model on real robots. Spend less time in simulation-only research, spend more time on-board.