This project implements a reinforcement learning system designed to teach a humanoid (the humanoid.xml file from mujoco) locomotion without requiring reference motions or pre-trained weights. The implementation uses MuJoCo physics engine for accurate simulation and Stable Baselines3's PPO algorithm for training, with a focus on creating a flexible and extensible framework for robotic motion research which allows for addition and selection of custom reward function
- Direct MuJoCo physics engine integration for precise simulation
- Flexible environment configuration system
- Customizable reward functions where the implemented ones focus on:
- Forward velocity matching
- Postural stability
- Energy efficiency
- Balance maintenance
- Proximal Policy Optimization (PPO) implementation with optimized hyperparameters:
- Two-layer MLP network (256 units each)
- ReLU activation functions
- Carefully tuned learning rates and batch sizes
- Parallel environment training using SubprocVecEnv
- Comprehensive configuration system:
config = {
# Environment parameters
"env_kwargs": {
"total_timesteps": 20_000_000,
"render_interval": 5000,
"n_envs": 8,
"reward_function": "stand",
"frame_skip": 3,
"framerate": 60
},
# PPO parameters
"ppo_kwargs": {
"learning_rate": 5e-5,
"n_steps": 2048,
"batch_size": 128,
"n_epochs": 10,
"gamma": 0.99,
"gae_lambda": 0.95,
"clip_range": 0.2,
"ent_coef": 0.002,
"policy_kwargs": {
"activation_fn": "ReLU", # Will be converted to torch.nn.ReLU
"net_arch": {
"pi": [256, 256],
"vf": [256, 256]
}
}
}- Command-line interface for customizing training parameters:
-
Environment Parameters:
--total_timesteps: Total timesteps for training (default: 20M)--render_interval: Interval between video recordings (default: 2500)--n_envs: Number of parallel environments (default: 8)--reward_function: Type of reward function to use (default: "stand")--frame_skip: Number of frames to skip (default: 3)--framerate: Framerate for rendering (default: 60)
-
PPO Hyperparameters:
--learning_rate: Learning rate (default: 1e-4)--n_steps: Number of steps per update (default: 2048)--batch_size: Minibatch size (default: 128)--n_epochs: Number of epochs (default: 20)--gamma: Discount factor (default: 0.99)--gae_lambda: GAE lambda parameter (default: 0.95)--clip_range: Clipping parameter (default: 0.2)--ent_coef: Entropy coefficient (default: 0.0)
-
Network Architecture:
--net_arch_pi: Policy network architecture (default: [64, 64])--net_arch_vf: Value function network architecture (default: [64, 64])
-
Configuration File:
--config: Path to config file for loading predefined settings
-
- Real-time rendering during training set by
render_intervalsaved to recording directory - Detailed state monitoring including:
- Tensorboard data saved in the
sb3_resultsdirectory - Outputting critical values concerning learning to the command line
- Tensorboard data saved in the
- The trained policy can be saved into an XML that can be loaded into mujoco where the keyframes would clearly show the values of torque in the joints (look at
generate_trajectories.py) - The trained policy can also be rendered and saved as a video into the recordings directory
- saved as
episode_-1.mp4 - key point for this rendering is that actions would be deterministic:
model.predict(obs, deterministic=True)
- saved as
- Conda package manager
- MuJoCo physics engine
- Python 3.10
conda env create -f environment.yml
conda activate mujoThe system supports various training configurations through command-line arguments or configuration files:
python main.py --config config.pyKey configuration options include:
- Learning rate: 5e-5
- Batch size: 128
- Network architecture: [256, 256] for both policy and value functions
- Frame skip: 3 (for simulation efficiency)
- Multiple parallel environments: 8 (default)
Render the deterministic and learned policy and save into the recordings/ directory:
python render_policy.pySave the XML file containing all of the steps of the deterministic actions taken by the trained policy. This would allow the XML file to be loaded into Mujoco and observe the key differences
python generate_trajectories.pyThe environment provides comprehensive state information including:
- Joint positions and velocities
- Center of mass position and velocity
- Contact forces
- Actuator states
Note: For all of the reward functions the policy was trained with Joint positions and velocities but the above values can easily be used fro training.
A variety of rewards functions were created and experimented with. Two key reward function that proved quite successful are stand_reward identified by key stand and robust_kneeling_reward identified by kneeling. The keys are used in the config or arg-parse to specify the reward functions for training.
- Policy network: Dual 256-unit hidden layers with ReLU activation
- Value network: Matching architecture for stable learning
- The policy and value network are not shared
- PPO-specific parameters optimized for humanoid locomotion
- Main training implementation:
train_sb3.py - Environment definition:
custom_env.py - Reward functions:
reward_functions.py - Configuration:
config.py - Visualization tools:
render_policy.py,generate_trajectories.py - main file for starting training:
main.py
The agent has been trained with the observation space of position and velocity of the joints (without any additional data). Furthermore, unlike the humanoid-v5 environment in gymnasium, there is no restriction placed on the action space (gymnasium clips the value from -0.4 to 0.4 but this environment does not clip the action space). The agent is free to exercises maximum torque in the joints by sending maximum value of the control signal. However, in the both results obtained below the agent is being penalized for the energy expenditure. Therefore, the policy has to learn the optimal way to move the body to acheive the desired pose while minimizing the energy expenditure.
Additionally, the starting position and the intial velocity of the humanoid is being purturbed to test the robustness of the policy during the training to make improve the generalization of the policy
The key results that were acheived are the standing and kneeling poses. The duration of each episode is 10s for standing and 3s for kneeling.
For the hyperparams, take a look at the config.py file.
The standing reward function successfully achieved its primary objective of maintaining a stable upright posture. The agent learned to:
- Maintain target height of 1.282m
- Keep balanced orientation with minimal deviation
- Efficiently use joint torques
- Distribute weight evenly between feet
The kneeling reward function produced an unexpected but interesting result. While originally designed for standing, the agent discovered a stable kneeling posture that:
- Minimizes energy expenditure
- Maintains stable orientation
- Achieves good balance
- Effectively distributes contact forces
Both reward functions demonstrate the ability of the PPO algorithm to find stable solutions, even if they're not the initially intended ones. The kneeling behavior emerged as a locally optimal solution that satisfied the reward criteria in an unexpected way.









