12,847 agents training now500k timesteps

Your agent learns.
You get certified.

Train real agents in live environments. Watch reward curves converge. Earn certifications that prove you understand policy gradients beyond the textbook.

epoch.ai/dashboard/run/humanoid-ppo-v3
LIVE

Episode Reward

+324.6

▲ +38.2%last 10k steps
convergence0100k200k300k400k500k

MuJoCo Humanoid v4 — mid-stride

Humanoid robot agent in mid-stride during MuJoCo simulation training run
step 2,944

Hyperparameters

lr3e-4
γ0.99
clip_ε0.2
n_steps2048
batch64
epochs10
policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103
timestep: 487,332 episodes: 2,841 mean_reward: 234.7
policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987
timestep: 492,100 episodes: 2,876 mean_reward: 267.3
policy_loss: -0.0156 value_loss: 0.3891 entropy: 0.5744
timestep: 496,800 episodes: 2,910 mean_reward: 298.1
[CHECKPOINT] saved at timestep 496,800
policy_loss: -0.0112 value_loss: 0.3344 entropy: 0.5501
timestep: 500,000 episodes: 2,944 mean_reward: 324.6
[CONVERGENCE] reward threshold exceeded — agent certified
policy_loss: -0.0089 value_loss: 0.2987 entropy: 0.5312
timestep: 504,200 episodes: 2,978 mean_reward: 341.2
policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103
timestep: 487,332 episodes: 2,841 mean_reward: 234.7
policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987
timestep: 492,100 episodes: 2,876 mean_reward: 267.3
policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103
timestep: 487,332 episodes: 2,841 mean_reward: 234.7
policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987
timestep: 492,100 episodes: 2,876 mean_reward: 267.3
policy_loss: -0.0156 value_loss: 0.3891 entropy: 0.5744
timestep: 496,800 episodes: 2,910 mean_reward: 298.1
[CHECKPOINT] saved at timestep 496,800
policy_loss: -0.0112 value_loss: 0.3344 entropy: 0.5501
timestep: 500,000 episodes: 2,944 mean_reward: 324.6
[CONVERGENCE] reward threshold exceeded — agent certified
policy_loss: -0.0089 value_loss: 0.2987 entropy: 0.5312
timestep: 504,200 episodes: 2,978 mean_reward: 341.2
policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103
timestep: 487,332 episodes: 2,841 mean_reward: 234.7
policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987
timestep: 492,100 episodes: 2,876 mean_reward: 267.3
Quantified proof

Numbers from the last training run

Not projected. Not estimated. Pulled from the dashboard 27 minutes ago.

Live metric
0

Agents trained this month

across 47 environments

+23% vs last month
Live metric
0%

Certification pass rate

on first attempt

Industry avg: 61%
Live metric
0.0x

Faster curriculum completion

vs traditional self-study

Median: 6.4 weeks
Training environments

Pick your environment.
Start the run.

47 environments available
Abstract balance visualization representing CartPole reinforcement learning environment with dark background
CartPole-v1
Beginner

CartPole — Policy Gradient Foundations

Start here. Balance a pole on a cart. Sounds trivial — the reward curve will humble you before it teaches you.

REINFORCEBaselineValue Functions
~50k timesteps4,821 enrolled
Completion rate78%
Dark blue space scene with moon surface representing LunarLander reinforcement learning environment
LunarLander-v2
Intermediate

LunarLander — Proximal Policy Optimization

Land a spacecraft using continuous thrust control. PPO's clipped objective will click when you watch the agent overcorrect, then not.

PPOClipped SurrogateGAE
~200k timesteps3,204 enrolled
Completion rate61%
Robotic humanoid figure in motion during simulation training for sim-to-real transfer learning
Humanoid-v4
Advanced

MuJoCo Humanoid — Sim-to-Real Transfer

Train a 17-DOF humanoid to walk. Then transfer the policy to hardware. This is the capstone that gets you hired.

SACDomain RandomizationSim2Real
~500k timesteps1,847 enrolled
Completion rate34%
Financial trading charts and data visualization on dark screen for quantitative trading agent training
TradingEnv-v3
Advanced

Quant Trading Agent — Market Microstructure

Build an agent that reads Level 2 order books and executes trades. The reward function is a modified Sharpe ratio.

DDPGOrder BookSharpe Reward
~1M timesteps2,103 enrolled
Completion rate29%
Grid of interconnected nodes representing multi-agent reinforcement learning coordination environment
MARL-Grid-v2
Advanced

Multi-Agent Coordination — Cooperative RL

Five agents. One shared reward. Watch emergent communication protocols develop by episode 3,000.

MADDPGCredit AssignmentEmergent
~800k timesteps987 enrolled
Completion rate22%

Free tier unlocks CartPole, LunarLander, and the sandbox dashboard immediately

No credit card · 3 environments free · Sandbox dashboard included
After the run

What convergence gets you

Certification outcomes, salary data, and capstone projects from engineers who shipped.

Salary delta

+$45k median increase

After Epoch certification. Based on 847 alumni reports, 2025.

Pre-cert
$142k
Post-cert
$187k
Capstone
$224k
Top 10%
$273k

n=847 · self-reported · US market · 2025

Hiring partners

38 companies hiring

Epoch-certified engineers get fast-tracked at these firms.

DeepMind AI research company logo representation
DeepMind
OpenAI artificial intelligence company logo representation
OpenAI
Waymo autonomous vehicle company logo representation
Waymo
Two Sigma quantitative finance company logo representation
Two Sigma
Boston Dynamics robotics company logo representation
Boston Dynamics
Cohere AI language model company logo representation
Cohere
This quarter
2,341

Certifications issued

Q1 2026

94%pass rate

Avg completion: 6.4 weeks

Capstone projects

Real projects. Real reward curves. Real hires.

Bipedal robotic walker achieving high reward score in reinforcement learning simulation environment
+412 reward at 500k steps

Bipedal Walker with Curriculum Learning

Professional headshot of Priya Krishnamurthy, ML engineer specializing in robotics

Priya Krishnamurthy

ML Engineer → Robotics at Waymo

Financial market data charts showing successful quantitative trading agent performance metrics
Sharpe 2.4 after 1M steps

Market-Making Agent with Risk Constraints

Professional headshot of Marcus Webb, quantitative developer and reinforcement learning researcher

Marcus Webb

Quant Dev → RL Researcher at Two Sigma

Robotic locomotion system successfully transferring learned simulation behavior to real hardware environment
87% zero-shot transfer rate

Sim-to-Real Locomotion Transfer

Professional headshot of Yuki Tanaka, robotics researcher specializing in sim-to-real transfer

Yuki Tanaka

Robotics Researcher → Lead at Boston Dynamics

Full Curriculum Map

9 weeks · 3 tracks · 15 lab environments

Included in free tierPro required
Markov Decision Processes — from theory to code
3h
Dynamic Programming: Value & Policy Iteration
4h
Monte Carlo Methods in Practice
3h
Temporal Difference Learning
4h
Q-Learning & SARSA — CartPole lab
5h
Deep Q-Networks — Atari from scratch
6h
Policy Gradient Theorem — derivation + code
5h
Actor-Critic Methods
4h
PPO — LunarLander lab
8h
SAC for continuous control
6h
Model-Based RL & World Models
6h
Multi-Agent RL & Emergent Behavior
8h
Sim-to-Real Transfer — MuJoCo lab
10h
RL for Trading — order book environments
8h
Capstone: Choose your environment
20h
47 engineers started training in the last hour

The reward curve won't wait.
Neither should you.

Free tier. Three environments. Sandbox dashboard. No credit card. Your first agent could be converging in 30 minutes.

No credit card·3 environments free·Instant access