12,847 agents training now500k timesteps

Your agent learns.
You get certified.

Train real agents in live environments. Watch reward curves converge. Earn certifications that prove you understand policy gradients beyond the textbook.

epoch.ai/dashboard/run/humanoid-ppo-v3

LIVE

Episode Reward

+324.6

▲ +38.2%last 10k steps

MuJoCo Humanoid v4 — mid-stride

Humanoid robot agent in mid-stride during MuJoCo simulation training run

step 2,944

Hyperparameters

lr3e-4

γ0.99

clip_ε0.2

n_steps2048

batch64

epochs10

policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103

timestep: 487,332 episodes: 2,841 mean_reward: 234.7

policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987

timestep: 492,100 episodes: 2,876 mean_reward: 267.3

policy_loss: -0.0156 value_loss: 0.3891 entropy: 0.5744

timestep: 496,800 episodes: 2,910 mean_reward: 298.1

[CHECKPOINT] saved at timestep 496,800

policy_loss: -0.0112 value_loss: 0.3344 entropy: 0.5501

timestep: 500,000 episodes: 2,944 mean_reward: 324.6

[CONVERGENCE] reward threshold exceeded — agent certified

policy_loss: -0.0089 value_loss: 0.2987 entropy: 0.5312

timestep: 504,200 episodes: 2,978 mean_reward: 341.2

policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103

timestep: 487,332 episodes: 2,841 mean_reward: 234.7

policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987

timestep: 492,100 episodes: 2,876 mean_reward: 267.3

policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103

timestep: 487,332 episodes: 2,841 mean_reward: 234.7

policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987

timestep: 492,100 episodes: 2,876 mean_reward: 267.3

policy_loss: -0.0156 value_loss: 0.3891 entropy: 0.5744

timestep: 496,800 episodes: 2,910 mean_reward: 298.1

[CHECKPOINT] saved at timestep 496,800

policy_loss: -0.0112 value_loss: 0.3344 entropy: 0.5501

timestep: 500,000 episodes: 2,944 mean_reward: 324.6

[CONVERGENCE] reward threshold exceeded — agent certified

policy_loss: -0.0089 value_loss: 0.2987 entropy: 0.5312

timestep: 504,200 episodes: 2,978 mean_reward: 341.2

policy_loss: -0.0234 value_loss: 0.4821 entropy: 0.6103

timestep: 487,332 episodes: 2,841 mean_reward: 234.7

policy_loss: -0.0198 value_loss: 0.4205 entropy: 0.5987

timestep: 492,100 episodes: 2,876 mean_reward: 267.3

Quantified proof

Numbers from the last training run

Not projected. Not estimated. Pulled from the dashboard 27 minutes ago.

Live metric

Agents trained this month

across 47 environments

▲ +23% vs last month

Live metric

Certification pass rate

on first attempt

▲ Industry avg: 61%

Live metric

0.0x

Faster curriculum completion

vs traditional self-study

▲ Median: 6.4 weeks

Training environments

Pick your environment.
Start the run.

47 environments available

Abstract balance visualization representing CartPole reinforcement learning environment with dark background

CartPole-v1

Beginner

CartPole — Policy Gradient Foundations

Start here. Balance a pole on a cart. Sounds trivial — the reward curve will humble you before it teaches you.

REINFORCEBaselineValue Functions

~50k timesteps4,821 enrolled

Completion rate78%

Dark blue space scene with moon surface representing LunarLander reinforcement learning environment

LunarLander-v2

Intermediate

LunarLander — Proximal Policy Optimization

Land a spacecraft using continuous thrust control. PPO's clipped objective will click when you watch the agent overcorrect, then not.

PPOClipped SurrogateGAE

~200k timesteps3,204 enrolled

Completion rate61%

Robotic humanoid figure in motion during simulation training for sim-to-real transfer learning

Humanoid-v4

Advanced

MuJoCo Humanoid — Sim-to-Real Transfer

Train a 17-DOF humanoid to walk. Then transfer the policy to hardware. This is the capstone that gets you hired.

SACDomain RandomizationSim2Real

~500k timesteps1,847 enrolled

Completion rate34%

Financial trading charts and data visualization on dark screen for quantitative trading agent training

TradingEnv-v3

Advanced

Quant Trading Agent — Market Microstructure

Build an agent that reads Level 2 order books and executes trades. The reward function is a modified Sharpe ratio.

DDPGOrder BookSharpe Reward

~1M timesteps2,103 enrolled

Completion rate29%

Grid of interconnected nodes representing multi-agent reinforcement learning coordination environment

MARL-Grid-v2

Advanced

Multi-Agent Coordination — Cooperative RL

Five agents. One shared reward. Watch emergent communication protocols develop by episode 3,000.

MADDPGCredit AssignmentEmergent

~800k timesteps987 enrolled

Completion rate22%

Free tier unlocks CartPole, LunarLander, and the sandbox dashboard immediately

↑No credit card · 3 environments free · Sandbox dashboard included

After the run

What convergence gets you

Certification outcomes, salary data, and capstone projects from engineers who shipped.

Salary delta

+$45k median increase

After Epoch certification. Based on 847 alumni reports, 2025.

Pre-cert

$142k

Post-cert

$187k

Capstone

$224k

Top 10%

$273k

n=847 · self-reported · US market · 2025

Hiring partners

38 companies hiring

Epoch-certified engineers get fast-tracked at these firms.

DeepMind

OpenAI

Waymo

Two Sigma

Boston Dynamics

Cohere

This quarter

2,341

Certifications issued

Q1 2026

Avg completion: 6.4 weeks

Capstone projects

Real projects. Real reward curves. Real hires.

Bipedal robotic walker achieving high reward score in reinforcement learning simulation environment

+412 reward at 500k steps

Bipedal Walker with Curriculum Learning

Priya Krishnamurthy

ML Engineer → Robotics at Waymo

Financial market data charts showing successful quantitative trading agent performance metrics

Sharpe 2.4 after 1M steps

Market-Making Agent with Risk Constraints

Marcus Webb

Quant Dev → RL Researcher at Two Sigma

Robotic locomotion system successfully transferring learned simulation behavior to real hardware environment

87% zero-shot transfer rate

Sim-to-Real Locomotion Transfer

Yuki Tanaka

Robotics Researcher → Lead at Boston Dynamics

Full Curriculum Map

9 weeks · 3 tracks · 15 lab environments

Included in free tierPro required

Markov Decision Processes — from theory to code

Dynamic Programming: Value & Policy Iteration

Monte Carlo Methods in Practice

Temporal Difference Learning

Q-Learning & SARSA — CartPole lab

Deep Q-Networks — Atari from scratch

Policy Gradient Theorem — derivation + code

Actor-Critic Methods

PPO — LunarLander lab

SAC for continuous control

Model-Based RL & World Models

Multi-Agent RL & Emergent Behavior

Sim-to-Real Transfer — MuJoCo lab

10h

RL for Trading — order book environments

Capstone: Choose your environment

20h

47 engineers started training in the last hour

The reward curve won't wait.
Neither should you.

Free tier. Three environments. Sandbox dashboard. No credit card. Your first agent could be converging in 30 minutes.

No credit card·3 environments free·Instant access

Your agent learns.You get certified.

Numbers from the last training run

Pick your environment.Start the run.

CartPole — Policy Gradient Foundations

LunarLander — Proximal Policy Optimization

MuJoCo Humanoid — Sim-to-Real Transfer

Quant Trading Agent — Market Microstructure

Multi-Agent Coordination — Cooperative RL

What convergence gets you

+$45k median increase

38 companies hiring

Capstone projects

Bipedal Walker with Curriculum Learning

Market-Making Agent with Risk Constraints

Sim-to-Real Locomotion Transfer

Full Curriculum Map

The reward curve won't wait.Neither should you.

Your agent learns.
You get certified.

Pick your environment.
Start the run.

The reward curve won't wait.
Neither should you.