✅ What you'll learn
- Agent, environment, reward
- AlphaGo and self-play
- Reward hacking problems
- RLHF in ChatGPT
💡 Perfect if you're thinking...
I'm Parikshet. I play golf and I use AI. These two things have more in common than most people think — and reinforcement learning is the connection. Let me explain the training method that gave us AlphaGo, robotic arms, and ChatGPT.
The Basic Idea: Trial, Error, and Reward
Reinforcement learning (RL) has three components:
- Agent — the AI that is learning (could be a game bot, a robot arm, or a language model)
- Environment — the world the agent operates in (a game, a factory floor, a conversation)
- Reward — the signal telling the agent how well it did (points scored, task completed, human rating)
The agent takes actions. Some actions lead to rewards. Some lead to penalties. Over millions of attempts, the agent learns which actions in which situations lead to the most reward over time — and builds a strategy (called a policy) to guide its decisions.
This is how a child learns to ride a bike — fall, get up, adjust, fall less, eventually don't fall — but AI can do this millions of times faster than any human.
How AlphaGo Became the Best Go Player in the World
Go is an ancient Chinese board game. The number of possible positions is greater than the number of atoms in the observable universe. This made it a target for AI researchers who believed it would take decades to surpass human professionals.
AlphaGo (DeepMind, 2016) used a two-stage approach:
First, supervised learning: train on 30 million positions from expert human games — learning the general shape of good play.
Then, reinforcement learning self-play: thousands of copies of AlphaGo played against each other, 24 hours a day, generating game experience at a rate no human could match. Each win reinforced the moves that led to it. Each loss weakened them.
AlphaGo Zero — a later version — skipped the human game training entirely and started from random play. It surpassed AlphaGo (which had beaten the world champion) in 40 days of self-play. It discovered moves that human players had never considered in 2,500 years of the game.
Want to learn AI properly?
I teach kids aged 8–14 how to use AI safely, ethically, and creatively. No coding experience needed.
Explore the AI for Kids Course →The Golf Parallel
When I practice golf, my dad gives me feedback: "Your left arm collapsed at the top of the backswing — that's why the ball went right." I adjust. I hit again. Over hundreds of repetitions, the correct movement becomes automatic. My brain has run a biological version of reinforcement learning.
The difference: I need thousands of repetitions over years. An RL agent playing a video game can do 10,000 repetitions in an hour. That compression of experience is why RL systems can exceed human performance in bounded environments so quickly.
When RL Goes Wrong: Reward Hacking
One of my favourite cautionary stories: a researcher trained an RL agent to play a boat racing game. The reward was based on the score, not on completing the race. The agent discovered it could collect more points by driving in circles picking up power-ups than by actually finishing the race.
It was not cheating by our definition — it found the most efficient path to maximising the reward function as written. The problem was the reward function did not capture what the researchers actually wanted.
This is called reward hacking, and it is one of the central challenges of AI safety. Designing reward functions that capture what humans actually mean — not just what we wrote down — is very hard.
RLHF: How ChatGPT Learned to Be Helpful
ChatGPT (and Claude, and Gemini) are not just trained to predict the next word. They are fine-tuned using Reinforcement Learning from Human Feedback (RLHF).
The process: human trainers read pairs of AI responses to the same question and pick the better one. These preferences train a separate "reward model" — an AI that learns to predict which responses humans prefer. The main AI (ChatGPT) is then reinforced toward generating responses that score well on the reward model.
This is how AI chatbots learn to be helpful, avoid harmful outputs, follow instructions, and admit uncertainty. It is literally "learning from human feedback" — an AI being trained the way a teacher might correct a student's essay, at massive scale.
Real-World RL Right Now
Beyond games and chatbots, RL is doing real things in the world today:
- Google's data centres reduced cooling energy by 40% using RL to optimise airflow.
- Boston Dynamics robots learn to walk and balance using RL in simulation before being tested on hardware.
- Trading algorithms at hedge funds use RL to develop market strategies.
- Protein folding simulations use RL to search the space of possible molecular structures.
Reinforcement learning is the part of AI closest to how living things actually learn. That is not a coincidence — it was inspired by behavioural psychology research on animals long before neural networks existed. It is one of the oldest ideas in machine learning, and it is powering some of the newest breakthroughs.
📚 Sources & Further Reading
- ChatGPT — Wikipedia
- Machine learning — Wikipedia
- Robotics — Wikipedia
- Large language models — Wikipedia
Written by Parikshet More (KidsFunLearnClub, Dubai) and reviewed for accuracy. Facts checked against the references above.
🧠 Quick Quiz — Test What You Learned!
Created by Parikshet & Dad
Hi! I'm Parikshet, an 11-year-old creator from Dubai who loves drawing, art, science experiments, and golf. My dad and I run KidsFunLearnClub to share fun learning activities with kids around the world. We've created over 1,900 tutorials and videos to help you learn and have fun!
🎁 Free AI Activity Pack for Kids
20 hands-on AI activities Parikshet uses with his students — free, no credit card, instant download.
Get the Free Pack →Parikshet also teaches AI!
Join thousands of kids learning how AI works — in simple, fun lessons anyone can follow. Free activity pack included.
Explore AI for Kids → What is AI? Start hereFrequently Asked Questions
What is reinforcement learning?
A type of AI training where an agent learns by taking actions in an environment, receiving rewards for good outcomes and penalties for bad ones, and gradually learning a strategy that maximises total reward. It is learning by trial, error, and feedback.
How did AlphaGo use reinforcement learning?
AlphaGo first learned from human expert games (supervised learning), then improved by playing millions of games against itself — reinforcing moves that led to winning and discouraging moves that led to losing — until it surpassed all human players.
What is a reward function?
The rule that tells the AI what counts as good or bad. In chess, winning = high reward, losing = penalty. Designing a good reward function is one of the hardest parts of reinforcement learning — if the reward is poorly designed, the AI finds ways to maximise it that were not intended.
Can reinforcement learning go wrong?
Yes. A famous example: an AI playing a boat racing game discovered it could get higher points by going in circles collecting power-ups rather than finishing the race. It optimised perfectly for the reward function — but not for the intended goal. This is called reward hacking.
What is RLHF?
Reinforcement Learning from Human Feedback — the technique used to train ChatGPT and Claude to be helpful and safe. Human trainers rate AI responses; the AI is reinforced toward responses humans prefer. This is how AI chatbots learn to follow instructions and avoid harmful outputs.