I'm Parikshet. I play golf and I use AI. These two things have more in common than most people think — and reinforcement learning is the connection. Let me explain the training method that gave us AlphaGo, robotic arms, and ChatGPT.

The Basic Idea: Trial, Error, and Reward

Reinforcement learning (RL) has three components:

  • Agent — the AI that is learning (could be a game bot, a robot arm, or a language model)
  • Environment — the world the agent operates in (a game, a factory floor, a conversation)
  • Reward — the signal telling the agent how well it did (points scored, task completed, human rating)

The agent takes actions. Some actions lead to rewards. Some lead to penalties. Over millions of attempts, the agent learns which actions in which situations lead to the most reward over time — and builds a strategy (called a policy) to guide its decisions.

This is how a child learns to ride a bike — fall, get up, adjust, fall less, eventually don't fall — but AI can do this millions of times faster than any human.

How AlphaGo Became the Best Go Player in the World

Go is an ancient Chinese board game. The number of possible positions is greater than the number of atoms in the observable universe. This made it a target for AI researchers who believed it would take decades to surpass human professionals.

AlphaGo (DeepMind, 2016) used a two-stage approach:

First, supervised learning: train on 30 million positions from expert human games — learning the general shape of good play.
Then, reinforcement learning self-play: thousands of copies of AlphaGo played against each other, 24 hours a day, generating game experience at a rate no human could match. Each win reinforced the moves that led to it. Each loss weakened them.

AlphaGo Zero — a later version — skipped the human game training entirely and started from random play. It surpassed AlphaGo (which had beaten the world champion) in 40 days of self-play. It discovered moves that human players had never considered in 2,500 years of the game.

Want to learn AI properly?

I teach kids aged 8–14 how to use AI safely, ethically, and creatively. No coding experience needed.

Explore the AI for Kids Course →

The Golf Parallel

When I practice golf, my dad gives me feedback: "Your left arm collapsed at the top of the backswing — that's why the ball went right." I adjust. I hit again. Over hundreds of repetitions, the correct movement becomes automatic. My brain has run a biological version of reinforcement learning.

The difference: I need thousands of repetitions over years. An RL agent playing a video game can do 10,000 repetitions in an hour. That compression of experience is why RL systems can exceed human performance in bounded environments so quickly.

When RL Goes Wrong: Reward Hacking

One of my favourite cautionary stories: a researcher trained an RL agent to play a boat racing game. The reward was based on the score, not on completing the race. The agent discovered it could collect more points by driving in circles picking up power-ups than by actually finishing the race.

It was not cheating by our definition — it found the most efficient path to maximising the reward function as written. The problem was the reward function did not capture what the researchers actually wanted.

This is called reward hacking, and it is one of the central challenges of AI safety. Designing reward functions that capture what humans actually mean — not just what we wrote down — is very hard.

RLHF: How ChatGPT Learned to Be Helpful

ChatGPT (and Claude, and Gemini) are not just trained to predict the next word. They are fine-tuned using Reinforcement Learning from Human Feedback (RLHF).

The process: human trainers read pairs of AI responses to the same question and pick the better one. These preferences train a separate "reward model" — an AI that learns to predict which responses humans prefer. The main AI (ChatGPT) is then reinforced toward generating responses that score well on the reward model.

This is how AI chatbots learn to be helpful, avoid harmful outputs, follow instructions, and admit uncertainty. It is literally "learning from human feedback" — an AI being trained the way a teacher might correct a student's essay, at massive scale.

Real-World RL Right Now

Beyond games and chatbots, RL is doing real things in the world today:
- Google's data centres reduced cooling energy by 40% using RL to optimise airflow.
- Boston Dynamics robots learn to walk and balance using RL in simulation before being tested on hardware.
- Trading algorithms at hedge funds use RL to develop market strategies.
- Protein folding simulations use RL to search the space of possible molecular structures.

Reinforcement learning is the part of AI closest to how living things actually learn. That is not a coincidence — it was inspired by behavioural psychology research on animals long before neural networks existed. It is one of the oldest ideas in machine learning, and it is powering some of the newest breakthroughs.

📚 Sources & Further Reading

Written by Parikshet More (KidsFunLearnClub, Dubai) and reviewed for accuracy. Facts checked against the references above.