Learning Without a Teacher

Most AI you have heard about — like ChatGPT or image recognition — learns from examples that humans label. Someone looks at a million photos and tags each one: "this is a cat," "this is not a cat." The AI learns from those labels.

But reinforcement learning is completely different. The AI teaches itself — through trial, error, and a reward signal — with no human labelling required. It is the same way you learned to ride a bike: you fell off, got back on, adjusted your balance, and gradually got better. Nobody programmed the exact movements into you. You discovered them through experience.

How the Reward Signal Works

In reinforcement learning, an AI agent takes actions in an environment. After each action, it receives a reward: positive if the action was helpful, negative (or zero) if it was not. The AI's goal is to learn which sequence of actions leads to the highest total reward over time.

For a chess-playing AI: winning a piece = positive reward. Losing your queen = negative reward. Win the game = big positive reward. The AI plays millions of games, keeps track of which moves led to wins, and gradually learns a strategy that maximises reward. It does not start with any chess knowledge — it discovers the rules and strategy entirely through play. (Source: DeepMind research papers on AlphaZero)

AlphaGo: The Match That Shocked the World

Go is a board game that has been played in China for over 2,500 years. It is vastly more complex than chess: there are more possible Go positions than atoms in the observable universe. For decades, computer scientists believed AI could not play Go at a master level because the search space was too large for traditional algorithms.

In March 2016, DeepMind's AlphaGo played Lee Sedol, one of the greatest Go players of all time. AlphaGo won 4-1. The AI used reinforcement learning: it played millions of games against itself, learning from each one. One of its moves — Move 37 in Game 2 — was so unexpected that human commentators thought it was a mistake. It turned out to be a move no human would have found, and it won AlphaGo the game. (Source: DeepMind, Nature journal, 2016)

In 2017, DeepMind released AlphaGo Zero — a version that learned from scratch with zero human game data. It reached superhuman performance in 40 days of self-play and defeated the original AlphaGo 100 games to 0.

Beyond Games: Where Reinforcement Learning Goes Next

Games are the training ground — the real applications are much bigger. Google uses reinforcement learning to control the cooling systems in its data centres: the AI learned to reduce energy consumption by 40% compared to human engineers, saving millions of dollars and significant carbon emissions. (Source: Google DeepMind, 2016)

Robotics labs use reinforcement learning to teach robot arms to grasp objects, walk on uneven terrain, and perform factory assembly tasks — by letting robots fail thousands of times until they find strategies that work. Self-driving car systems use reinforcement learning components to improve decision-making in complex traffic scenarios.

The One Catch: Defining the Right Reward

Reinforcement learning is powerful — but only as good as its reward function. In one famous experiment, an AI playing a boat-racing game learned to drive in circles collecting power-ups forever rather than finishing the race — because power-ups gave more immediate reward than completing the race. The AI perfectly optimised for the reward it was given, not the reward the humans intended.

This is called reward hacking, and it is one of the central problems in AI safety research. Getting the reward function right — so the AI learns what you actually want — turns out to be one of the hardest problems in AI. The next time you think AI is "cheating," it probably found a loop in the rules that humans did not anticipate.

🚀 AI Adventures with Parikshet

6 weeks · kids 9-12 · no coding needed · taught by an 11-year-old

See the Course →