This Is How Machines Learn! Reinforcement Learning (Part 4)
Our goal with this series is to enable everyone to understand AI phenomena in their daily lives, as well as to actively shape the growing influence of AI on our society. Therefore, we do not consider any technical details or provide an introduction on how to use certain machine learning frameworks. Instead, we focus on explaining the underlying ideas of machine learning which empower everyone to understand and shape the digital world that surrounds us.
“Ouch, that’s hot!” — After their first encounter with a hot plate, children quickly learn not to touch it… That’s because children learn through direct interaction with and feedback from their environment.
Underlying Idea
Reinforcement learning is a type of machine learning that works similarly — and it’s inspired by psychology: the agent — a computer program capable of autonomous behavior — learns to better predict the success of its actions. Through continuous interaction with its environment, it is repeatedly rewarded or punished and thus optimizes its strategy.
Unlike the two types of machine learning we already looked at, reinforcement learning does not require large amounts of data (labeled or unlabeled). With reinforcement learning, the agent instead pursues a goal, for example, to successfully play a game of Snake, or, in the case of our robot, to create a meadow with lots of flowers. However, it first needs to learn a proper strategy.
The steps below describe the activities in the figure.
① First, the agent captures the state, i.e. relevant aspects of its environment. For our robot, this is primarily determined by the growth status of the flowers. In the game Snake, the position of the food and all parts of the snake’s body would form the state.
② Within the environment, the agent can perform actions that it selects from a set of available actions depending on the state of the environment. Our robot has the same two possible actions to choose from in every state: watering or planting seedlings with a spade. In Snake, the snake can move either up, down, right or left. By performing one of the available actions, the state of the environment changes.
③ The agent is then rewarded or punished according to predefined rules. If our robot is in the state “seedlings planted” and contributes to the growth of the seedlings by “watering”, it will be rewarded with a fixed number of coins. If it instead digs up the seedlings with a spade, it is punished and coins are removed. Similarly, the snake is rewarded when it consumes the food and punished when it touches its tail. The way rewards and punishments are awarded has a significant impact on how the agent learns. So it is not unthinkable that an autonomous car learns not to accelerate because stopping is not punished, but punishment in an accident would be much higher.
④ Rewards make the agent show behavior more frequently, while punishments make that behavior less frequent. Successful actions are thus “reinforced”, unsuitable actions “unlearned”. This way, the agent adjusts its strategy, which is stored in its model. When we speak of the agent’s learning process, we refer to an adjustment of the model. Our robot manages its strategy using the shelf, where it maintains an up-to-date evaluation of all possible actions for each state.
Due to lack of experience, the agent will initially take an exploratory approach and select actions randomly. By repeatedly completing the cycle of capture state, select and execute an action, and receiving reward or punishment, the agent will gradually optimize its strategy.
Application Areas
Games are a famous field of application for reinforcement learning. The state of the environment is fairly easy to capture, but the best action or move depends on several factors. Because of complexity, it is difficult to find a traditional algorithm that can deal with all eventualities of the game. Through reinforcement learning, the agent is able to master a wide range of games after playing several rounds.
Robots or self-driving cars can also be trained using reinforcement learning. Often, however, simulation environments are used before letting robots or cars have real-world experience.
Another area of application is in the optimization of processes: Problems that are hard to solve mathematically and where ideal strategies are unclear. This could be, for example, controlling the air conditioning in a server farm or planning public transit schedules.
This is it for part 4 of our series. Thank you for reading. If you have any questions, feel free to ask them in the comments. Click here to go to part 5.
Written by Stefan Seegerer, Tilman Michaeli and Ralf Romeike.
The robot is adapted from https://openclipart.org/detail/191072/blue-robot and licensed under CC0. The article and the derived graphics are licensed under CC-BY.