Reinforcement learning is a type of machine learning often applied to problems involving indirect rewards – that is, in which making a correct choice gives a delayed reward, possibly dependent on future actions. This class of problems is extremely general and includes many areas in which AI is being employed today.
When putting together an algorithm for reinforcement learning, one must specify a number of parameters to determine how the algorithm will be trained; these are known as hyperparameters. In modern reinforcement learning literature, the choice of hyperparameters often takes a backseat when evaluating the efficacy of an algorithm. This is partly because the complexity of many modern reinforcement learning approaches results in a high number of hyperparameters to optimize, making a full grid-search all but impossible given time constraints. Often, most hyperparameters are fixed, while one or two seemingly arbitrarily chosen parameters are allowed to vary across 10 or 15 fixed variables.
A delicate balance of hyperparameter exploration and experimental evaluation, combined with the use of a tool like Zegami to analyze graphical results effectively, provides a novel approach to this problem. To demonstrate, we’ll apply our approach in a common testing ground for RL algorithms — OpenAI Gym’s Lunar Lander environment, in which a reward is given for successfully landing a spaceship on a landing pad. Here’s a video of an agent without any training trying to tackle this environment: Now, we will attempt to produce an agent that can solve the environment. We’ll start with a boilerplate RL agent, an algorithm known as a Deep Q-Network, or DQN for short (see https://medium.com/@jonathan_hui/rl-dqn-deep-q-network-e207751f7ae4 for an introduction). This agent attempts to learn a function for attributing a value to each possible action it could take, using a neural network. The following hyperparameters must be chosen:
- Sizes of network layers
- Learning rate – this is the rate at which the network’s weights are updated
- Target sync frequency – the rate at which the separate target-action network is updated
- Epsilon – the rate at which the agent chooses random moves, to promote strategy exploration
- Epsilon decay – whether or not the above epsilon value is adjusted as the agent learns
To start, we define discrete ranges for each hyperparameter:
Our algorithm will randomly sample each of these ranges, construct a DQN with those hyperparameters, and train it for 1000 episodes. This number of episodes is generally not enough for a vanilla DQN to solve (achieve a perfect score on) this environment but is enough to get a feel for how well the agent is learning. We store the results of these episodes as a time series, and plot it along with a moving estimation of its mean and variance. In the rules that we have defined 200 is a perfect score, which means in the chart below, where the y-axis is score and x-axis is episodes, our algorithm failed to learn how to land the spaceship.
Once we have a good sample of these agents, we can load them into Zegami and start our analysis.
Here, we have plotted the agents’ final scores against their learning rates. We see that the highest-performing agents had learning rates of around 0.0005-0.001.
Here we have plotted the agent’s learning rate against second layer’s size; we can use the graphs themselves to reason about the agent’s learning ability. We can see that the agents with a high learning rate (circled in red) have too much variability in their rewards; we also see that a larger layer 2 size generally causes the agents with a lower learning rate (circled in blue) to get a high score faster.
Clearly the best-performing agents have small second layers, probably because this allow them to be trained more quickly with little loss of accuracy. After performing this analysis on every hyperparameter, we construct new ranges for each one.
And run it again. In our new collection, all our agents figured out how to balance within 100 episodes.
And lo, one of our agents learned how to land perfectly almost every time within 1000 timesteps, an impressive feat for a vanilla DQN!