Internally Rewarded Reinforcement Learning
International Conference on Machine Learning (ICML), 2023
- Mengdi Li*
- Xufeng Zhao*
- Jae Hee Lee
- Cornelius Weber
- Stefan Wermter
* Equal contribution
Abstract
We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an untrained policy impedes discriminator learning. We call this learning setting Internally Rewarded Reinforcement Learning (IRRL) as the reward is not provided directly by the environment but internally by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.
Experiments
We compare the proposed clipped linear reward with conventional clipped logarithmic reward and accuracy-based reward on various tasks. Some experimental results on the hard attention task of digit recognition and robotic object counting are presented below. Please check our paper for details about the models and experimental setups.
RAM on the digit recognition task
Visulize the performance of RAM models trained by different reward functions. RAM executes a fixed number of exploration steps (18 steps) before performing the final digit recognition. Use the slider to change the training epoch when the model is evaluated. Samples are randomly selected from the evalution dataset. The starting, intermediate, and stopping glimpses are represented by yellow, green, and red boxes respectively. You can change the random seeds to see more cases.
-
Select a random seed:
-
Clipped linear reward
Clipped logarithmic reward
Accuracy-based reward
Training curves
DT-RAM on the digit recognition task
Visulize the performance of DT-RAM models trained by different reward functions. Different from RAM, DT-RAM learns to terminate the exploration before reaching the maximum number of movement steps (18 steps). Use the slider to change the training epoch when the model is evaluated. Samples are randomly selected from the evalution dataset. The starting, intermediate, and stopping glimpses are represented by yellow, green, and red boxes respectively. You can change the random seeds to see more cases.
-
Select a random seed:
-
Clipped linear reward
Clipped logarithmic reward
Accuracy-based reward
Training curves
Robotic object counting
Visualize the performance of agents trained by different reward functions on the robotic object counting task. Give the name of the goal object, e.g., "cube_small_blue", the agent is expected to predict the number of the goal object in table-top occlusion scenarios. We provide 50 randomly selected samples. Use the slider to change the indexes of samples.
Select a sample:
Goal object
Label
Clipped linear reward
Prediction:
Clipped logarithmic reward
Prediction:
Accuracy-based reward
Prediction:
Citation
Website template borrowed from Jon Barron.