Internally Rewarded Reinforcement Learning


International Conference on Machine Learning (ICML), 2023

Abstract

We study a class of reinforcement learning problems where the reward signals for policy learning are generated by a discriminator that is dependent on and jointly optimized with the policy. This interdependence between the policy and the discriminator leads to an unstable learning process because reward signals from an immature discriminator are noisy and impede policy learning, and conversely, an untrained policy impedes discriminator learning. We call this learning setting Internally Rewarded Reinforcement Learning (IRRL) as the reward is not provided directly by the environment but internally by the discriminator. In this paper, we formally formulate IRRL and present a class of problems that belong to IRRL. We theoretically derive and empirically analyze the effect of the reward function in IRRL and based on these analyses propose the clipped linear reward function. Experimental results show that the proposed reward function can consistently stabilize the training process by reducing the impact of reward noise, which leads to faster convergence and higher performance compared with baselines in diverse tasks.


Experiments

We compare the proposed clipped linear reward with conventional clipped logarithmic reward and accuracy-based reward on various tasks. Some experimental results on the hard attention task of digit recognition and robotic object counting are presented below. Please check our paper for details about the models and experimental setups.

RAM on the digit recognition task

Visulize the performance of RAM models trained by different reward functions. RAM executes a fixed number of exploration steps (18 steps) before performing the final digit recognition. Use the slider to change the training epoch when the model is evaluated. Samples are randomly selected from the evalution dataset. The starting, intermediate, and stopping glimpses are represented by yellow, green, and red boxes respectively. You can change the random seeds to see more cases.

  • Select a random seed:

Loading...
1450

Clipped linear reward

Loading...
1450

Clipped logarithmic reward

Loading...
1450

Accuracy-based reward

Training curves

DT-RAM on the digit recognition task

Visulize the performance of DT-RAM models trained by different reward functions. Different from RAM, DT-RAM learns to terminate the exploration before reaching the maximum number of movement steps (18 steps). Use the slider to change the training epoch when the model is evaluated. Samples are randomly selected from the evalution dataset. The starting, intermediate, and stopping glimpses are represented by yellow, green, and red boxes respectively. You can change the random seeds to see more cases.

  • Select a random seed:

Loading...
1450

Clipped linear reward

Loading...
1450

Clipped logarithmic reward

Loading...
1450

Accuracy-based reward

Training curves

Robotic object counting

Visualize the performance of agents trained by different reward functions on the robotic object counting task. Give the name of the goal object, e.g., "cube_small_blue", the agent is expected to predict the number of the goal object in table-top occlusion scenarios. We provide 50 randomly selected samples. Use the slider to change the indexes of samples.

Select a sample:

1

Goal object

Label

Loading...

Clipped linear reward

Prediction:

Loading...

Clipped logarithmic reward

Prediction:

Loading...

Accuracy-based reward

Prediction:

Citation

Website template borrowed from Jon Barron.