Flipping a pancake is a complex move that humans need to learn through experimentation and repetition. So, how could a robot learn to flip a pancake? Let’s imagine that there is an robotized armed capable of performing the same moves as a human arm. The hand is holding a frying pan, and the dough is cooked and has the shape of perfect pancake.

Let’s also imagine that we are able to write the necessary code to make the robotized arm move the hand in a way that could make the pancake flip in the air and fall back onto the frying pan. The most probable consequence of our first attempt would probably be a failure.

A human would stop and think what did go wrong. Was the move too strong? Too weak? Should I have expected that the pancake to fall more to the right or left? When we think about the possible consequences of our moves and we see the result, we learn. And so do machines.

A team led by Petar Kormushev used Machine Learning Technologies to train a Barret Wam robot to flip pancakes by reinforcement learning. After 50 attempts, the robot was capable of turning a pancake. Using a complex mixture of techniques based on Markov Decision Process, the team designed a system capable of understanding what worked and what went wrong.

At BBVA Data & Analytics, we are exploring this same model to optimize pricing strategies, which can be useful for revenue increase and customer retention. In the following lines, we explain the basic pillars of this methodology.

## Markov Decision Process

Reinforcement learning consists of learning to decide in a given situation what action is the best to achieve an objective. A hardware or software agent is connected to its environment via perception and action. At each instant the agent receives from the environment through sensors the actual state . Then the agent decides to execute an action , which it generates as an output. This output changes the state of the environment to , which is transmitted to the agent with an signal of reward . This signal informs the agent of the utility of executing the action from the state s to achieve a specific goal.

The agent’s behavior or policy must be such that it chooses actions that increase the sum of all reward signals received over time. Formally, the model consists of a state space, , in which the agent can be found, and an action space, , that the agent can execute. Also the model includes several functions that are unknown in principle: a function , which performs the state transitions, and a function , which calculates the reward that the agent receives at each moment. With these functions the description of Markov Decision Processes (MDPs) is completed.

The goal of the agent is to find a policy that for each state decides what action has to be taken, so that some measure of long-term reward is maximized. This goal is called the optimality criterion and it may be different depending on the problem to be solved.

Solving the problem of Pricing strategy optimization (PSO) as a Reinforcement learning (RL) problem requires the modelling of PSO as a Markov decision process.

## Optimization Criteria

The optimization of a pricing strategy for renewal portfolios requires handling two main goal functions:

- Revenue

where represents the Bernoulli distribution for probability of acceptance of a given renewal ratio .

- Retention

where is the probability of acceptance of a given renewal ratio .

We will maximize the revenue function that implicitly holds the retention.

## State space

States are the possible conditions of the object before and after an action is executed. A simple way of explaining this would be to differentiate the initial conditions, and the expected optimal or worse conditions. In our example one state is a tuple of a global and customer features.

## Action space

Actions refer to the way in which we consciously interact with the object and can be manifested in different ways: it can refer to the pressure or softening that we apply onto a surface, or the movement of the arm, in the case of the robot and the pancake.

In PSO action is the renewal ratio for current client. We assume that we have finite and discrete space of actions. The renewal ratios for each client are limited with the global constraints applied to each client (depending on his variables). That is, each customer has his constraints for renewal ratios. We divide the range of possible renewal ratios of each client such that each client has 10 possible actions (increments/decrements) including the value of 1 (when the price is the same for the next renewal).

## Transition function

Transition function performs the transition from a given state to a different one after executing a specific action. In PSO we use the probability model (Logistic Regression) as a simulator that gives us the feed-back as probability of acceptance of a given renewal ratio (action) by each client. Then we update global variables depending of final value of class (renewal or not).

## Reward function

Reward is a measure of how good the action applied to the state is. If an action leads to an optimal policy, then the reward recognizes the optimal action, and that is when learning happens. As our optimization criteria is to maximize the revenue function, the reward in PSO is the difference between the value of revenue function in the previous state and the actual one.

It can be said that we have a complete knowledge of the model of a finite MDP if we know the state space and action space, the probabilities that define their dynamics and the reward function.

## Value functions

Many reinforcement learning algorithms attempt to learn the policy by approximation of value functions. These functions estimate, given a policy, how good it is for an agent to be in a particular state, or even how good it is to execute a certain action from a particular state. The value of a state under a policy , which is denoted by , is the reward that is expected to be obtained we begin to be guided by the policy of action from the state to infinity:

where denotes the expected value given that the agent follows the policy . This function is called state-value function.

In the same way, can be defined as the value of executing a given action from a state following a policy , i.e. the reward that is expected to be obtained if we begin to be guided by the policy of action after executing the action from state .

This function is called action-value function. In both equations the parameter has been introduced as a discount factor for future actions, following an infinite horizon optimality criterion discounted at the time:

There are always one or more policies that are better or equal than other policies, which are defined as optimal policies. An optimal policy is denoted as . Optimal policies share a single optimal state-value function.

There is also optimal action-value function , which is also unique and that maximizes the value of any state-action pair for any policy.

Optimal policies can easily be derived from these functions. There are a lot of methods that can be applied to learn the value functions, and therefore the optimal policies for a given finite MDP:

- Dynamic programming
- Value Iteration
- Policy Iteration

- Model-free methods (on-policy and off-policy methods)
- Temporal-Difference Learning (Q-Learning)
- Monte Carlo with Exploring Starts

- Model-based methods
- Certainty Equivalence Method
- Dyna-Q
- Prioritized Sweeping
- Queue-Dyna
- Real Time Dynamic Programming

In Pricing strategy optimization problem we used Q-learning to learn the value functions and the optimal policies. The implementation phase consisted of exploration the domain by executing different strategies, normalization and clustering of states, building Q-table initialized to zero with a number of states equal to the number of clusters and simulate all tuples of experiences by following the Q-learning. For each experience tuple, the states must be normalized and discretized following the normalization factors and clusters obtained above.

### In the work described has collaborated with us an expert in Reinforcement Learning of the Carlos III University of Madrid, Fernando Fernández Rebollo.

Have you used Reinforcement Learning for other uses? Share your examples with us on BBVA Data & Analytics Twitter channel!