*I had hard time to wrap my head around Reinforcement learning, still thinking of the possible implementations, implications and ramifications, thats why there will be alot of simplifications.*

To spare you the mathematical baggage required to start with basic RL instead of using MDP, probability models and so on, I will try a different way of explaining the core of Reinforcement learning by using very simple math and some bold assumptions :). Also you will understand were the basic RL formula comes from, I hope.

For starters let me see if I can surmise what RL is in one line, ready ??

Exponentially weighted averaging of discounted future rewards.

got it ? if you do, no need to read further :), but please do ...

Basically the idea of RL is to find the best **ACTION** to execute given the accumulated knowledge so far.
The way to do it is to do that is by using rewards and punishments to guide the algorithm towards a general goal.
RL works in uncertain environment, from us is expected to have some idea how to **REWARD** the agent when it goes in the right direction.
That is the greatest strength and weakness of RL, you have to, need, must know your *rewards*, most of the other requirements are more or less optional.

Rewards are expressed as a number, indicating how good or bad the action the agent took, was!.
Rewards are just immediate bonus, to solve the **RL** problem we need to know the **VALUE** of an action i.e. the long term benefit of doing something.

Here is the obligatory pictorial representation :

So the **AGENT** interacts with the **ENVIRONMENT** transitioning from **STATE** to **STATE**. By taking different **ACTIONS** the agent receives **REWARDS** and **PUNISHMENTS** (i.e. negative rewards). The agent make a decision between actions depending on the long term **VALUE** (of future accumulated rewards).
(We would not talk about **POLICIES** in this article to simplify the discussion).

The question then is how do we calculate an **ACTION long term VALUE**, so that we can make best decision how to act ?

The most complete way would be to sum all the future rewards (expected return : Gt), from our current state for all possible paths of actions and pick the best one.

The formula for Expected return for a sequence is:

$$G_t = R_{t+1} + R_{t+2} + R_{t+3} +... $$As you may suspect that is unfeasible from computational point of view.

Another hurdle is that if we have continuous process (T = infinity), **Gt** will be infinity for all paths-of-actions i.e. we can not distinguish which one is the best.
The solution is to instead find the *best next action* to take, this way we don't have to calculate all action-paths, but just care only about near term prospects.

We will using time series again. Like we did in Threensitions.

In Threensitions we approximated the y-values in the series by a number of states **S={s1,s2,s3, ... sn}** i.e. we substitute real numbers with finite discrete values. (The lesson from our previous experiment is that in most of the cases 200-300 states are more than enough to represent the data range for the test signals we tried).

This time we would use a 2D table to represent transitions from one state to another, like in markov order-1 chain. There will be a twist on that but be patient.

The RL algorithm requires us to have **ACTIONS**, we will use the **state transition** to be our action. (In normal RL, actions can encode much richer behavior, but we will stick with this simple notion for now)

So our goal is to award the transition/action with a better score if it happens often enough. In the treensitions we used **counting** for that. But as we discussed there a model with a simple 2D counting table is not rich enough to represent complex time series, that's why we used 3D cube, so that the statistics can be more sparse.

The twist with RL is that instead of counting we will use **discounting**.
What does this mean ? It means we add smaller and smaller parts of every next reward, instead of simply summing all the rewards.

here **"gamma"** is discount percentage (0,1] .

Why would we do that ? It provides mathematical tool to solve the infinity problem and as we will see in a moment allows us to do step by step calculation.
The bigger **gamma** is, the more influence we receive from actions further away.

So the first trick is to use **DISCOUNTING**, rather than simple **COUNTING**.

In a normal RL the environment will award different rewards, but in our simple case we will award any successful transition/action with reward of **1**.

**Now what is the goal of our model ?** The goal is to predict correctly what is the next state. This is what a normal transition model does too, but by using discounting we don't just look at the transition itself, but the subsequent transitions also have a say on how valuable the action is.
There is one small problem with our current formula of discounting, we have to wait several steps before we apply the discounts. So let's refactore it abit :

Now that is more like it, we have to wait just one step to apply the discounted rewards. There is one additional subtlety, Gt, Gt+1 are all predictions, we will mention this again later.

The second trick **RL** uses is the so called **temporal difference**. What does it mean ?

Like we said every transition/action has accumulated *value* depending of how often it is visited (counting) and how often states that follow are visited (discounting).
There will be some tendency toward an average value on the long run, depending on which path-of-actions is winning.

One way to do this is apply the full reward at every step :

$$Q_t = Q_t + R_t$$

(Qis state-action score i.e. this is the value we store for every transition between state-A =to=> state-B in the table).

But this is cumulative reward, which if you remember suffers from the infinity problem.

This formula in our reward=1 scenario is equivalent to counting. In Threensitions app it would overflow much earlier i.e. 65353 (uint16 type) to be more correct. In our simple scenario we didn't have this problem, but if we had continuous signal, eventually we would reach this limit and flip.

What if instead we adjust it slightly towards an average **Q** value, like this :

**alpha** is the learning rate and describe how much of the difference between the current **Q-value** (the one in the table) and current **R**eward is applied. If the rewards trend up/down the Q will slowly follow it.

What is left now is to plug in the place of the absolute **R**eward the **Discounted reward** we mentioned earlier i.e..

So let substitute :

$$Q_t = Q_t + \alpha * (G_t - Q_t)$$$$Q_t = Q_t + \alpha * (R_{t+1} + \gamma G_{t+1} - Q_t)$$So now the state-action (Q-value) trends toward the discounted reward.

Did you see the something ikky ... yep, when we write the algorithm we have to wait one step (t+1) before we do the calculation and then apply it backward to the previous state in the table i.e. the update happens at time *t+1*, rather than time *t* and we update *Q(st, st+1)* transition.

Also we use the **predicted discounted reward** because we don't have the **real discounted reward**. In the table we store the temporary value that tends toward the real value.

So how to we get the prediction, then! ... easy :

$$G \sim max(Q)$$let's substitute :

$$Q_t = Q_t + \alpha * (R_{t+1} + \gamma * max(Q_{t+1}) - Q_t)$$shorter for programmers :

$$Q_t \mathrel{+=} \alpha * (R_{t+1} + \gamma * max(Q_{t+1}) - Q_t)$$The **max()** in the formula is the **prediction** part i.e. we pick the biggest Q-value between all subsequent states. That is the most probable direction the signal will go according to information we currently have.

You can visualize it like this : *Every row in the table specifies the Q-value between state and the next state. Whichever column have the highest Q-value is the next predicted state (number).*

Let's rewrite the formula with states included, so that you can see what I just explained :

$$Q_t(s_{t}, s_{t+1}) = Q_t(s_{t}, s_{t+1}) + \alpha * ( R_{t+1}(s_{t}, s_{t+1}) + \gamma * max(Q_{t+1}(s_{t+1}) ) - Q_t(s_{t}, s_{t+1}))$$or in our case where the reward is always ONE, so :

$$Q_t(s_{t}, s_{t+1}) = Q_t(s_{t}, s_{t+1}) + \alpha * (1 + \gamma * max( Q_{t+1}(s_{t+1})) - Q_t(s_{t}, s_{t+1}) )$$I want to point your attention to one more thing. There is second way to interpret the Temporal difference mechanism.

If you look at the formula :

$$Q_t = Q_t + \alpha (R_t - Q_t)$$$$Q_t = Q_t + \alpha R_t - \alpha Q_t$$$$Q_t = (1 - \alpha) Q_t + \alpha * R_t$$Is exactly the same as the formula of *Exponentially weighted moving average*. Now the heading of this article make sense.

DEW = Discounted Exponentially weighted : Moving average

Now that we have the model let see the implementation.

It is split in two parts QLearn which is implementation of QLearn algorithm, the formula we saw above.

Then we have the wrapper around it which is the DEW average module which is mostly graphing, stats and handling of the data.

Here is how to use it :

In [3]:

```
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (20.0, 10.0)
import sys
sys.path.append('../lib')
from dew_avg import *
#those DataSets are not available in the repo, it is external library and will take me some time to integrate it
# ... for now what you need is to load your data-set in 1D numpy array by some other means. Sorry..
from data_sources.data_sets import DataSet
ny_taxi = DataSet(fname='../data/nyc_taxi.csv', field='passenger_count')
```

In [22]:

```
ny_taxi.data[:1000].max()
```

Out[22]:

In [17]:

```
dew = DEWAvg(nstates=200, vmin=0, vmax=30000, learn_rate=0.5, gamma=0.9)
```

In [18]:

```
dew.batch_train(ny_taxi.data[:1000], log=False)
```

You can look at Threensitions, for more info on stats and graphing explanation.

In [19]:

```
dew.plot(nope=True)
```

We can also print the Q-table, darker colors means higher value.

In [20]:

```
dew.ql.qmap.max()
```

Out[20]:

In [21]:

```
dew.ql.show_map()
```

So the basic idea of RL is temporal defference learning of discounted rewards.

That's all folks .....

In [ ]:

```
```