{"nbformat":4,"nbformat_minor":0,"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.7.3"},"colab":{"name":"Qlearning-example.ipynb","provenance":[],"collapsed_sections":[]}},"cells":[{"cell_type":"code","metadata":{"id":"oo6oN95NW5Jv","colab_type":"code","colab":{}},"source":["import math, sys \n","import numpy as np\n","\n","from mpl_toolkits.mplot3d import Axes3D\n","import matplotlib.pyplot as plt\n","from matplotlib import cm\n","from os import path\n","count = 0\n","\n","from matplotlib.backends.backend_pdf import PdfPages"],"execution_count":0,"outputs":[]},{"cell_type":"markdown","metadata":{"id":"zBBpBiwhW5Jo","colab_type":"text"},"source":["## The problem\n"," \n","There is a biased coin with $p \\in (0,1)$, $p\\neq \\frac12$, probability of getting heads and $q=1-p$ probability of getting tails. \n"," \n","We will start with an initial wealth $x = i \\delta$, $i \\in \\mathbb N$ with $i \\frac12$ then $\\alpha(x) = 1$ is the optimal strategy. \n","So we see obtain the linear functional equation\n","$$\n","v(x) = p v (x+a\\delta) + q v(x-a\\delta)\\,,\\,\\,\\, v(0) = 0\\,,\\,\\,\\, v(m\\delta) = m\\delta\\,.\n","$$\n","This can be solved iteratively (value iteration). "]},{"cell_type":"code","metadata":{"id":"tMBTCAbHwrfl","colab_type":"code","outputId":"09cc0b2d-4b91-43a7-e3e3-4543804ad3af","executionInfo":{"status":"ok","timestamp":1580461978895,"user_tz":0,"elapsed":940,"user":{"displayName":"David Siska","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBDNE5Njm-BPE60xNUkoyILKN35TdHSFWbKwDXI=s64","userId":"03664128930488109577"}},"colab":{"base_uri":"https://localhost:8080/","height":51}},"source":["p = 51.0/100.0\n","delta = 1\n","m = 3\n","\n","x_vals = np.linspace(0,delta*m,m)\n","a_vals = np.linspace(-1,1,2)\n","\n","# Lookup table for Q\n","v_new = np.zeros(np.size(x_vals))\n","v_old = np.zeros(np.size(x_vals))\n","v_old[0] = 0; v_old[np.size(x_vals)-1] = m*delta\n","\n","N_iter = 1000;\n","\n","a = -1;\n","if p > 0.5:\n"," a = 1;\n","\n","tol = 1e-8\n","\n","for k in range(0,N_iter):\n"," # update value function\n"," v_new[0] = 0; v_new[np.size(x_vals)-1] = m*delta\n"," for i in range(1,np.size(x_vals)-1):\n"," v = p*v_old[i+a] + (1-p)*v_old[i-a]\n"," #print(v)\n"," v_new[i] = v\n","\n"," #check for convergence\n"," err_v = np.abs(v_new - v_old)\n"," err = np.linalg.norm(err_v)\n"," if (err < tol):\n"," print(\"Converged in %d iterations\" % k)\n"," break;\n","\n"," #ready for next iteration\n"," v_old = v_new;\n"," v_new = np.zeros(np.size(x_vals))\n","\n","\n","print(v_old[1:m-1])\n"],"execution_count":5,"outputs":[{"output_type":"stream","text":["Converged in 1 iterations\n","[1.53]\n"],"name":"stdout"}]},{"cell_type":"markdown","metadata":{"id":"on5UAMVAW5Jt","colab_type":"text"},"source":["## Action-values or Q-values\n","\n","Unlike in a control problem setting we don't assume the knowledge of the value of $p$.\n","\n","For a policy $\\alpha = \\alpha(x)$ define Q values (or action-values) as: \n","$$\n","Q^\\alpha(x,a) := \\mathbb E \\Big[ J^\\alpha(X_1) | X_0 = x, \\alpha_0 = a \\Big]\\,.\n","$$\n","We see that this is expected reward for executing action $a$ in state $x$ and then following policy $\\alpha$ thereafter. \n","At this point we still don't know how this shall be computed. "]},{"cell_type":"markdown","metadata":{"id":"AZ3e8UD9W5Jt","colab_type":"text"},"source":["## The Q-learning algorithm \n","\n","We need to fix the learning rate at each step: $(\\gamma_k)_{k\\in \\mathbb N}$ such that $\\gamma_k \\in (0,1)$, $\\sum_k \\gamma_k^2 < \\infty$ and $\\sum_k \\gamma_k = \\infty$. We can take $\\gamma_k = \\frac1k$ to begin with.\n","\n","We start by making an initial guess for $Q$, call it $Q_0 = Q_0(x,a)$.\n","The learning proceeds in episodes, where at the $k$-th episode:\n","1. We are in the state $x_k$ (this can be either the resulting state of a previous episode or one chosen at random).\n","1. We select and perform action $a_k$ (randomly seems fine).\n","1. We observe the state we landed in $y_k$, if $y_k = 0$ then set $R_k = 0$ (we lost), if $y_k = m\\delta$ then set $R_k = m\\delta$ (we won), otherwise we are still in the game and $R_k = V_{k-1}(y_k)$, where\n","$$\n","V_{k-1}(y_k) := \\max_{b\\in A} Q_{k-1}(y_k,b)\\,. \n","$$\n","1. We adjust the $Q_{k-1}(x_k,a_k)$ to $Q_k(x_k,a_k)$ using $\\gamma_k$ as\n","$$\n","Q_k(x_k,a_k) = (1-\\gamma_k) Q_{k-1}(x_k,a_k) + \\gamma_k R_k\n","$$\n","and we leave the remaining values of $Q_k=Q_{k-1}$ in this episode.\n","\n","That's all. "]},{"cell_type":"markdown","metadata":{"id":"IZGiSt_SW5Ju","colab_type":"text"},"source":["## Conclusion\n","\n","Once (if) the algorithm converges so that $Q^*(x,a) = \\lim_{k\\to \\infty} Q_k(x,a)$ we then have $v^*(x) = \\max_a Q^*(x,a)$ for all $x,a$."]},{"cell_type":"code","metadata":{"id":"O8cxpHdLW5Jz","colab_type":"code","outputId":"1e108446-1edd-44ac-bc3c-e0eed5247d6d","executionInfo":{"status":"ok","timestamp":1579546905880,"user_tz":0,"elapsed":5666,"user":{"displayName":"David Siska","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBDNE5Njm-BPE60xNUkoyILKN35TdHSFWbKwDXI=s64","userId":"03664128930488109577"}},"colab":{"base_uri":"https://localhost:8080/","height":34}},"source":["#p = 44.0/100.0\n","#delta = 1\n","#m = 5\n","\n","x_vals = np.linspace(0,delta*m,m)\n","a_vals = np.linspace(-1,1,2)\n","\n","# Lookup table for Q\n","Q_0 = np.zeros((np.size(x_vals),np.size(a_vals)))\n","\n","\n","i_0 = np.random.randint(1,m-1) # includes first but excludes last\n","\n","def choose_action(Q, i):\n"," a_idx = np.random.randint(0,2) # will return either 0 or 1\n"," return int(a_vals[a_idx]),a_idx\n"," \n","def update_V(Q,i):\n"," return np.max(Q[i,:]) \n"," \n","N_learnsteps = 10**5\n","\n","win_count = 0\n","lose_count = 0\n","\n","for k in range(1,N_learnsteps+1):\n"," a, a_idx = choose_action(Q_0,i_0)\n"," coin = -1+2*int(np.random.binomial(1,p))\n"," i_1 = i_0 + coin*a\n"," #print(\"Old idx: %d, Action: %d, Coin: %d, New idx: %d\" % (i_0, a, coin, i_1))\n"," R = 0\n"," if (i_1 <= 0):\n"," # R stays at 0 as set above\n"," i_1 = np.random.randint(0,m) # gamme ended, learning hasn't so we restart\n"," lose_count = lose_count + 1\n"," elif (i_1 >= m-1):\n"," R = m*delta\n"," i_1 = np.random.randint(0,m) # gamme ended, learning hasn't so we restart\n"," # print('Reward %f' % R)\n"," win_count = win_count + 1\n"," else:\n"," R = update_V(Q_0,i_1)\n"," \n"," # We adjust Q values\n"," #print('Reward %f' % R)\n"," \n"," #gamma = 1.0/k\n"," gamma = np.power(1.0/k,0.55)\n"," #gamma = 1.0/10\n"," if (i_0 == 0):\n"," #print('Loosing position at iteration %d' % k)\n"," Q_0[i_0,a_idx] = 0\n"," elif (i_0 == m-1):\n"," #print('Winning position at iteration %d' % k)\n"," Q_0[i_0,a_idx] = m * delta \n"," else: \n"," Q_0[i_0,a_idx] = (1-gamma)*Q_0[i_0,a_idx] + gamma*R \n"," \n"," # And we get ready for next round\n"," i_0 = i_1\n"," #i_0 = np.random.randint(1,m)\n"," \n"," \n","print('Win count: %d, Lose count: %d' % (win_count,lose_count))"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Win count: 8517, Lose count: 8351\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"NI-gPunlW5J3","colab_type":"code","outputId":"ab23b0a6-3e7a-4cb0-c0b7-e7bdc27ca889","executionInfo":{"status":"ok","timestamp":1579546911137,"user_tz":0,"elapsed":951,"user":{"displayName":"David Siska","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBDNE5Njm-BPE60xNUkoyILKN35TdHSFWbKwDXI=s64","userId":"03664128930488109577"}},"colab":{"base_uri":"https://localhost:8080/","height":136}},"source":["Q_0"],"execution_count":0,"outputs":[{"output_type":"execute_result","data":{"text/plain":["array([[0. , 0. ],\n"," [1.44395548, 4.65082863],\n"," [5.12334816, 6.25878317],\n"," [6.41451261, 6.76310919],\n"," [6.81485198, 6.92786616],\n"," [6.94621711, 6.98206721],\n"," [7. , 7. ]])"]},"metadata":{"tags":[]},"execution_count":5}]},{"cell_type":"code","metadata":{"id":"T08GWL9PW5J9","colab_type":"code","outputId":"abcea29b-d346-45b4-c1da-14c71fae2d6b","executionInfo":{"status":"ok","timestamp":1579546912565,"user_tz":0,"elapsed":504,"user":{"displayName":"David Siska","photoUrl":"https://lh3.googleusercontent.com/a-/AAuE7mBDNE5Njm-BPE60xNUkoyILKN35TdHSFWbKwDXI=s64","userId":"03664128930488109577"}},"colab":{"base_uri":"https://localhost:8080/","height":119}},"source":["# Let us calculate V^* - the optimal payoff:\n","V_ast = np.zeros(np.size(x_vals))\n","\n","for i in range(0,np.size(x_vals)):\n"," V_ast[i] = np.max(Q_0[i,:])\n","\n","print(\"Value from Q-learning\")\n","print(V_ast[1:m-1])\n","\n","print (\"Value exact\")\n","print(v_old[1:m-1])\n","\n","print(\"Difference\")\n","print(np.abs(v_old[1:m-1] - V_ast[1:m-1]))\n","# Let us calculate a^* - the optimal action:\n","a_ast = np.zeros(np.size(x_vals))\n","\n","for i in range(0,np.size(x_vals)):\n"," a_ast[i] = a_vals[np.argmax(Q_0[i,:])]\n","\n","\n","#print(a_ast[1:m-1])\n","\n"],"execution_count":0,"outputs":[{"output_type":"stream","text":["Value from Q-learning\n","[4.65082863 6.25878317 6.76310919 6.92786616 6.98206721]\n","Value exact\n","[4.6730769 6.23076922 6.74999999 6.92307692 6.98076923]\n","Difference\n","[0.02224827 0.02801395 0.0131092 0.00478924 0.00129799]\n"],"name":"stdout"}]},{"cell_type":"code","metadata":{"id":"Z-B6DMdyW5KG","colab_type":"code","colab":{}},"source":[""],"execution_count":0,"outputs":[]}]}