1 Introduction 3,2 The Classical MAB Problem 3,2 1 A Bandit Process 3. 2 2 Extension to MAB process 4,2 3 The MAB Problem Statement 4. 3 Solving The MAB Problem 5,3 1 Stochastic Dynamic Programming Solution. Backward Induction 5,3 2 Gittins Index Theorem,Forward Induction 6. 3 3 Advantage of Gittins Index Theorem 8,4 Proof of Gittins Index Theorem 8. 4 1 An Intuitive Proof via Interchange Argument 8,4 2 A Rigorous Proof by Tsitsiklis 9. 5 References 13,1 Introduction, It is quite common in real world situations that a decision maker is faced with the problem of. allocating limited available resources to a number of competing projects For example machine. job scheduling project management and clinical trials are all concerned with the reallocation. of limited available resources to competing jobs projects or trials These examples belong. to a class of sequential resource allocation problems known as Multi armed bandit MAB. problems From a paradigmatic point of view MAB problems highlight the fundamental. con ict between exploitation choosing the best decision to maximize the immediate expected. payo and exploration trying other decisions to better understand their expected payo. In fact the name multi armed bandit was traditionally motivated by the single arm bandit. the slot machine In the MAB case the slot machine has several arms and the gambler is. faced with the decision of whether to pull the best arm based on current knowledge or try. other arms in hope to maximize future payo s The report will focus on the classical MAB. problem and its optimal solution,2 The Classical MAB Problem. First we de ne the trivial case of a single armed bandit process and then extend it to the. multi armed bandit process Next the MAB problem is stated as a maximization problem. 2 1 A Bandit Process, A bandit process refers to a single armed bandit problem The arm is selected pulled re. peatedly generating two random sequences,X 0 X 1 X 2. R X 0 R X 1 R X 2, where X n R refers to the state of the bandit process after being selected n times and. R X n R refers to the reward obtained as a result of being in state X n after the nth. selection We will assume the bandit process is a time homogeneous Markov process and thus. the state transition is modeled as,X n 1 f X n W n n 0 1 2. where f is given and W n R is a sequence of independent random variables that are also. independent of X 0,2 2 Extension to MAB process, A k armed bandit process consists of k independent single arms where only one arm is played. at every decision time This generates k independent single armed bandit processes as de. scribed earlier When the decision maker selects an arm the other arms are frozen and their. corresponding processes are frozen as a result Only the process of the selected arm is con. tinued leading to a state change and a reward being obtained The whole system evolves. according to the following model,f Xi n Wi n,i i if Ui n 1. where i 1 2 k denotes the ith arm and Ui is the decision to select 1 or freeze 0 the. ith arm Therefore the sequence U 0 U 1 U 2 where U n U1 n U2 n Uk n is. a time homogenous Markov decision policy of the form. where X n X n X2 n Xk n Here it is su cient to con ne our search within the set. of Markov decision policies as a result of assuming a Markov MAB process In other words. the optimal policy is going to be a Markov decision policy The sequence U 0 U 1 U 2. is referred to as a scheduling policy,2 3 The MAB Problem Statement. The problem can be stated as the following,Determine a scheduling policy g that maximizes. J E Ri Xi t Ui t X 0 3, subject to 1 and 2 B 1 1 is a discount factor which guarantees convergence of the summa. tion over in nite time horizon,3 Solving The MAB Problem. The MAB problem involves sequential decision making and therefore it can be solved via. stochastic dynamic programming backward induction Albeit being intractable stochastic. dynamic programming was the only known approach and presented a limited understanding. of the structure of the optimal solution until Gittins was able to formulate and prove the. optimal solution to be simple and of an index type via forward induction We will show the. intractability of backward induction formulation and proceed to forward induction formulation. in deriving the index type optimal solution,3 1 Stochastic Dynamic Programming Solution. Backward Induction, The standard in nite horizon dynamic programming formulation can be applied to the MAB. problem The value function can be written as following. V x sup E R X U V f X W U X x U u 4,where x Rk and ui 0 1. We ve assumed earlier that the MAB process is a time homogenous Markov process We add. a second assumption that the reward function R n is bounded The result is a controlled. Markov decision problem for which the in nite horizon dynamic program converges to a unique. xed point solution that is optimal The resulting optimal policy is also a Markov decision. policy as mentioned earlier, Unfortunately the dynamic program su ers from the curse of dimensionality in which the. size of the program grows exponentially with number of states of each arm So if we have k. arms and each arm can occupy N number of states then the size of the dynamic program is. k N possible states The curse of dimensionality makes the dynamic program computationally. infeasible and hence o ers little information about the structure of the optimal solution Next. we present the approach of forward induction to tackle this problem. 3 2 Gittins Index Theorem,Forward Induction, Stochastic dynamic programming solution proceeds backward on induction and hence leads to. no loss of optimality at the expense of computational complexity On the other hand forward. induction reduces the computational complexity at the expense of possible loss of optimality. To begin with we consider a myopic policy which maximizes the conditional expected reward. over the next stage i e one step look ahead Although myopic policies are much simpler to. compute they are generally not optimal, Next we can improve upon myopic policies by maximizing the conditional expected total. reward over the next T stages In other words we consider T step look ahead policies where T. is a xed number As T increases the optimality of the T step look ahead policies improves at. the expense of computational simplicity Nonetheless such policies are generally suboptimal. Furthermore the notion of T step look ahead policies can be extended by varying number of. look ahead stages call it In this case it does not make sense to maximize the conditional. expected total reward because will grow arbitrarily large and make the comparison among. the decision rules meaningless in addition to worsening computation feasibility Instead we. maximize over the conditional expected total reward rate Then we pick the decision rule. with the best reward rate and select it to run times The maximization is repeated at the. end of runs Policies generated by this procedure are called forward induction policies and. they are generally suboptimal except for certain stochastic decision problems to which our. de ned classical MAB problem belongs to, In order to have an intuitive understanding of why forward induction policies may be subopti. mal consider the example of a car traveling in one direction and there are several intersecting. routes along that direction Each route has a certain speed limit and we are interested in. maximizing the total discounted distance traveled over an in nite time horizon Rationally. we would want to maximize the immediate distance traveled by picking the fastest road before. the discount factor becomes smaller and smaller with time So a forward induction policy. would pick the route with the highest distance rate speed limit as long as we do not intersect. a route with a higher speed limit It is possible to run into a situation where we prefer our. route over a slower route that leads later on to a route much faster than ours Moreover. it is also possible that our route ends later on at an intersection which provides alternative. routes that are much slower than our route and the previous routes we rejected Overall it is. possible that picking the fastest route via forward induction policy lead to accumulating less. discounted distance especially when the discount factor is closer to unity Hence the forward. induction policy may be suboptimal, On the other hand there is one situation where we are guaranteed to always accumulate. the most traveled distance by always picking the fastest route We would require to always. have access to the slower routes we rejected earlier at each intersection If we compare this. example to the MAB bandit and equate the intersecting routes to the arms of the bandit and. traveling speed to reward rate then the forward induction policy is optimal for the classical. MAB problem due to the following assumptions we made. 1 Only one arm is played at each decision time, 2 Frozen arms that we rejected are always available for continuation at next decision time. 3 The freezing time does not a ect the state and reward sequence after continuing a frozen arm. 4 Frozen arms contribute no rewards while frozen, Therefore a forward induction policy is optimal for the MAB problem and can be enumerated. as follows, 1 At time t for each arm i 1 k maximize over the conditional expected reward rate. s Ri Xi t s xi t,vi xi t max hP i 5,E s t x i t,2 Pick the arm with the highest reward rate. i max vi xi t 6, 3 Run the process for the duration of by repeatedly playing arm i. 4 Repeat 1 3 at next decision time t, The value vi xi t is called dynamic allocation index or Gittins index of arm i at state xi t. The above result can be restated by what is known as Gittins Index Theorem. Reward is maximized by always continuing the bandit having greatest value of dynamic. allocation index,3 3 Advantage of Gittins Index Theorem. Gittins index theorem based on forward induction simpli es the computational complexity. signi cantly when compared to backward induction Recall that under backward induction. the computational complexity grows exponentially and the size of the problem is exponential. in N Under Gittins index theorem the problem reduces to a size that is linear in N i e. k N where k is the number of arms and N is the number of states Moreover Gittins index. theorem exposes the nature of the optimal policy to be of an index type. 4 Proof of Gittins Index Theorem, Over the last 40 years Gittins index theorem had been proven and reproved several times. These proofs vary in di culty and interpretation of the multi armed bandit problem Here. we present a simple intuitive proof based on an interchange argument 2 and then proceed. to a more rigorous proof by Tsitsiklis 3,4 1 An Intuitive Proof via Interchange Argument. The intuition behind the interchange argument is that it is optimal to select earlier the arm. with greatest Gittins index conditional expected reward rate because this allows us to. accumulate more rewards as soon as possible before the discount factor increases geometrically. with time The proof goes as follows, Let there be two arms which belong to a 2 armed bandit process At a decision time t arm. B1 has a Gittins index v B1 and an optimal stopping time and arm B2 has a Gittins index. v B2 and an optimal stopping time Suppose Gittins index of arm B1 is greater than Gittins. index of arm B2 i e v B1 v B2 There are two possible decision rules give priority to arm. B1 followed by playing arm B2 or give priority to arm B2 followed by playing arm B1 This. is illustrated by the following diagram where each arm is run for the duration of its optimal. stopping time, Then the conditional expected total reward over the duration of for the policy that gives. priority to arm B1 is R B1 E R B2 whereas the conditional expected total reward for. the policy that gives priority to arm B2 is R B2 E R B1 R B1 and R B2 are the. numerators of the Gittins index of arm B1 and arm B2 respectively We have assumed that. arm B1 has a greater Gittins index than arm B2 Therefore. R X s 0 R X s 0,E s 0 B1 B1 x B1 E s 0 B2 B2 x B2,v B1 v B2 hP i hP i. E s 0 xB1 0 E s 0 xB2 0,R B1 E R B2 R B2 E R B1, This implies that giving priority to the arm with the greater Gittins index i e arm B1 yields. higher conditional expected total reward,4 2 A Rigorous Proof by Tsitsiklis. The following proof by Tsitsiklis is complete as it addresses the bandit problem for multiple. arms and compares policies in terms of the conditional expected total reward over an in nite. horizon In order to simplify the calculations Tsitsiklis assumes that the MAB process is a. semi Markov process in continuous time with an exponential discount rate He also assumes. that the arms have disjoint nite state spaces The process is time homogenous and the arms. are independent All these assumptions conform to out de nition of the classical multi arm. bandit problem, We are interested in maximizing the conditional expected total reward. e ti Ri X 0 8, where 1 and e t is the discount factor as a function of continuous time Ri is the reward. received as a result of the ith play The discrete time expected discounted reward resulting. from the ith play is,e ti E R xi xi 9, where xi is the state of the arm played at the ith play Tsitsi. ECSE 506 Project Report Multi armed Bandit Problems ymanA Elkasrawy 260458854 Hussam Nosair 260454183 April 16 2012 1

1 Most production applications today use Python 2 7 2 Python 3 is ready for the production deployment of applications today 3 Python 2 7 will only receive necessary security updates until 20206 4 The brand name Python encapsulates both Python 3 and Python 2 1 1 2Recommendations Note The use of Python 3 is highly preferred over Python 2 Consider upgrading your applications and

How many books have you got I ve got six seven eight books How many pupils are there There are twenty thirty pupils What s colour is it It is blue grey What s your favourite colour My favourite colour is pink purple brown grey How many colours are there There is one colour It s blue There are three colours blue purple and yellow Les go ts

INSTRUCTOR S MANUAL TO ACCOMPANY ITERATURE An Introduction to Fiction Poetry Drama and Writing TWELFTH EDITION Boston Columbus Indianapolis New York San Francisco Upper Saddle River Amsterdam Cape Town Dubai London Madrid Milan Munich Paris Montreal Toronto Delhi Mexico City S o Paulo Sydney Hong Kong Seoul Singapore Taipei Tokyo KENN 7938 bkfm i lii KENN 7938 bkfm i lii 2 13 12 12 31 PM

Writing Skills Practice Book for EFL Beginning Intermediate Level D e v e l o p i n g W r i t i n g W R I T I N G S K I L L S P R A C T I E B O O K F O R E F L P E T E R S O N 4155 DevelopingDeveloping Writingriting Developing Writing Writing Skills Practice Book for EFL P AT R I C I A W I L C O X P E T E R S O N Each of the twenty chapters in Developing Writing is

pada perang Badar karena sibuk mengurusi putri Rasulullah n istri beliau yang sedang sakit Jadi beliau hanya tinggal di Madinah 5 Lihat Al Musnad jld 1 h 57 Hadits ini dishahihkan oleh Ahmad Syakir 6 Dinukil dari Utsman bin Affan karya Muhammad Husain Haekal penerjemah Ali Audah Jakarta Pustaka Litera Antar Nusa h 35 3 Laporan Reguler SYAMINA 12 September 2016 Ketika istri

Seluruh laporan kami bisa diunduh di website www syamina org 2 Laporan Bulanan SYAMINA XV Desember 2016 umat Islam dan tuntutan untuk memproses para pembunuh Utsman tidak menemukan titik temu Atas provokasi para pembunuh Utsman terjadilah perang Jamal antara pasukan Ali di satu pihak dan pasukan Ummul Mukminin Aisyah Thalhah bin Ubaidillah dan Zubair bin al Awwam di pihak yang lain Dan

DOS Les programmes tels que DEBUG ou EDIT sont parfois appel s utilitaires Fichiers bat Les commandes du DOS peuvent tre enregistr es dans un fichier texte auquel on donne l extension bat ou cmd Ces fichiers son t alors en quelque sorte des programmes interpr t s des scripts

Pr face Japonismes 2018 les mes en r sonance est une manifestation d une ampleur in gal e initi e par le gouvernement japonais afin de mieux faire conna tre la culture de l Archipel

Tadao Ando grande figure de l architecture contemporaine L exposition interroge les principes de la cr ation de Tadao Ando comme son usage du b ton lisse la pr minence donn e aux volumes g om triques simples l int gration des l ments naturels tels que la lumi re ou l eau dans ses dispositifs spatiaux ou encore l importance qu il accorde l intensit de l