We employed bandits in a product I worked on. It was selecting which piece of content to show in a certain context, optimizing for clicks. It did a great job, but there were implications that I wish we understood from the start.
There was a constant stream of new content (i.e., arms for the bandits) to choose from. Instead of running manual experiments (e.g., A/B tests or other designs), the bandits would sample the new set of options and arrive at a new optimal mix much more quickly.
But we did want to run experiments with other things around the content that was managed by the bandits (e.g., UI flow, overall layout, other algorithmic things, etc.). It turns out bandits complicate these experiments significantly. Any changes to the context in which the bandits operate lead them to shift things more towards exploration to find a new optimal mix, hurting performance for some period of time.
We had a choice we could make here... treat all traffic, regardless of cohort, as a single universe that the bandits are managing (so they would optimize for the mix of cohorts as a whole). Or we could setup bandit stats for each cohort. If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence. But the optimal mix will likely look different for one cohort vs. another vs. all of them combined. So it's better for experiment validity to isolate the bandits for each cohort. Now small cohorts can take quite a while to converge before we can measure how well things work. All of this puts a real limit on iteration speed.
Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
Much (all?) of this comes from bandits being feedback loops - these same problems are present in other approaches where feedback loops are used (e.g., control theory based approaches). Feedback mechanisms are incredibly powerful, but they couple things together in ways that can be difficult to tease apart.
>Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
One way to peak into the state is to use bayesian models to represent the "belief" state of the bandits. For example, the arm's "utility" can be a linear function of the features of the arm. At each period, you can inspect the coefficients (and their distribution) for each arm.
The real trouble with bandits is that people don't bother to look into what the real potential benefit is as far as the target you're optimizing. Despite theoretically loving bandit techniques, I've convinced multiple teams not to go that path because the real advantage of using them is a slightly more optimal mix of people in experiment than if you ran them manually.
At some scales it can make sense, but for the vast majority of products/companies the statistical benefits don't outweigh the added complexity.
Bandits work best for relatively simple optimization choices at very large scale. If you care at all about the information gained in an experiment, it's fairly unlikely bandits are a good choice for you (or any reinforcement learning algorithm at that point).
Agreed. In the case I was describing above, new arms were constantly being introduced (often several times a day for each of hundreds of thousands of scenarios). Manual experiments weren't an option. This also meant we were in a constant state of partial convergence for most scenarios, but the same would be true with experiments.
How to cull arms, so that there are enough samples for any kind of convergence, is another problem in this setup. We eventually built an ML model to select arms and used bandits to pick between them. This proved "too effective". The arms were all user-generated content. The bandits on top of the model, both setup to maximize clicks was stupidly good at surfacing inappropriate content because it got a lot of clicks. We ended up having to put more safeties on arm selection for certain categories of our content where we had the most inappropriate submissions.
I’ve actually run into the exact same issue. At the time we similarly had to scrap bandits. Since then I’ve had the opportunity to do a fair amount of research into hierarchical dirichelete processes in an unrelated field.
On a random day, a light went off in my head that hierarchy perfectly addresses the stratification vs aggregation problems that arise in bandits. Unfortunately I’ve never had a chance to apply this (and thus see the issues) in a relevant setting since.
You can do fairly well here with ridge regression as a poor man's hierarchical model. We've used this library's Bayesian ridge regression to support a geo-pricing strategy (and it contains the Dirichlet-Multinomial approach as well): https://github.com/bayesianbandits/bayesianbandits
Ahh, hierarchical dirichlet processes. Sounds like you were reading the literature on Bayesian diffusion modelling / diffusion trees. I studied that stuff almost 20 years ago now, really takes me back.
Haha I’ve actually never heard of that field. My work was focused on applying Chinese restaurant process models to text analysis. But very curious what you were working on?
“ If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence.”
Not the OP. I think what they are driving at is that if knowledge is discovered during exploration in cohort A, cohort B can exploit it. Then, the whole A/B test breaks down to which cohort got to benefit more from the bandit learnings.
Yes, this is exactly the kind of scenario I was alluding to.
For example, cohorts with very light traffic are likely to get undue benefit as a lot of exploration might be done before the smaller cohort needs to select an arm, so things are closer to convergence.
Another example is if there are wildly different outcomes between cohorts. More of the exploration will be done in cohorts with more traffic, leading bandit optimizations to fit large cohorts better than lower traffic cohorts.
Even if you do manage to make things independent, you have to wait for bandit convergence before you know what converged results will look like. That requires being able to measure convergence, which isn't always easy, especially if you don't know to do that before designing the system.
With all of these problems, we kept bandits, and even expanded their application. At least for the 10 years I was still around. They are incredibly powerful. But there was a lot of "I wish those damned bandits didn't work so well!"
For anyone who is not aware, A/B tests assume cohorts behave independently of each other. The less true that is, the less reliable the results are. This was even worse for us in parts of our system where there were no bandits, but there was direct interactions between individuals.
I was just mentioning that any feedback loops have the potential downside of coupling things together in ways that make experiments (and analysis, improvements, etc.) difficult.
I used control theory as an example of another type of feedback loop because we were using controllers in other parts of the system. One was roughly the proportional term of a PID controller. Another was both the proportional and integral terms, but with a lot of domain specific heuristics added on (e.g., ML models to set baselines that help the controllers understand how quickly to move values). This was all incredibly powerful, but also required experiment design and analysis that was far more complicated than we would have liked.
This was not a ride-sharing company, but you can imagine dynamic pricing for Uber or Lyft working this way. Use a controller to try to keep the number of available riders and drivers in balance. If riders exceed drivers, increase payment to drivers (and therefore cost to riders), thereby attracting more riders. If drivers are sitting idle, decrease payment. Ideally, this is done constantly to respond to time of day, weather, traffic, etc. (with some predictive baseline expectations to keep it from being too slow to respond). When you get it right, riders get the cheapest price that ensures a driver is available. In economic terminology, this is finding the market clearing price. A controller can get you a long way in a hurry on such a system. But it's certainly not the only approach that can work, and it takes a lot to get it right (I have no idea how Uber and Lyft actually do any of this).
Does the name "Multi-Armed Bandits" have anything to do with "One Armed Bandits" - old style slot machine/fruit machines/gambling machines with a big lever that people would pull?
Hey, an area I actually work on and use in production! MAB in our case has shown improvements in our A/B metrics. My general takeaways for my experience with MAB:
* we have a single optimization goal (e.g the metric to minimize or maximize). The hard part isnt defining a optimization goal. The hard part is identifying what stakeholders actually want the tradeoff between different metrics. If you have goal A and goal B, where B is more aggressive than A, then getting agreement on which is better is hard. This is a people problem.
* MAB seems to be a good proof of concept that something that can be optimized but isnt an "end game" optimization path.
* MAB for A/B testing is going to mess up your AB data and make everything more difficult. You should have a treatment that uses the MAB algorithm and a treatment that doesnt.
All of the above is for non contextual MAB. I am currently learning about different MAB algorithms although each of them are pretty similar. The ones ive read about are all effectively linear/logistic regression and tbe differences come from exploration mechanism and how uncertainty is represented. Epsilon greedy has no uncertainty, exploration just happens a fixed percentage of time. UCB is optimistic about uncertainty, and the amount of optimism controls exploration. Thompson sampling uses statistical (beta) distributions about uncertainty, exploration happens less as confidence about a particular set of options increases.
Overall its a fun area to work in that is quite different from your typical CRUD development which is a nice change of pace.
Its very easy to implement toy multi-armed bandits, and even getting them into production code isn't so bad. Its the reporting, education, and adding features without breaking everything that gets very hard very quick.
I used bandits to select a best vendor/model to use on a per customer basis for a task that we had a reward signal for. It worked amazingly well but the more important thing we gained was the ability to accurately estimate the improvement other features/changes had. Getting a hard number for 'feature x improved over previous baseline by y' is a game changer for a lot of product and development work. You only think feature x is a massive improvement, until you see the numbers and realize that the spacing displayed to a user was 10x move valuable than the Really Big Feature you thought was going to make a difference.
Just from a philosophical perspective I think multi-armed bandits are a fascinating idea and a great technique for thinking about many kinds of decision processes.
We employed bandits in a product I worked on. It was selecting which piece of content to show in a certain context, optimizing for clicks. It did a great job, but there were implications that I wish we understood from the start.
There was a constant stream of new content (i.e., arms for the bandits) to choose from. Instead of running manual experiments (e.g., A/B tests or other designs), the bandits would sample the new set of options and arrive at a new optimal mix much more quickly.
But we did want to run experiments with other things around the content that was managed by the bandits (e.g., UI flow, overall layout, other algorithmic things, etc.). It turns out bandits complicate these experiments significantly. Any changes to the context in which the bandits operate lead them to shift things more towards exploration to find a new optimal mix, hurting performance for some period of time.
We had a choice we could make here... treat all traffic, regardless of cohort, as a single universe that the bandits are managing (so they would optimize for the mix of cohorts as a whole). Or we could setup bandit stats for each cohort. If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence. But the optimal mix will likely look different for one cohort vs. another vs. all of them combined. So it's better for experiment validity to isolate the bandits for each cohort. Now small cohorts can take quite a while to converge before we can measure how well things work. All of this puts a real limit on iteration speed.
Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
Much (all?) of this comes from bandits being feedback loops - these same problems are present in other approaches where feedback loops are used (e.g., control theory based approaches). Feedback mechanisms are incredibly powerful, but they couple things together in ways that can be difficult to tease apart.
>Things also become very difficult to reason about because their is state in the bandit stats that are being used to optimize things. You can often think of that as a black box, but sometimes you need to look inside and it can be very difficult.
One way to peak into the state is to use bayesian models to represent the "belief" state of the bandits. For example, the arm's "utility" can be a linear function of the features of the arm. At each period, you can inspect the coefficients (and their distribution) for each arm.
See this package:
https://github.com/bayesianbandits/bayesianbandits
The real trouble with bandits is that people don't bother to look into what the real potential benefit is as far as the target you're optimizing. Despite theoretically loving bandit techniques, I've convinced multiple teams not to go that path because the real advantage of using them is a slightly more optimal mix of people in experiment than if you ran them manually.
At some scales it can make sense, but for the vast majority of products/companies the statistical benefits don't outweigh the added complexity.
Bandits work best for relatively simple optimization choices at very large scale. If you care at all about the information gained in an experiment, it's fairly unlikely bandits are a good choice for you (or any reinforcement learning algorithm at that point).
Agreed. In the case I was describing above, new arms were constantly being introduced (often several times a day for each of hundreds of thousands of scenarios). Manual experiments weren't an option. This also meant we were in a constant state of partial convergence for most scenarios, but the same would be true with experiments.
How to cull arms, so that there are enough samples for any kind of convergence, is another problem in this setup. We eventually built an ML model to select arms and used bandits to pick between them. This proved "too effective". The arms were all user-generated content. The bandits on top of the model, both setup to maximize clicks was stupidly good at surfacing inappropriate content because it got a lot of clicks. We ended up having to put more safeties on arm selection for certain categories of our content where we had the most inappropriate submissions.
I’ve actually run into the exact same issue. At the time we similarly had to scrap bandits. Since then I’ve had the opportunity to do a fair amount of research into hierarchical dirichelete processes in an unrelated field.
On a random day, a light went off in my head that hierarchy perfectly addresses the stratification vs aggregation problems that arise in bandits. Unfortunately I’ve never had a chance to apply this (and thus see the issues) in a relevant setting since.
You can do fairly well here with ridge regression as a poor man's hierarchical model. We've used this library's Bayesian ridge regression to support a geo-pricing strategy (and it contains the Dirichlet-Multinomial approach as well): https://github.com/bayesianbandits/bayesianbandits
Ahh, hierarchical dirichlet processes. Sounds like you were reading the literature on Bayesian diffusion modelling / diffusion trees. I studied that stuff almost 20 years ago now, really takes me back.
Haha I’ve actually never heard of that field. My work was focused on applying Chinese restaurant process models to text analysis. But very curious what you were working on?
“ If things are combined, then we can't use an experiment design that assumes independence between cohorts (e.g., A/B testing) because the bandits break independence.”
Can you explain, please?
Not the OP. I think what they are driving at is that if knowledge is discovered during exploration in cohort A, cohort B can exploit it. Then, the whole A/B test breaks down to which cohort got to benefit more from the bandit learnings.
Yes, this is exactly the kind of scenario I was alluding to.
For example, cohorts with very light traffic are likely to get undue benefit as a lot of exploration might be done before the smaller cohort needs to select an arm, so things are closer to convergence.
Another example is if there are wildly different outcomes between cohorts. More of the exploration will be done in cohorts with more traffic, leading bandit optimizations to fit large cohorts better than lower traffic cohorts.
Even if you do manage to make things independent, you have to wait for bandit convergence before you know what converged results will look like. That requires being able to measure convergence, which isn't always easy, especially if you don't know to do that before designing the system.
With all of these problems, we kept bandits, and even expanded their application. At least for the 10 years I was still around. They are incredibly powerful. But there was a lot of "I wish those damned bandits didn't work so well!"
For anyone who is not aware, A/B tests assume cohorts behave independently of each other. The less true that is, the less reliable the results are. This was even worse for us in parts of our system where there were no bandits, but there was direct interactions between individuals.
Could you expand on control theory? Is it PID?
I was just mentioning that any feedback loops have the potential downside of coupling things together in ways that make experiments (and analysis, improvements, etc.) difficult.
I used control theory as an example of another type of feedback loop because we were using controllers in other parts of the system. One was roughly the proportional term of a PID controller. Another was both the proportional and integral terms, but with a lot of domain specific heuristics added on (e.g., ML models to set baselines that help the controllers understand how quickly to move values). This was all incredibly powerful, but also required experiment design and analysis that was far more complicated than we would have liked.
This was not a ride-sharing company, but you can imagine dynamic pricing for Uber or Lyft working this way. Use a controller to try to keep the number of available riders and drivers in balance. If riders exceed drivers, increase payment to drivers (and therefore cost to riders), thereby attracting more riders. If drivers are sitting idle, decrease payment. Ideally, this is done constantly to respond to time of day, weather, traffic, etc. (with some predictive baseline expectations to keep it from being too slow to respond). When you get it right, riders get the cheapest price that ensures a driver is available. In economic terminology, this is finding the market clearing price. A controller can get you a long way in a hurry on such a system. But it's certainly not the only approach that can work, and it takes a lot to get it right (I have no idea how Uber and Lyft actually do any of this).
Does the name "Multi-Armed Bandits" have anything to do with "One Armed Bandits" - old style slot machine/fruit machines/gambling machines with a big lever that people would pull?
edit: ah ok yes it does: https://en.wikipedia.org/wiki/Multi-armed_bandit
Multi-Armed Bandits running on move trees (aka the UCT algorithm) was the invention that led the foundation to computers cracking Go.
Hey, an area I actually work on and use in production! MAB in our case has shown improvements in our A/B metrics. My general takeaways for my experience with MAB:
* we have a single optimization goal (e.g the metric to minimize or maximize). The hard part isnt defining a optimization goal. The hard part is identifying what stakeholders actually want the tradeoff between different metrics. If you have goal A and goal B, where B is more aggressive than A, then getting agreement on which is better is hard. This is a people problem.
* MAB seems to be a good proof of concept that something that can be optimized but isnt an "end game" optimization path.
* MAB for A/B testing is going to mess up your AB data and make everything more difficult. You should have a treatment that uses the MAB algorithm and a treatment that doesnt.
All of the above is for non contextual MAB. I am currently learning about different MAB algorithms although each of them are pretty similar. The ones ive read about are all effectively linear/logistic regression and tbe differences come from exploration mechanism and how uncertainty is represented. Epsilon greedy has no uncertainty, exploration just happens a fixed percentage of time. UCB is optimistic about uncertainty, and the amount of optimism controls exploration. Thompson sampling uses statistical (beta) distributions about uncertainty, exploration happens less as confidence about a particular set of options increases.
Overall its a fun area to work in that is quite different from your typical CRUD development which is a nice change of pace.
Its very easy to implement toy multi-armed bandits, and even getting them into production code isn't so bad. Its the reporting, education, and adding features without breaking everything that gets very hard very quick.
> Its the reporting, education, and adding features without breaking everything that gets very hard very quick.
What's challenging about that though?
I’ve found this all to be just as straightforward as implementing the MAB in my experience.
I used bandits to select a best vendor/model to use on a per customer basis for a task that we had a reward signal for. It worked amazingly well but the more important thing we gained was the ability to accurately estimate the improvement other features/changes had. Getting a hard number for 'feature x improved over previous baseline by y' is a game changer for a lot of product and development work. You only think feature x is a massive improvement, until you see the numbers and realize that the spacing displayed to a user was 10x move valuable than the Really Big Feature you thought was going to make a difference.
One way to address the https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitati...
Just from a philosophical perspective I think multi-armed bandits are a fascinating idea and a great technique for thinking about many kinds of decision processes.