- For hashing, md5 hash on (user-id + a/b-test-id) is sufficient. In practice we had no issues with split bias. You should not get too clever with hashing. You'll want to stick to something reliable and widely supported to make any post-experiment analysis easier. You definitely want to log the user-id to version mapping somewhere.
- As for in-house vs external, I would probably go in-house, though that depends on the system you're A/B testing. In practice the amount of work needed to integrate a third-party tool was roughly the same as building the platform, but building the platform meant we could test more bespoke features.
As someone with a lot of experience using all the main tools on the market I’d advise that traffic split and assignment is the just the tip of the iceberg when it comes to making your own tool.
If it’s a tool you ever want to get into production you need to focus on which metrics you are collecting, how you’re collecting them, how you’re defining a session, how are you defining a user, how are you complying with data privacy laws, what tools and data platforms are you integrating with, what safeguards are you implementing to ensure that you’re not leaking sensitive information, how are you handling code delivery, how are you accounting for ad blockers… to name a few.
I’d personally look for some open source tools and assess what features they provide/how they provide them to get a good idea of what is required. Better yet, I’d try and get some trials on the current market leaders such as AB Tasty, Optimizely, Web Trends Optimize, Monetate and VWO.
You’ll notice that they all offer more than just A/B testing…
Thanks for responding, I appreciate the insights here! I'm not focused on offering every single feature at the moment. This is a brand-new project and also is not in the same market as the products you have mentioned (AB Tasty, Optimizely, etc.). Those products may offer a lot more than A/B testing, but for my situation, I have a single, clear problem that I want to solve which doesn't require much.
No probs at all mate, feel free to reach out if you need to bounce ideas. As I say I’ve quite a bit of experience from leading out the development offering in an agency and building a bespoke AB Testing IDE.
I'm not clear why you are focusing on hashing user ids. Nor how you landed on a 50/50 split. When someone logs in, write a record of which one they got. No need for hashing. The algorithm for choosing can be any percentage. And yes, just a simple lookup table is fine.
But the work comes when analyzing the data later, so that is where you need to put in more thought. What are you measuring? How are you logging the events that will track that measurement? Those are both harder and more important questions. Based on what you are measuring, you might want only specific subsets of users to test things - why assign 50% of an entire userbase to a feature being tested that only 10% of the users touch? You need to assign a portion of that 10%, not a global 50%.
Basically, you need to start with the end data you intend to act on and work backwards, not start with a hashing algorithm that might not even help.
> I'm not clear why you are focusing on hashing user ids. Nor how you landed on a 50/50 split
I landed on hashing and splitting from my research on building A/B tools, but none of that research was targeted towards building real, enterprise products (which is why I asked the question here). From your reply, I take it that this isn't as important as I read about earlier?
> When someone logs in, write a record of which one they got.
I'm confused about what you mean by "which one they got". How do I know which version to assign them? This is what I assumed hashing would solve - we'd have a reliable way to "choose" a version for any given user.
> why assign 50% of an entire userbase to a feature being tested that only 10% of the users touch?
This makes sense, I'm not sure why I had landed on 50%. So the percentage difference does not matter? I had assumed that we need a way to enforce a certain percentage split - how do I prevent a "feature" from only reaching 0.01% of the userbase, whereas the other feature reaches 99.99%?
Thanks again for your reply, you've been really helpful.
Can't you just concat a random number to their hash at the time of hashing? Like userId-1 to userId-100. Doesn't matter what hashing algorithm you use then. Then just divide the number by the number of buckets. < 49 goes to A, > 49 goes to B. The larger number just means you can split them into more buckets later if needed.
I’ve always wanted to build a self-optimizing system based on one armed bandit algorithms. Split testing tools are all the rage, but imagine using GenAI to come up with and automatically test variations of <form> submissions.
The ability to constantly change and backtest as your audience changes (because source traffic is a dynamic block more akin to a river than something static that we can test once for!) always sounded powerful.
At my last company I helped build the experimentations platform that processes millions of requests per day. I have some thoughts:
- The most useful resource we've found was from Spotify, of all places: https://engineering.atspotify.com/category/data-science/
- For hashing, md5 hash on (user-id + a/b-test-id) is sufficient. In practice we had no issues with split bias. You should not get too clever with hashing. You'll want to stick to something reliable and widely supported to make any post-experiment analysis easier. You definitely want to log the user-id to version mapping somewhere.
- As for in-house vs external, I would probably go in-house, though that depends on the system you're A/B testing. In practice the amount of work needed to integrate a third-party tool was roughly the same as building the platform, but building the platform meant we could test more bespoke features.
As someone with a lot of experience using all the main tools on the market I’d advise that traffic split and assignment is the just the tip of the iceberg when it comes to making your own tool.
If it’s a tool you ever want to get into production you need to focus on which metrics you are collecting, how you’re collecting them, how you’re defining a session, how are you defining a user, how are you complying with data privacy laws, what tools and data platforms are you integrating with, what safeguards are you implementing to ensure that you’re not leaking sensitive information, how are you handling code delivery, how are you accounting for ad blockers… to name a few.
I’d personally look for some open source tools and assess what features they provide/how they provide them to get a good idea of what is required. Better yet, I’d try and get some trials on the current market leaders such as AB Tasty, Optimizely, Web Trends Optimize, Monetate and VWO.
You’ll notice that they all offer more than just A/B testing…
Thanks for responding, I appreciate the insights here! I'm not focused on offering every single feature at the moment. This is a brand-new project and also is not in the same market as the products you have mentioned (AB Tasty, Optimizely, etc.). Those products may offer a lot more than A/B testing, but for my situation, I have a single, clear problem that I want to solve which doesn't require much.
No probs at all mate, feel free to reach out if you need to bounce ideas. As I say I’ve quite a bit of experience from leading out the development offering in an agency and building a bespoke AB Testing IDE.
I'm not clear why you are focusing on hashing user ids. Nor how you landed on a 50/50 split. When someone logs in, write a record of which one they got. No need for hashing. The algorithm for choosing can be any percentage. And yes, just a simple lookup table is fine.
But the work comes when analyzing the data later, so that is where you need to put in more thought. What are you measuring? How are you logging the events that will track that measurement? Those are both harder and more important questions. Based on what you are measuring, you might want only specific subsets of users to test things - why assign 50% of an entire userbase to a feature being tested that only 10% of the users touch? You need to assign a portion of that 10%, not a global 50%.
Basically, you need to start with the end data you intend to act on and work backwards, not start with a hashing algorithm that might not even help.
Thank you for helping!
> I'm not clear why you are focusing on hashing user ids. Nor how you landed on a 50/50 split
I landed on hashing and splitting from my research on building A/B tools, but none of that research was targeted towards building real, enterprise products (which is why I asked the question here). From your reply, I take it that this isn't as important as I read about earlier?
> When someone logs in, write a record of which one they got.
I'm confused about what you mean by "which one they got". How do I know which version to assign them? This is what I assumed hashing would solve - we'd have a reliable way to "choose" a version for any given user.
> why assign 50% of an entire userbase to a feature being tested that only 10% of the users touch?
This makes sense, I'm not sure why I had landed on 50%. So the percentage difference does not matter? I had assumed that we need a way to enforce a certain percentage split - how do I prevent a "feature" from only reaching 0.01% of the userbase, whereas the other feature reaches 99.99%?
Thanks again for your reply, you've been really helpful.
There's a pragprog book currently in beta on A/B testing: https://pragprog.com/titles/abtestprac/next-level-a-b-testin...
Thanks, will look into it
If Hash(attribute) % buckets == 0
That's basically all you need for this. The hash algorithm doesn't matter as long as it's fast and returns a number.
For your example it's "hash(user_id) % 2" because a 50/50 split has two buckets.
I had thought about this but didn't know if it was reliable for large-scale applications. Thank you!
https://www.manning.com/books/experimentation-for-engineers
Thanks!
Can't you just concat a random number to their hash at the time of hashing? Like userId-1 to userId-100. Doesn't matter what hashing algorithm you use then. Then just divide the number by the number of buckets. < 49 goes to A, > 49 goes to B. The larger number just means you can split them into more buckets later if needed.
I’ve always wanted to build a self-optimizing system based on one armed bandit algorithms. Split testing tools are all the rage, but imagine using GenAI to come up with and automatically test variations of <form> submissions.
The ability to constantly change and backtest as your audience changes (because source traffic is a dynamic block more akin to a river than something static that we can test once for!) always sounded powerful.
I think this is what you need:
https://github.com/facebookarchive/planout
It’s just the bones of a factorial design framework for online experiments.
It’s simple enough to copy/paste and roll yourself. Many have translated it into their preferred language.
This looks great, thank you!
Hei there, have you tried putting this through AI to help you get started?
I can see you going from idea to prototype with 3 or 4 agents to help you.. system design, data design, tooling, deployment.