10/1/18 10/1/18

Setting Your Hypothesis Test Up For Success

Setting up your hypothesis test for success as a data scientist is critical. I want to go deep with you on exactly how I work with stakeholders ahead of launching a test. This step is crucial to make sure that once a test is done running, we'll actually be able to analyze it. This includes:

A well defined hypothesis
A solid test design
Knowing your sample size
Understanding potential conflicts
Population criteria (who are we testing)
Test duration (it's like the cousin of sample size)
Success metrics
Decisions that will be made based on results

This is obviously a lot of information. Before we jump in, here is how I keep it all organized:I recently created a google doc at work so that stakeholders and analytics could align on all the information to fully scope a test upfront. This also gives you (the analyst/data scientist) a bit of an insurance policy. It's possible the business decides to go with a design or a sample size that wasn't your recommendation. If things end up working out less than stellar (not enough data, design that is basically impossible to analyze), you have your original suggestions documented.In my previous article I wrote:

"Sit down with marketing and other stakeholders before the launch of the A/B test to understand the business implications, what they’re hoping to learn, who they’re testing, and how they’re testing. In my experience, everyone is set up for success when you’re viewed as a thought partner in helping to construct the test design, and have agreed upon the scope of the analysis ahead of launch."

Well, this is literally what I'm talking about:This document was born of things that we often see in industry:HypothesisI've seen scenarios that look like "we're going to make this change, and then we'd like you to read out on the results". So, your hypothesis is what? You're going to make this change, and what do you expect to happen? Why are we doing this? A hypothesis clearly states the change that is being made, the impact you expect it to have, and why you think it will have that impact. It's not an open-ended statement. You are testing a measurable response to a change. It's ok to be a stickler, this is your foundation.Test DesignThe test design needs to be solid, so you'll want to have an understanding of exactly what change is being made between test and control. If you're approached by a stakeholder with a design that won't allow you to accurately measure criteria, you'll want to coach them on how they could design the test more effectively to read out on the results. I cover test design a bit in my article here.Sample SizeYou need to understand the sample size ahead of launch, and your expected effect size. If you run with a small sample and need an unreasonable effect size for it to be significant, it's most likely not worth running. Time to rethink your sample and your design. Sarah Nooravi recently wrote a great article on determining sample size for a test. You can find Sarah's article here.

An example might be that you want to test the effect of offering a service credit to select customers. You have a certain budget worth of credits you're allowed to give out. So you're hoping you can have 1,500 in test and 1,500 in control (this is small). The test experience sees the service along with a coupon, and the control experience sees content advertising the service but does not see any mention of the credit. If the average purchase rate is 13.3% you would need a 2.6 point increase (15.9%) in the control to see significance at 0.95 confidence. This is a large effect size that we probably won't achieve (unless the credit is AMAZING). It's good to know these things upfront so that you can make changes (for instance, reduce the amount of the credit to allow for additional sample size, ask for extra budget, etc).

Potential Conflicts:It's possible that 2 different groups in your organization could be running tests at the same time that conflict with each other, resulting in data that is junk for potentially both cases. (I actually used to run a "testing governance" meeting at my previous job to proactively identify these cases, this might be something you want to consider).

An example of a conflict might be that the acquisition team is running an ad in Google advertising 500 business cards for $10. But if at the same time this test was running another team was running a pricing test on the business card product page that doesn't respect the ad that is driving traffic, the acquisition team's test is not getting the experience they thought they were! Customers will see a different price than what is advertised, and this has negative implications all around.
It is so important in a large analytics organization to be collaborating across teams and have an understanding of the tests in flight and how they could impact your test.

Population criteria: Obviously you want to target the correct people. But often I've seen criteria so specific that the results of the test need to be caveated with "These results are not representative of our customer base, this effect is for people who [[lists criteria here]]." If your test targeted super performers, you know that it doesn't apply to everyone in the base, but you want to make sure it is spelled out or doesn't get miscommunicated to a more broad audience.

Test duration: This is often directly related to sample size. (see Sarah's article) You'll want to estimate how long you'll need to run the test to achieve the required sample size. Maybe you're randomly sampling from the base and already have sufficient population to choose from. But often we're testing an experience for new customers, or we're testing a change on the website and we need to wait for traffic to visit the site and view the change. If it's going to take 6 months of running to get the required sample size, you probably want to rethink your population criteria or what you're testing. And better to know that upfront.

Success Metrics: This is an important one to talk through. If you've been running tests previously, I'm sure you've had stakeholders ask you for the kitchen sink in terms of analysis.If your hypothesis is that a message about a new feature on the website will drive people to go see that feature; it is reasonable to check how many people visited that page and whether or not people downloaded/used that feature. This would probably be too benign to cause cancellations, or effect upsell/cross-sell metrics, so make sure you're clear about what the analysis will and will not include. And try not to make a mountain out of a molehill unless you're testing something that is a dramatic change and has large implications for the business.

Decisions! Getting agreement ahead of time on what decisions will be made based on the results of the test is imperative.Have you ever been in a situation where the business tests something, it's not significant, and then they roll it out anyways? Well then that really didn't need to be a test, they could have just rolled it out. There are endless opportunities for tests that will guide the direction of the business, don't get caught up in a test that isn't actually a test.

Conclusion: Of course, each of these areas could have been explained in much more depth. But the main point is that there are a number of items that you want to have a discussion about before a test launches. Especially if you're on the hook for doing the analysis, you want to have the complete picture and context so that you can analyze the test appropriately.I hope this helps you to be more collaborative with your business partners and potentially be more "proactive" rather than "reactive".

No one has any fun when you run a test and then later find out it should have been scoped differently. Adding a little extra work and clarification upfront can save you some heartache later on. Consider creating a document like the one I have pictured above for scoping your future tests, and you'll have a full understanding of the goals and implications of your next test ahead of launch. :)

7/2/18 7/2/18

What I Enjoyed Most at ODSC East 2018

Last week I had the opportunity to attend Open Data Science Conference (ODSC) in Boston. It was awesome to see people just walking around who I had previously read about or I'm following them on twitter. It was even nicer to meet some of these people, and I was amazed at how friendly everyone was.

Of course you can't attend everything at a conference like this, at one point there was 11 different sessions going on at once. It was really difficult to determine which sessions to attend given the number of great options, but I tried to align the information I'd be consuming closely with what I'd be able to bring back to my day job and implement.

In this article I'll cover some learnings/ favorite moments from:

one of the trainings
a couple different workshops
the sweet conference swag
mention one of the keynotes

Trainings:My original plan was to take an R training in the morning on Tuesday and take a Python training that afternoon. However, what really happened was I went to the R training in the morning, this training left me feeling super jazzed about R, and so I ended up going to another R training that afternoon (instead of the Python training I had originally planned on). The morning R training I took was "Getting to grips with the tidyverse (R)" given by Dr. Colin Gillespie. This was perfect, because I had been struggling with dplyr (an R package) the night previously, and this training went through parts of dplyr with great explanations along the way. Colin also showed us how to create plots using the package "Plotly". This was my first time creating an interactive graph in R. Easy to use, and super cool. He was also nice enough to take a look at the code I was currently working on, I definitely appreciated this.

The afternoon R training I attended was given by Jared Lander entitled "Intermediate RMarkdown in Shiny". It was my first introduction to Shiny. I had heard about it, but had never ventured to use it, now I don't know what I was waiting for. If you ever have the opportunity to hear Jared speak, I found him incredibly entertaining, and he explained the material clearly, making it super accessible. I like to think Jared also enjoyed my overly animated crowd participation.

Workshops:

On Thursday I attended "Uplift Modeling and Uplift Prescriptive Analytics: Introduction and Advanced Topics" by Victor Lo, PHD. This information really resonated with me. Dr. Lo spoke about the common scenario in Data Science where you'll build a model to try and predict something like customer attrition. You'd maybe take the bottom three deciles (the people with the highest probability of cancelling their subscription, and do an A/B test with some treatment to try and encourage those customers to stay.

In the end, during analysis, you'd find that you did not have a statistically significant lift in test over control with the usual methods. You end up in a situation where the marketers would be saying "hey, this model doesn't work" and the data scientist would be saying "what? It's a highly predictive model". It's just that this is not the way that you should be going about trying to determine the uplift. Dr. Lo spoke about 3 different methods and showed their results.

These included:

Two Model Approach
Treatment Dummy Approach
Four Quadrant Method

Here is the link to his ODSC slides from 2015 where he also covered these 3 models (with similar slides): here

I've experienced this scenario before myself, where the marketing team will ask for a model and want to approach testing this way. I'm super excited to use these methods to determine uplift in the near future.

Another workshop I attended was "R Packages as Collaboration Tools" by Stephanie Kirmer (slides). Stephanie spoke about creating R packages as a way to automate repeated tasks. She also showed us how incredibly easy it is to take your code and make it an R package for internal use. Here is another case that is applicable currently at my work. I don't have reports or anything that is due on a regular cadence, but we could certainly automate part of the test analysis process, and there are currently ongoing requests asked of Analytics in our organization that could be automated. Test analysis is done in a different department, but if automated, this would save time on analysis, reduce potential for human error in test analysis, and free up bandwidth for more high value work.SWAG:

Although conference swag probably doesn't really need a place in this article, Figure Eight gave out a really cool little vacuum that said "CLEAN YOUR DATA". I thought I'd share a picture with you. Also, my daughter loved the DataRobot stickers and little wooden robots they gave out. She fashioned the sticker around her wrist and wore it as a bracelet. 3 year olds love conference swag:

ODSC vacuum ODSC stickers Keynote:The keynote was Thursday morning. I LOVED the talk given by Cathy O'Neil, a link to her TED talk is here. She spoke about the importance of ethics in data science, and how algorithms have to use historical data, therefore, they're going perpetuate our current social biases. I love a woman who is direct, cares about ethics, and has some hustle. Go get em' girl. I made sure to get a chance to tell her how awesome her keynote was afterwards. And of course I went home and bought her book "Weapons of Math Destruction". I fully support awesome. Summary:I had an incredible time at the ODSC conference. Everyone was so friendly, my questions were met with patience, and it was clear that many attendees and speakers had a true desire to help others learn. I could feel the sense of community. I highly suggest that if you every get the opportunity to attend, go! I am returning to work with a ton of new information that I can begin using immediately at my current job, it was a valuable experience. I hope to see you there next year.