Beginning the Data Science Pipeline - Meetings
I spoke in a Webinar recently about how to get into Data Science. One of the questions asked was "What does a typical day look like?" I think there is a big opportunity to explain what really happens before any machine learning takes place for a large project. I've previously written about thinking creatively for feature engineering, but there is even more to getting ready for a data science project, you need to get buy in on the project from other areas of the business to ensure you're delivery insights that the business wants and needs.It may be that the business has a high priority problem for you to solve, but often you'll identify projects with a high ROI and want to show others the value you could provide if you were given the opportunity to work on the project you've come up with.The road to getting to the machine learning algorithm looks something like:
Plenty of meetings
Data gathering (often from multiple sources)
Exploratory data analysis
Feature engineering
Researching the best methodology (if it's not standard)
Machine learning
We're literally going to cover the 1st bullet here in this article. There are a ton of meetings that take place before I ever write a line of SQL for a big project. If you read enough comments/blogs about Data Science, you'll see people say it's 90% data aggregation and 10% modeling (or some other similar split), but that's also not quite the whole picture. I'd love for you to fully understand what you're signing up for when you become a data scientist.
Meetings: As I mentioned, the first step is really getting buy in on your project. It's important that as an Analytics department, we're working to solve the needs of the business. We want to help the rest of the business understand the value that a project could deliver, through pitching the idea in meetings with these stakeholders. Just to be clear, I'm also not a one woman show. My boss takes the opportunity to talk about what we could potentially learn and action on with this project whenever he gets the chance (in additional meetings). After meetings at all different levels with all sorts of stakeholders, we might now have agreement that this project should move forward.
More Meetings: At this point I'm not just diving right into SQL. There may be members of my team who have ideas for data that I'm not aware of that might be relevant. Other areas of the business can also help give inputs into what variables might be relevant (they don't know they database, but they have the business context, and this project is supposed to SUPPORT their work).There is potentially a ton of data living somewhere that has yet to be analyzed, the databases of a typical organization are quite large, unless you've been at a company for years, there is most likely useful data that you are not aware of.
The first step was meeting with my team to discuss every piece of data that we could think of that might be relevant. Thinking of things like:
If something might be a proxy for customers who are more "tech savvy". Maybe this is having a business email address as opposed to a gmail address (or any non-business email address), or maybe customers who utilize more advanced features of our product are the ones we'd consider tech savvy. It all depends on context and could be answered in multiple ways. It's an art.
Census data could tell us if a customers zip code is in a rural or urban area? Urban or rural customers might have different needs and behave differently, maybe the extra work to aggregate by rural/urban isn't necessary for this particular project. Bouncing ideas off other and including your teammates and stakeholders will directly impact your effectiveness.
What is available in the BigData environment? In the Data Warehouse? Other data sources within the company. When you really look to list everything, you find that this can be a large undertaking and you'll want the feedback from others.
After we have a list of potential data to find, then the meetings start to help track all that data down. You certainly don't want to reinvent the wheel here. No one gets brownie points for writing all of the SQL themselves when it would have taken you half the time if you leveraged previously written queries from teammates. If I know of a project where someone had already created a few cool features, I email them and ask for their code, we're a team. For a previous project I worked on, there were 6 different people outside of my team that I needed to connect with who knew these tables or data sources better than members of my team. So it's time to ask those other people about those tables, and that means scheduling more meetings.
Summary: I honestly enjoy this process, it's an opportunity to learn about the data we have, work with others, and think of cool opportunities for feature engineering. The mental picture is often painted of data scientists sitting in a corner by themselves, for months, and then coming back with a model. But by getting buy in, collaborating with other teams, and your team members, you can keep stakeholders informed through the process and feel confident that you'll deliver what they're hoping. You can be a thought partner that is proactively delivering solutions.
Designing and Learning With A/B Testing
I've spent the last 6 years of my life heavily involved in A/B testing, and other testing methodologies. Whether it was the performance of an email campaign to drive health outcomes, product changes, Website changes, the example list goes on. A few of these tests have been full factorial MVT tests (my fave). I wanted to share some testing best practices and examples in marketing, so that you can feel confident about how you're designing and thinking about A/B testing.As a Data Scientist, you may be expected to be the subject matter expert on how to test correctly. Or it may be that you've just built a product recommendation engine (or some other model), and you want to see how much better you're performing compared to the previously used model or business logic, so you'll test the new model vs. whatever is currently in production.There is SO MUCH more to the world of testing than is contained here, but what I'm looking to cover here is:
Determining test and control populations
Scoping the test ahead of launch
A test design that will allow us to read the results we’re hoping to measure
Test Analysis
Thoughts on automating test analysis
Choosing Test and Control PopulationsThis is where the magic starts. The only way to determine a causal relationship is by having randomized populations (and a correct test design). So it's imperative that our populations are drawn correctly if we want to learn anything from our A/B test. In general, the population you want to target will be specific to what you're testing. If this is a site test for an Ecommerce company, you hope that visitors are randomized to test and control upon visiting the website. If you're running an email campaign or some other type of test, then you'll pull all of the relevant customers/people from a database or BigData environment who meet the criteria for being involved in your A/B test. If this is a large list you'll probably want to take a random sample of customers over some time period. This is called a simple random sample. A simple random sample is a subset of your population, where every member had an equal probability of being chosen to be in the sample.
Here is a great example on how to pull a random sample from Hive: here
Also, just to be clear, writing a "select top 1000 * from table" in SQL is NOT A RANDOM SAMPLE. There are a couple different ways to get a random sample in SQL, but how to do it will depend on the "flavor" of SQL you're using.
Here is an example pulling a random sample in SQL server: here
Now that you have your sample, you'll randomly assign these people to test and control groups.There are times when we’ll need to be a little more sophisticated….Let’s say that the marketing team wants to learn about ability to drive engagement by industry (and that you have industry data). Some of the industries are probably going to contain fewer members than others. Meaning that if you just split a portion of your population into two groups, you might not have a high enough sample size in certain industries that you care about to determine statistical significance.Rather than putting in all the effort running the A/B test to the find out that you can’t learn about an industry you care about, use stratified sampling (This would involve doing a simple random sample within each group of interest).
Scoping Ahead of LaunchI've seen in practice when the marketing team doesn't see the results they want say "We're going to let this A/B test run for two more weeks to see what happens". Especially for site tests, if you run anything long enough, tiny effect sizes can become statistically significant. You should have an idea of how much traffic you're getting to the particular webpage, and how long the A/B test should run before you launch. Otherwise, what is to stop us from just running the A/B test until we get the result that we want?Sit down with marketing and other stakeholders before the launch of the A/B test to understand the business implications, what they're hoping to learn, who they're testing, and how they're testing. In my experience, everyone is set up for success when you're viewed as a thought partner in helping to construct the test design, and have agreed upon the scope of the analysis ahead of launch.
Test DesignFor each cell in an A/B test, you can only make ONE change. For instance, if we have:
Cell A: $15 price point
Cell B: $25 price point
Cell C: UI change and a $30 price point
You just lost valuable information. Adding a UI change AND a different price option makes it impossible to parse out what effect was due to the UI change or the $30 price point. We’ll only know how that cell performed in aggregate. Iterative A/B testing is when you take the winner from one test and make it the control for a subsequent A/B test. This method is going to result in a loss of information. What if the combination of the loser from test 1 and the winner from test 2 is actually the winner? We’d never know!Sometimes iterating like this makes sense (maybe you don't have enough traffic for more test cells), but we’d want to talk about all potential concessions ahead of time.Another type of test design is MVT (Multivariate). Here we'll look at a full-factorial MVT. There are more types of multivariate tests, but full-factorial is the easiest to analyze.
MVT is better for more subtle optimizations (A/B testing should be used if you think the test will have a huge impact)
Rule of thumb is at least 100,000 unique visitors per month.
You'll need to know how to use ANOVA to analyze (I will provide a follow-up article with code and explanation for how to do this analysis and link it here later)
One illustrative example of an MVT test is below. The left (below) is the control experiences, and on the right are the 3 test treatments. This results in 2^3 = 8 treatments, because we'll look at each possible combination of test and control.
On the left: The controls would be the current experience
On the right: Cell A could be new photography (ex: friendly waving stick figure), Cell B could reference a sale and, Cell C could show new content.
We can learn about all the interactions! Understanding the interactions and finding the optimal treatment when changing multiple items is the big benefit of MVT testing. The chart below shows you how each person would be assigned to one of the 8 treatments in this example.
In a future article I'll write up one of my previous MVT tests that I've analyzed, with R code.A/B Test AnalysisOne of the most important parts of test analysis is to have consistency across the business in how we analyze tests. You don't want to say something had a causal effect, when if another person had analyzed the same test, they might have reached a different conclusion. In addition to having consistent ways of determining conclusions, you'll also want to have a consistent way of communicating these results with the rest of the business. For example, "Do we share results we find with a p-value greater than .05?" Maybe we do, maybe we don't, but make sure the whole team is being consistent in their communication with marketing and other teams. Confidence intervals should always be given! You don’t want to say “Wow! This is worth $700k a year”, when really it’s worth somewhere between $100k and $1.3m. That's a big difference and could have an impact on decisions whether to roll out the change or not.Let's Automate our A/B Test Analysis!Why spend multiple hours analyzing each A/B test, when we can:
Automate removal of outliers
Build in not calculating statistical significance if the sample is not quite large enough yet
Determine statistical significance of metrics with confidence intervals and engaging graphs
See how A/B tests are performing soon after launch to make sure there aren’t any bugs messing with our results or large drops in revenue.
This also reduces opportunity for error in analysis
With a couple data entries and button pushes!This would take a while to build, and will not be a one size fits all for all of your tests. Automating even a portion could greatly reduce the amount of time spent analyzing tests!I hope this article gave you some things to be on the lookout for when testing. If you're still in school to become a Data Scientist, taking a general statistics class that covers which statistics to use and how to calculate confidence intervals is something that will benefit you throughout your career in Data Science. Otherwise, there is certainly tons of information on the internet to give you an overview of how to calculate these statistics. I personally prefer Coursera, because it's nice to sit back and watch videos on the content, knowing that the content is from well known universities.You can learn a ton through properly executed testing. Happy learning!