12/30/19 12/30/19

Hands-on SQL Practice For A Data Science Interview

I bet you're searching the internet for a fantastic reference to help you get ready for your SQL interview. You've landed in the right place.

Let's jump right in and get started. I have a SQL browser available for you to use here: Show me the data

The data is fictitious, but extensive and useful for writing queries.

You can use this browser to answer all of the questions below. When you're done, if you weren't able to answer a couple of the questions, you can give me your email address in the email form near the bottom of this article and I'll send you the query solutions.

This article is designed to help you with "white boarding" SQL questions. We're not going to cover any theory here. Just a bunch of questions and how to answer them leveraging SQL.

I wish you a ton of luck on your interview, I hope it results in an offer! If you're looking to further your SQL skills for data science, I have also created the ultimate course in SQL for data science. We cover tons of material that you won't see here, because feature engineering, handling NULLs, working with datetimes, etc., is not typically part of the interview. But if you want to really hit the ground running at your new job, I'd highly suggest this course. It's free and you can find it here

SQL Questions using just the select statement:

Write a query to determine the number of rows in the customer table. Answer: 351,962
What was the maximum commission paid in the customer table? The median? Answer: Max -$10,295, Average -$66.30
Write a query to that returns the customer_id, business_type and Country from the customer table.

SQL Questions using a where statement:

How many customers do we have "has_instagram" information for in the customer table? i.e. - How many rows are not NULL? Answer: 128,449
How many customers have a "First_conversion_date" greater than 1/1/2016 in the customer table? Answer: 54,397

SQL Question using a group by statement:

How many customers have "has_facebook" =1 in the customer table? Use a group by statement. Answer: 60,894

SQL Question using a group by and order statement:

Which state has the most customers? How many customers live in that state in the customer table? Answer: California, 43,736 customers

SQL Question using a having statement and subquery:

Using the billedservices table, how many customers had more than 1 billed service? Use a subquery to answer this question. Answer: 44

SQL Question requiring a join:

How many customers from OUTSIDE the United States have an entry in the billed services table? Answer: 89

In an interview, they'll typically place two or three pieces of paper up on the whiteboard. This will have your data. Obviously, this means that the data you'll be working with is much smaller.

Take home tests are typically much more difficult than what was covered here. During an in-person interview, they'll typically only have 30 minutes to an hour to assess your SQL knowledge. Most often, they just want to know that if you have SQL listed on your resume, that you can write some simple queries like up above.

I've never personally been asked to whiteboard the solution for creating a table, updating a column, etc., but obviously any SQL questions are fair game.

If you had no difficulty answering these questions, you're likely to do fine on your SQL interview.

Want to further your SQL skills for data science? Check out the Ultimate SQL for Data Science course.

Looking for the solutions to the questions above? I'll send them directly to your inbox :)

7/3/18 7/3/18

Beginning the Data Science Pipeline - Meetings

I spoke in a Webinar recently about how to get into Data Science. One of the questions asked was "What does a typical day look like?" I think there is a big opportunity to explain what really happens before any machine learning takes place for a large project. I've previously written about thinking creatively for feature engineering, but there is even more to getting ready for a data science project, you need to get buy in on the project from other areas of the business to ensure you're delivery insights that the business wants and needs.It may be that the business has a high priority problem for you to solve, but often you'll identify projects with a high ROI and want to show others the value you could provide if you were given the opportunity to work on the project you've come up with.The road to getting to the machine learning algorithm looks something like:

Plenty of meetings
Data gathering (often from multiple sources)
Exploratory data analysis
Feature engineering
Researching the best methodology (if it's not standard)
Machine learning

We're literally going to cover the 1st bullet here in this article. There are a ton of meetings that take place before I ever write a line of SQL for a big project. If you read enough comments/blogs about Data Science, you'll see people say it's 90% data aggregation and 10% modeling (or some other similar split), but that's also not quite the whole picture. I'd love for you to fully understand what you're signing up for when you become a data scientist.

Meetings: As I mentioned, the first step is really getting buy in on your project. It's important that as an Analytics department, we're working to solve the needs of the business. We want to help the rest of the business understand the value that a project could deliver, through pitching the idea in meetings with these stakeholders. Just to be clear, I'm also not a one woman show. My boss takes the opportunity to talk about what we could potentially learn and action on with this project whenever he gets the chance (in additional meetings). After meetings at all different levels with all sorts of stakeholders, we might now have agreement that this project should move forward.

More Meetings: At this point I'm not just diving right into SQL. There may be members of my team who have ideas for data that I'm not aware of that might be relevant. Other areas of the business can also help give inputs into what variables might be relevant (they don't know they database, but they have the business context, and this project is supposed to SUPPORT their work).There is potentially a ton of data living somewhere that has yet to be analyzed, the databases of a typical organization are quite large, unless you've been at a company for years, there is most likely useful data that you are not aware of.

The first step was meeting with my team to discuss every piece of data that we could think of that might be relevant. Thinking of things like:

If something might be a proxy for customers who are more "tech savvy". Maybe this is having a business email address as opposed to a gmail address (or any non-business email address), or maybe customers who utilize more advanced features of our product are the ones we'd consider tech savvy. It all depends on context and could be answered in multiple ways. It's an art.
Census data could tell us if a customers zip code is in a rural or urban area? Urban or rural customers might have different needs and behave differently, maybe the extra work to aggregate by rural/urban isn't necessary for this particular project. Bouncing ideas off other and including your teammates and stakeholders will directly impact your effectiveness.
What is available in the BigData environment? In the Data Warehouse? Other data sources within the company. When you really look to list everything, you find that this can be a large undertaking and you'll want the feedback from others.

After we have a list of potential data to find, then the meetings start to help track all that data down. You certainly don't want to reinvent the wheel here. No one gets brownie points for writing all of the SQL themselves when it would have taken you half the time if you leveraged previously written queries from teammates. If I know of a project where someone had already created a few cool features, I email them and ask for their code, we're a team. For a previous project I worked on, there were 6 different people outside of my team that I needed to connect with who knew these tables or data sources better than members of my team. So it's time to ask those other people about those tables, and that means scheduling more meetings.

Summary: I honestly enjoy this process, it's an opportunity to learn about the data we have, work with others, and think of cool opportunities for feature engineering. The mental picture is often painted of data scientists sitting in a corner by themselves, for months, and then coming back with a model. But by getting buy in, collaborating with other teams, and your team members, you can keep stakeholders informed through the process and feel confident that you'll deliver what they're hoping. You can be a thought partner that is proactively delivering solutions.