SQL is for all UX researchers - Carl J. Pearson, PhD

TL;DR: UX researchers should learn SQL to get their own participant lists and better understand company metrics and product questions. When learning SQL, keep in mind that you have to learn the syntax but also the unique ways your organization has built their data infrastructure.

Please wait…

Thank you for signing up, stay tuned.

In all tech companies, and even most companies outside of tech these days, data is the lifeblood of the business. It’s how the company understands its product’s use and success. SQL is most often the way someone gets a pulse on what is going on with the data.

SQL can seem intimidating to most UX researchers – it’s typically the domain of data scientists after all. In reality, SQL can be used quite simply and with a great impact on the work of a UX researcher.

Personally, I had the title of “quantitative” UX researcher before I knew any SQL at all. As the sub-field grows, it’s more and more a requirement on quantitative UXR job postings. Even if quant isn’t your focus, qualitative UX researchers can also get a lot of value from learning SQL.

In this post, I’ll lay out why learning SQL is so useful and some critical things I learned along the way in my journey. This post won’t teach you SQL but will show you what to focus on. Because I knew R first, you’ll find a few R specific tips as well.

Note: Tess Rothstein in 2018 wrote another great article with a similar set of reasons why you should learn SQL. I am writing this article to expand on the core concepts with some specific experiences I’ve had.

The value of SQL

Unless you’re a quantitative UX researcher, why should you care about SQL at all? There are a few reasons that can help you unblock your own work and better understand the big picture.

Learn about your users

Have you ever thought to yourself, “How many users are in X country” or “How many users have used feature Y”? SQL is the way to answer these questions. UX researchers are all about reducing assumptions and uncovering facts. It’s a nice superpower to be able to get answers to these questions without sending a survey or running an interview.

SQL is able to help you get answers the factual¹ questions about product usage, user characteristics, and more.

Recruit your users

Working at Meta, I often had to spend time and resources to get a product manager to align data scientist (DS) resources to make queries for me – it could be a major project blocker. DS colleagues are busy and I’ve found my work to be much faster without waiting to secure DS resources to build my queries. Even qualitative researchers use similar queries to find participants for interviews or usability tests. Queries are an essential part of getting participants for UX researchers.

In practice, you can build a query from scratch, or with some DS help, and reuse that. You don’t necessarily have to make a brand new query each time, but you’d be able to run it yourself with minimal effort. You also can adjust the query in simple ways to fit each project. This is tactically the most valuable reason why SQL will improve your day-to-day work as a UX researcher.

Understand your company better

Tech businesses are metrics-driven, and UX researchers need to be focused on impacting the business. It’s easy to hear about metrics, even company north star metrics like active users, but not know exactly what they mean. Once you know about your company tables and how they’re made, you’re likely to get a clearer understanding of exactly what the metrics are made out of.

Knowing the details of metrics will help your work to better impact the metrics. (And this is what I’d consider the gold standard of measuring UX research impact). Beyond impact, knowing SQL will also let you better interrogate metrics. I’ve seen UX research projects uncover why current metrics are bad for users and the company, and then create new metrics that deliver more value to the company and its users.

Hopefully it’s clear now why you should learn SQL as a UX researcher. How should you go about learning SQL?

How to learn SQL

There are two major components to learning SQL: the formal syntax (coding language) and the way your messy data exists in the wild at your organization (yes, it is all messy, even the good data).

When I first started learning SQL at Meta, I was getting stuck and frustrated quite often. In retrospect, much of this frustration was because I didn’t know the unique ways the infrastructure worked rather than my syntax being inherently wrong.

As you start to learn SQL and run into roadblocks, keep in mind that your code may be fine but you need to learn about the organizational choices that have gone into the data you’re querying.

Syntax

SQL syntax is easy to learn and difficult to master. As a UX researcher, you’re luckily probably only aiming to learn it and not master it, unlike our colleagues in data science. I estimate I only spent about 2-3 hours learning the simplest syntax before diving into the real tables. From there, I learned more difficult syntax along the way.

Basics

You can find numerous places to learn the basics. Personally, I used W3Schools as a simple resource for syntax that also lets you run queries in your browser.

Each SQL query has just a few elements at the core:

SELECT chooses the variables you want
- * selects all variables in the table
FROM chooses the table you want to select variables from

One small jump in complexity may have clauses like:

WHERE filters data to be returned by certain values in a column you SELECT
AND and OR let you stack WHERE clauses with additional logic

These do have a certain order that I can show in an example.

Example query:

SELECT  user_id, email, country, subscription FROM all_users_dailyWHERE
 subscription = "premium"

With just these simple pieces of code, you can start to get the data you need for a qualitative study or survey sample. If you are a wiz in R or Excel, you can also get the building blocks to do your own deeper aggregations and analysis. Early on, I did a lot in R that I do now natively in SQL, simply because it was faster at the time for me to do it in R as I got up to speed on SQL.

One set of functions that are important conceptually but I almost never use are UPDATE and DELETE: in the tech world we’re not often making our own tables – they populate from pipelines and logs.

Aggregation

If you have a table, you can add certain commands that let you aggregate (summarize) the data you’ve returned. This is particularly useful if you want to learn about a question you have (how many users do we have in each country?) but don’t need a list of user ids for a study:

GROUP BY is the simplest way for the following aggregation functions to work together
- You need to decide what you want to aggregate by using this command
COUNT is used on a variable and will return all values, split by your GROUP BY
MAX or MIN will show the largest or smallest value in a column, also using GROUP BY

SELECT 
count(user_id) as mycount, country
FROM 
test
GROUP BY country

Advanced

Calling the next commands advanced is not totally accurate, because the advanced category goes quite deep. I also don’t recommend starting with these if you are just beginning on SQL. I’m sharing these here because they are what I feel enabled me to go from beginner to my intermediate/ advanced level. Consider tackling these after you are comfortable with the basics:

JOIN is a way to tie two tables together (VLOOKUP in Excel is similar)
- If you use R, dplyr:: uses the example same naming terminology for joins which makes it simple to transfer your knowledge
WITH AS lets you run two subqueries in one query and then reference them together
- This allows you to stack longer queries with multiple stages, no need to save individual tables along the way.
DECLARE lets you make variables, like writing a date once instead of rewriting it 10 times across subqueries.

In the next subsection for R users, you can see an example of a WITH AS command that calculates two averages in one subquery, and then sums those averages together in the next query (without saving each table separately).

A tip for R users

The last major consideration I have for syntax is something I learned coming from R. With R, you can stack many functions together. If you want to calculate a mean from two columns and then add those together, you can do it all at once.

Example in R:

ABSum = mean(ColA) + mean(ColB)

If you try to do this in SQL it will not run. SQL happens in more discrete stages. I like to visualize the table I’ll receive from each subquery, and then think about how I’ll make a new subquery to do the following step. It’s a bit clunky coming from R, but that’s just how it works.

This will fail:

SELECT 
 avg(ColA) as ColAavg, 
 avg(ColB) as ColBavg, 
 ColBavg + ColBavg as ABSum 
FROM 
  test

This will run:

WITH avgs AS(
 SELECT 
  avg(ColA) as ColASum, 
  avg(ColB) as ColBSum 
 FROM 
  test
) 
SELECT 
 (ColASum + ColBSum) as ABSum 
FROM 
 avgs

I wouldn’t try to fully wrap your head around WITH AS clauses or JOIN clauses right now before you get your hands on real data. Like I said before, half of what you will be learning for the first time is how your data exists in the organizational context.

In the wild

Learning SQL for the first time means learning the syntax and the organizational context at the same time. If you move to another company after, you just need to learn the new context without worrying about syntax. In my experience, this is a really useful opportunity where you can feel confident about your code and realize a roadblock is an idiosyncrasy with your organizational context. When you’re learning both at first, it’s extra challenging. Here are some key things to keep in mind so you may diagnose problems more accurately.

Access and tools

First, you need access. You need to learn who owns access to the data and convince them to let you use it. At some companies for me, this was the biggest blocker and I never got around it. At other places, it is as simple as a ticket that is automatically approved.

Then, you need to learn how you can access the data – this is what your approval ticket will give you access to. At Meta, I used daiquery, a homegrown tool. Many places use enterprise suite solutions like BigQuery or RedShift – this is likely what you’ll run into within a tech company. These tools let you write SQL queries, run them, and save the data in some way (among the many other capabilities). Along with these tools, you’ll need to identify what specific SQL language you need to use.

SQL is a language with slightly varied dialects such as GoogleSQL, SQL Server or MySQL. The core aspects are the same but some functions that are more advanced will have unique names or conventions. It shouldn’t be a big impediment, but consider that when you start to search online for help with queries. A simple SELECT * FROM will work anywhere, but things like converting Date/Time columns may not.

These tools will get you the data, but are often limited in what analysis you will do with the data you got from the query. I use R to do more advanced analysis, but you can also use whatever you’re comfortable with, like Excel. I do find that R or Python quickly show their worth when you get larger data sets.

Data warehouses and inner workings

Once you have your tooling set up, you can start querying. Without knowing at least a little bit of the inner workings of the system, you’ll quickly find yourself stuck.

Data tables are the thing you query, and data tables are stored in a data warehouse². In smaller companies you may just have one data warehouse. In larger companies you may have multiple warehouses – this can impact your work if you need different access permissions for each or want to join up tables from across warehouses (at Meta we had a tool called “Uhaul” to combine queries across warehouses).

Data enters the warehouse in several ways. The foundational path is often from logging implemented in the application itself like tracking if a user changed a setting or what account was created at what time. Data scientists or data engineers can take log data and transform it (ETL) into a table that is pre-aggregated, like a daily snapshot of how many users logged in each day or how many transactions an account had.

Why do you need to know this? Here are some things I’ve run into:

Partitioned tables: some tables may overwrite themselves, like if you overwrite the data in a spreadsheet. More commonly, data tables use partitions. This means that a new set of rows in of the table is generated each day with the latest data. For example, a user id variable would be repeated everyday with the latest value that day pertaining to an account balance or friend count. This way you can access historical data.
- Where I got stuck: Partitioned tables often require WHERE clause to get a certain day’s data rather than getting a massive table of all historical data for each account or users (sucking up unneeded money/resources in your query).
- Where I also got stuck: some tables only update when the values change, so a user id may be missing in a day’s partition for friend counts, if they added no new friends that day.
Different warehouses: if you have different warehouses, you may need to transfer data across them to use them together – this likely has a certain tool and process.
- Where I got stuck: When you query a table and you don’t have access, it may simply tell you it doesn’t exist – this might not be true, but you need to first get access.

I only scratched the surface here. Companies not only employ data scientists to work on these but specialized data engineers that focus solely on creating, storing, and moving data. Again, data is the lifeblood of tech companies, so all of this effort goes a long way in making an informed and effective set of business/product teams.

The human side

Okay, we’ve talked about the syntax, the tools, and the infrastructure. What else could there possibly be?

All these data are there because humans made a series of choices leading up to the data’s existence. It’s no small task to organize these fire hoses of data in a way allows others to do multitudes of tasks with differing goals. I’ve been to places where data is organized in better and worse ways, but they are all messy. This is the point where you want to approach your friends in data science to help.

Once you have access to the data, you may still need to find where the data is that you want. Each team probably knows certain tables better (like bespoke tables for the team they work on). Certain tables are probably used by everyone at the company (like high level usage metrics or account data). Work with a DS friend to learn about the repository, wiki, or explorer that can help you find the table you need. Even with those tools, I often still have to ask a real human to help me make sense of them.

Putting it all together

As you undertake this process, I’d suggest finding a DS colleague that you’ve worked with before. Ask to see their queries, try to find out what each piece does, and modify some of it. Try to work through your problems as far as you can, organize your thoughts, and then reach out to them when you need help.

The process of learning SQL involves reading the basics in documentation, but it’s mostly trial and error, plus reading tons of posts on StackExchange. This is true for beginners or pros. Don’t feel any imposter syndrome because you’re looking things up as you tweak your query 50+ times.

ChatGPT and other LLMs have made some shortcuts easier, but I find these tools much more useful when I understand the output I get back from them at a base level. ChatGPT will rarely take me from 0% to 100% in a project but does a solid job helping me go from 80% to 100% (even reaching slightly past my current abilities in advanced queries).

I believe you can learn SQL with no or little prior coding knowledge – it is a simple language that only goes deep when you need it to. If you already do know some R code from academic training, I’d suggest reading this article, How to Learn SQL as an R User in Academia.

What if you aren’t in a role right now or one where you can access SQL? Because the syntax is only half the work with SQL, it’s hard to really learn how to use it without actual data. I’d suggest learning the SQL syntax and then confidently putting down “beginner” on your resume, even if you feel shaky. More than anything in UX research, I think SQL is easily and best learned on the job.

Wrap up

SQL is a useful tool for any UX researcher, even a qualitative UX researcher. You can unblock your own work by querying users for participants. You can better understand your users and better learn how to impact your business.

To learn SQL, you need to learn the syntax and the organizational context/infrastructure. Spend just a bit of time getting the basics of the syntax, and then dig into the real data to start getting the data you need. SQL is something you can take from company to company and you will not regret having this tool in your toolbox.

Skip the algorithm, get my new posts right to your inbox.

Email:

Please wait…

Thank you for signing up, stay tuned.

This was also featured in the UserWeekly newsletter.

Appendix

Factual is a bit of a misnomer. All metrics are imbued with human decision points and chances for measurement error. The more you know about your company’s data, the more you’ll understand what I mean. ↩︎
I’m way over simplifying here but this isn’t a data engineering article. ↩︎