Merge projects seamlessly with our NEW GitLab integrationLearnmore
Data

How to level up your SQL skills

Peter Bex

If you're like most programmers, you know some SQL but aren't fluent in it. Using an ORM insulates you from writing SQL by hand, so there's not much incentive to learn SQL well. There are good reasons to learn it, though:

  • SQL is framework-independent. That means you can carry your knowledge of SQL over if you drop your current ORM, framework or even programming language.
  • SQL has been around since the seventies and the standard was last updated in 2016. It is here to stay. Time spent learning it is a good investment.
  • For performance-critical tasks, sometimes it is essential to use SQL directly.
  • Even if you avoid it in your own code, you'll still run across other people's code. So it is likely you'll run across SQL.

In this post I'll touch upon some skills that you'll find useful to write good SQL.

Thinking in set operations

The first thing you'll need to get used to, is to think in terms of set operations. I've often run into codebases where you see something like this (pseudo-code):

for (obj in Mymodel.all()) { # SELECT * FROM mymodel;

obj.value += 1;
obj.save(); # UPDATE mymodel SET value = $1 WHERE id = $2;
}

This is a big no-no. For one, this fetches all objects from the database into memory. The database must first fetch the complete records from disk, serialize the values and send them across the wire. Then, your framework deserializes (or "hydrates") the objects into programming language value types. Finally, you loop over each, increment one value and save.

I've added the SQL statements that are executed under the hood as comments. Each save() call is a round trip to the database to send an UPDATE command. This implies a lot of latency.

Think of it like getting 10 bottles of milk from the grocery store. You wouldn't want to go to the store, get one bottle, go home, put it in the fridge, go back to the store, get another bottle, and so on. Think of each query you're executing as a trip to the store. Sending a query to the database usually isn't that expensive, but it does add overhead, and they do add up over time. If your database is on another machine, you can add network latency for each query too.

It is faster to update all the values in one go:

UPDATE mymodel SET value = value + 1;

This avoids all the overhead I just mentioned: we don't fetch all the records from the database, we don't serialize the values and we only send one query. Another advantage is that this operation is atomic; even if you're not in a transaction, the values are incremented "all at once". Consider the loop above; if another process is writing to the same table, you might be clobbering some records with their old value incremented by one! An `UPDATE` statement always operates on a consistent snapshot of the database.

Writing maintainable SQL

Sometimes, SQL can get messy. It doesn't compose particularly well, and if you start using subqueries it tends to be difficult to figure out how to read the query. If you are programming in a programming language, you wouldn't nest complicated expressions willy-nilly. Instead, you would use temporary variables so the code can be read from top to bottom.

The same way, in SQL, you can use "common table expressions" (CTEs, aka WITH-statements) to cut a massive query into little pieces that are incrementally understandable and testable. Here's a simple example which shows you which actors current earns the highest salary. We only consider first billed (lead) actors who starred in movies that start with an "a":

WITH lead_actors AS ( 
SELECT DISTINCT ON (r.movie_id) r.actor_id
FROM roles AS r
INNER JOIN movies AS m
ON m.id = r.movie_id
WHERE m.title ILIKE 'a%'
ORDER BY r.movie_id, r.billing_order
), current_salaries AS (
SELECT DISTINCT ON (actor_id) actor_id, amount
FROM salaries
INNER JOIN lead_actors USING (actor_id)
ORDER BY actor_id, year DESC

SELECT *
FROM current_salaries AS cs
INNER JOIN actors AS a
ON cs.actor_id = a.id
ORDER BY cs.amount DESC
LIMIT 1;

If you wanted to debug this (quite large) query, you could start by dropping the final `SELECT` and changing it to `SELECT * FROM lead_actors`. That would allow you to verify that the lead actors are correct. Then, you could `SELECT * FROM current_salaries` and verify that those are correct. And it also makes it more readable and maintainable for the next programmer who comes along!

One note of warning though: in PostgreSQL, WITH can act as a performance barrier. This means that queries inside WITH evaluate one at a time. Starting with PostgreSQL 12, this is no longer true if the CTE is used more than once. You can still opt to have the old behaviour by explicitly saying so with the new (non-standard) `MATERIALIZED` keyword.

These are just two ways of improving your SQL skills. There are many more things you can do. I would recommend studying the manual of your SQL system every once in a while and trying out new keywords. But remember, the best way to improve your skill is to practice!