The Cartesian Product Trap in Django ORM
I hit a performance wall at work recently after adding a single extra Count()
on a Django queryset that looked harmless. Luckily, the problematic code didn't make it into the production environment, but I had to think hard to figure out what went wrong and find a solution.
The Demo Models
All the following models and entities are made up by me to illustrate the problem.
Imagine we have a PostgreSQL database storing information about separate stores, their departments, employees and products of the stores:
The DB tables can contain much more fields, but in our case we only care about relations between the models.
So in Django, models.py
would look like this:
from django.db.models import Model, CharField, ForeignKey, CASCADE
class Store(Model):
name = CharField(max_length=200)
# ...
class Department(Model):
store = ForeignKey(
Store, related_name="departments", on_delete=CASCADE
)
# ...
class Employee(Model):
department = ForeignKey(
Department, related_name="employees", on_delete=CASCADE
)
# ...
class Product(Model):
store = ForeignKey(
Store, related_name="products", on_delete=CASCADE
)
# ...
Seed data (roughly what bit me):
- 2 stores,
- 10 departments per store,
- 500 employees per department - 10 000 employees overall,
- 2500 products per store.
In one of the places of our system, we show a list of the stores, including the amount of employees in each store. Pretty easy with Django ORM, right?
stores = Store.objects.annotate(
total_employees=Count("departments__employees")
).values(
"id",
# ...
"total_employees"
)
This query is relatively fast and works like a charm.
The Problem
Now let's imagine we were asked to add another counter: a number of products per store. We already have total_employees
, why not just add total_products
?
Well, most likely we already have some unit test for our piece of code, which checks the logic on a small amount of data, and before releasing the code, we can figure out that another JOIN
was added, some data is duplicated, and instead of just COUNT(...)
, we switch to COUNT(DISTINCT ...)
, eventually coming up with something like this:
stores = Store.objects.annotate(
total_employees=Count("departments__employees",
distinct=True),
total_products=Count("products", distinct=True),
).values(
"id",
# ...
"total_employees",
"total_products",
)
Looks safe to commit, push, wait for the green tests on CI, merge and deploy.
However, after the deployment you'll almost immediately see that everything hangs. As I said, there's not that many stores, only 2 for now, not that many employees, and not that many departments.
And it takes, like, 10 seconds to fetch the numbers! What's wrong with it?
Let's take a closer look at the generated SQL for this seemingly innocent queryset:
SELECT
"shop_store"."id",
COUNT(DISTINCT "shop_product"."id") AS "total_products",
COUNT(DISTINCT "shop_employee"."id") AS "total_employees"
FROM "shop_store"
LEFT OUTER JOIN "shop_product"
ON "shop_store"."id" = "shop_product"."store_id"
LEFT OUTER JOIN "shop_department"
ON "shop_store"."id" = "shop_department"."store_id"
LEFT OUTER JOIN "shop_employee"
ON "shop_department"."id" = "shop_employee"."department_id"
GROUP BY "shop_store"."id"
And let's check the actual query plan:
GroupAggregate ... rows=2 ...
-> Nested Loop Left Join ... rows=25000000 ...
...
...
Execution Time: 11376.072 ms
Translation: these three JOIN
s turn into a 25-million-row cartesian mess before GROUP BY
and COUNT(DISTINCT)
:
Products × Departments × Employees
= 2500 × 10 × 500
= 12 500 000 (per store)
× 2 stores
= 25 000 000 joined rows
The Fix
There are multiple ways of handling this case, but the easiest fix is to use subqueries:
subquery_products = Subquery(
Product.objects.filter(store_id=OuterRef("pk"))
.values("store_id")
.annotate(count=Count("pk"))
.values("count"),
output_field=IntegerField()
)
subquery_employees = Subquery(
Employee.objects.filter(department__store_id=OuterRef("pk"))
.values("department__store_id")
.annotate(count=Count("pk"))
.values("count"),
output_field=IntegerField()
)
stores = Store.objects.annotate(
total_products=Coalesce(subquery_products, 0),
total_employees=Coalesce(subquery_employees, 0),
).values("id", "total_products", "total_employees")
SQL query:
SELECT "shop_store"."id",
COALESCE((
SELECT COUNT(U0."id") AS "count"
FROM "shop_product" U0
WHERE U0."store_id" = "shop_store"."id"
GROUP BY U0."store_id"), 0
) AS "total_products",
COALESCE((
SELECT COUNT(U0."id") AS "count"
FROM "shop_employee" U0
INNER JOIN "shop_department" U1
ON U0."department_id" = U1."id"
WHERE U1."store_id" = "shop_store"."id"
GROUP BY U1."store_id"), 0
) AS "total_employees"
FROM "shop_store";
Now this one takes a couple of milliseconds with pretty modest and predictable plan:
Seq Scan on shop_store ... rows=2 ...
SubPlan 1
-> Seq Scan on shop_product u0 ... rows=2500 ...
SubPlan 2
-> Hash Join ... rows=5000 ...
-> Seq Scan on shop_employee u0_1 ... rows=10000 ...
-> Hash ... rows=10 ...
...
Execution Time: 5.600 ms
No giant intermediate data sets, just two tiny scans:
- before: 11 376 ms (~11 seconds)
- after: 5.6 ms (2000x faster)
Takeaways
COUNT(DISTINCT)
with multi-branchLEFT JOIN
s makes the database loop through the entire cartesian product.- Correlated subqueries aggregate each branch separately and scale linearly with data size.
- Always test aggregate queries against production-sized data before you ship.
Why Django's override_settings Sometimes Fails (and How reload + patch Saved Me)
Sometimes @override_settings
just doesn’t cut it.
I ran into a nasty issue while testing a Django module that relies on global state initialized during import. The usual test approach didn’t work. Here’s what happened and how I solved it.
The Setup
We had a module that builds a global dictionary from Django settings at import time. Let’s call it dragon.py
, which takes settings.PUT_EGGS
, which is False
by default:
from django.conf import settings
DRAGON = {}
...
if settings.PUT_EGGS:
DRAGON["eggs"] = "spam"
Another module uses DRAGON
for core logic, e.g. mario.py
:
from myproject.dragon import DRAGON
def find_eggs():
if "eggs" in DRAGON:
return "Found eggs!"
return "Eggs not found"
Now I wanted to write a test that tweaks DRAGON
and expects the logic to behave differently. Easy, right?
@override_settings(PUT_EGGS=True)
def test_find_eggs():
assert find_eggs() == "Found eggs!"
Wrong. The test failed.
The Problem
override_settings
works, but only for code that reads settings at runtime.
In my case, DRAGON
was already built at import time , before the override kicked in. So it used the old value of PUT_EGGS
, no matter what I did in the test.
This is the classic trap of global state baked during import. Welcome to pain town.
The Fix: reload + patch
Here's how I got out:
import importlib
from django.test import override_settings
from unittest.mock import patch
from myproject.mario import find_eggs
@override_settings(PUT_EGGS=True)
def test_find_eggs():
# Reload the dragon module so DRAGON is rebuilt
# with updated settings
from myproject import dragon
new_dragon = importlib.reload(dragon)
# Patch the logic module to use the reloaded DRAGON
with patch('myproject.mario.DRAGON', new_dragon.DRAGON):
result = find_eggs()
assert result == "Found eggs!"
Why This Works
importlib.reload(dragon)
forces a fresh import ofdragon
, rebuildingDRAGON
with the overridden settings;dragon.DRAGON
is updated in the scope of the test only, i.e.mario
module still has the stale version ofDRAGON;
patch(...)
solves this problem by swapping the oldDRAGON
inmario
with the freshly rebuilt one.
This is surgical. Ugly, but effective.
Lessons Learned
- Avoid putting non-trivial logic at module scope, especially if it depends on Django settings. Wrap it in a function or lazy loader.
- If you're stuck with global state,
reload()
andpatch()
give you a way out - just be careful about cascading dependencies.
If you’ve ever had a test mysteriously fail after overriding settings, this might be why.
Under the Hood of Spense.app: The Code.
While Spense v0.2 is under development, I want to talk about the internal organization of the application from a technical perspective. This article is mainly for web developers, and it's written in the corresponding language, so if you're reading and don't understand anything, that's okay, you can just skip it.
In a Nutshell
Backend on Django (Python), frontend on Django templates and Bootstrap, with a pinch of JavaScript and some htmx (not anymore).
Why So Boring?
Sounds not very hype, right. But remember that Spense in its current state is not a full-fledged product. It's more of a prototype, in which I often need to change things and test ideas. If the ideas work, I'll throw this code away and write another; if they don't, I'll just keep it for memory.
So, I chose Django not because I love it (actually, I hate it), but because I've gotten used to it over the last year, and it allows me to prototype the application easily and quickly.
Read moreThis Week in Changelogs: Django and faker
Django 4.1.6, 4.1.7
- 4.1.6: release notes, blog post
- 4.1.7: release notes, blog post
9d7bd5a
An interesting bug of parsing the Accept-Language
header. The format of the header value is complex, so there's a bunch of regular expressions and @functools.lru_cache(maxsize=1000)
for caching the result. However, you can pass a huge header multiple times, causing DoS, so they added two if
statements:
- one that checks if the length is less than
ACCEPT_LANGUAGE_HEADER_MAX_LENGTH
- second - for checking the comma-separated strings. So they decided not to just raise an exception or truncate the string by
[:ACCEPT_LANGUAGE_HEADER_MAX_LENGTH]
, but truncate the value in a safe way, so it can be parsed in a meaningful result. Good job!
26b7a25
There was a bug in generated SQL, caused by that .desc()
in the model's Meta.constraints
:
constraints = [
UniqueConstraint(
Lower("name").desc(), name="unique_lower_name"
)
]
which resulted in <...> WHERE LOWER("myapp_foo"."name") DESC <...>
when checking the uniqueness. Apparently, Django can check the constraints itself, not delegating it to the underlying database.
Although the fix is trivial, the case is not, and it wasn't covered in the initial implementation.
By the way, I like how they use typographic double quotes:
msg = "Constraint “name_lower_uniq_desc” is violated."
a637d0b f3b6a4f Those black
updates are annoying, mainly because they make git blame
misleading. However, there's a solution I didn't know about:
git blame --ignore-revs-file <file>
- ignore commits listed in thefile
..git-blame-ignore-revs
file - make GitHub ignore them as well.
590a92e
The bug was caused by the commit which we've already seen. Now you can safely raise ValidationError
without the code
.
628b33a
One more DoS fix, now it's about number of opened files when you put too many of them in one multipart payload. The fix introduces TooManyFilesSent
exception, which results in HTTP 400 (DATA_UPLOAD_MAX_NUMBER_FILES = 100
by default).
I like this fragment:
try:
return self._parse()
except Exception:
if hasattr(self, "_files"):
for _, files in self._files.lists():
for fileobj in files:
fileobj.close()
raise
Beware of freeing your resources, garbage collector can't help you all the time!
faker 16.6.1..17.0.0
Their CHANGELOG is quite descriptive, so I'll just highlight something that I liked.
faker
can generate valid image URLs using specific websites (TIL), and one of them, PlaceIMG, is shutting down, and they removed it from the list. The announcement is included in all the generated images:

- Biased booleans introduced.
- Added
emoji
provider 🎉 🥳 - Added new
es_AR
provider, but for some reason it's not in the reflected in the CHANGELOG. - Black formatting - always beautiful.
In addition, it turned out that GitHub can put those linter errors from the actions right in the code. I don't know yet how to add this, but I definitely want it!
