dima

on software
Posts from blog by tag django:

The Cartesian Product Trap in Django ORM

I hit a performance wall at work recently after adding a single extra Count() on a Django queryset that looked harmless. Luckily, the problematic code didn't make it into the production environment, but I had to think hard to figure out what went wrong and find a solution.

The Demo Models

All the following models and entities are made up by me to illustrate the problem.

Imagine we have a PostgreSQL database storing information about separate stores, their departments, employees and products of the stores:

Database schema showing Store, Department, Employee, and Product relationships

The DB tables can contain much more fields, but in our case we only care about relations between the models.

So in Django, models.py would look like this:

from django.db.models import Model, CharField, ForeignKey, CASCADE

class Store(Model):
    name = CharField(max_length=200)
    # ...

class Department(Model):
    store = ForeignKey(
        Store, related_name="departments", on_delete=CASCADE
    )
    # ...

class Employee(Model):
    department = ForeignKey(
        Department, related_name="employees", on_delete=CASCADE
    )
    # ...

class Product(Model):
    store = ForeignKey(
        Store, related_name="products", on_delete=CASCADE
    )
    # ...

Seed data (roughly what bit me):

  • 2 stores,
  • 10 departments per store,
  • 500 employees per department - 10 000 employees overall,
  • 2500 products per store.

In one of the places of our system, we show a list of the stores, including the amount of employees in each store. Pretty easy with Django ORM, right?

stores = Store.objects.annotate(
    total_employees=Count("departments__employees")
).values(
    "id",
    # ...
    "total_employees"
)

This query is relatively fast and works like a charm.

The Problem

Now let's imagine we were asked to add another counter: a number of products per store. We already have total_employees, why not just add total_products?

Well, most likely we already have some unit test for our piece of code, which checks the logic on a small amount of data, and before releasing the code, we can figure out that another JOIN was added, some data is duplicated, and instead of just COUNT(...), we switch to COUNT(DISTINCT ...), eventually coming up with something like this:

stores = Store.objects.annotate(
    total_employees=Count("departments__employees",
                          distinct=True),
    total_products=Count("products", distinct=True),
).values(
    "id",
    # ...
    "total_employees",
    "total_products",
)

Looks safe to commit, push, wait for the green tests on CI, merge and deploy.

However, after the deployment you'll almost immediately see that everything hangs. As I said, there's not that many stores, only 2 for now, not that many employees, and not that many departments.

And it takes, like, 10 seconds to fetch the numbers! What's wrong with it?

Let's take a closer look at the generated SQL for this seemingly innocent queryset:

SELECT 
    "shop_store"."id",
    COUNT(DISTINCT "shop_product"."id") AS "total_products", 
    COUNT(DISTINCT "shop_employee"."id") AS "total_employees"
FROM "shop_store"
  LEFT OUTER JOIN "shop_product"
    ON "shop_store"."id" = "shop_product"."store_id"
  LEFT OUTER JOIN "shop_department"
    ON "shop_store"."id" = "shop_department"."store_id"
  LEFT OUTER JOIN "shop_employee"
    ON "shop_department"."id" = "shop_employee"."department_id" 
GROUP BY "shop_store"."id"

And let's check the actual query plan:

GroupAggregate ... rows=2 ...
  ->  Nested Loop Left Join ... rows=25000000 ...
  ...
...
Execution Time: 11376.072 ms

Translation: these three JOINs turn into a 25-million-row cartesian mess before GROUP BY and COUNT(DISTINCT):

Products × Departments × Employees
= 2500 × 10 × 500
= 12 500 000  (per store)
× 2 stores
= 25 000 000 joined rows

The Fix

There are multiple ways of handling this case, but the easiest fix is to use subqueries:

subquery_products = Subquery(
    Product.objects.filter(store_id=OuterRef("pk"))
        .values("store_id")
        .annotate(count=Count("pk"))
        .values("count"),
    output_field=IntegerField()
)
subquery_employees = Subquery(
    Employee.objects.filter(department__store_id=OuterRef("pk"))
        .values("department__store_id")
        .annotate(count=Count("pk"))
        .values("count"),
    output_field=IntegerField()
)
stores = Store.objects.annotate(
    total_products=Coalesce(subquery_products, 0),
    total_employees=Coalesce(subquery_employees, 0),
).values("id", "total_products", "total_employees")

SQL query:

SELECT "shop_store"."id",
       COALESCE((
         SELECT COUNT(U0."id") AS "count"  
         FROM "shop_product" U0  
         WHERE U0."store_id" = "shop_store"."id"  
         GROUP BY U0."store_id"), 0
       ) AS "total_products",
       COALESCE((
         SELECT COUNT(U0."id") AS "count"
         FROM "shop_employee" U0
           INNER JOIN "shop_department" U1
             ON U0."department_id" = U1."id"
         WHERE U1."store_id" = "shop_store"."id"
         GROUP BY U1."store_id"), 0
       ) AS "total_employees"
FROM "shop_store";

Now this one takes a couple of milliseconds with pretty modest and predictable plan:

Seq Scan on shop_store ... rows=2 ...
  SubPlan 1
     -> Seq Scan on shop_product u0 ... rows=2500 ...
  SubPlan 2
     -> Hash Join ... rows=5000 ...
        -> Seq Scan on shop_employee u0_1 ... rows=10000 ...
        -> Hash ... rows=10 ...
...
Execution Time: 5.600 ms

No giant intermediate data sets, just two tiny scans:

  • before: 11 376 ms (~11 seconds)
  • after: 5.6 ms (2000x faster)

Takeaways

  • COUNT(DISTINCT) with multi-branch LEFT JOINs makes the database loop through the entire cartesian product.
  • Correlated subqueries aggregate each branch separately and scale linearly with data size.
  • Always test aggregate queries against production-sized data before you ship.

Why Django's override_settings Sometimes Fails (and How reload + patch Saved Me)

Sometimes @override_settings just doesn’t cut it.

I ran into a nasty issue while testing a Django module that relies on global state initialized during import. The usual test approach didn’t work. Here’s what happened and how I solved it.

The Setup

We had a module that builds a global dictionary from Django settings at import time. Let’s call it dragon.py, which takes settings.PUT_EGGS, which is False by default:

from django.conf import settings

DRAGON = {}
...
if settings.PUT_EGGS:
    DRAGON["eggs"] = "spam"

Another module uses DRAGON for core logic, e.g. mario.py:

from myproject.dragon import DRAGON

def find_eggs():
    if "eggs" in DRAGON:
        return "Found eggs!"
    return "Eggs not found"

Now I wanted to write a test that tweaks DRAGON and expects the logic to behave differently. Easy, right?

@override_settings(PUT_EGGS=True)
def test_find_eggs():
    assert find_eggs() == "Found eggs!"

Wrong. The test failed.

The Problem

override_settings works, but only for code that reads settings at runtime.

In my case, DRAGON was already built at import time , before the override kicked in. So it used the old value of PUT_EGGS, no matter what I did in the test.

This is the classic trap of global state baked during import. Welcome to pain town.

The Fix: reload + patch

Here's how I got out:

import importlib
from django.test import override_settings
from unittest.mock import patch
from myproject.mario import find_eggs

@override_settings(PUT_EGGS=True)
def test_find_eggs():
    # Reload the dragon module so DRAGON is rebuilt
    # with updated settings
    from myproject import dragon
    new_dragon = importlib.reload(dragon)

    # Patch the logic module to use the reloaded DRAGON
    with patch('myproject.mario.DRAGON', new_dragon.DRAGON):
        result = find_eggs()
        assert result == "Found eggs!"

Why This Works

  • importlib.reload(dragon) forces a fresh import of dragon, rebuilding DRAGON with the overridden settings;
  • dragon.DRAGON is updated in the scope of the test only, i.e. mario module still has the stale version of DRAGON;
  • patch(...) solves this problem by swapping the old DRAGON in mario with the freshly rebuilt one.

This is surgical. Ugly, but effective.

Lessons Learned

  • Avoid putting non-trivial logic at module scope, especially if it depends on Django settings. Wrap it in a function or lazy loader.
  • If you're stuck with global state, reload() and patch() give you a way out - just be careful about cascading dependencies.

If you’ve ever had a test mysteriously fail after overriding settings, this might be why.

Under the Hood of Spense.app: The Code.

This article is a translation and adaptation of my article in Russian.

While Spense v0.2 is under development, I want to talk about the internal organization of the application from a technical perspective. This article is mainly for web developers, and it's written in the corresponding language, so if you're reading and don't understand anything, that's okay, you can just skip it.

In a Nutshell

Backend on Django (Python), frontend on Django templates and Bootstrap, with a pinch of JavaScript and some htmx (not anymore).

Why So Boring?

Sounds not very hype, right. But remember that Spense in its current state is not a full-fledged product. It's more of a prototype, in which I often need to change things and test ideas. If the ideas work, I'll throw this code away and write another; if they don't, I'll just keep it for memory.

So, I chose Django not because I love it (actually, I hate it), but because I've gotten used to it over the last year, and it allows me to prototype the application easily and quickly.

Read more

This Week in Changelogs: Django and faker

Django 4.1.6, 4.1.7

9d7bd5a An interesting bug of parsing the Accept-Language header. The format of the header value is complex, so there's a bunch of regular expressions and @functools.lru_cache(maxsize=1000) for caching the result. However, you can pass a huge header multiple times, causing DoS, so they added two if statements:

  • one that checks if the length is less than ACCEPT_LANGUAGE_HEADER_MAX_LENGTH
  • second - for checking the comma-separated strings. So they decided not to just raise an exception or truncate the string by [:ACCEPT_LANGUAGE_HEADER_MAX_LENGTH], but truncate the value in a safe way, so it can be parsed in a meaningful result. Good job!

26b7a25 There was a bug in generated SQL, caused by that .desc() in the model's Meta.constraints:

constraints = [
    UniqueConstraint(
        Lower("name").desc(), name="unique_lower_name"
    )
]

which resulted in <...> WHERE LOWER("myapp_foo"."name") DESC <...> when checking the uniqueness. Apparently, Django can check the constraints itself, not delegating it to the underlying database.

Although the fix is trivial, the case is not, and it wasn't covered in the initial implementation.

By the way, I like how they use typographic double quotes:

msg = "Constraint “name_lower_uniq_desc” is violated."

a637d0b f3b6a4f Those black updates are annoying, mainly because they make git blame misleading. However, there's a solution I didn't know about:

  • git blame --ignore-revs-file <file> - ignore commits listed in the file.
  • .git-blame-ignore-revs file - make GitHub ignore them as well.

590a92e The bug was caused by the commit which we've already seen. Now you can safely raise ValidationError without the code.

628b33a One more DoS fix, now it's about number of opened files when you put too many of them in one multipart payload. The fix introduces TooManyFilesSent exception, which results in HTTP 400 (DATA_UPLOAD_MAX_NUMBER_FILES = 100 by default).

I like this fragment:

try:
    return self._parse()
except Exception:
    if hasattr(self, "_files"):
        for _, files in self._files.lists():
            for fileobj in files:
                fileobj.close()
    raise

Beware of freeing your resources, garbage collector can't help you all the time!

faker 16.6.1..17.0.0

Their CHANGELOG is quite descriptive, so I'll just highlight something that I liked.

  • faker can generate valid image URLs using specific websites (TIL), and one of them, PlaceIMG, is shutting down, and they removed it from the list. The announcement is included in all the generated images:

In addition, it turned out that GitHub can put those linter errors from the actions right in the code. I don't know yet how to add this, but I definitely want it!