dima on software

The Cartesian Product Trap in Django ORM

2025-07-09

I hit a performance wall at work recently after adding a single extra Count() on a Django queryset that looked harmless. Luckily, the problematic code didn't make it into the production environment, but I had to think hard to figure out what went wrong and find a solution.

The Demo Models

All the following models and entities are made up by me to illustrate the problem.

Imagine we have a PostgreSQL database storing information about separate stores, their departments, employees and products of the stores:

Database schema showing Store, Department, Employee, and Product relationships

The DB tables can contain much more fields, but in our case we only care about relations between the models.

So in Django, models.py would look like this:

from django.db.models import Model, CharField, ForeignKey, CASCADE

class Store(Model):
    name = CharField(max_length=200)
    # ...

class Department(Model):
    store = ForeignKey(
        Store, related_name="departments", on_delete=CASCADE
    )
    # ...

class Employee(Model):
    department = ForeignKey(
        Department, related_name="employees", on_delete=CASCADE
    )
    # ...

class Product(Model):
    store = ForeignKey(
        Store, related_name="products", on_delete=CASCADE
    )
    # ...

Seed data (roughly what bit me):

2 stores,
10 departments per store,
500 employees per department - 10 000 employees overall,
2500 products per store.

In one of the places of our system, we show a list of the stores, including the amount of employees in each store. Pretty easy with Django ORM, right?

stores = Store.objects.annotate(
    total_employees=Count("departments__employees")
).values(
    "id",
    # ...
    "total_employees"
)

This query is relatively fast and works like a charm.

The Problem

Now let's imagine we were asked to add another counter: a number of products per store. We already have total_employees, why not just add total_products?

Well, most likely we already have some unit test for our piece of code, which checks the logic on a small amount of data, and before releasing the code, we can figure out that another JOIN was added, some data is duplicated, and instead of just COUNT(...), we switch to COUNT(DISTINCT ...), eventually coming up with something like this:

stores = Store.objects.annotate(
    total_employees=Count("departments__employees",
                          distinct=True),
    total_products=Count("products", distinct=True),
).values(
    "id",
    # ...
    "total_employees",
    "total_products",
)

Looks safe to commit, push, wait for the green tests on CI, merge and deploy.

However, after the deployment you'll almost immediately see that everything hangs. As I said, there's not that many stores, only 2 for now, not that many employees, and not that many departments.

And it takes, like, 10 seconds to fetch the numbers! What's wrong with it?

Let's take a closer look at the generated SQL for this seemingly innocent queryset:

SELECT 
    "shop_store"."id",
    COUNT(DISTINCT "shop_product"."id") AS "total_products", 
    COUNT(DISTINCT "shop_employee"."id") AS "total_employees"
FROM "shop_store"
  LEFT OUTER JOIN "shop_product"
    ON "shop_store"."id" = "shop_product"."store_id"
  LEFT OUTER JOIN "shop_department"
    ON "shop_store"."id" = "shop_department"."store_id"
  LEFT OUTER JOIN "shop_employee"
    ON "shop_department"."id" = "shop_employee"."department_id" 
GROUP BY "shop_store"."id"

And let's check the actual query plan:

GroupAggregate ... rows=2 ...
  ->  Nested Loop Left Join ... rows=25000000 ...
  ...
...
Execution Time: 11376.072 ms

Translation: these three JOINs turn into a 25-million-row cartesian mess before GROUP BY and COUNT(DISTINCT):

Products × Departments × Employees
= 2500 × 10 × 500
= 12 500 000  (per store)
× 2 stores
= 25 000 000 joined rows

The Fix

There are multiple ways of handling this case, but the easiest fix is to use subqueries:

subquery_products = Subquery(
    Product.objects.filter(store_id=OuterRef("pk"))
        .values("store_id")
        .annotate(count=Count("pk"))
        .values("count"),
    output_field=IntegerField()
)
subquery_employees = Subquery(
    Employee.objects.filter(department__store_id=OuterRef("pk"))
        .values("department__store_id")
        .annotate(count=Count("pk"))
        .values("count"),
    output_field=IntegerField()
)
stores = Store.objects.annotate(
    total_products=Coalesce(subquery_products, 0),
    total_employees=Coalesce(subquery_employees, 0),
).values("id", "total_products", "total_employees")

SQL query:

SELECT "shop_store"."id",
       COALESCE((
         SELECT COUNT(U0."id") AS "count"  
         FROM "shop_product" U0  
         WHERE U0."store_id" = "shop_store"."id"  
         GROUP BY U0."store_id"), 0
       ) AS "total_products",
       COALESCE((
         SELECT COUNT(U0."id") AS "count"
         FROM "shop_employee" U0
           INNER JOIN "shop_department" U1
             ON U0."department_id" = U1."id"
         WHERE U1."store_id" = "shop_store"."id"
         GROUP BY U1."store_id"), 0
       ) AS "total_employees"
FROM "shop_store";

Now this one takes a couple of milliseconds with pretty modest and predictable plan:

Seq Scan on shop_store ... rows=2 ...
  SubPlan 1
     -> Seq Scan on shop_product u0 ... rows=2500 ...
  SubPlan 2
     -> Hash Join ... rows=5000 ...
        -> Seq Scan on shop_employee u0_1 ... rows=10000 ...
        -> Hash ... rows=10 ...
...
Execution Time: 5.600 ms

No giant intermediate data sets, just two tiny scans:

before: 11 376 ms (~11 seconds)
after: 5.6 ms (2000x faster)

Takeaways

COUNT(DISTINCT) with multi-branch LEFT JOINs makes the database loop through the entire cartesian product.
Correlated subqueries aggregate each branch separately and scale linearly with data size.
Always test aggregate queries against production-sized data before you ship.

backend python django sql postgres

The Hidden Cost of Test Inheritance

2025-06-23

I'm subscribed to Adam Johnson's blog and usually really enjoy his writing - it's practical, deep, and no-bullshit. But one recent post, Python: sharing common tests in unittest, caught me off guard.

It describes a "neat" pattern: write reusable test logic in a base class, subclass it to test multiple objects, hiding the base class from unittest discovery. While the intent is fine - DRYing out duplicated test code - the result is fragile, confusing, and just not worth it.

Here's why.

The Pattern: DRY Tests via Subclassing

# Sample units to test
class Armadillo:
    def speak(self) -> str:
        return "Hrrr!"

class Okapi:
    def speak(self) -> str:
        return "Gronk!"

# Test module
class BaseAnimalTests(TestCase):
    animal_class: type

    def test_speak(self):
        sound = self.animal_class().speak()
        self.assertIsInstance(sound, str)
        self.assertGreater(len(sound), 0)

class ArmadilloTests(BaseAnimalTests):
    animal_class = Armadillo

class OkapiTests(BaseAnimalTests):
    animal_class = Okapi

del BaseAnimalTests

Yes, it works and it reduces duplication. But it comes at the cost of everything else that makes tests maintainable.

The Problems

IDE and DX Pain

IDE

When a test fails, I want to jump to it in my IDE, set a breakpoint, and debug. With this pattern - good luck.

The method doesn't exist in ArmadilloTests, it's buried in a deleted parent class. You have to manually hunt it down, re-declare the test method just to put a breakpoint and debug it, and pray the animal_class setup matches what failed:

class ArmadilloTests(TestCase):
    animal_class = Armadillo

    def test_speak(self):
        super().test_speak()

breakpoint

It's tedious and wastes time. All this to avoid writing a 3-line test twice?

class ArmadilloTests(TestCase):
    def test_speak(self):
        sound = Armadillo().speak()
        self.assertIsInstance(sound, str)
        self.assertGreater(len(sound), 0)

Clear, simple, debug-friendly. Worth the few extra lines.

CI Failures Are Confusing

If a shared test fails in CI, you get something like:

test_speak (tests.ArmadilloTests.test_speak) ... FAIL
...
Traceback (most recent call last):
  File ".../tests.py", line 20, in test_speak
    self.assertGreater(len(sound), 0)
AssertionError: 0 not greater than 0

But the method isn't defined in ArmadilloTests, and Search everywhere won't help at all:

nothing found

So now you have to reverse-engineer which base class it came from and how to recreate it locally.

This isn't clever. It's just fragile.

When It Kinda Makes Sense

There are rare cases:

dozens of classes implementing the same interface
you're the only one maintaining the codebase
you run everything headless in CI

But even then, you're building test framework plumbing to save what, a hundred lines?

The Clean Alternative: Parametrize It

Pytest Style

@pytest.mark.parametrize('animal_class', [Armadillo, Okapi])
def test_speak(animal_class):
    sound = animal_class().speak()
    assert isinstance(sound, str)
    assert len(sound) > 0

You see all the parameters. You see where the test lives. Failures are explicit:

test_speak[Armadillo] FAILED
test_speak[Okapi] PASSED

You can re-run just the failing test. You can debug with a conditional breakpoint. You don't need to explain how the tests are wired together - because they're not.

unittest Style (Optional, Not Ideal)

from parameterized import parameterized_class

@parameterized_class([
    {'animal_class': Armadillo},
    {'animal_class': Okapi},
], class_name_func=get_class_name)
class AnimalTests(TestCase):
    def test_speak(self):
        sound = self.animal_class().speak()
        self.assertIsInstance(sound, str)
        self.assertGreater(len(sound), 0)

Using parameterized_class from parameterized is still better than inheritance, but clunkier. Output is readable if you customize class_name_func. IDE support isn't great. Pytest remains the better option for anything dynamic.

Final Verdict

Tests should fail clearly, debug easily, and be readable years later. This pattern fails all three.

DRY is good. But in tests, visible duplication beats invisible abstraction.

Adam's trick technically works, but in practice, it makes tests harder to navigate, harder to trust, and harder to work with.

Stick to the boring version - you'll thank yourself later.

backend python testing pytest unittest

Why Django's override_settings Sometimes Fails (and How reload + patch Saved Me)

2025-06-13

Sometimes @override_settings just doesn’t cut it.

I ran into a nasty issue while testing a Django module that relies on global state initialized during import. The usual test approach didn’t work. Here’s what happened and how I solved it.

The Setup

We had a module that builds a global dictionary from Django settings at import time. Let’s call it dragon.py, which takes settings.PUT_EGGS, which is False by default:

from django.conf import settings

DRAGON = {}
...
if settings.PUT_EGGS:
    DRAGON["eggs"] = "spam"

Another module uses DRAGON for core logic, e.g. mario.py:

from myproject.dragon import DRAGON

def find_eggs():
    if "eggs" in DRAGON:
        return "Found eggs!"
    return "Eggs not found"

Now I wanted to write a test that tweaks DRAGON and expects the logic to behave differently. Easy, right?

@override_settings(PUT_EGGS=True)
def test_find_eggs():
    assert find_eggs() == "Found eggs!"

Wrong. The test failed.

The Problem

override_settings works, but only for code that reads settings at runtime.

In my case, DRAGON was already built at import time , before the override kicked in. So it used the old value of PUT_EGGS, no matter what I did in the test.

This is the classic trap of global state baked during import. Welcome to pain town.

The Fix: reload + patch

Here's how I got out:

import importlib
from django.test import override_settings
from unittest.mock import patch
from myproject.mario import find_eggs

@override_settings(PUT_EGGS=True)
def test_find_eggs():
    # Reload the dragon module so DRAGON is rebuilt
    # with updated settings
    from myproject import dragon
    new_dragon = importlib.reload(dragon)

    # Patch the logic module to use the reloaded DRAGON
    with patch('myproject.mario.DRAGON', new_dragon.DRAGON):
        result = find_eggs()
        assert result == "Found eggs!"

Why This Works

importlib.reload(dragon) forces a fresh import of dragon, rebuilding DRAGON with the overridden settings;
dragon.DRAGON is updated in the scope of the test only, i.e. mario module still has the stale version of DRAGON;
patch(...) solves this problem by swapping the old DRAGON in mario with the freshly rebuilt one.

This is surgical. Ugly, but effective.

Lessons Learned

Avoid putting non-trivial logic at module scope, especially if it depends on Django settings. Wrap it in a function or lazy loader.
If you're stuck with global state, reload() and patch() give you a way out - just be careful about cascading dependencies.

If you’ve ever had a test mysteriously fail after overriding settings, this might be why.

backend python django mock testing importlib

Calculating the next run date of a Celery periodic task

2025-06-11

The problem

You have a periodic task in Celery defined with a crontab(...) schedule, and you want to calculate the next time it's supposed to run.

Example: you want to find out when crontab(hour=12, minute=0) will trigger next after now.

Simple, right? There’s croniter library, which seems to be designed to solve this exact problem. Just use it, right?

Well.

First mistake: trying croniter with crontab

So my first instinct was to use croniter like this:

from celery.schedules import crontab
from croniter import croniter
from datetime import datetime

schedule = crontab(hour=12, minute=0)
cron = croniter(schedule, datetime.now())
next_run = cron.get_next(datetime)

Boom 💥 doesn’t work. Because Celery’s crontab is not a string and croniter expects a string like "0 12 * * *":

AttributeError: 'crontab' object has no attribute 'lower'

And no, crontab() does not have a nice .as_cron_string() method either.

So now you’re stuck parsing crontab's internal fields (._orig_minute, ._orig_hour, etc) just to reconstruct a string - and it starts to smell like overengineering for something that should be simple.

The right way (which I learned too late)

Turns out Celery’s crontab (and all schedules derived from celery.schedules.BaseSchedule) already has a method for this:

from datetime import datetime
from celery.schedules import crontab

schedule = crontab(hour=12, minute=0)
now = datetime.now()
# `now` is datetime.datetime(2025, 6, 11, 0, 16, 58, 484085)
next_run = now + schedule.remaining_delta(now)[1]
# `next_run` is datetime.datetime(2025, 6, 11, 12, 0)

That’s it. You don’t need croniter at all. Celery knows how to calculate the delta to the next run. It just doesn’t shout about it in the docs.

Summary

don’t reinvent Celery’s scheduling logic - it already has what you need;
crontab is not a cron string, don’t try to treat it like one;
use .remaining_delta(last_run) to calculate when the next run will happen.

Hope this saves someone the 2 hours I wasted trying to do it the wrong way.

backend cron celery python croniter

Pytest Fish shell autocompletion

2024-12-29

^TL;DR https://github.com/ddoroshev/pytest.fish

Typing repetitive commands or copying and pasting test names can eat up valuable time. To help, I've created pytest.fish - a Fish shell plugin that simplifies your pytest workflow. It's lightweight, simple to set up, and makes testing more efficient.

How to Use

Autocomplete test paths

Type pytest and hit TAB to get suggestions for test paths and functions:

Support for `-k` filter

Narrow down tests with -k and get name suggestions:

The plugin dynamically scans your project, so suggestions stay up-to-date.

Installation

Install with Fisher:

fisher install ddoroshev/pytest.fish

Or manually copy the files from the repository into your Fish configuration.

How It Works

The plugin doesn't rely on pytest directly (yet). Instead, it scans the current directory for test files and searches for test functions inside them, making the process relatively fast and efficient.

Other shells?

Since I primarily use Fish in my local development environment, I created a plugin specifically for this shell. However, if you use Bash or Zsh, feel free to create your own - or switch to Fish already. 😉

pytest fish productivity open-source

Spense.app v0.2

2024-03-01 RU

Spense.app is under active development and is not available for public use.

This article is a translation and adaptation of my article in Russian.

Hey everyone! I've finished working on the next version of Spense with a bunch of improvements and as per tradition I'm sharing the most interesting parts.

Accounts and Wallets Page

In the app interface, you can now manage your wallets and view the current balance:

Spense.app changelog

Under the Hood of Spense.app: The Code.

2024-01-29 RU

This article is a translation and adaptation of my article in Russian.

While Spense v0.2 is under development, I want to talk about the internal organization of the application from a technical perspective. This article is mainly for web developers, and it's written in the corresponding language, so if you're reading and don't understand anything, that's okay, you can just skip it.

In a Nutshell

Backend on Django (Python), frontend on Django templates and Bootstrap, with a pinch of JavaScript ~~and some htmx~~ (not anymore).

Why So Boring?

Sounds not very hype, right. But remember that Spense in its current state is not a full-fledged product. It's more of a prototype, in which I often need to change things and test ideas. If the ideas work, I'll throw this code away and write another; if they don't, I'll just keep it for memory.

So, I chose Django not because I love it (actually, I hate it), but because I've gotten used to it over the last year, and it allows me to prototype the application easily and quickly.

Spense.app python django css js programming

Spense.app v0.1

2024-01-20 RU

Spense.app is under active development and is not available for public use.

This article is a translation and adaptation of my article in Russian.

I've rolled out version 0.1 of my app for tracking finances and decided to report on the progress. To be honest, it's not an actual mobile app, but a Progressive Web App - essentially a website that you can open on your phone, add to the Home Screen, and it will launch in full screen without browser panels. I don't yet know how to make real apps, but even in its current form, it works quite well.

By the way, there's even an icon already, which I created in ChatGPT/DALL-E:

As you've already guessed, I named it Spense. I came up with this name a long time ago by trimming the word "expense", and I bought the domain back then, which I'm only now starting to use more or less.

Currently, I'm the only user, there's no registration, and I think at least until version 1.0 everything will remain closed, but I'm happy to show you what's going on there now and what has changed.

Spense.app changelog

I started making a personal finance app. Why?

2024-01-07 RU

This article is a translation and adaptation of my article in Russian.

Since around 2014, I've been keeping track of my finances in Google Spreadsheets. It always went like this: 2-3 times a week, I'd sit down at the computer, gather receipts, go through the transaction history in banking apps, recall expenses from memory, and record them in a spreadsheet.

The spreadsheet had one row for each day, and columns for accounts, wallets, and a couple of calculated fields. There was also a "Notes" field where I would describe in almost free form where the money went.

An example from 2016. I have converted the original prices from Russian roubles to euros for clarity.

As you can see, I didn't exactly have a lot of money. Probably because I mostly ate and drank coffee instead of working, but the point is that I didn't want to economize on these daily things at all. Saving on everyday items and cutting back is fundamentally unnatural for me, and I've always tried to avoid it. Therefore, the question "Where does the money go?" didn't concern me at that moment, but more interesting questions did:

How much money do I have right now?
How much did I have a month/six months/a year ago? Have I become richer or poorer?
Can I afford to spend on a vacation/buy a new phone/go to a private clinic right now? Will I go into the red by the next paycheck?
What should my income be so that with my current spending, I start saving any money at all?
How soon will I start starving if I lose my job?

Spense.app money idea programming

Abstractions and Inheritance in C - Elegant Foot-Shooting

2023-10-01 RU

^TL;DR https://github.com/ddoroshev/c-inheritance

Sometimes you want to abstract and generalize something in C code. For example, if you want to print the contents of a structure multiple times, you end up writing printf("%s %d %f\n", foo->bar, foo->baz, foo->boom) everywhere like a fool, and it intuitively seems that there should be a way to do foo->print(foo), and not just with foo, but with any structure.

Let's take an example: there is a guy with a first name and a last name, and there is a bird that has a name and an owner.

typedef struct Person Person;
struct Person {
    char *first_name;
    char *last_name;
};

typedef struct Bird Bird;
struct Bird {
    char *name;
    Person *owner;
};

To print information about these animals, a cunning C programmer would simply write two functions:

void Person_Print(Person *p) {
    printf("%s %s\n", p->first_name, p->last_name);
}

void Bird_Print(Bird *b) {
    printf("%s of %s %s\n", b->name, b->owner->first_name, b->owner->last_name);
}

And they would be right! But what if we have many such structures and our brains are corrupted by OOP?

c oop insane

GraphQL and Python

2023-05-31

python graphql pain oss

This Week in Changelogs: flask, pytest, IPython, etc

2023-03-07

pyenv 2.3.13, 2.3.14

Highlights from the changelog:

added versions 3.10.10, 3.11.2, and 3.12.0a5;
fixed versions 3.5.10 and 3.6.15 for macOS and modern 64-bit platforms.

This one made me laugh a bit:

a7b181c introduce an indentation issue
3829742 fix the indendation issue.

That's how programming actually works!

TIL: head -n123 is a part of POSIX, head -123 is a shorthand that can be missing in some operating systems (pull request).

IPython 8.11.0

Highlights from the changelog:

%autoreload supports meaningful parameters (%autoreload all, %autoreload off, etc), not only numbers (%autoreload 0, %autoreload 2, etc).

I like the log of the pull request, it illustrates the approach of implementing a feature step-by-step, one frame at a time:

a7c76db get a dumb and working solution
d8bb14b optimize it
809ebdf and make it nice

Also, this fragment is quite interesting, print and logger.info need to be used carefully for logging and protected from being overwritten during hot-reload:

p = print
logger = logging.getLogger("autoreload")
l = logger.info

def pl(msg):
    p(msg)
    l(msg)

Everything you wanted to know about GitHub actions:

flask 2.2.3

Although the changelog is not that big, I like the thing about flask run --debug.

Previously, it was flask --debug run , and it was awkward. The fix itself is quite small, but there's a lot of changes in docs, and also a PyCharm screenshot was changed. Nice and pure!

pytest 7.2.1, 7.2.2

The changelogs contains mostly bug fixes. One of them is about pytest.approx() causing ZeroDivisionError on dicts.

Another one fixes type checkers behaviour for the following code, which I think should be illegal:

with pytest.raises(RuntimeError) if val else contextlib.nullcontext() as excinfo:

(Please, don't write the code like this.)

And they fixed a race condition when creating directories in parallel, using os.makedirs(..., exists_ok=True). Simple, but helpful.

whitenoise 6.4.0

The changelog mentions support for Django 4.2. It was good to know, by the way, that STATICFILES_STORAGE is going to be changed to STORAGES dict (pull request).

django-cors-headers 3.14.0

Changelog:

added support for Django 4.2,
switched from urlparse to urlsplit.

The latter is the most interesting, urlsplit is slightly faster. Also, it's cached, so sometimes you gain a huge performance.

The difference between these functions is that urlparse includes parsing of the "parameters" section of a URL:

scheme://netloc/path;parameters?query#fragment
                     ^ this

Since it's not widely used, in most cases it's safe to switch from urlparse to urlsplit.

twic TIL pyenv ipython flask pytest

This Week in Changelogs: Django and faker

2023-02-17

Django 4.1.6, 4.1.7

4.1.6: release notes, blog post
4.1.7: release notes, blog post

9d7bd5a An interesting bug of parsing the Accept-Language header. The format of the header value is complex, so there's a bunch of regular expressions and @functools.lru_cache(maxsize=1000) for caching the result. However, you can pass a huge header multiple times, causing DoS, so they added two if statements:

one that checks if the length is less than ACCEPT_LANGUAGE_HEADER_MAX_LENGTH
second - for checking the comma-separated strings. So they decided not to just raise an exception or truncate the string by [:ACCEPT_LANGUAGE_HEADER_MAX_LENGTH], but truncate the value in a safe way, so it can be parsed in a meaningful result. Good job!

26b7a25 There was a bug in generated SQL, caused by that .desc() in the model's Meta.constraints:

constraints = [
    UniqueConstraint(
        Lower("name").desc(), name="unique_lower_name"
    )
]

which resulted in <...> WHERE LOWER("myapp_foo"."name") DESC <...> when checking the uniqueness. Apparently, Django can check the constraints itself, not delegating it to the underlying database.

Although the fix is trivial, the case is not, and it wasn't covered in the initial implementation.

By the way, I like how they use typographic double quotes:

msg = "Constraint “name_lower_uniq_desc” is violated."

a637d0b f3b6a4f Those black updates are annoying, mainly because they make git blame misleading. However, there's a solution I didn't know about:

git blame --ignore-revs-file <file> - ignore commits listed in the file.
.git-blame-ignore-revs file - make GitHub ignore them as well.

590a92e The bug was caused by the commit which we've already seen. Now you can safely raise ValidationError without the code.

628b33a One more DoS fix, now it's about number of opened files when you put too many of them in one multipart payload. The fix introduces TooManyFilesSent exception, which results in HTTP 400 (DATA_UPLOAD_MAX_NUMBER_FILES = 100 by default).

I like this fragment:

try:
    return self._parse()
except Exception:
    if hasattr(self, "_files"):
        for _, files in self._files.lists():
            for fileobj in files:
                fileobj.close()
    raise

Beware of freeing your resources, garbage collector can't help you all the time!

faker 16.6.1..17.0.0

Their CHANGELOG is quite descriptive, so I'll just highlight something that I liked.

faker can generate valid image URLs using specific websites (TIL), and one of them, PlaceIMG, is shutting down, and they removed it from the list. The announcement is included in all the generated images:

Biased booleans introduced.
Added emoji provider 🎉 🥳
Added new es_AR provider, but for some reason it's not in the reflected in the CHANGELOG.
Black formatting - always beautiful.

In addition, it turned out that GitHub can put those linter errors from the actions right in the code. I don't know yet how to add this, but I definitely want it!

twic TIL django faker

Go: Delayed file system events handling

2022-06-20 RU

Suppose you need to do something when some file system event occurs. For example, restart the web server when files change. Quite a common practice in development: recompile the project and restart it on the fly immediately after editing the source files.

This site is written in Go, and I recently decided to add a similar hot-reload for markdown files with articles: make the web server notice a new file in it's data directories and repopulate the internal in-memory-storage without restarting itself. And I wanted to "listen" to the file system, not scan it every few seconds.

Fortunately, there is already a good listener fsnotify, which can observe given directories. (Not recursively though, but I don't have that many directories.)

The README gives a pretty clear example. I wrapped it in the Watcher() function and added my channel there, which I send something to once an event happen. I don't care about the type of events, so I just always send 1. Also, I wrapped the Watcher() function into another Watch() function, which executes the reload function passed to it at every FS change.

Something like this (error handling and some unimportant stuff intentionally left out):

func main() {
    dirs := []string[
        "/foo",
        "/bar",
    ]
    go Watch(dirs, func() {
        reloadStuff()
    })
}

func Watch(dirs []string, reload func()) {
    ch := make(chan int)
    go watcher(dirs, ch)

    // Execute the reload function on each
    // file system event
    for range ch {
        reload()
    }
}

func Watcher(dirs []string, ch chan int) {
    watcher, _ := fsnotify.NewWatcher()
    defer watcher.Close()

    done := make(chan bool)
    go func() {
        for {
            select {
            case event, _ := <-watcher.Events:
                // Send a notification to the channel
                // on any event from the watcher
                ch <- 1
            case err, _ := <-watcher.Errors:
                log.Println("error:", err)
            }
        }
    }()
    for _, dir := range dirs {
        watcher.Add(dir)
    }
    <-done
}

I tried to run ran some tests and found that more than one event can be triggered when changing a file (e.g. CHMOD and WRITE sequentially). And if multiple files are changed at once (git checkout, rsync, touch *.*), there'll be even more events, and my hot-reload is triggered on each of them.

In fact, I only need to trigger it once, if a lot of events came in a short period of time. That is, accumulate them, wait half a second, and if nothing else came, do the thing.

To my shame, I couldn't come up with a good solution on my own, but I noticed that CompileDaemon, which I use in development to recompile the source code, works exactly as I want. The solution from there is elegant (as elegant as it can be in Go), and it's about using time.After(): it starts a timer and sends the current time to the return channel after a specified interval.

As a result, the Watch() function has become the following:

func Watch(dirs []string, reload func()) {
    ch := make(chan int)
    go watcher(dirs, ch)

    // A function that returns the channel, which receives
    // the current time at the end of the specified time interval.
    createThreshold := func() <-chan time.Time {
        return time.After(time.Duration(500 * time.Millisecond))
    }

    // `threshold := createThreshold()` is also acceptable,
    // if you want to trigger reload() on the first run.
    // I don't need that, so an empty channel is enough.
    threshold := make(<-chan time.Time)
    for {
        select {
        case <-ch:
            // At each event, we simply recreate the threshold
            // so that `case <-threshold` is delayed for another 500ms.
            threshold = createThreshold()
        case <-threshold:
            // If nothing else comes into the `ch` channel within 500ms,
            // trigger the reload() and wait for the next burst of events.
            reload()
        }
    }
}

It worked exactly as I wanted it to, and now I can update all the data files through rsync without restarting the web server within 500ms.

I invented PHP.

go concurrency channels fs

Docker Buildkit: the proper usage of --mount=type=cache

2022-02-17 RU

^TL;DR The contents of directories mounted with --mount=type=cache are not stored in the docker image, so it makes sense to cache intermediate directories, rather than target ones.

In dockerfile:1.3 there is a feature of mounting file system directories during the build process, that can be used for caching downloaded packages or compilation artifacts.

For example, the uwsgi package must be compiled every time it is installed, and at first glance, build times can be reduced by making the entire Python package directory cacheable:

# syntax=docker/dockerfile:1.3
FROM python:3.10

RUN mkdir /pip-packages

RUN --mount=type=cache,target=/pip-packages \
      pip install --target=/pip-packages uwsgi

> docker build -t pip-cache -f Dockerfile.pip .
# ...
[+] Building 14.6s (7/7) FINISHED

Looks like everything went well, but the target directory is empty:

> docker run -it --rm pip-cache ls -l /pip-packages
total 0

Something is definitely wrong. You can see that during the build uWSGI was compiled and installed. You can even check it adding ls in the build process:

RUN --mount=type=cache,target=/pip-packages \
      pip install --target=/pip-packages uwsgi \
      && ls -1 /pip-packages

> docker build -t pip-cache --progress=plain -f Dockerfile.pip .
<...>
#6 12.48 Successfully installed uwsgi-2.0.20
<...>
#6 12.91 __pycache__
#6 12.91 bin
#6 12.91 uWSGI-2.0.20.dist-info
#6 12.91 uwsgidecorators.py
#6 DONE 13.0s
<...>

Everything is in its place. But the final image is empty again:

> docker run -it --rm pip-cache ls -l /pip-packages
total 0

The thing is, the /pip-packages catalog, that is inside the image, and the catalog, that is in RUN --mount=type=cache,target=<dirname>, are completely different. Let's try to put something inside this directory and track its contents during the build process:

RUN mkdir /pip-packages \
    && touch /pip-packages/foo \
    && ls -1 /pip-packages

RUN --mount=type=cache,target=/pip-packages \
    ls -1 /pip-packages \
    && pip install --target=/pip-packages uwsgi \
    && ls -1 /pip-packages

RUN ls -1 /pip-packages

> docker build -t pip-cache --progress=plain -f Dockerfile.pip-track .
<...>
#5 [stage-0 2/4] RUN mkdir /pip-packages
      && touch /pip-packages/foo
      && ls -1 /pip-packages
#5 sha256:fb542<...>
#5 0.211 foo  👈1️⃣
#5 DONE 0.2s

#6 [stage-0 3/4] RUN --mount=type=cache,target=/pip-packages
      ls -1 /pip-packages
      && pip install --target=/pip-packages uwsgi
      && ls -1 /pip-packages
#6 sha256:10ed6<...>
#6 0.292 __pycache__            👈2️⃣
#6 0.292 bin
#6 0.292 uWSGI-2.0.20.dist-info
#6 0.292 uwsgidecorators.py
#6 2.802 Collecting uwsgi       🤔3️⃣
#6 3.189   Downloading uwsgi-2.0.20.tar.gz (804 kB)
#6 4.400 Building wheels for collected packages: uwsgi
<...>
#6 13.34 __pycache__            👈4️⃣
#6 13.34 bin
#6 13.34 uWSGI-2.0.20.dist-info
#6 13.34 uwsgidecorators.py
#6 DONE 13.4s

#7 [stage-0 4/4] RUN ls -1 /pip-packages
#7 sha256:fb6f4<...>
#7 0.227 foo  👈5️⃣
#7 DONE 0.2s
<...>

1️⃣ file foo created successfully
2️⃣ the directory with the results of the previous docker build was mounted, and there's no foo file
3️⃣ uWSGI is downloaded, compiled and installed again
4️⃣ an updated uWSGI package appeared in the catalog
5️⃣ only the file foo is left in the directory

This means that --mount=type=cache only works in the context of a single RUN instruction, replacing the directory created inside the image with RUN mkdir /pip-packages and then reverting it back. Also, caching turned out to be ineffective because pip reinstalled uWSGI with a full compilation.

In this case, it would be correct to cache not the target directory, but /root/.cache, where pip stores all the artifacts:

RUN --mount=type=cache,target=/root/.cache \
    pip install --target=/pip-packages uwsgi

> docker build -t pip-cache -f Dockerfile.pip-right .
> docker run -it --rm pip-cache ls -1 /pip-packages
__pycache__
bin
uWSGI-2.0.20.dist-info
uwsgidecorators.py

Now everything is in place, the installed packages have not disappeared.

Let's check the effectiveness of caching by adding the requests package:

RUN --mount=type=cache,target=/root/.cache \
    pip install --target=/pip-packages uwsgi requests
                                                👆

> docker build -t pip-cache --progress=plain -f Dockerfile.pip-right .
<...>
#6 6.297 Collecting uwsgi
#6 6.297   Using cached uWSGI-<...>.whl  👈
#6 6.561 Collecting requests
#6 6.980   Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
<...>

pip used the pre-built wheel file from /root/.cache and installed a ready-to-use package from it.

All sources are available on GitHub.

docker cache python

Fast commit and push

2021-09-10 RU

I want to share a shell-function called gacp (Git Add, Commit and Push), which I came up with a few months ago and have been using about every hour since then:

# fish
function gacp
    git add .
    git commit -m "$argv"
    git push origin HEAD
end

# bash/zsh
function gacp() {
    git add .
    git commit -m "$*"
    git push origin HEAD
}

Usage example:

> gacp add some new crazy stuff
[master fb8dcc9] add some new crazy stuff
 <...>
Enumerating objects: 12, done.
 <...>
To github.com:foo/bar.git
   912c95d..fb9dcc9  master -> master

No more chains with && and quotes for verbose messages!

git shell

The Demo Models

The Problem

The Fix

Takeaways

The Pattern: DRY Tests via Subclassing

The Problems

IDE and DX Pain

CI Failures Are Confusing

When It Kinda Makes Sense

The Clean Alternative: Parametrize It

Pytest Style

unittest Style (Optional, Not Ideal)

Final Verdict

The Setup

The Problem

The Fix: reload + patch

Why This Works

Lessons Learned

The problem

First mistake: trying croniter with crontab

The right way (which I learned too late)

Summary

How to Use

Autocomplete test paths

Support for -k filter

Installation

How It Works

Other shells?

Accounts and Wallets Page

In a Nutshell

Why So Boring?

pyenv 2.3.13, 2.3.14

IPython 8.11.0

flask 2.2.3

pytest 7.2.1, 7.2.2

whitenoise 6.4.0

django-cors-headers 3.14.0

Django 4.1.6, 4.1.7

faker 16.6.1..17.0.0

Support for `-k` filter