Django + Postgres: The Hunt for Long Running Queries
Using django-pgactivity for application-level monitoring of database queries.
Database queries sometimes take a lot longer to complete than the average developer would expect. Seemingly innocent actions such as searching can produce very complex SELECT statements which take what feels like an eternity to finish. If you get a few users performing these heavy queries in parallel, your app’s performance may begin to suffer.
In this article we reflect on how we at Intempus implemented some basic tooling to help prevent getting into the scenario described above. In particular, we describe a way to detect long running active queries and also a way in which we can set a timeout on a query-by-query basis. We speak in terms of implementing these ideas in Django and Postgres, but most concepts should be transferable to other frameworks and databases.
The Set-Up
I work as a Software Engineer at Intempus , a time-registration software company based in Denmark. I help develop the backend of the app and website, which is built from Django on top of a Postgres database.
Our website offers users the ability to filter data on an almost limitless number of combinations of different columns. It also allows users to perform full-text searches which match on multiple “simple” columns together (e.g. “name” and “number”).
The Incident
Recently we noticed some transient but significant performance degradation of the website as a whole. We were eventually able to trace this back to one of our bigger customers repeatedly performing the same innocent looking, but apparently very expensive, search query. Without going into too much detail, the user was essentially attempting to filter by ID on a specific foreign key column combined with a search on a “name” column. The below is a simplified version of how this would be mapped into SQL
SELECT * FROM "table_1" INNER JOIN "table_2" ON ("table_1"."foreign_key_id" = "table_2"."id") WHERE (UPPER("table_2"."name"::text) LIKE UPPER(%Paul%) AND "table_1"."another_foreign_key_id" = 12345
In reality, the SQL produced on our server was much more complex since the search must also pass through an involved authorisation layer. Still, it was surprising to us that this query would take in the order of minutes to complete, especially when similar looking queries were so fast.
What next?
There are a lot of different ways to start tackling this issue. Some possibilities are:
Perform deep analysis on why this specific query is so slow and attempt to optimise it.
Restructure our authorisation framework to play more nicely with these kind of searches.
Limit the kind of searches a user can perform — for example, can we move away from expensive
LIKE
type queries?
All of these approaches are valid and likely to be fruitful. However, none of them guarantee that another slow search query won’t slip through the cracks and hurt our app’s performance. We want a way to detect these queries before they cause an issue. Even better, we want to find a way to ensure these queries automatically cancel themselves if they’re running over a certain time limit.
The Hunt
To hunt for long running active queries, we made use of the wonderful django-pgactivity library, written by Wes Kendall. As the README states, this library “makes it easy to view, filter, and kill active Postgres queries”. The crucial thing here is that it allows us to do this easily from the application level using Python and Django abstractions. This breaks down the barriers that many Django developers may feel are present when trying to inspect something in the database which would otherwise need to be done with a raw SQL query.
django-pgactivity allows us to monitor queries with a Django management command. If we want to set up monitoring for all active queries which have been running for longer than 1 minute, its as simple as having some process run the following command:
python manage.py pgactivity -f state=ACTIVE -f 'duration__gt=1 minute'
This will produce some output in the format below:
39225 | 0:01:32 | IDLE_IN_TRANSACTION | None | lock auth_user in access exclusiv
39299 | 0:00:15 | ACTIVE | None | SELECT "auth_user"."id", "auth_user"."password
39315 | 0:00:00 | ACTIVE | None | WITH _pgactivity_activity_cte AS ( SELECT pid
You can refer to the pg-activity documentation to find out more about what the columns and values in this output represent.
Tracing queries back to their origin
As application developers, even when we have access to a log full of long running queries in the form of SQL statements, it can be hard to know what to do next. Ideally, we would like a way to easily traceback to where in the application these SQL queries came from. Once an application developer has that information, they have a starting point for investigating how to optimise or eliminate the slow queries.
pg-activity gives us the ability to associate context to an SQL query. For our purposes, we can make use of the pg-activity’s middleware. With the middleware installed, any SQL query issued during a request’s lifecycle will have a comment injected into its SQL statement. This comment contains information on both the URL and the HTTP method of the request which birthed the SQL-query.
With this additional context, pg-activity’s output will look something like the below:
39299 | 0:00:15 | ACTIVE | {"url": "intempus.dk/web/employee/search/", "method": "GET"} | SELECT "auth_use ...
Now we can not only see the SQL queries which are taking a long time, but we can see where they came from in the application. Our output shows explicitly that long running queries are coming from the endpoint intempus.dk/web/employee/search/. Django developers at Intempus don’t need to be experts in Postgres to know where to look now: the context tells us that the issue is coming from the intempus.dk/web/employee/search/ endpoint, and that we must try to optimise it.
The Kill
Although we were feeling more secure with our new monitoring in place, we felt more could be done. In particular, we felt like there should be a postgres-level timeout on queries coming from searches which also meets the following criteria:
The timeout applies only to queries coming from the search. We do not want a global timeout to be applied across all database queries (since some actions are expected to take longer than others).
We want to be alerted when the query does time out and we want that alert to contain information about the query.
To achieve the above, we can utilise django-pgactivity’s timeout
decorator, which will ensure that any SQL query issued within its scope automatically terminates whenever a specified timeout exceeds a specified timeout.
For our purposes, we can use it to decorate our search view
import pgactivity
@pgactivity.timeout(60)
def search(request: HttpRequest):
...
Now whenever a user performs a search which leads to a query which takes more than 1 minute, a Django OperationalError
will be raised and the underlying SQL query cancelled. Under the hood, pgactivity sets a PostgreSQL local statement_timeout
value, which will apply to any transaction within the function scope.
We as application developers can then decide how to handle the OperationalError
. One useful course of action would be to log the exception and information from the request and return a user friendly error message to the user.
For some of my other work, see https://github.com/PaulGilmartin/