Finding Lookup Items that Are Not Used
The Django ORM is quite flexible and provides a lot of ways to get at the data you need without having to write raw SQL, although that option is always available. Every once in a while someone on the #django IRC channel will ask about LEFT JOIN functionality. Although, the ORM doesn’t support LEFT JOINs “out of the box” (as best as I can tell), there are creative approaches to the problem.
Setting Up
Say we have a hypothetical problem of trying to find all items in a lookup table that are not being used in the master table. Why don’t you join me (excuse the pun) and we can have some shell fun. Start with the following model structure:
class Section(models.Model):
name = models.CharField(maxlength=50)
slug = models.SlugField(prepopulate_from=("name",),)
position = models.IntegerField(default=1)
class Post(models.Model):
pub_date = models.DateTimeField()
title = models.CharField(maxlength=255)
slug = models.SlugField(prepopulate_from=("title",), unique_for_date='pub_date')
summary = models.TextField(help_text='Use regular non formatted text.')
body = models.TextField()
section = models.ForeignKey(Section)
I’ve removed all the noise so we can keep things clean. The key items here is that there is a ForeignKey
relationship between Posts and Sections. In Rails terminology we would say that each Post belongs to a Section and each Section has many Posts. You get the idea.
Let’s continue by putting in some data so we have something to work with:
>>> import datetime
>>> from blog.models import Section, Post
>>> s = Section.objects.create(name='Django', slug='django', position=1)
>>> Section.objects.create(name='Javascript', slug='javascript', position=2)
>>> Section.objects.create(name='Rails', slug='rails', position=3)
>>> Section.objects.all()
[, , ]
>>> p = Post(pub_date=datetime.datetime.now(), title='Django Rocks!', slug='django-rocks', summary='Article in which I explain the rockingness of Django.', body='Enough said', section=s)
>>> p.save()
>>> p
At this point we should have three Sections: Django, Javascript, and Rails, and one post that is assigned to the Django Section. Now that we have everything set up and we’re all on the same page, let’s play around a bit.
Common Foreign Key Usage
It’s pretty easy to find the Section for a Post:
>>> p.section
If a model has a ForeignKey
we can access the related object through the model attribute.
Likewise if we have a particular section we can find all Posts that tied to it:
>>> s.post_set.all()
[]
Django automatically interjects a manager to the foreign master table called FOO_set, by default, where FOO is the master table name in lowercase. You can override this default using the related_name
parameter. The Django documentation explains this quite well.
If our ForeignKey
field (in this case section) allows nulls, we can easily find all Posts where Section is not specified. Note we didn’t set up our model or data to handle this scenario, but it is here just for completeness.
>>> Post.objects.filter(section__isnull=True)
[]
How it Looks in SQL
So back to our original problem. Things get a bit more difficult when we want to find “orphaned” records. What if we want to find all Sections that do not have any Posts?
Sometimes it is easiest to think about how you would get what you need in SQL and then work backwards from there. Normally in SQL I would use one of the following queries:
SELECT *
FROM blog_section
LEFT OUTER JOIN blog_post
ON blog_section.id = blog_post.section_id
WHERE blog_post.id IS NULL
or
SELECT *
FROM blog_section
WHERE blog_section.id NOT IN (SELECT DISTINCT blog_post.section_id FROM blog_post)
My personal preference is the second approach, but both will get you the same results. I’ve kept any kind of optimizations out of the equation to keep things simple.
Throughout I will be viewing the generated SQL using the helper method I posted about previously. If you’re following along, be sure to import it.
Using Q Objects
So back to the Django ORM. Regarding the first approach, although using LEFT JOINs are not possible with the standard ORM, they can be accomplished using a custom Q object. For instance, some folks have created custom Left Outer Join Q objects here and here. When get_sql()
is called on a Q object, it returns a tuple of joins
, where
, and params
. The joins
argument is a dictionary that describes the join condition for the generated sql clause. The join condition can be modified to do what we want. Using one of these snippets I can do the following:
>>> from django.db.models.query import Q, QAnd, QOr
>>> from utils import QLeftOuterJoins
>>> q = Section.objects.filter(QLeftOuterJoins(Q(post__id__isnull=True)))
>>> q
[, ]
>>> show_sql(q)
'SELECT "blog_section"."id", "blog_section"."name", "blog_section"."slug",
"blog_section"."position"
FROM "blog_section"
LEFT OUTER JOIN "blog_post" AS "blog_section__post"
ON "blog_section"."id" = "blog_section__post"."section_id"
WHERE ("blog_section__post"."id" IS NULL)
ORDER BY "blog_section"."position" ASC'
Looks like we nailed it with a little bit of outside help.
Using Custom SQL Or Multiple Queries
The only two methods I’ve come up with for duplicating the second (NOT IN) approach is to:
1. Write custom sql for the sub select. There are lots of approaches here, but here’s just one.
>>> q = Section.objects.extra(where=["id NOT IN (SELECT DISTINCT section_id FROM blog_post)"])
>>> q
[, ]
>>> show_sql(q)
'SELECT "blog_section"."id", "blog_section"."name", "blog_section"."slug",
"blog_section"."position"
FROM "blog_section"
WHERE id NOT IN (SELECT DISTINCT section_id FROM blog_post)
ORDER BY "blog_section"."position" ASC'
2. Use multiple queries. For instance we can find the distinct items that exist in the master table and use a NOT IN syntax to exclude those items. For example (note my Python is not very good, so if there’s a more concise way, please let me know):
>>> posts = Post.objects.all()
>>> s = [p.section_id for p in posts]
>>> s
[1]
>>> q = Section.objects.exclude(id__in=s)
>>> q
[, ]
>>> show_sql(q)
'SELECT "blog_section"."id", "blog_section"."name", "blog_section"."slug",
"blog_section"."position"
FROM "blog_section"
WHERE ((NOT ("blog_section"."id" IN (1))))
ORDER BY "blog_section"."position" ASC'
So we get the right answer, but it’s surely not the most efficient method. We could optimize it a bit by finding the distinct section_id
s in use, but then we’re back to writing custom sql.
Using Extra()
Although it may not be a common usage approach there is actually a third way we can find the “orphaned” sections using SQL. Although the results turn out to be quite odd if you’re using straight SQL, it will help us get to where we need to be:
SELECT *
FROM "blog_section" , "blog_post"
WHERE blog_section.id NOT IN (blog_post.section_id)
As it turns out we can achieve the above using the standard Django ORM through the use of the extra()
method. James Bennett mentioned that it was possible, but I never caught the solution. So I poked around for a few hours and was finally able to come up with an answer:
>>> q = Section.objects.extra(tables=["blog_post",], where=["blog_section.id NOT IN (blog_post.section_id)",])
>>> q
[, ]
>>> show_sql(q)
'SELECT "blog_section"."id", "blog_section"."name", "blog_section"."slug",
"blog_section"."position"
FROM "blog_section" , "blog_post"
WHERE blog_section.id NOT IN (blog_post.section_id)
ORDER BY "blog_section"."position" ASC'
Perfect! That’s what we were looking for.
I don’t know about you, but I’m definitely exhausted. Now clearly there are probably ten more ways to attack this problem. What other approaches have you used to arrived at a solution to this problem?