django query filter manytomany count exists

Django: Filter Query Objects Where ManyToMany Exists

Django comes with a powerful ORM framework by which one can easily design, implement, and interact with databases. The ManyToMany relationship field provides intuitive access to semi-complex data relationships. Knowing how to leverage it within Django filters can make your queries more efficient.

Example Database Models

To illustrate the ManyToMany field in action, let’s create some simple database models representing publicly-traded securities and their historic market prices. This could be used to model price changes over a period of time for a stock.

from django.db.models import Model, CharField, ManyToManyField, FloatField, DateField


class Stock(Model):
    """
    Model for single security
    """
    symbol  = CharField(max_length=8, unique=True)
    prices  = ManyToManyField('Price', related_name='price_stock')
    
    
class Price(Model):
    """
    Model for historic Stock quotes
    """
    price   = FloatField()
    date    = DateField()

With this design in place, one can query Stock objects, Price objects, or functional relationships between the two—Stocks with Prices, for example. The use-case here is to generate a query that returns all Stock objects with at least one Price relationship. Simply put: that’s retrieving any stock that has pricing data.

Example Dataset

First, let’s generate some sample data for Stocks, Prices, and relationships between the two. There’s a lot of shortcuts in the following code, but it’s effectively creating a number of Stock entries then adding historic pricing data for a sample of those entries.

# Generate random securities
random_symbols = set([
    # Generate 50 random series of characters between 1 and 7 characters in length
    "".join(random.choice(string.ascii_letters).upper() for x in range(random.choice(range(1, 8)))) for y in range(50)
])

# Add all symbol data as Stock objects
for random_symbol in random_symbols:

    # Create a new stock object
    # Note: the use of get_or_create here negates the need for unique symbol values
    stock, created = Stock.objects.get_or_create(symbol=random_symbol)


# Add pricing data for 10 randomly-selected symbols.
# Note: Symbols omitted to demonstrate query filtering later.
for random_stock in random.sample([x for x in Stock.objects.all()], 10):
    
    # Create a new pandas dataframe with randomized pricing data for 1 years
    price_data            = pd.DataFrame(columns=['date', 'price'])
    price_data['date']    = pd.date_range(start='1/1/2020', end='1/1/2021', freq='D')
    price_data['price']   = [random.uniform(1.5, 5.0) for date in range(price_data.shape[0])]

    # Create a Price entry and add as relationship to Stock object for each entry
    for index, row in price_data.iterrows():

        # Create new price entry
        price_entry = Price.objects.create(price=row['price'], date=row['date'])

        # Add relationship to stock
        random_stock.prices.add(price_entry)

The above code has added a year’s worth of pricing data (1/1/2020-1/1/2021) for 10 randomly-selected Stock objects from the 50 added to the database. To get an idea of what things look like:

# import models
from stocks.models import Stock, Price

# Get count of total DB Stock entries
Stock.objects.count()
>>> 50

# Get count of total DB Price entries
Price.objects.count()
>>> 3670

Querying for Models with ManyToMany Entries

Now that our test data is created, we can begin to approach the problem: querying for all Stock entries that have price data. In other words, we don’t want to return any Stock object without pricing data. While Django’s ORM is incredible, this use-case can be approached via several ways—each with different performances.

# import models
from stocks.models import Stock, Price
import time

# First Approach — doesn't return expected data
st = time.time()
Stock.objects.filter(prices=True).count()
>>> 1
time.time() - st
>>> 0.003003358840942383

# Second Approach - doesn't filter data
st = time.time()
Stock.objects.exclude(prices=False).count()
>>> 50
time.time() - st
>>> 0.0010004043579101562

# Third Approach - returns expected data super fast
st = time.time()
Stock.objects.exclude(prices=None).count()
>>> 10
time.time() - st
>>> 0.0

# Fourth Approach - returns expected data
st = time.time()
Stock.objects.exclude(prices__isnull=True).count()
>>> 10
time.time() - st
>>> 0.0010006427764892578


# Fifth Approach - returns expected data, but much more slowly
st = time.time()
len([stock for stock in Stock.objects.all() if stock.prices.count() > 0])
>>> 10
time.time() - st
>>> 0.019017934799194336

To be honest, I’m not entirely sure why the first two examples don’t return the expected data. The syntax, while valid, doesn’t seem to produce the query I’d expect. The remaining examples, however, each return the 10 Stock objects for which pricing data was added.

The Stock.objects.exclude(prices=None).count() produces the fastest query, though admittedly there could be much more test data to magnify the differences. Aside from the illustration of effective syntax, I’d like to emphasize that remaining within the syntactic realm of Django’s Queryset is crucial to optimize query speed. In the last example using Python’s list comprehension—the query is nearly 10-times as slow!

Known Issues

  1. The generation of random symbols is willy-nilly and doesn’t always produce the same number of unique values.
  2. The time-series data doesn’t account for days that actual security wouldn’t be trading—weekends, holidays, etc.
  3. Pricing data is completely random in range and doesn’t attempt to model any type of trend (price increase, splits, etc.)
  4. Adding this much sample data to a standard SQLite database is slow. Using another database type (like PostgreSQL) that accommodates parallel writes, in combination with Python’s multiprocessing library, is recommended.

Final Thoughts

I find myself using Django quite often simply for its ORM system. When I need to iterate quickly with some data that necessitates a relational design I often use while validating an idea/notion/curiosity. The Django ORM makes data modeling and basic querying a breeze.

For complex and—in the case here—the syntax of even semi-complex lookups can get a little quirky. Django provides the Q object to simplify more complex lookups. I would recommend anyone that finds themselves without a clue as to the proper query syntax to check that class out. If for no other reason, it’s insightful as to the possibilities of DB queries Django supports natively.