Brenton Cleeland

Respectfully Requesting RSS Feeds with Python

Published on

I found myself writing code to fetch RSS (and JSON) feeds, and was immediately reminded of Rachel By The Bay's posts about feed reader behaviour [1, 2, 3]. Those posts go into detail about their expectations for feed readers. This post will show you how to satisfy those requirements and a few others that I'm going to add.

Feeds, updates, 200s, 304s, and now 429s can be summarised as requiring these steps:

I'm adding two other requirements:

Let's start with a simple Django model that represents our Feed. We'll slowly update the fetch() function to retrieve the feed using thttp.

class Feed(models.Model):
    url = models.URLField()

    def fetch(self):
        pass

I'm using Django and Python, but don't let that put you off! The example should be trivial enough to convert to most other languages.

Poll feeds at a reasonable interval

Rachel talks about feed readers requesting feeds every two minutes. This is clearly unreasonable for a blog that posts a few times a month at most. But what is a reasonable frequency to request a feed of posts?

You need to think about your use case and decide how important it is that you get the latest post immediately, as well as how frequently the feed will change. Most blogs, and even most news sites, aren't adding new content every few minutes.

If you are presenting the posts in a feed reader then hourly (or even slower) likely makes sense. For use cases that trigger events on a new post (i.e. a push notification) you might request the feed more frequently (or use a push mechanism like rssCloud)

The best option is to implement a "delay" that's based on the type of feed that your are requesting. Use a sensible default (say 30-60 minutes), then either allow the user to change that delay or use the posting frequency to decide on the delay.

Our model needs to keep two things in order to make this happen: the last time the feed was requested, and the minimum interval between checks. In the fetch() function we will check if the check_interval has passed and return early if it hasn't.

class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    def fetch(self):
        # ensure the check_interval has passed
        if (
            self.last_checked_utc
            and (datetime.now(tz=zoneinfo.ZoneInfo("UTC")) - self.last_checked_utc).seconds < self.check_interval
        ):
            logger.info(f"[fetch] {self.url} was checked less than {self.check_interval} seconds ago")
            return

Because Django's DateTimeField has timezone support we are explicitly using UTC timestamps for our comparison here.

Make conditional requests with If-Modified-Since or If-None-Match

The If-Modified-Since and If-None-Match headers are part of the HTTP spec and give servers the option of returning 304 Not Modified instead of the regular response when handling GET requests. They use slightly different mechanisms to achieve this.

If-Modified-Since tells the server to only return content if it has been modified after the date provided. In our example we will set this date to the last time that we requested the feed.

If-None-Match relies on an ETag (or entity tag) returned by the server. When sending a response the server generates a unique ETag value for that specific version of the response. Our client keeps that value and sends it in the If-None-Match header for subsequent requests. If the generated tag would be the same then the server returns an empty response.

In both cases a status code of 304 is returned to indicate that there are no updates.

We need to keep track of the ETag value if it's returned, and have two headers that we need to (optionally) send on the request.

class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    last_etag = models.CharField(max_length=500, blank=True, default="")

    def fetch(self):
        # ensure the check_interval has passed
        # ...

        # send conditional request headers
        headers = {}
        if self.last_etag:
            headers["If-None-Match"] = self.last_etag

        if self.last_checked_utc:
            headers["If-Modified-Since"] = self.last_checked_utc.strftime("%a, %d %m %Y %H:%M:%S GMT")

        response = thttp.request(self.url, headers=headers)

        # save ETag and last checked timestamp
        if response.headers.get("etag"):
            self.last_etag = response.headers["etag"]

        self.last_checked_utc = datetime.now(tz=zoneinfo.ZoneInfo("UTC"))
        self.save()

Tthe If-None-Match header will take precedence over the If-Modified-Since value if both are provided, but there's really no harm in keeping the code simple and providing both.

Handle 304 responses correctly

When making conditional requests the 304 Not Modified response indicates that there is no new content. The response body is likely empty in this case and you should stop processing that feed.

class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    last_etag = models.CharField(max_length=500, blank=True, default="")

    def fetch(self):
        # ensure the check_interval has passed
        # ...

        # send conditional request headers
        # ...

        # return immediately if 304 is returned
        if response.status == 304:
            # content is unchanged
            logger.info(f"[fetch] {self.url} is unchanged")
            self.last_checked_utc = datetime.now(tz=zoneinfo.ZoneInfo("UTC"))
            self.save()
            return

        # save ETag and last checked timestamp
        # ...

Because the last_checked_utc value is used to determine whether we should make future requests we make sure to update it here.

Back off if you receive a 429

A 429 Too Many Request response is used to tell you that you're making request too quickly. There are many strategies for managing backoffs (also called retry strategies) for HTTP requests, but here ours will be pretty simple:

Let's double the check interval if we receive a response from the server that isn't a 200 or 304, up to a maximum of 24 hours.

class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    last_etag = models.CharField(max_length=500, blank=True, default="")

    def fetch(self):
        # ensure the check_interval has passed
        # ...

        # send conditional request headers
        # ...

        # return immediately if 304 is returned
        # ...

        # backoff if we receive a 429 response
        if response.status != 200:
            logger.info(f"[fetch] {self.url} returned {response.status}")
            self.check_interval = self.check_interval + self.check_interval
            if self.check_interval > 60 * 60 * 24:
                self.check_interval = 60 * 60 * 24
            self.last_checked_utc = datetime.now(tz=zoneinfo.ZoneInfo("UTC"))
            self.save()
            return

        # save ETag and last checked timestamp
        # ...

Use a User-Agent that lets server admins know who you are

Our two bonus requirements are both fairly straight forward and are implemented as simple request headers.

While it might be tempting to set the User-Agent header to mimic a set of browser header you should really give the servers your are making requests to some information about your client. Feeds are intended to be read by bots and servers should not block non-browser user agents.

Some example feed reader agents are:

Feedly/1.0 (+http://www.feedly.com/fetcher.html; XXXX subscribers;)
Feedbin feed-id:XXXXXXX - XXXX subscribers
Slackbot 1.0 (+https://api.slack.com/robots)
Overcast/1.0 Podcast Sync (XXXX subscribers; feed-id=XXXXXXX; +http://overcast.fm/)
CommaFeed/3.9.0 (https://github.com/Athou/commafeed)

Although not true of all of the above we will include both the name of our bot and a URL that can be used to find more details. Convention dictates that URLs are preceded by "+" in your User-Agent.

class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    last_etag = models.CharField(max_length=500, blank=True, default="")

    def fetch(self):
        # ensure the check_interval has passed
        # ...

        # send our user agent for all requests
        headers = {
            "User-Agent": "ExampleBot (+https://brntn.me/blog/respectfully-requesting-rss-feeds/)"
        }

        # send conditional request headers
        # ...

        # return immediately if 304 is returned
        # ...

        # backoff if we receive a 429 response
        # ...

        # save ETag and last checked timestamp
        # ...

Don't forget to update that URL if you are copying this code into your project!

Allow responses to be compressed

If the library we are using to make HTTP requests supports it, we should set the appropriate Accept-Encoding headers to let the server compress the response. RSS feeds that contain many posts and full post bodies can get quite large. The best case scenario for a server is to return a cached version of the compressed feed.

thttp has a deliberately limited feature set but it does support gzip compression. We can update our headers dictionary to allow this before making the request.

class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    last_etag = models.CharField(max_length=500, blank=True, default="")

    def fetch(self):
        # ensure the check_interval has passed
        # ...

        # send our user agent for all requests
        headers = {
            "User-Agent": "ExampleBot (+https://brntn.me/blog/respectfully-requesting-rss-feeds/)",
            "Accept-Encoding": "gzip",
        }

        # send conditional request headers
        # ...

        # return immediately if 304 is returned
        # ...

        # backoff if we receive a 429 response
        # ...

        # save ETag and last checked timestamp
        # ...

Code Complete

That's it! I hope this deep dive into feed fetching was helpful!

The code below is a complete Django model that fetches an RSS feed in the most respectful way possible.

The next step (for you!) is to actually process the feed and save those entries. In Python I would recommend feedparser for that job. Most languages will have libraries for parsing the RSS feed's XML body so that you don't have to.

import logging
import thttp
import zoneinfo

from datetime import datetime
from django.db import models

logger = logging.getLogger("django")


class Feed(models.Model):
    url = models.URLField(help_text="URL for the RSS/Atom feed")

    last_checked_utc = models.DateTimeField(blank=True, null=True)
    check_interval = models.IntegerField(default=3600)

    last_etag = models.CharField(max_length=500, blank=True, default="")

    def fetch(self):
        # ensure the check_interval has passed
        if (
            self.last_checked_utc
            and (
                datetime.now(tz=zoneinfo.ZoneInfo("UTC")) - self.last_checked_utc
            ).seconds
            < self.check_interval
        ):
            logger.info(
                f"[fetch] {self.url} was checked less than {self.check_interval} seconds ago"
            )
            return

        # send our user agent for all requests
        headers = {
            "User-Agent": "ExampleBot (+https://brntn.me/blog/respectfully-requesting-rss-feeds/)",
            "Accept-Encoding": "gzip",
        }

        # send conditional request headers
        headers = {}
        if self.last_etag:
            headers["If-None-Match"] = self.last_etag

        if self.last_checked_utc:
            headers["If-Modified-Since"] = self.last_checked_utc.strftime(
                "%a, %d %m %Y %H:%M:%S GMT"
            )

        response = thttp.request(self.url, headers=headers)

        # return immediately if 304 is returned
        if response.status == 304:
            # content is unchanged
            logger.info(f"[fetch] {self.url} is unchanged")
            self.last_checked_utc = datetime.now(tz=zoneinfo.ZoneInfo("UTC"))
            self.save()
            return

        # backoff if we receive a 429 response
        if response.status != 200:
            logger.info(f"[fetch] {self.url} returned {response.status}")
            self.check_interval = self.check_interval + self.check_interval
            if self.check_interval > 60 * 60 * 24:
                self.check_interval = 60 * 60 * 24
            self.last_checked_utc = datetime.now(tz=zoneinfo.ZoneInfo("UTC"))
            self.save()
            return

        # save ETag and last checked timestamp
        if response.headers.get("etag"):
            self.last_etag = response.headers["etag"]

        self.last_checked_utc = datetime.now(tz=zoneinfo.ZoneInfo("UTC"))
        self.save()

        # TODO: Process the feed items and save them in our database