Roland's homepage

My random knot in the Web

A simple feed reader for Youtube

As an exercise, I wrote a small script to read the Atom feeds for some favourite youtube channels. Of course I could have installed a “real” feed-reader, but that would be overkill and not half as much fun. :-)

Find the URI and channel-id

In this blog post I saw how to access the feeds. The beginning of the URI is https://www.youtube.com/feeds/videos.xml?channel_id=, and it is followed by the channel-id. This channel-id starts with UC and looks like it is base64 encoded data.

Note

The blog post above mentions channel-external-id. However did not find them in the pages for the channels I looked at. I did find the channelId, though.

Open the homepage of the channel and view its source code. Look for the channelId and copy it.

Download the data in XML format

Take the URI https://www.youtube.com/feeds/videos.xml?channel_id= and append the channel-id. Use that to download the data. For this I use the awesome requests module.

import requests
import re
import datetime

base = "https://www.youtube.com/feeds/videos.xml?channel_id="
channel = 'UC5NO8MgTQKHAWXp6z8Xl7yQ'

uri = base + channel
res = requests.get(uri)

At this point, we should check that res.ok is True. In that case the information was successfully retrieved.

Extracting the data

The downloaded data is in res.text, which is a chunk of XML. Looking at the data it seems to be in Atom format rather than RSS; it uses entry instead of item tags.

Since I only need a couple of pieces of information, I’m not going to bother with an XML parser and just use regular expressions to extract the <title>...</title>, <link rel="alternate" href="..."/> and '<published>...</published> data.

In this case, the XML is formatted nicely. All the tags I want are on their own line. This is significant, because by default the . (dot) special character in Python regular expressions does not match a newline. So here we can get away with using the standard greedy capturing group (.*).

If the XML contains newlines in weird places, I would remove those from the text (using the replace method of strings) and then use the non-greedy capturing group (.*?) to make sure that we only capture between the closest adjacent begin and end tags.

I’m translating the publication times to datetime.datetime instances, for easier comparison later.

titles = re.findall('<title>(.*)</title>', res.text, re.IGNORECASE)
links = re.findall('<link rel="alternate" href="(.*)"/>', res.text, re.IGNORECASE)
published = [
    datetime.datetime.fromisoformat(pt) for pt in
    re.findall('<published>(.*)</published>', res.text, re.IGNORECASE)
]

The first title and link are that of the channel intro.

In [17]: titles[:3]
Out[17]:
['This Old Tony',
'ALL IN for PROJECT EGRESS',
'Moto Erratum! &amp; A Flat Tire']

In [18]: links[:3]
Out[18]:
['https://www.youtube.com/channel/UC5NO8MgTQKHAWXp6z8Xl7yQ',
'https://www.youtube.com/watch?v=d6SbRkM0C3s',
'https://www.youtube.com/watch?v=WDcBLPCYQpU']

So, let’s combine them and filter out the channel intro.

In [20]: items = list(zip(titles, links, published))[1:3]
Out[20]:
[('ALL IN for PROJECT EGRESS',
  'https://www.youtube.com/watch?v=d6SbRkM0C3s',
  datetime.datetime(2019, 7, 13, 14, 32, 29, tzinfo=datetime.timezone.utc)),
 ('Moto Erratum! &amp; A Flat Tire',
  'https://www.youtube.com/watch?v=WDcBLPCYQpU',
  datetime.datetime(2019, 7, 8, 17, 22, 48, tzinfo=datetime.timezone.utc))]

The titles still contain HTML escapes. And I want to limit the video’s to those published in a configurable number of days. Let’s fix that.

import html

now = datetime.datetime.now(tz=datetime.timezone.utc)
limit = 7

items = tuple(
    (html.unescape(title), link, date)
    for title, link, date in zip(titles, links, published)
    if 'watch' in link and (now-date).days < limit
)

Note how relatively easy this is. The difference between two datetime objects is a timedelta object, which has a days property.

Configuring the script

For simplicity I could just have put the list of channels that I like into the script itself, but that isn’t very user-friendly. Usually I publish scripts like this in a github repository so others can use them as well. And that means they should not contain configuration information that is specific for me.

So I decided to put the list of channels and the limit for how old the video’s should be to show up in a JSON file that the script reads. This file should be named .youtube-feedrc and should be located in your $HOME directory.

The format of the file is shown below:

{
    "limit": 7,
    "channels": {
        "first channel name": "UC7QoixBiFO2zstyEXkbVpVg",
        "second channel name": "UCvY8pgX_3ksKvZBNFHVoUAQ",
        "third channel name": "UCk3wqxZummX-HTzYGjHDBpw"
    }
}

Note

The channel-id’s above should not be used; they’re randomly generated examples.

You can have as many channels as you like. They will be visited in the order that they are listed in the file.


For comments, please send me an e-mail.


Related articles


←  Attempting a conky replacement in Python (part 2) Generating barcodes with “BWIPP →