A simple feed reader for Youtube
As an exercise, I wrote a small script to read the Atom feeds for some favourite youtube channels. Of course I could have installed a “real” feed-reader, but that would be overkill and not half as much fun. :-)
Find the URI and channel-id
In this blog post I saw how to access the feeds. The beginning of the URI is
https://www.youtube.com/feeds/videos.xml?channel_id=
, and it is followed
by the channel-id. This channel-id starts with UC
and looks like it is
base64 encoded data.
Note
The blog post above mentions channel-external-id
. However
did not find them in the pages for the channels I looked at. I did find
the channelId
, though.
Open the homepage of the channel and view its source code. Look for the
channelId
and copy it.
Download the data in XML format
Take the URI https://www.youtube.com/feeds/videos.xml?channel_id=
and
append the channel-id. Use that to download the data. For this I use the
awesome requests module.
import requests
import re
import datetime
base = "https://www.youtube.com/feeds/videos.xml?channel_id="
channel = 'UC5NO8MgTQKHAWXp6z8Xl7yQ'
uri = base + channel
res = requests.get(uri)
At this point, we should check that res.ok
is True
. In that case the
information was successfully retrieved.
Extracting the data
The downloaded data is in res.text
, which is a chunk of XML. Looking at
the data it seems to be in Atom format rather than RSS; it uses
entry
instead of item
tags.
Since I only need a couple of pieces of information, I’m not going to bother
with an XML parser and just use regular expressions to extract the
<title>...</title>
, <link rel="alternate" href="..."/>
and
'<published>...</published>
data.
In this case, the XML is formatted nicely. All the tags I want are on their
own line. This is significant, because by default the .
(dot) special
character in Python regular expressions does not match a newline. So here we
can get away with using the standard greedy capturing group (.*)
.
If the XML contains newlines in weird places, I would remove those from the
text (using the replace
method of strings) and then use the non-greedy
capturing group (.*?)
to make sure that we only capture between the
closest adjacent begin and end tags.
I’m translating the publication times to datetime.datetime
instances, for
easier comparison later.
titles = re.findall('<title>(.*)</title>', res.text, re.IGNORECASE)
links = re.findall('<link rel="alternate" href="(.*)"/>', res.text, re.IGNORECASE)
published = [
datetime.datetime.fromisoformat(pt) for pt in
re.findall('<published>(.*)</published>', res.text, re.IGNORECASE)
]
The first title and link are that of the channel intro.
In [17]: titles[:3]
Out[17]:
['This Old Tony',
'ALL IN for PROJECT EGRESS',
'Moto Erratum! & A Flat Tire']
In [18]: links[:3]
Out[18]:
['https://www.youtube.com/channel/UC5NO8MgTQKHAWXp6z8Xl7yQ',
'https://www.youtube.com/watch?v=d6SbRkM0C3s',
'https://www.youtube.com/watch?v=WDcBLPCYQpU']
So, let’s combine them and filter out the channel intro.
In [20]: items = list(zip(titles, links, published))[1:3]
Out[20]:
[('ALL IN for PROJECT EGRESS',
'https://www.youtube.com/watch?v=d6SbRkM0C3s',
datetime.datetime(2019, 7, 13, 14, 32, 29, tzinfo=datetime.timezone.utc)),
('Moto Erratum! & A Flat Tire',
'https://www.youtube.com/watch?v=WDcBLPCYQpU',
datetime.datetime(2019, 7, 8, 17, 22, 48, tzinfo=datetime.timezone.utc))]
The titles still contain HTML escapes. And I want to limit the video’s to those published in a configurable number of days. Let’s fix that.
import html
now = datetime.datetime.now(tz=datetime.timezone.utc)
limit = 7
items = tuple(
(html.unescape(title), link, date)
for title, link, date in zip(titles, links, published)
if 'watch' in link and (now-date).days < limit
)
Note how relatively easy this is. The difference between two datetime
objects is a timedelta
object, which has a days
property.
Configuring the script
For simplicity I could just have put the list of channels that I like into the script itself, but that isn’t very user-friendly. Usually I publish scripts like this in a github repository so others can use them as well. And that means they should not contain configuration information that is specific for me.
So I decided to put the list of channels and the limit for how old the video’s
should be to show up in a JSON file that the script reads. This file should be
named .youtube-feedrc
and should be located in your $HOME
directory.
The format of the file is shown below:
{ "limit": 7, "channels": { "first channel name": "UC7QoixBiFO2zstyEXkbVpVg", "second channel name": "UCvY8pgX_3ksKvZBNFHVoUAQ", "third channel name": "UCk3wqxZummX-HTzYGjHDBpw" } }
Note
The channel-id’s above should not be used; they’re randomly generated examples.
You can have as many channels as you like. They will be visited in the order that they are listed in the file.
For comments, please send me an e-mail.
Related articles
- Profiling Python scripts(6): auto-orient
- Profiling with pyinstrument
- From python script to executable with cython
- On Python speed
- Python 3.11 speed comparison with 3.9