A simple feed reader for Youtube
As an exercise, I wrote a small script to read the Atom feeds for some favourite youtube channels. Of course I could have installed a “real” feed-reader, but that would be overkill and not half as much fun. :-)
Find the URI and channel-id
In this blog post I saw how to access the feeds. The beginning of the URI is
https://www.youtube.com/feeds/videos.xml?channel_id=
, and it is followed
by the channel-id. This channel-id starts with UC
and looks like it is
base64 encoded data.
Note
The blog post above mentions channel-external-id
. However
did not find them in the pages for the channels I looked at. I did find
the channelId
, though.
Open the homepage of the channel and view its source code. Look for the
channelId
and copy it.
Download the data in XML format
Take the URI https://www.youtube.com/feeds/videos.xml?channel_id=
and
append the channel-id. Use that to download the data. For this I use the
awesome requests module.
import requests
import re
import datetime
base = "https://www.youtube.com/feeds/videos.xml?channel_id="
channel = 'UC5NO8MgTQKHAWXp6z8Xl7yQ'
uri = base + channel
res = requests.get(uri)
At this point, we should check that res.ok
is True
. In that case the
information was successfully retrieved.
Extracting the data
The downloaded data is in res.text
, which is a chunk of XML. Looking at
the data it seems to be in Atom format rather than RSS; it uses
entry
instead of item
tags.
Since I only need a couple of pieces of information, I’m not going to bother
with an XML parser and just use regular expressions to extract the
<title>...</title>
, <link rel="alternate" href="..."/>
and
'<published>...</published>
data.
In this case, the XML is formatted nicely. All the tags I want are on their
own line. This is significant, because by default the .
(dot) special
character in Python regular expressions does not match a newline. So here we
can get away with using the standard greedy capturing group (.*)
.
If the XML contains newlines in weird places, I would remove those from the
text (using the replace
method of strings) and then use the non-greedy
capturing group (.*?)
to make sure that we only capture between the
closest adjacent begin and end tags.
I’m translating the publication times to datetime.datetime
instances, for
easier comparison later.
titles = re.findall('<title>(.*)</title>', res.text, re.IGNORECASE)
links = re.findall('<link rel="alternate" href="(.*)"/>', res.text, re.IGNORECASE)
published = [
datetime.datetime.fromisoformat(pt) for pt in
re.findall('<published>(.*)</published>', res.text, re.IGNORECASE)
]
The first title and link are that of the channel intro.
In [17]: titles[:3]
Out[17]:
['This Old Tony',
'ALL IN for PROJECT EGRESS',
'Moto Erratum! & A Flat Tire']
In [18]: links[:3]
Out[18]:
['https://www.youtube.com/channel/UC5NO8MgTQKHAWXp6z8Xl7yQ',
'https://www.youtube.com/watch?v=d6SbRkM0C3s',
'https://www.youtube.com/watch?v=WDcBLPCYQpU']
So, let’s combine them and filter out the channel intro.
In [20]: items = list(zip(titles, links, published))[1:3]
Out[20]:
[('ALL IN for PROJECT EGRESS',
'https://www.youtube.com/watch?v=d6SbRkM0C3s',
datetime.datetime(2019, 7, 13, 14, 32, 29, tzinfo=datetime.timezone.utc)),
('Moto Erratum! & A Flat Tire',
'https://www.youtube.com/watch?v=WDcBLPCYQpU',
datetime.datetime(2019, 7, 8, 17, 22, 48, tzinfo=datetime.timezone.utc))]
The titles still contain HTML escapes. And I want to limit the video’s to those published in a configurable number of days. Let’s fix that.
import html
now = datetime.datetime.now(tz=datetime.timezone.utc)
limit = 7
items = tuple(
(html.unescape(title), link, date)
for title, link, date in zip(titles, links, published)
if 'watch' in link and (now-date).days < limit
)
Note how relatively easy this is. The difference between two datetime
objects is a timedelta
object, which has a days
property.
Configuring the script
For simplicity I could just have put the list of channels that I like into the script itself, but that isn’t very user-friendly. Usually I publish scripts like this in a github repository so others can use them as well. And that means they should not contain configuration information that is specific for me.
So I decided to put the list of channels and the limit for how old the video’s
should be to show up in a JSON file that the script reads. This file should be
named .youtube-feedrc
and should be located in your $HOME
directory.
The format of the file is shown below:
{ "limit": 7, "channels": { "first channel name": "UC7QoixBiFO2zstyEXkbVpVg", "second channel name": "UCvY8pgX_3ksKvZBNFHVoUAQ", "third channel name": "UCk3wqxZummX-HTzYGjHDBpw" } }
Note
The channel-id’s above should not be used; they’re randomly generated examples.
You can have as many channels as you like. They will be visited in the order that they are listed in the file.
Related articles
- On the nature of GUI programs
- Have Python log to syslog
- Including binary data in Python scripts
- Attempting a conky replacement in Python (part 2)
- Attempting a conky replacement in Python (part 1)