Reading xlsx files with Python
This article is a write-up of what I learned trying to extract information from
ms-office xlsx
files using Python.
Introduction
For my work I need to keep track of the hours that I spend on projects. Traditionally this is done on pre-printed paper sheets. This is somewhat cumbersome and the data is not available for analysis.
My handwriting is generally considered illegible by anyone but myself. Additionally I wanted the data in a format that I could query in a meaningful way, as in “how many hours did I spend on project X in year Y”. So I was looking for an alternative.
A colleague of mine created an xlsx
spreadsheet for tracking hours. This
was at least readable, and it also calculates durations of items. But since
every day is a separate worksheet in the file, it was not very useful for
data analysis as-is.
So I got to work investigating the file format.
Why not use the specifications?
The “Office Open XML” format is standardised both by ECMA and ISO.
When these standards were first published I had a quick look at them. The documentation runs in the thousands of pages and does not make for light reading. Life’s too short to plow through that.
In my estimation it would be easier to look into some files and see how I could extract the data that I needed. So that’s what I did.
Tools
What I wanted to do (for instance tally up hours by project) could probably be achieved by using the BASIC variant built into ms-office. But I’m not familiar with it, nor do I especially want to be. It’s very verbose and limited to a platform that I don’t prefer anyway.
My tools of choice tend to vary with the task at hand. For very simple things I’ll probably use the power of UNIX utilities in a shell-script. For those tasks where speed is critical I will use C. For most things in between I prefer Python 3. In my view it strikes a good balance between expressiveness, power and readability. And it has an excellent standard library.
By the time I was doing this some brave souls had already created a library for reading and writing these files, openpyxl. But this library is very much overkill for what I had in mind. And I preferred to have a standard-library only solution. So I went the DIY route.
During the process I used xmllint --format
to make the xml
files human-readable.
File structure
Basics
An xlsx
file is basically a zip-file with a standard directory structure
and a gaggle of XML files. When unpacking these files, I generally found the
following directories.
> mkdir foo
> cd foo
> unzip ../foo.xlsx
> find . -type d|sort
.
./_rels
./docProps
./xl
./xl/_rels
./xl/printerSettings
./xl/theme
./xl/worksheets
./xl/worksheets/_rels
(The lines starting with >
are commands, the rest is their output.)
The _rels directory
In the files that I looked at, the _rels
directory was empty, so I skipped that.
The docProps directory
This directory contains two files; app.xml
and core.xml
.
The app.xml
file basically contained a list of the titles as seen on the tabs
on the bottom of the worksheets. The titles are listed in this file in the
sequence they appear in the xlsx
file from left to right. They are
bracketed between <vt:lpstr>
and </vt:lpstr>
tags.
The core.xml
contains information about creation, modification and
printing of the file. The relevant contents are:
<dc:creator>Foo Bar</dc:creator>
<cp:lastModifiedBy>Spam Eggs</cp:lastModifiedBy>
<cp:lastPrinted>2011-12-15T16:46:09Z</cp:lastPrinted>
<dcterms:created xsi:type="dcterms:W3CDTF">2011-08-16T07:14:54Z</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">2011-12-16T15:24:35Z</dcterms:modified>
They might come in handy but I haven’t used them yet.
The xl directory
Files
Before we go into the subdirectories let’s have a look at the files in the
xl
directory.
> find . -type f -depth 1|sort
./calcChain.xml
./sharedStrings.xml
./styles.xml
./vbaProject.bin
./workbook.xml
Looking at the name, calcChain.xml
has information about the recalculation
sequence of cells. I skipped this.
As the name implies sharedStrings.xml
contains a list of strings that are
used in multiple places in the worksheets. Each string is bracketed between
nested <si><t>
and </t></si>
tags. We’ll come back to this later.
The styles.xml
file contains information about the styles used. This
includes things like fonts, borders, cell formatting. Not relevant if you’re
only interested in the content.
The vbaProject.bin
is a binary blob that probably contains BASIC code.
Also not relevant if you just want to read the cell contents.
The workbook.xml
file contains a number of sheet definitions. These link
the name of the worksheet to several numbers. Each sheet has a single tag with
attributes, like this.
<sheet name="template" sheetId="4" r:id="rId1"/>
In the subdirectory xl/worksheets
there is a number of XML files named
sheetN.xml
, where N is a number. One might expect that N corresponds
with the sheetId
. But that turns out not to be the case. The sheet number
N actually is the number in the r:id
attribute after the rId
text. So
in the example above, the worksheet named template
is
xl/worksheets/sheet1.xml
.
The xl/_rels directory
This contained only one file workbook.xml.rels
. This contains a mapping
between Relationship Id
and Target
file paths.
The xl/printerSettings directory
This directory contained a binary blob for each worksheet. Inspection with
hexdump
showed me that these blobs contain specific information for
a chosen printer. I’ve skipped this.
The xl/theme directory
In this directory you can find at least one themeN.xml
file (where N is
a number). This contains information about the presentation of the file.
Colors, fonts, fill styles et cetera. This is also not really relevant to the
contents, so I’ve skipped this as well.
The xl/worksheets directory
In the _rels
subdirectory you’ll find files linking the sheets to printer
settings. That is also not relevant to the contents.
Now we come to the most relevant part, the actual worksheets. The worksheets
folder contains a number of XML files named sheetN.xml
, where N is
a number without leading zeroes. In these files the only tags that we’re
interested in are the cell tags. These come in different kinds.
A cell with just a number in it looks like this when formatted.
<c r="D8" s="28">
<v>13.15</v>
</c>
The r
attribute of the c
tag is the column/row address. The data
between the v
tags is the value. So cell D8 holds the value 13.15.
What the s
attribute is I don’t know for sure. It could refer to the
cell’s style. It does not seem relevant to the content.
A variant of this has a formula (a set of f
tags) after the opening c
tag but before the v
tags. I ignore the formula and just read the value.
The following form refers to a shared string.
<c r="E1" s="11" t="s">
<v>8</v>
</c>
It uses 0-based indexing so this means that cell E1 contains the ninth string from the shared string list.
Extracting the information
To read the XLSX file we use Python’s zipfile
module.
from zipfile import ZipFile
z = ZipFile('test.xlsx')
We then proceed to create a list of shared strings using regular expressions. Contrary to polular notion, it is perfectly doable to use regular expressions to extract data from xml files. Using a full-blown xml parser it not necessary in this case.
import re
shre = re.compile('<t[^>]*>(.*?)</t>')
with z.open('xl/sharedStrings.xml') as sstr:
ssdata = sstr.read().decode('utf-8')
shared_strings = shre.findall(ssdata)
Next we create a dictionary where the worksheet name is the key and the sheet number the value, since worksheet names must be unique.
wbre = re.compile('<sheet name="([^"]+)".*?"rId([0-9]{1,3})"/>')
with z.open('xl/workbook.xml') as wb:
wbdata = wb.read().decode('utf-8')
worksheets = dict(wbre.findall(wbdata))
We could then iterate over the worksheets and read them. I use two regular expressions to extract non-empty cell data.
def conv(n):
"""Convert to appropriate type."""
try:
r = float(n)
except ValueError:
return n # Keep it a str.
if round(r) == r:
return int(r) # It's an int.
return r # It's a float.
numre = re.compile('<c r="([A-Z]+)([0-9]+)" s="[^"]+">'
'(?:<f[^<]+</f>)?<v>(.*?)</v></c>')
strre = re.compile('<c r="([A-Z]+)([0-9]+)" s="[^"]+" t="s"><v>(.*?)</v></c>')
for name, num in worksheets:
with z.open('xl/worksheets/sheet{}.xml'.format(num)) as ws:
wsdata = ws.read().decode('utf-8')
C = {(k, int(p)): conv(v) for k, p, v in numre.findall(wsdata)}
Cs = {(k, int(p)): shared_strings[int(v)] for k, p, v
in strre.findall(wsdata)}
C.update(Cs)
The C
dictionary now contains the content of all non-empty cells. So we
can now for instance query all cells in row 5.
In [49]: {k: v for k, v in C.items() if k[1] == 5}
Out[49]:
{('A', 5): 42571,
('C', 5): '001',
('D', 5): 10.3,
('E', 5): 12,
('F', 5): 1.5,
('I', 5): '3D modelling'}
This means that I spent 1.5 hours on 3D modelling for project 42571.
For comments, please send me an e-mail.