Roland's homepage

My random knot in the Web

Reading xlsx files with Python

This article is a write-up of what I learned trying to extract information from ms-office xlsx files using Python.

Introduction

For my work I need to keep track of the hours that I spend on projects. Traditionally this is done on pre-printed paper sheets. This is somewhat cumbersome and the data is not available for analysis.

My handwriting is generally considered illegible by anyone but myself. Additionally I wanted the data in a format that I could query in a meaningful way, as in “how many hours did I spend on project X in year Y”. So I was looking for an alternative.

A colleague of mine created an xlsx spreadsheet for tracking hours. This was at least readable, and it also calculates durations of items. But since every day is a separate worksheet in the file, it was not very useful for data analysis as-is.

So I got to work investigating the file format.

Why not use the specifications?

The “Office Open XML” format is standardised both by ECMA and ISO.

When these standards were first published I had a quick look at them. The documentation runs in the thousands of pages and does not make for light reading. Life’s too short to plow through that.

In my estimation it would be easier to look into some files and see how I could extract the data that I needed. So that’s what I did.

Tools

What I wanted to do (for instance tally up hours by project) could probably be achieved by using the BASIC variant built into ms-office. But I’m not familiar with it, nor do I especially want to be. It’s very verbose and limited to a platform that I don’t prefer anyway.

My tools of choice tend to vary with the task at hand. For very simple things I’ll probably use the power of UNIX utilities in a shell-script. For those tasks where speed is critical I will use C. For most things in between I prefer Python 3. In my view it strikes a good balance between expressiveness, power and readability. And it has an excellent standard library.

By the time I was doing this some brave souls had already created a library for reading and writing these files, openpyxl. But this library is very much overkill for what I had in mind. And I preferred to have a standard-library only solution. So I went the DIY route.

During the process I used xmllint --format to make the xml files human-readable.

File structure

Basics

An xlsx file is basically a zip-file with a standard directory structure and a gaggle of XML files. When unpacking these files, I generally found the following directories.

> mkdir foo
> cd foo
> unzip ../foo.xlsx
> find . -type d|sort
.
./_rels
./docProps
./xl
./xl/_rels
./xl/printerSettings
./xl/theme
./xl/worksheets
./xl/worksheets/_rels

(The lines starting with > are commands, the rest is their output.)

The _rels directory

In the files that I looked at, the _rels directory was empty, so I skipped that.

The docProps directory

This directory contains two files; app.xml and core.xml.

The app.xml file basically contained a list of the titles as seen on the tabs on the bottom of the worksheets. The titles are listed in this file in the sequence they appear in the xlsx file from left to right. They are bracketed between <vt:lpstr> and </vt:lpstr> tags.

The core.xml contains information about creation, modification and printing of the file. The relevant contents are:

<dc:creator>Foo Bar</dc:creator>
<cp:lastModifiedBy>Spam Eggs</cp:lastModifiedBy>
<cp:lastPrinted>2011-12-15T16:46:09Z</cp:lastPrinted>
<dcterms:created xsi:type="dcterms:W3CDTF">2011-08-16T07:14:54Z</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">2011-12-16T15:24:35Z</dcterms:modified>

They might come in handy but I haven’t used them yet.

The xl directory

Files

Before we go into the subdirectories let’s have a look at the files in the xl directory.

> find . -type f -depth 1|sort
./calcChain.xml
./sharedStrings.xml
./styles.xml
./vbaProject.bin
./workbook.xml

Looking at the name, calcChain.xml has information about the recalculation sequence of cells. I skipped this.

As the name implies sharedStrings.xml contains a list of strings that are used in multiple places in the worksheets. Each string is bracketed between nested <si><t> and </t></si> tags. We’ll come back to this later.

The styles.xml file contains information about the styles used. This includes things like fonts, borders, cell formatting. Not relevant if you’re only interested in the content.

The vbaProject.bin is a binary blob that probably contains BASIC code. Also not relevant if you just want to read the cell contents.

The workbook.xml file contains a number of sheet definitions. These link the name of the worksheet to several numbers. Each sheet has a single tag with attributes, like this.

<sheet name="template" sheetId="4" r:id="rId1"/>

In the subdirectory xl/worksheets there is a number of XML files named sheetN.xml, where N is a number. One might expect that N corresponds with the sheetId. But that turns out not to be the case. The sheet number N actually is the number in the r:id attribute after the rId text. So in the example above, the worksheet named template is xl/worksheets/sheet1.xml.

The xl/_rels directory

This contained only one file workbook.xml.rels. This contains a mapping between Relationship Id and Target file paths.

The xl/printerSettings directory

This directory contained a binary blob for each worksheet. Inspection with hexdump showed me that these blobs contain specific information for a chosen printer. I’ve skipped this.

The xl/theme directory

In this directory you can find at least one themeN.xml file (where N is a number). This contains information about the presentation of the file. Colors, fonts, fill styles et cetera. This is also not really relevant to the contents, so I’ve skipped this as well.

The xl/worksheets directory

In the _rels subdirectory you’ll find files linking the sheets to printer settings. That is also not relevant to the contents.

Now we come to the most relevant part, the actual worksheets. The worksheets folder contains a number of XML files named sheetN.xml, where N is a number without leading zeroes. In these files the only tags that we’re interested in are the cell tags. These come in different kinds.

A cell with just a number in it looks like this when formatted.

<c r="D8" s="28">
  <v>13.15</v>
</c>

The r attribute of the c tag is the column/row address. The data between the v tags is the value. So cell D8 holds the value 13.15. What the s attribute is I don’t know for sure. It could refer to the cell’s style. It does not seem relevant to the content.

A variant of this has a formula (a set of f tags) after the opening c tag but before the v tags. I ignore the formula and just read the value.

The following form refers to a shared string.

<c r="E1" s="11" t="s">
  <v>8</v>
</c>

It uses 0-based indexing so this means that cell E1 contains the ninth string from the shared string list.

Extracting the information

To read the XLSX file we use Python’s zipfile module.

from zipfile import ZipFile

z = ZipFile('test.xlsx')

We then proceed to create a list of shared strings using regular expressions. Contrary to polular notion, it is perfectly doable to use regular expressions to extract data from xml files. Using a full-blown xml parser it not necessary in this case.

import re

shre = re.compile('<t[^>]*>(.*?)</t>')

with z.open('xl/sharedStrings.xml') as sstr:
    ssdata = sstr.read().decode('utf-8')
shared_strings = shre.findall(ssdata)

Next we create a dictionary where the worksheet name is the key and the sheet number the value, since worksheet names must be unique.

wbre = re.compile('<sheet name="([^"]+)".*?"rId([0-9]{1,3})"/>')
with z.open('xl/workbook.xml') as wb:
    wbdata = wb.read().decode('utf-8')
worksheets = dict(wbre.findall(wbdata))

We could then iterate over the worksheets and read them. I use two regular expressions to extract non-empty cell data.

def conv(n):
    """Convert to appropriate type."""
    try:
        r = float(n)
    except ValueError:
        return n  # Keep it a str.
    if round(r) == r:
        return int(r)  # It's an int.
    return r  # It's a float.


numre = re.compile('<c r="([A-Z]+)([0-9]+)" s="[^"]+">'
                   '(?:<f[^<]+</f>)?<v>(.*?)</v></c>')
strre = re.compile('<c r="([A-Z]+)([0-9]+)" s="[^"]+" t="s"><v>(.*?)</v></c>')

for name, num in worksheets:
    with z.open('xl/worksheets/sheet{}.xml'.format(num)) as ws:
        wsdata = ws.read().decode('utf-8')
   C = {(k, int(p)): conv(v) for k, p, v in numre.findall(wsdata)}
   Cs = {(k, int(p)): shared_strings[int(v)] for k, p, v
         in strre.findall(wsdata)}
   C.update(Cs)

The C dictionary now contains the content of all non-empty cells. So we can now for instance query all cells in row 5.

In [49]: {k: v for k, v in C.items() if k[1] == 5}
Out[49]:
{('A', 5): 42571,
('C', 5): '001',
('D', 5): 10.3,
('E', 5): 12,
('F', 5): 1.5,
('I', 5): '3D modelling'}

This means that I spent 1.5 hours on 3D modelling for project 42571.


For comments, please send me an e-mail.


←  Finding your own files, fast Installing asymptote 2.38 in TeXLive 2016  →