Roland's homepage

My random knot in the Web

Extracting data from XML with regular expressions

While regular expressions can’t fully parse XML, they can be sufficient to extract data from it. In cases where the dataset is large and we are only interested in a small part of the data, this can be significantly faster than using a full XML parser.

In this article I’ll cover two cases. Getting all the contents of a single tag and converting non-nested tags to a dictionary. My language of choice is Python, specifically version 3. Regular expressions are part of the standard library.

Retrieving the contents of a single tag

If you know the tag you are looking for, creating a regular expression to find it is pretty easy.

I wrote the following Python function to get the contents of a single tag.

import re

def xmlfind(text, tag, attributes=None):
    """Isolate the content of a tag from XML text.

    Arguments:
        text: The text to look in.
        tag: The tag to look for. In '<lat>23.7</lat>' the tag is 'lat'.
        attributes: A dict of attributes info that is part of the start tag.
                    E.g. the dict {"unit": "degrees"} would produce
                    '<lat unit="degrees">'.

    Returns:
        A list containing the data in the requested tag.
    """
    if attributes:
        xs = ' '.join(['{}="{}"'.format(k, v) for k, v in attributes.items()])
        rx = '<{0} {1}>(.*?)</{0}>'.format(tag, xs)
    else:
        rx = '<{0}>(.*?)</{0}>'.format(tag)
    return re.findall(rx, text, re.DOTALL)

The last line basically does all the work. The rest is for dealing with attributes in the start tag.

In the case without attributes, and with the tag lat, the regular expression would look like <lat>(.*?)</lat>. It has three parts. Translated into English they mean:

  • Match the string <lat>.
  • Match a group of 0 or more characters, but as little as possible.
  • Match the string </lat>.

The findall function returns all the found groups. Let’s use the XML weather report on this page as a sample.

In the example below, the text from the weather report is stored as data and the xmlfind function is defined as above. We will use it to extract the information in the image tag.

In [8]: xmlfind(data, 'image')
Out[8]: ["\n  <url>http://weather.gov/images/xml_logo.gif</url>\n"\
         "  <title>NOAA's National Weather Service</title>\n"\
         "  <link>http://weather.gov</link>\n"]

Note that this is a list which contains one string. (The string is re-formatted to fit this page better.) If the tag occurs multiple times in the given text, you will get multiple matches.

Converting non-nested tags to a dict

Suppose we have some not-nested tags, as shown below.

<weather>A Few Clouds</weather>
<temp_f>11</temp_f>
<temp_c>-12</temp_c>
<relative_humidity>36</relative_humidity>
<wind_dir>West</wind_dir>
<wind_degrees>280</wind_degrees>
<wind_mph>18.4</wind_mph>
<wind_gust_mph>29</wind_gust_mph>
<pressure_mb>1023.6</pressure_mb>
<pressure_in>30.23</pressure_in>
<dewpoint_f>-11</dewpoint_f>
<dewpoint_c>-24</dewpoint_c>
<windchill_f>-7</windchill_f>
<windchill_c>-22</windchill_c>
<visibility_mi>10.00</visibility_mi>

The regular expression <(?P<tag>\S+).*>(.*?)</(?P=tag)> and the findall function can extract the tags and contents in one go. In the following example, the text above is available as data.

In [2]: import re

In [3]: rv = re.findall('<(?P<tag>\S+).*>(.*?)</(?P=tag)>', data, re.DOTALL)

In [4]: rv
Out[4]:
[('weather', 'A Few Clouds'), ('temp_f', '11'),
('temp_c', '-12'), ('relative_humidity', '36'),
('wind_dir', 'West'), ('wind_degrees', '280'),
('wind_mph', '18.4'), ('wind_gust_mph', '29'),
('pressure_mb', '1023.6'), ('pressure_in', '30.23'),
('dewpoint_f', '-11'), ('dewpoint_c', '-24'),
('windchill_f', '-7'), ('windchill_c', '-22'),
('visibility_mi', '10.00')]

The findall function returns a list of 2-tuples each containing a tag and its value. For simply reporting all the data, this would work fine.

For using part of the data without having to take the order of the list items into account, a dictionary would be more suited. Using the dict built-in function, this data is converted into a dictionary.

In [5]: dict(rv)
Out[5]:
{'dewpoint_c': '-24',
'dewpoint_f': '-11',
'pressure_in': '30.23',
'pressure_mb': '1023.6',
'relative_humidity': '36',
'temp_c': '-12',
'temp_f': '11',
'visibility_mi': '10.00',
'weather': 'A Few Clouds',
'wind_degrees': '280',
'wind_dir': 'West',
'wind_gust_mph': '29',
'wind_mph': '18.4',
'windchill_c': '-22',
'windchill_f': '-7'}

Closing remarks

The purpose of this article is not to convince you that an XML parser is unnecessary. If you need to read a whole XML file or a deeply nested one the solutions presented here will not be sufficient.

But in special cases they can be quite useful.


←  Python number conversions Parallel execution with Python  →