Roland's homepage

My random knot in the Web

Profiling Python scripts (1): stl2pov

This is the first in a series of articles that covers analyzing and improving performance bottlenecks in Python scripts.

Note

The other articles in this series will be listed under “Related Articles” at the bottom of the page.

My stltools programs will serve as examples because they can have significant run times. It must be said that these scripts are not unoptimized! On and off I’ve been working on them since 2011. Meanwhile I’ve used numpy, pypy and discovered ways to make it faster without them.

But as will be seen, there is still room for improvement.

Profiling stl2pov

First the script that started it all; stl2pov. Profiling is done as follows using the cProfile module, from the source directory of the project.

Generating a mesh object

python -m cProfile \
-s tottime \
stl2pov.py \
../testdata-stltools/headrest_mesh.stl | less

The test file headrest_mesh.stl is a binary STL file because those tend to be the largest. It has 277648 facets.

The profile is sorted by total time spent, so only the first lines are most important. Only items with total time >0.1 s are shown below:

     2236215 function calls (2235779 primitive calls) in 3.181 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
277650    1.828    0.000    1.828    0.000 {method 'format' of 'str' objects}
832945    0.313    0.000    0.470    0.000 stl.py:82(_getbp)
     3    0.295    0.098    0.295    0.098 {method 'write' of '_io.TextIOWrapper' objects}
277649    0.132    0.000    0.132    0.000 utils.py:85(take)
     1    0.118    0.118    2.077    2.077 stl2pov.py:90(<listcomp>)

Of the 3.1 seconds the script takes, 1.8 s is spent in formatting strings. So there is clearly a major bottleneck in this script. Looking at the source code for stl2pov.py, there is only one format call used, in the function mesh1:

fct = (
    "  triangle {{\n    <{0}, {2}, {1}>,\n    <{3}, {5}, {4}>,\n"
    "    <{6}, {8}, {7}>\n  }}"
)
lines += [fct.format(*f) for f in facets]

So the current bottleneck is formatting the floating point coordinates of the vertices. Honestly, I had not expected that.

Generating a mesh2 object

Note

This section was added after other improvements to the code were also made. So the lines referenced here might be different from those in the rest of the article.

What happens if we generate a mesh2 object instead of the default mesh:

         3902230 function calls (3901785 primitive calls) in 1.727 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.332    0.332    0.332    0.332 stl2pov.py:114(<listcomp>)
   832945    0.262    0.000    0.505    0.000 vecops.py:120(<genexpr>)
   832945    0.250    0.000    0.256    0.000 stl.py:82(_getbp)
   832988    0.176    0.000    0.176    0.000 {method 'setdefault' of 'dict' objects}
   277649    0.133    0.000    0.133    0.000 utils.py:85(take)
        1    0.122    0.122    0.122    0.122 stl2pov.py:117(<listcomp>)
        1    0.091    0.091    0.346    0.346 stl.py:78(<listcomp>)
   277649    0.087    0.000    0.220    0.000 stl.py:161(<genexpr>)
        1    0.077    0.077    0.582    0.582 vecops.py:69(indexate)
833894/833805    0.068    0.000    0.068    0.000 {built-in method builtins.len}
        1    0.037    0.037    1.727    1.727 stl2pov.py:9(<module>)
        1    0.025    0.025    0.827    0.827 stl.py:147(toindexed)
        1    0.021    0.021    1.678    1.678 stl2pov.py:123(main)
      499    0.010    0.000    0.010    0.000 {method 'join' of 'str' objects}

In this case the vertices have duplicates removed. So much less data needs to be processed. In contrast to the default, there is no one item that dominates the total time. The top three items account for about half the runtime.

The line stl2pov.py:114 is:

lines += [f"    <{p[0]}, {p[2]}, {p[1]}>," for p in points]

And vecops.py:120 is:

indices = tuple(pd.setdefault(p, len(pd)) for p in points)

Investigating string formatting

Let’s get a baseline of how long a format action takes. This is done using IPython because of its built-in %timeit command.

In [1]: facet = (
...:     139.3592529296875, 69.91365814208984, 97.01824951171875,
...:     138.14346313476562, 70.0313949584961, 97.50768280029297,
...:     138.79571533203125, 69.03594970703125, 97.875244140625
...: );

In [2]: fct = (
...:     "  triangle {{\n    <{0}, {2}, {1}>,\n    <{3}, {5}, {4}>,\n"
...:     "    <{6}, {8}, {7}>\n  }}"
...: );

In [3]: %timeit fct.format(*facet)
6.39 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

If you don’t want to install IPython, it could also be done like this in the Python REPL using timeit from the standard library.

>>> facet = (
...     139.3592529296875, 69.91365814208984, 97.01824951171875,
...     138.14346313476562, 70.0313949584961, 97.50768280029297,
...     138.79571533203125, 69.03594970703125, 97.875244140625
... )
>>> fct = (
...     "  triangle {{\n    <{0}, {2}, {1}>,\n    <{3}, {5}, {4}>,\n"
...     "    <{6}, {8}, {7}>\n  }}"
... )
>>> import timeit
>>> timeit.timeit("fct.format(*facet)", number=10000, globals=globals())/10000
6.748177204281092e-06

On my machine, running the above code yields that it takes ≈6.5 μs to format the string. Replacing the format method with an f-string yields the same result.

In [5]: f = (
...:     139.3592529296875, 69.91365814208984, 97.01824951171875,
...:     138.14346313476562, 70.0313949584961, 97.50768280029297,
...:     138.79571533203125, 69.03594970703125, 97.875244140625
...: );

In [6]: %timeit f"  triangle {{\n    <{f[0]}, {f[2]}, {f[1]}>,\n    <{f[3]}, {f[5]}, {f[4]}>,\n    <{f[6]
...: }, {f[8]}, {f[7]}>\n  }}"
6.79 µs ± 4.79 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So let’s look at float to str conversion in detail.

In [1]: f = 139.3592529296875;

In [3]: %timeit str(f)
724 ns ± 0.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [4]: %timeit "{0}".format(f)
734 ns ± 1.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [5]: %timeit "{0:.5f}".format(f)
314 ns ± 0.0365 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [7]: %timeit "{0:f}".format(f)
313 ns ± 7.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [8]: %timeit "{0:e}".format(f)
382 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

The str() function and format without specifier take approximately the same time, as expected. But if a type is added to the format specifier, the time needed is approximately halved!

Improvements

mesh object

So the format string in stl2pov.py is changed to:

fct = (
    "  triangle {{\n    <{0:e}, {2:e}, {1:e}>,\n    <{3:e}, {5:e}, {4:e}>,\n"
    "    <{6:e}, {8:e}, {7:e}>\n  }}"
)

And then the profiling is run again:

     2236215 function calls (2235779 primitive calls) in 2.082 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
277650    0.863    0.000    0.863    0.000 {method 'format' of 'str' objects}
832945    0.308    0.000    0.465    0.000 stl.py:82(_getbp)
     3    0.195    0.065    0.195    0.065 {method 'write' of '_io.TextIOWrapper' objects}
     1    0.117    0.117    1.089    1.089 stl2pov.py:90(<listcomp>)
277649    0.109    0.000    0.109    0.000 utils.py:85(take)

This simple 2-line change has halved the time the formatting takes, and reduced the total runtime from 3 s to 2 s!

mesh2 object

The formatting for points a mesh2 in stl2pov.py is changed to:

lines += [f"    <{p[0]:e}, {p[2]:e}, {p[1]:e}>," for p in points]

This has the following effect on the profile:

> python -m cProfile -s tottime stl2pov.py -2 ../testdata-stltools/headrest_mesh.stl | less
         3902230 function calls (3901785 primitive calls) in 1.578 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   832945    0.264    0.000    0.509    0.000 vecops.py:120(<genexpr>)
   832945    0.253    0.000    0.259    0.000 stl.py:82(_getbp)
        1    0.180    0.180    0.180    0.180 stl2pov.py:114(<listcomp>)
   832988    0.178    0.000    0.178    0.000 {method 'setdefault' of 'dict' objects}
   277649    0.133    0.000    0.133    0.000 utils.py:85(take)
        1    0.123    0.123    0.123    0.123 stl2pov.py:117(<listcomp>)
        1    0.092    0.092    0.350    0.350 stl.py:78(<listcomp>)
   277649    0.087    0.000    0.220    0.000 stl.py:161(<genexpr>)
        1    0.077    0.077    0.586    0.586 vecops.py:69(indexate)
833894/833805    0.068    0.000    0.068    0.000 {built-in method builtins.len}
        1    0.037    0.037    1.578    1.578 stl2pov.py:9(<module>)
        1    0.025    0.025    0.831    0.831 stl.py:147(toindexed)
        1    0.020    0.020    1.529    1.529 stl2pov.py:123(main)

There is an improvement but it is not as pronounced because the original line doesn’t really dominate the run time as in the other case. Still, this one-line edit makes that the line now takes 16.7% of the total time instead of 19.2%. And the total run time is reduced by 8.6%.

Using strings natively?

If the facets were to contain strings instead of floating point numbers, the time for formatting drops from ≈6.5 μs to ≈1 μs!

In [1]: facet = (
...:     "139.3592529296875", "69.91365814208984", "97.01824951171875",
...:     "138.14346313476562", "70.0313949584961", "97.50768280029297",
...:     "138.79571533203125", "69.03594970703125", "97.875244140625"
...: );

In [2]: fct = (
...:     "  triangle {{\n    <{0}, {2}, {1}>,\n    <{3}, {5}, {4}>,\n"
...:     "    <{6}, {8}, {7}>\n  }}"
...: );

In [3]: %timeit fct.format(*facet)
930 ns ± 0.318 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

And if that was to be changed that into a string join, it would be even faster.

In [4]: %timeit "".join(["  triangle {{\n    <", facet[0], ", ", facet[2], ", ", facet[1],
...: ">,\n    <", facet[3], ", ", facet[5], ", ", facet[4],
...: ">,\n    <", facet[6], ", ", facet[8], ", ", facet[7],
...: ">\n  }}"])
547 ns ± 0.433 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

So for text STL files, changing the parser and keeping the text representations of the coordinates could result in a significant performance increase. But at the cost of other functionality like calculating a bounding box or transforming the object.

For binary STL files, the conversion from float to str would still have to be done somewhere. In that case the performance cost of that conversion would probably cancel the gains from the string formatting. Unless we remove duplicates from the vertex list first. But that will be saved for a later time.

Lessons learned

  • Often the bottleneck is not what you expect.
  • You don’t know until you measure!

For comments, please send me an e-mail.


Related articles


←  Simple setup.py for python scripts Profiling Python scripts (2): stlinfo  →