Profiling Python scripts (1): stl2pov
This is the first in a series of articles that covers analyzing and improving performance bottlenecks in Python scripts.
Note
The other articles in this series will be listed under “Related Articles” at the bottom of the page.
My stltools programs will serve as examples because they can have significant
run times.
It must be said that these scripts are not unoptimized!
On and off I’ve been working on them since 2011. Meanwhile I’ve used
numpy
, pypy
and discovered ways to make it faster without them.
But as will be seen, there is still room for improvement.
Profiling stl2pov
First the script that started it all; stl2pov
.
Profiling is done as follows using the cProfile module, from the source
directory of the project.
Generating a mesh
object
python -m cProfile \
-s tottime \
stl2pov.py \
../testdata-stltools/headrest_mesh.stl | less
The test file headrest_mesh.stl
is a binary STL file because those tend
to be the largest.
It has 277648 facets.
The profile is sorted by total time spent, so only the first lines are most important. Only items with total time >0.1 s are shown below:
2236215 function calls (2235779 primitive calls) in 3.181 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 277650 1.828 0.000 1.828 0.000 {method 'format' of 'str' objects} 832945 0.313 0.000 0.470 0.000 stl.py:82(_getbp) 3 0.295 0.098 0.295 0.098 {method 'write' of '_io.TextIOWrapper' objects} 277649 0.132 0.000 0.132 0.000 utils.py:85(take) 1 0.118 0.118 2.077 2.077 stl2pov.py:90(<listcomp>)
Of the 3.1 seconds the script takes, 1.8 s is spent in formatting strings.
So there is clearly a major bottleneck in this script.
Looking at the source code for stl2pov.py
, there is only one format
call
used, in the function mesh1
:
fct = (
" triangle {{\n <{0}, {2}, {1}>,\n <{3}, {5}, {4}>,\n"
" <{6}, {8}, {7}>\n }}"
)
lines += [fct.format(*f) for f in facets]
So the current bottleneck is formatting the floating point coordinates of the vertices. Honestly, I had not expected that.
Generating a mesh2
object
Note
This section was added after other improvements to the code were also made. So the lines referenced here might be different from those in the rest of the article.
What happens if we generate a mesh2
object instead of the default
mesh
:
3902230 function calls (3901785 primitive calls) in 1.727 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.332 0.332 0.332 0.332 stl2pov.py:114(<listcomp>) 832945 0.262 0.000 0.505 0.000 vecops.py:120(<genexpr>) 832945 0.250 0.000 0.256 0.000 stl.py:82(_getbp) 832988 0.176 0.000 0.176 0.000 {method 'setdefault' of 'dict' objects} 277649 0.133 0.000 0.133 0.000 utils.py:85(take) 1 0.122 0.122 0.122 0.122 stl2pov.py:117(<listcomp>) 1 0.091 0.091 0.346 0.346 stl.py:78(<listcomp>) 277649 0.087 0.000 0.220 0.000 stl.py:161(<genexpr>) 1 0.077 0.077 0.582 0.582 vecops.py:69(indexate) 833894/833805 0.068 0.000 0.068 0.000 {built-in method builtins.len} 1 0.037 0.037 1.727 1.727 stl2pov.py:9(<module>) 1 0.025 0.025 0.827 0.827 stl.py:147(toindexed) 1 0.021 0.021 1.678 1.678 stl2pov.py:123(main) 499 0.010 0.000 0.010 0.000 {method 'join' of 'str' objects}
In this case the vertices have duplicates removed. So much less data needs to be processed. In contrast to the default, there is no one item that dominates the total time. The top three items account for about half the runtime.
The line stl2pov.py:114
is:
lines += [f" <{p[0]}, {p[2]}, {p[1]}>," for p in points]
And vecops.py:120
is:
indices = tuple(pd.setdefault(p, len(pd)) for p in points)
Investigating string formatting
Let’s get a baseline of how long a format action takes.
This is done using IPython because of its built-in %timeit
command.
In [1]: facet = (
...: 139.3592529296875, 69.91365814208984, 97.01824951171875,
...: 138.14346313476562, 70.0313949584961, 97.50768280029297,
...: 138.79571533203125, 69.03594970703125, 97.875244140625
...: );
In [2]: fct = (
...: " triangle {{\n <{0}, {2}, {1}>,\n <{3}, {5}, {4}>,\n"
...: " <{6}, {8}, {7}>\n }}"
...: );
In [3]: %timeit fct.format(*facet)
6.39 µs ± 11.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
If you don’t want to install IPython, it could also be done like this in
the Python REPL using timeit
from the standard library.
>>> facet = (
... 139.3592529296875, 69.91365814208984, 97.01824951171875,
... 138.14346313476562, 70.0313949584961, 97.50768280029297,
... 138.79571533203125, 69.03594970703125, 97.875244140625
... )
>>> fct = (
... " triangle {{\n <{0}, {2}, {1}>,\n <{3}, {5}, {4}>,\n"
... " <{6}, {8}, {7}>\n }}"
... )
>>> import timeit
>>> timeit.timeit("fct.format(*facet)", number=10000, globals=globals())/10000
6.748177204281092e-06
On my machine, running the above code yields that it takes ≈6.5 μs to format the
string.
Replacing the format
method with an f-string yields the same result.
In [5]: f = (
...: 139.3592529296875, 69.91365814208984, 97.01824951171875,
...: 138.14346313476562, 70.0313949584961, 97.50768280029297,
...: 138.79571533203125, 69.03594970703125, 97.875244140625
...: );
In [6]: %timeit f" triangle {{\n <{f[0]}, {f[2]}, {f[1]}>,\n <{f[3]}, {f[5]}, {f[4]}>,\n <{f[6]
...: }, {f[8]}, {f[7]}>\n }}"
6.79 µs ± 4.79 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So let’s look at float
to str
conversion in detail.
In [1]: f = 139.3592529296875;
In [3]: %timeit str(f)
724 ns ± 0.31 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [4]: %timeit "{0}".format(f)
734 ns ± 1.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [5]: %timeit "{0:.5f}".format(f)
314 ns ± 0.0365 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [7]: %timeit "{0:f}".format(f)
313 ns ± 7.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [8]: %timeit "{0:e}".format(f)
382 ns ± 0.171 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
The str()
function and format
without specifier take approximately the
same time, as expected.
But if a type is added to the format specifier, the time needed is
approximately halved!
Improvements
mesh
object
So the format string in stl2pov.py
is changed to:
fct = (
" triangle {{\n <{0:e}, {2:e}, {1:e}>,\n <{3:e}, {5:e}, {4:e}>,\n"
" <{6:e}, {8:e}, {7:e}>\n }}"
)
And then the profiling is run again:
2236215 function calls (2235779 primitive calls) in 2.082 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 277650 0.863 0.000 0.863 0.000 {method 'format' of 'str' objects} 832945 0.308 0.000 0.465 0.000 stl.py:82(_getbp) 3 0.195 0.065 0.195 0.065 {method 'write' of '_io.TextIOWrapper' objects} 1 0.117 0.117 1.089 1.089 stl2pov.py:90(<listcomp>) 277649 0.109 0.000 0.109 0.000 utils.py:85(take)
This simple 2-line change has halved the time the formatting takes, and reduced the total runtime from 3 s to 2 s!
mesh2
object
The formatting for points a mesh2
in stl2pov.py
is changed to:
lines += [f" <{p[0]:e}, {p[2]:e}, {p[1]:e}>," for p in points]
This has the following effect on the profile:
> python -m cProfile -s tottime stl2pov.py -2 ../testdata-stltools/headrest_mesh.stl | less 3902230 function calls (3901785 primitive calls) in 1.578 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 832945 0.264 0.000 0.509 0.000 vecops.py:120(<genexpr>) 832945 0.253 0.000 0.259 0.000 stl.py:82(_getbp) 1 0.180 0.180 0.180 0.180 stl2pov.py:114(<listcomp>) 832988 0.178 0.000 0.178 0.000 {method 'setdefault' of 'dict' objects} 277649 0.133 0.000 0.133 0.000 utils.py:85(take) 1 0.123 0.123 0.123 0.123 stl2pov.py:117(<listcomp>) 1 0.092 0.092 0.350 0.350 stl.py:78(<listcomp>) 277649 0.087 0.000 0.220 0.000 stl.py:161(<genexpr>) 1 0.077 0.077 0.586 0.586 vecops.py:69(indexate) 833894/833805 0.068 0.000 0.068 0.000 {built-in method builtins.len} 1 0.037 0.037 1.578 1.578 stl2pov.py:9(<module>) 1 0.025 0.025 0.831 0.831 stl.py:147(toindexed) 1 0.020 0.020 1.529 1.529 stl2pov.py:123(main)
There is an improvement but it is not as pronounced because the original line doesn’t really dominate the run time as in the other case. Still, this one-line edit makes that the line now takes 16.7% of the total time instead of 19.2%. And the total run time is reduced by 8.6%.
Using strings natively?
If the facets were to contain strings instead of floating point numbers, the time for formatting drops from ≈6.5 μs to ≈1 μs!
In [1]: facet = (
...: "139.3592529296875", "69.91365814208984", "97.01824951171875",
...: "138.14346313476562", "70.0313949584961", "97.50768280029297",
...: "138.79571533203125", "69.03594970703125", "97.875244140625"
...: );
In [2]: fct = (
...: " triangle {{\n <{0}, {2}, {1}>,\n <{3}, {5}, {4}>,\n"
...: " <{6}, {8}, {7}>\n }}"
...: );
In [3]: %timeit fct.format(*facet)
930 ns ± 0.318 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
And if that was to be changed that into a string join, it would be even faster.
In [4]: %timeit "".join([" triangle {{\n <", facet[0], ", ", facet[2], ", ", facet[1],
...: ">,\n <", facet[3], ", ", facet[5], ", ", facet[4],
...: ">,\n <", facet[6], ", ", facet[8], ", ", facet[7],
...: ">\n }}"])
547 ns ± 0.433 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
So for text STL files, changing the parser and keeping the text representations of the coordinates could result in a significant performance increase. But at the cost of other functionality like calculating a bounding box or transforming the object.
For binary STL files, the conversion from float
to str
would still
have to be done somewhere.
In that case the performance cost of that conversion would probably cancel the
gains from the string formatting.
Unless we remove duplicates from the vertex list first.
But that will be saved for a later time.
Lessons learned
- Often the bottleneck is not what you expect.
- You don’t know until you measure!
For comments, please send me an e-mail.
Related articles
- Profiling Python scripts (4): vecops.indexate
- Profiling Python scripts (3): stl2ps
- Profiling Python scripts (2): stlinfo
- Profiling Python scripts(6): auto-orient
- Profiling with pyinstrument