Profiling Python scripts (5): ent_without_numpy
This is the fifth in a series of articles that covers analyzing and improving
performance bottlenecks in Python scripts.
The program ent_without_numpy.py
is a lot slower than the version that
uses numpy
.
The performance of ent_without_numpy.py
is the topic of this article.
Note
The other articles in this series will be listed under “Related Articles” at the bottom of the page.
The profiling is done as follows:
> python -m cProfile -s tottime ent_without_numpy.py test/random.dat - Entropy is 7.999982 bits per byte. - Optimum compression would reduce the size of this 10485760 byte file by 0%. - χ² distribution for 10485760 samples is 259.03, and randomly would exceed this value 41.80% of the times. According to the χ² test, this sequence looks random. - Arithmetic mean value of data bytes is 127.5116 (random = 127.5). - Monte Carlo value for π is 3.139877754 (error 0.05%). - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). 69915993 function calls (69915730 primitive calls) in 21.497 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 3.931 3.931 7.373 7.373 statistics.py:123(_sum) 3495261 2.891 0.000 5.944 0.000 {built-in method builtins.sum} 10485761 2.703 0.000 2.703 0.000 statistics.py:219(_exact_ratio) 10485757 2.437 0.000 4.271 0.000 ent_without_numpy.py:269(<genexpr>) 10485757 1.833 0.000 1.833 0.000 ent_without_numpy.py:268(<genexpr>) 1 1.672 1.672 6.589 6.589 ent_without_numpy.py:271(<listcomp>) 10485761 1.305 0.000 1.305 0.000 ent_without_numpy.py:183(<genexpr>) 1 1.301 1.301 1.301 1.301 ent_without_numpy.py:181(<listcomp>) 10485761 1.069 0.000 1.069 0.000 ent_without_numpy.py:185(<genexpr>) 10485876 0.740 0.000 0.740 0.000 {method 'get' of 'dict' objects} 1 0.555 0.555 0.555 0.555 {built-in method _collections._count_elements} 1747627 0.369 0.000 0.680 0.000 ent_without_numpy.py:275(<genexpr>) 1747627 0.311 0.000 0.311 0.000 ent_without_numpy.py:274(<genexpr>) 1 0.194 0.194 21.491 21.491 ent_without_numpy.py:28(main) 1 0.146 0.146 5.892 5.892 ent_without_numpy.py:170(correlation) 1 0.027 0.027 7.468 7.468 ent_without_numpy.py:256(monte_carlo)
The items with a tottime of <0.01 s have been omitted.
According to time(1), the program uses around 11.1 s of real time and 10.8 s of user time:
> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null - Entropy is 7.999982 bits per byte. - Optimum compression would reduce the size of this 10485760 byte file by 0%. - χ² distribution for 10485760 samples is 259.03, and randomly would exceed this value 41.80% of the times. According to the χ² test, this sequence looks random. - Arithmetic mean value of data bytes is 127.5116 (random = 127.5). - Monte Carlo value for π is 3.139877754 (error 0.05%). - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). 11.16 real 10.82 user 0.33 sys
This illistrates perfectly the possible overhead associated with profiling. So with profiling, is it not the amount of time in itself that is important, but how much it can be improved. As will become clear, a reduction in run time under the profiler should also result in a reduction in run time when not running under the profiler.
For comparison, version that uses numpy runs in about 0.4 s on the same machine on the same dataset:
> /usr/bin/time python ent.py test/random.dat - Entropy is 7.999982 bits per byte. - Optimum compression would reduce the size of this 10485760 byte file by 0%. - χ² distribution for 10485760 samples is 259.03, and randomly would exceed this value 41.80% of the times. According to the χ² test, this sequence looks random. - Arithmetic mean value of data bytes is 127.5116 (random = 127.5). - Monte Carlo value for π is 3.139877754 (error 0.05%). - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). 0.41 real 0.29 user 0.11 sys
Note how the outputs between both versions are identical.
Improvements
Looking at the source for statistics.py, statistics.mean()
is responsible for
the first three lines in the profile.
Looking into the statistics
module at the source code of mean
it is
clear than this function is actually pretty complex because it has to deal
with different types of data.
So let’s replace it with statistics.fmean()
, a fast floating point
arithmetic mean.
For those that are not familiar with “unified diff” output, a “-” in front of
a line means it has been removed, while a “+” means it has been added.
@@ -75,7 +75,7 @@ def terseout(data, e, chi2, p, d, scc, mc):
"""
print("0,File-bytes,Entropy,Chi-square,Mean," "Monte-Carlo-Pi,Serial-Correlation")
n = len(data)
- m = stat.mean(data)
+ m = stat.fmean(data)
print(f"1,{n},{e:.6f},{chi2:.6f},{m:.6f},{mc:.6f},{scc}")
@@ -109,7 +109,7 @@ def textout(data, e, chi2, p, d, scc, mc):
print("is close to random, but not perfect.")
else:
print("looks random.")
- m = stat.mean(data)
+ m = stat.fmean(data)
print(f"- Arithmetic mean value of data bytes is {m:.4f} (random = 127.5).")
err = 100 * (math.fabs(PI - mc) / PI)
print(f"- Monte Carlo value for π is {mc:.9f} (error {err:.2f}%).")
And profile again:
> python -m cProfile -s cumtime ent_without_numpy.py test/random.dat | tail -n +10 | head -n 19 48944433 function calls (48944172 primitive calls) in 13.888 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 3495260 2.843 0.000 5.816 0.000 {built-in method builtins.sum} 10485757 2.410 0.000 4.143 0.000 ent_without_numpy.py:269(<genexpr>) 10485757 1.734 0.000 1.734 0.000 ent_without_numpy.py:268(<genexpr>) 1 1.634 1.634 6.410 6.410 ent_without_numpy.py:271(<listcomp>) 10485761 1.257 0.000 1.257 0.000 ent_without_numpy.py:183(<genexpr>) 1 1.238 1.238 1.238 1.238 ent_without_numpy.py:181(<listcomp>) 10485761 1.052 0.000 1.052 0.000 ent_without_numpy.py:185(<genexpr>) 1 0.537 0.537 0.537 0.537 {built-in method _collections._count_elements} 1747627 0.363 0.000 0.663 0.000 ent_without_numpy.py:275(<genexpr>) 1747627 0.300 0.000 0.300 0.000 ent_without_numpy.py:274(<genexpr>) 1 0.188 0.188 13.881 13.881 ent_without_numpy.py:28(main) 1 0.159 0.159 0.159 0.159 {built-in method math.fsum} 1 0.132 0.132 5.724 5.724 ent_without_numpy.py:170(correlation) 1 0.027 0.027 7.266 7.266 ent_without_numpy.py:256(monte_carlo)
The run time according to time(1)
is 6.8 s of real time and 6.5 s of user
time.
This two-line change has reduced the run time by a third!
The built-in sum
function now takes the most time.
Built-ins are not something we can easily alter, so the following three items
are now looked at.
Line 268 is a simple float conversion of the data. (MONTEN
is a constant.)
d = (float(j) for j in d[: len(d) // MONTEN * MONTEN])
And line 269 uses that to calculate an intermediate.
intermediate = (i * j for i, j in zip(d, it.cycle([256 ** 2, 256, 1])))
Since d
is not used anywhere else, both lines can be combined.
Both of these are generator expressions.
That means that they do not have to build intermediate lists but that values
are produced as needed.
Pretty neat if you are working with large data sets.
It also means that when such generator expression feed each other like is done
here, the benefit of combining them can be limited.
@@ -265,8 +265,12 @@ def monte_carlo(d):
"""
MONTEN = 6
incirc = (256.0 ** (MONTEN // 2) - 1) ** 2
- d = (float(j) for j in d[: len(d) // MONTEN * MONTEN])
- intermediate = (i * j for i, j in zip(d, it.cycle([256 ** 2, 256, 1])))
+ intermediate = (
+ float(i) * j
+ for i, j in zip(
+ d[: len(d) // MONTEN * MONTEN], it.cycle((256.0 ** 2, 256.0, 1.0))
+ )
+ )
args = [intermediate] * 3
values = [sum(j) for j in it.zip_longest(*args)]
montex = values[0::2]
After this change, we profile again:
> python -m cProfile -s tottime ent_without_numpy.py test/random.dat| tail -n +10 | head -n 18 38458676 function calls (38458415 primitive calls) in 11.774 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 3495260 2.794 0.000 5.795 0.000 {built-in method builtins.sum} 10485757 2.018 0.000 2.018 0.000 ent_without_numpy.py:268(<genexpr>) 1 1.593 1.593 4.190 4.190 ent_without_numpy.py:275(<listcomp>) 1 1.311 1.311 1.311 1.311 ent_without_numpy.py:181(<listcomp>) 10485761 1.272 0.000 1.272 0.000 ent_without_numpy.py:183(<genexpr>) 10485761 1.068 0.000 1.068 0.000 ent_without_numpy.py:185(<genexpr>) 1 0.537 0.537 0.537 0.537 {built-in method _collections._count_elements} 1747627 0.361 0.000 0.662 0.000 ent_without_numpy.py:279(<genexpr>) 1747627 0.300 0.000 0.300 0.000 ent_without_numpy.py:278(<genexpr>) 1 0.188 0.188 11.768 11.768 ent_without_numpy.py:28(main) 1 0.159 0.159 0.159 0.159 {built-in method math.fsum} 1 0.132 0.132 5.831 5.831 ent_without_numpy.py:170(correlation) 1 0.027 0.027 5.045 5.045 ent_without_numpy.py:256(monte_carlo)
The run time according to time(1)
is now 6.3 s real time and 6.0 user time.
So not a huge improvement, but still in the order of 10%.
Looking at the lines 181—185, these are in the function correlation
.
In the calculations for the scct
values, the bytes d
are replaced with
the list of floats a
as the next step.
@@ -180,9 +180,9 @@ def correlation(d):
totalc = len(d)
a = [float(j) for j in d]
b = a[1:] + [a[0]]
- scct1 = sum(i * j for i, j in zip(d, b))
- scct2 = sum(d) ** 2
- scct3 = sum(j * j for j in d)
+ scct1 = sum(i * j for i, j in zip(a, b))
+ scct2 = sum(a) ** 2
+ scct3 = sum(j * j for j in a)
scc = totalc * scct3 - scct2
if scc == 0:
raise ValueError
This shortened the real run-time to 5.4 s and the user time to 5.0 s:
> /usr/bin/time python ent_without_numpy.py test/random.dat|grep correlation - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). 5.40 real 5.03 user 0.37 sys
At this point I didn’t see any obvious improvements anymore. Time to let it rest and revisit it later.
Update end Januari 2022
A simpler monte_carlo
was implemented.
@@ -16,7 +16,8 @@ See https://www.fourmilab.ch/random/ for the original.
import argparse
import collections
-import itertools as it
+
+# import itertools as it
import math
import statistics as stat
import sys
@@ -263,19 +264,14 @@ def monte_carlo(d):
Returns:
Approximation of π
"""
- MONTEN = 6
- incirc = (256.0 ** (MONTEN // 2) - 1) ** 2
- intermediate = (
- float(i) * j
- for i, j in zip(
- d[: len(d) // MONTEN * MONTEN], it.cycle((256.0 ** 2, 256.0, 1.0))
- )
- )
- values = [sum(j) for j in zip(intermediate, intermediate, intermediate)]
+ values = [
+ a * 65536.0 + b * 256.0 + c * 1.0 for a, b, c in zip(d[0::3], d[1::3], d[2::3])
+ ]
montex = values[0::2]
montey = values[1::2]
dist2 = (i * i + j * j for i, j in zip(montex, montey))
- inmont = sum(j <= incirc for j in dist2)
+ # constant in the next line is (256.0 ** 3 - 1) ** 2
+ inmont = sum(k <= 281474943156225.0 for k in dist2)
montepi = 4 * inmont / len(montex)
This reduced the run-time significantly:
> /usr/bin/time python ent_without_numpy.py test/random.dat | grep Monte - Monte Carlo value for π is 3.139875958 (error 0.05%). 4.08 real 3.74 user 0.33 sys
While the output of the Monte Carlo calculation has changed, the difference is very small (0.0000018) and not enough to change the error value. In my eyes this is acceptable.
Update August 2022
At this point, I initially used time.process_time()
to check how long the
functions in main()
took.
This was done to get a better understanding of the bottlenecks in the process.
The logging.debug
function was used to report it:
DEBUG: reading data took 0.456 s. DEBUG: calculating entropy took 0.000 s. DEBUG: calculating Χ² took 0.000 s. DEBUG: calculating probability of χ² test value took 0.000 s. DEBUG: calculating Monte Carlo value of π took 0.927 s. DEBUG: calculating serial correlation coefficient took 2.517 s.
To confirm this, ent_without_numpy.py
was profiled again (output removed):
> python -m cProfile -s cumtime ent_without_numpy.py test/random.dat|head -n 32 24477882 function calls (24477621 primitive calls) in 6.773 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 14/1 0.000 0.000 6.773 6.773 {built-in method builtins.exec} 1 0.000 0.000 6.773 6.773 ent_without_numpy.py:9(<module>) 1 0.171 0.171 6.768 6.768 ent_without_numpy.py:29(main) 1 0.125 0.125 4.677 4.677 ent_without_numpy.py:171(correlation) 8 1.751 0.219 4.076 0.510 {built-in method builtins.sum} 1 0.029 0.029 1.290 1.290 ent_without_numpy.py:257(monte_carlo) 1 1.130 1.130 1.130 1.130 ent_without_numpy.py:182(<listcomp>) 10485761 0.973 0.000 0.973 0.000 ent_without_numpy.py:184(<genexpr>) 10485761 0.834 0.000 0.834 0.000 ent_without_numpy.py:186(<genexpr>) 1 0.607 0.607 0.607 0.607 ent_without_numpy.py:267(<listcomp>) 1747627 0.278 0.000 0.518 0.000 ent_without_numpy.py:274(<genexpr>) 1 0.000 0.000 0.453 0.453 ent_without_numpy.py:120(readdata) 1 0.000 0.000 0.450 0.450 __init__.py:581(__init__) 1 0.000 0.000 0.450 0.450 __init__.py:649(update) 1 0.450 0.450 0.450 0.450 {built-in method _collections._count_elements} 1747627 0.240 0.000 0.240 0.000 ent_without_numpy.py:272(<genexpr>) 1 0.000 0.000 0.171 0.171 ent_without_numpy.py:83(textout) 1 0.000 0.000 0.171 0.171 statistics.py:321(fmean)
When we look at the functions (except main
), it is clear that
correlation
uses the most time, followed by monte_carlo
.
As expected, readdata
is next down the list.
Let’s time it without profling:
> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null 4.09 real 3.76 user 0.33 sys
Using readdata()
and the timeit
module in IPython, I found that
converting the bytes data to a list of floats as is done in correlation
is
relatively slow, taking roughly 1.2 seconds:
In [4]: import collections In [5]: d, c = readdata("test/random.dat"); In [6]: len(d), len(c) Out[6]: (10485760, 256) In [7]: import timeit In [8]: timeit.timeit('[float(j) for j in d]',number=1,globals=globals()) Out[8]: 1.2279654936864972
After unsuccesfull experimentation with array.array
I decided to try and
use bytes
directly instead of float
for the calculations of the
scct
-values:
@@ -179,11 +177,10 @@ def correlation(d): Serial correlation coeffiecient. """ totalc = len(d) - a = [float(j) for j in d] - b = a[1:] + [a[0]] - scct1 = sum(i * j for i, j in zip(a, b)) - scct2 = sum(a) ** 2 - scct3 = sum(j * j for j in a) + b = d[1:] + bytes(d[0]) + scct1 = sum(i * j for i, j in zip(d, b)) + scct2 = sum(d) ** 2 + scct3 = sum(j * j for j in d) scc = totalc * scct3 - scct2 if scc == 0: raise ValueError
It should be noted here that when we iterate over bytes
, we automatically
get integers:
In [1]: foo = bytes([24, 128, 72, 254, 2, 12]) Out[1]: b'\x18\x80H\xfe\x02\x0c' In [2]: [type(j) for j in foo] Out[2]: [int, int, int, int, int, int]
So we don’t have to worry about overflow:
In [3]: [j * j for j in foo] Out[3]: [576, 16384, 5184, 64516, 4, 144] In [4]: sum(j * j for j in foo) Out[4]: 86808
After the change we profile again:
> python -m cProfile -s cumtime ent_without_numpy.py test/random.dat | tail -n +9 | head -n 27 - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). 24477881 function calls (24477620 primitive calls) in 5.804 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 14/1 0.000 0.000 5.804 5.804 {built-in method builtins.exec} 1 0.000 0.000 5.804 5.804 ent_without_numpy.py:9(<module>) 1 0.051 0.051 5.798 5.798 ent_without_numpy.py:27(main) 8 1.942 0.243 4.494 0.562 {built-in method builtins.sum} 1 0.004 0.004 3.837 3.837 ent_without_numpy.py:169(correlation) 1 0.029 0.029 1.295 1.295 ent_without_numpy.py:254(monte_carlo) 10485761 1.086 0.000 1.086 0.000 ent_without_numpy.py:181(<genexpr>) 10485761 0.944 0.000 0.944 0.000 ent_without_numpy.py:183(<genexpr>) 1 0.605 0.605 0.605 0.605 ent_without_numpy.py:264(<listcomp>) 1747627 0.280 0.000 0.522 0.000 ent_without_numpy.py:271(<genexpr>) 1 0.000 0.000 0.442 0.442 ent_without_numpy.py:118(readdata) 1 0.000 0.000 0.440 0.440 __init__.py:581(__init__) 1 0.000 0.000 0.440 0.440 __init__.py:649(update) 1 0.440 0.440 0.440 0.440 {built-in method _collections._count_elements} 1747627 0.242 0.000 0.242 0.000 ent_without_numpy.py:269(<genexpr>) 1 0.000 0.000 0.168 0.168 ent_without_numpy.py:81(textout) 1 0.000 0.000 0.168 0.168 statistics.py:321(fmean) 1 0.168 0.168 0.168 0.168 {built-in method math.fsum}
Note that the time needed for correlation
has been reduced, but the result
is unchanged.
Time without profling:
> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null 3.04 real 2.94 user 0.09 sys
This is a reduction in runtime of about 25%, not bad.
Update Juli 2024
In the meantime, the default python on FreeBSD was updated to 3.11.
Time without profling:
> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null 2.81 real 2.72 user 0.09 sys
In July of 2024 I repeated the profiling with pyinstrument
under Python 3.11:
> python -m pyinstrument ent_without_numpy.py test/random.dat - Entropy is 7.999982 bits per byte. - Optimum compression would reduce the size of this 10485760 byte file by 0%. - χ² distribution for 10485760 samples is 259.03, and randomly would exceed this value 41.80% of the times. According to the χ² test, this sequence looks random. - Arithmetic mean value of data bytes is 127.5116 (random = 127.5). - Monte Carlo value for π is 3.139875958 (error 0.05%). - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). _ ._ __/__ _ _ _ _ _/_ Recorded: 11:37:07 Samples: 6106 /_//_/// /_\ / //_// / //_'/ // Duration: 7.947 CPU time: 7.850 / _/ v4.6.2 Program: /home/rsmith/.local/lib/python3.11/site-packages/pyinstrument/__main__.py ent_without_numpy.py test/random.dat 7.928 <module> ent_without_numpy.py:1 └─ 7.915 main ent_without_numpy.py:28 ├─ 5.225 correlation ent_without_numpy.py:170 │ ├─ 2.004 <genexpr> ent_without_numpy.py:184 │ ├─ 1.793 [self] ent_without_numpy.py │ └─ 1.352 <genexpr> ent_without_numpy.py:186 ├─ 1.964 monte_carlo ent_without_numpy.py:257 │ ├─ 0.951 <listcomp> ent_without_numpy.py:267 │ ├─ 0.800 <genexpr> ent_without_numpy.py:274 │ │ ├─ 0.450 <genexpr> ent_without_numpy.py:272 │ │ └─ 0.350 [self] ent_without_numpy.py │ └─ 0.214 [self] ent_without_numpy.py ├─ 0.539 readdata ent_without_numpy.py:119 │ ├─ 0.440 Counter.__init__ collections/__init__.py:587 │ │ [3 frames hidden] collections, <built-in> │ └─ 0.099 BufferedReader.read <built-in> └─ 0.144 textout ent_without_numpy.py:82 └─ 0.144 fmean statistics.py:436 [2 frames hidden] statistics, <built-in> To view this report with different options, run: pyinstrument --load-prev 2024-07-14T11-37-07 [options]
This was worse then before, so I reverted the change to use arrays.
This resulted in:
> /usr/bin/time python ent_without_numpy.py test/random.dat > /dev/null 2.65 real 2.60 user 0.04 sys
And:
> python -m pyinstrument ent_without_numpy.py test/random.dat - Entropy is 7.999982 bits per byte. - Optimum compression would reduce the size of this 10485760 byte file by 0%. - χ² distribution for 10485760 samples is 259.03, and randomly would exceed this value 41.80% of the times. According to the χ² test, this sequence looks random. - Arithmetic mean value of data bytes is 127.5116 (random = 127.5). - Monte Carlo value for π is 3.139875958 (error 0.05%). - Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0). _ ._ __/__ _ _ _ _ _/_ Recorded: 11:58:27 Samples: 5888 /_//_/// /_\ / //_// / //_'/ // Duration: 7.555 CPU time: 7.554 / _/ v4.6.2 Program: /home/rsmith/.local/lib/python3.11/site-packages/pyinstrument/__main__.py ent_without_numpy.py test/random.dat 7.552 <module> ent_without_numpy.py:1 └─ 7.547 main ent_without_numpy.py:27 ├─ 5.002 correlation ent_without_numpy.py:169 │ ├─ 1.901 <genexpr> ent_without_numpy.py:181 │ ├─ 1.797 [self] ent_without_numpy.py │ └─ 1.244 <genexpr> ent_without_numpy.py:183 ├─ 1.941 monte_carlo ent_without_numpy.py:254 │ ├─ 0.933 <listcomp> ent_without_numpy.py:264 │ ├─ 0.831 <genexpr> ent_without_numpy.py:271 │ │ ├─ 0.497 <genexpr> ent_without_numpy.py:269 │ │ └─ 0.334 [self] ent_without_numpy.py │ └─ 0.177 [self] ent_without_numpy.py ├─ 0.417 readdata ent_without_numpy.py:118 │ └─ 0.413 Counter.__init__ collections/__init__.py:587 │ [3 frames hidden] collections, <built-in> └─ 0.145 textout ent_without_numpy.py:81 └─ 0.145 fmean statistics.py:436 [2 frames hidden] statistics, <built-in> To view this report with different options, run: pyinstrument --load-prev 2024-07-14T11-58-27 [options]
Lessons learned
- For dealing with large amounts of numeric data,
numpy
can be a huge improvement w.r.t. performance. - Profiling has significant overhead. Run your scripts under time(1) to look at the unencumbered run time.
- Even when not using
numpy
,float
is often faster thanint
. - But not converting data always saves time.
- If you don’t see improvements anymore, let it rest and come back later.
- Newer Python versions can change the results.
For comments, please send me an e-mail.
Related articles
- Profiling Python scripts(6): auto-orient
- Profiling with pyinstrument
- Profiling Python scripts (4): vecops.indexate
- Profiling Python scripts (3): stl2ps
- Profiling Python scripts (2): stlinfo