Roland's homepage

My random knot in the Web

Profiling Python scripts (5): ent_without_numpy

This is the fifth in a series of articles that covers analyzing and improving performance bottlenecks in Python scripts. The program ent_without_numpy.py is a lot slower than the version that uses numpy. The performance of ent_without_numpy.py is the topic of this article.

Note

The other articles in this series will be listed under “Related Articles” at the bottom of the page.

The profiling is done as follows:

> python -m cProfile -s tottime ent_without_numpy.py test/random.dat
- Entropy is 7.999982 bits per byte.
- Optimum compression would reduce the size
  of this 10485760 byte file by 0%.
- χ² distribution for 10485760 samples is 259.03, and randomly
  would exceed this value 41.80% of the times.
  According to the χ² test, this sequence looks random.
- Arithmetic mean value of data bytes is 127.5116 (random = 127.5).
- Monte Carlo value for π is 3.139877754 (error 0.05%).
- Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0).

     69915993 function calls (69915730 primitive calls) in 21.497 seconds

  Ordered by: internal time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       1    3.931    3.931    7.373    7.373 statistics.py:123(_sum)
 3495261    2.891    0.000    5.944    0.000 {built-in method builtins.sum}
10485761    2.703    0.000    2.703    0.000 statistics.py:219(_exact_ratio)
10485757    2.437    0.000    4.271    0.000 ent_without_numpy.py:269(<genexpr>)
10485757    1.833    0.000    1.833    0.000 ent_without_numpy.py:268(<genexpr>)
       1    1.672    1.672    6.589    6.589 ent_without_numpy.py:271(<listcomp>)
10485761    1.305    0.000    1.305    0.000 ent_without_numpy.py:183(<genexpr>)
       1    1.301    1.301    1.301    1.301 ent_without_numpy.py:181(<listcomp>)
10485761    1.069    0.000    1.069    0.000 ent_without_numpy.py:185(<genexpr>)
10485876    0.740    0.000    0.740    0.000 {method 'get' of 'dict' objects}
       1    0.555    0.555    0.555    0.555 {built-in method _collections._count_elements}
 1747627    0.369    0.000    0.680    0.000 ent_without_numpy.py:275(<genexpr>)
 1747627    0.311    0.000    0.311    0.000 ent_without_numpy.py:274(<genexpr>)
       1    0.194    0.194   21.491   21.491 ent_without_numpy.py:28(main)
       1    0.146    0.146    5.892    5.892 ent_without_numpy.py:170(correlation)
       1    0.027    0.027    7.468    7.468 ent_without_numpy.py:256(monte_carlo)

The items with a tottime of <0.01 s have been omitted.

According to time(1), the program uses around 11.1 s of real time and 10.8 s of user time:

> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null
- Entropy is 7.999982 bits per byte.
- Optimum compression would reduce the size
  of this 10485760 byte file by 0%.
- χ² distribution for 10485760 samples is 259.03, and randomly
  would exceed this value 41.80% of the times.
  According to the χ² test, this sequence looks random.
- Arithmetic mean value of data bytes is 127.5116 (random = 127.5).
- Monte Carlo value for π is 3.139877754 (error 0.05%).
- Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0).
      11.16 real        10.82 user         0.33 sys

This illistrates perfectly the possible overhead associated with profiling. So with profiling, is it not the amount of time in itself that is important, but how much it can be improved. As will become clear, a reduction in run time under the profiler should also result in a reduction in run time when not running under the profiler.

For comparison, version that uses numpy runs in about 0.4 s on the same machine on the same dataset:

> /usr/bin/time python ent.py test/random.dat
- Entropy is 7.999982 bits per byte.
- Optimum compression would reduce the size
  of this 10485760 byte file by 0%.
- χ² distribution for 10485760 samples is 259.03, and randomly
  would exceed this value 41.80% of the times.
  According to the χ² test, this sequence looks random.
- Arithmetic mean value of data bytes is 127.5116 (random = 127.5).
- Monte Carlo value for π is 3.139877754 (error 0.05%).
- Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0).
        0.41 real         0.29 user         0.11 sys

Note how the outputs between both versions are identical.

Improvements

Looking at the source for statistics.py, statistics.mean() is responsible for the first three lines in the profile. Looking into the statistics module at the source code of mean it is clear than this function is actually pretty complex because it has to deal with different types of data.

So let’s replace it with statistics.fmean(), a fast floating point arithmetic mean. For those that are not familiar with “unified diff” output, a “-” in front of a line means it has been removed, while a “+” means it has been added.

@@ -75,7 +75,7 @@ def terseout(data, e, chi2, p, d, scc, mc):
    """
    print("0,File-bytes,Entropy,Chi-square,Mean," "Monte-Carlo-Pi,Serial-Correlation")
    n = len(data)
-    m = stat.mean(data)
+    m = stat.fmean(data)
    print(f"1,{n},{e:.6f},{chi2:.6f},{m:.6f},{mc:.6f},{scc}")


@@ -109,7 +109,7 @@ def textout(data, e, chi2, p, d, scc, mc):
        print("is close to random, but not perfect.")
    else:
        print("looks random.")
-    m = stat.mean(data)
+    m = stat.fmean(data)
    print(f"- Arithmetic mean value of data bytes is {m:.4f} (random = 127.5).")
    err = 100 * (math.fabs(PI - mc) / PI)
    print(f"- Monte Carlo value for π is {mc:.9f} (error {err:.2f}%).")

And profile again:

> python -m cProfile -s cumtime ent_without_numpy.py test/random.dat | tail -n +10 | head -n 19
        48944433 function calls (48944172 primitive calls) in 13.888 seconds

  Ordered by: internal time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 3495260    2.843    0.000    5.816    0.000 {built-in method builtins.sum}
10485757    2.410    0.000    4.143    0.000 ent_without_numpy.py:269(<genexpr>)
10485757    1.734    0.000    1.734    0.000 ent_without_numpy.py:268(<genexpr>)
       1    1.634    1.634    6.410    6.410 ent_without_numpy.py:271(<listcomp>)
10485761    1.257    0.000    1.257    0.000 ent_without_numpy.py:183(<genexpr>)
       1    1.238    1.238    1.238    1.238 ent_without_numpy.py:181(<listcomp>)
10485761    1.052    0.000    1.052    0.000 ent_without_numpy.py:185(<genexpr>)
       1    0.537    0.537    0.537    0.537 {built-in method _collections._count_elements}
 1747627    0.363    0.000    0.663    0.000 ent_without_numpy.py:275(<genexpr>)
 1747627    0.300    0.000    0.300    0.000 ent_without_numpy.py:274(<genexpr>)
       1    0.188    0.188   13.881   13.881 ent_without_numpy.py:28(main)
       1    0.159    0.159    0.159    0.159 {built-in method math.fsum}
       1    0.132    0.132    5.724    5.724 ent_without_numpy.py:170(correlation)
       1    0.027    0.027    7.266    7.266 ent_without_numpy.py:256(monte_carlo)

The run time according to time(1) is 6.8 s of real time and 6.5 s of user time. This two-line change has reduced the run time by a third!

The built-in sum function now takes the most time. Built-ins are not something we can easily alter, so the following three items are now looked at.

Line 268 is a simple float conversion of the data. (MONTEN is a constant.)

d = (float(j) for j in d[: len(d) // MONTEN * MONTEN])

And line 269 uses that to calculate an intermediate.

intermediate = (i * j for i, j in zip(d, it.cycle([256 ** 2, 256, 1])))

Since d is not used anywhere else, both lines can be combined. Both of these are generator expressions. That means that they do not have to build intermediate lists but that values are produced as needed. Pretty neat if you are working with large data sets. It also means that when such generator expression feed each other like is done here, the benefit of combining them can be limited.

@@ -265,8 +265,12 @@ def monte_carlo(d):
    """
    MONTEN = 6
    incirc = (256.0 ** (MONTEN // 2) - 1) ** 2
-    d = (float(j) for j in d[: len(d) // MONTEN * MONTEN])
-    intermediate = (i * j for i, j in zip(d, it.cycle([256 ** 2, 256, 1])))
+    intermediate = (
+        float(i) * j
+        for i, j in zip(
+            d[: len(d) // MONTEN * MONTEN], it.cycle((256.0 ** 2, 256.0, 1.0))
+        )
+    )
    args = [intermediate] * 3
    values = [sum(j) for j in it.zip_longest(*args)]
    montex = values[0::2]

After this change, we profile again:

> python -m cProfile -s tottime ent_without_numpy.py test/random.dat| tail -n +10 | head -n 18
        38458676 function calls (38458415 primitive calls) in 11.774 seconds

  Ordered by: internal time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 3495260    2.794    0.000    5.795    0.000 {built-in method builtins.sum}
10485757    2.018    0.000    2.018    0.000 ent_without_numpy.py:268(<genexpr>)
       1    1.593    1.593    4.190    4.190 ent_without_numpy.py:275(<listcomp>)
       1    1.311    1.311    1.311    1.311 ent_without_numpy.py:181(<listcomp>)
10485761    1.272    0.000    1.272    0.000 ent_without_numpy.py:183(<genexpr>)
10485761    1.068    0.000    1.068    0.000 ent_without_numpy.py:185(<genexpr>)
       1    0.537    0.537    0.537    0.537 {built-in method _collections._count_elements}
 1747627    0.361    0.000    0.662    0.000 ent_without_numpy.py:279(<genexpr>)
 1747627    0.300    0.000    0.300    0.000 ent_without_numpy.py:278(<genexpr>)
       1    0.188    0.188   11.768   11.768 ent_without_numpy.py:28(main)
       1    0.159    0.159    0.159    0.159 {built-in method math.fsum}
       1    0.132    0.132    5.831    5.831 ent_without_numpy.py:170(correlation)
       1    0.027    0.027    5.045    5.045 ent_without_numpy.py:256(monte_carlo)

The run time according to time(1) is now 6.3 s real time and 6.0 user time. So not a huge improvement, but still in the order of 10%.

Looking at the lines 181—185, these are in the function correlation. In the calculations for the scct values, the bytes d are replaced with the list of floats a as the next step.

@@ -180,9 +180,9 @@ def correlation(d):
    totalc = len(d)
    a = [float(j) for j in d]
    b = a[1:] + [a[0]]
-    scct1 = sum(i * j for i, j in zip(d, b))
-    scct2 = sum(d) ** 2
-    scct3 = sum(j * j for j in d)
+    scct1 = sum(i * j for i, j in zip(a, b))
+    scct2 = sum(a) ** 2
+    scct3 = sum(j * j for j in a)
    scc = totalc * scct3 - scct2
    if scc == 0:
        raise ValueError

This shortened the real run-time to 5.4 s and the user time to 5.0 s:

> /usr/bin/time python ent_without_numpy.py test/random.dat|grep correlation
- Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0).
        5.40 real         5.03 user         0.37 sys

At this point I didn’t see any obvious improvements anymore. Time to let it rest and revisit it later.

Update end Januari 2022

A simpler monte_carlo was implemented.

@@ -16,7 +16,8 @@ See https://www.fourmilab.ch/random/ for the original.

import argparse
import collections
-import itertools as it
+
+# import itertools as it
import math
import statistics as stat
import sys
@@ -263,19 +264,14 @@ def monte_carlo(d):
    Returns:
        Approximation of π
    """
-    MONTEN = 6
-    incirc = (256.0 ** (MONTEN // 2) - 1) ** 2
-    intermediate = (
-        float(i) * j
-        for i, j in zip(
-            d[: len(d) // MONTEN * MONTEN], it.cycle((256.0 ** 2, 256.0, 1.0))
-        )
-    )
-    values = [sum(j) for j in zip(intermediate, intermediate, intermediate)]
+    values = [
+        a * 65536.0 + b * 256.0 + c * 1.0 for a, b, c in zip(d[0::3], d[1::3], d[2::3])
+    ]
    montex = values[0::2]
    montey = values[1::2]
    dist2 = (i * i + j * j for i, j in zip(montex, montey))
-    inmont = sum(j <= incirc for j in dist2)
+    # constant in the next line is (256.0 ** 3 - 1) ** 2
+    inmont = sum(k <= 281474943156225.0 for k in dist2)
    montepi = 4 * inmont / len(montex)

This reduced the run-time significantly:

> /usr/bin/time python ent_without_numpy.py test/random.dat | grep Monte
- Monte Carlo value for π is 3.139875958 (error 0.05%).
        4.08 real         3.74 user         0.33 sys

While the output of the Monte Carlo calculation has changed, the difference is very small (0.0000018) and not enough to change the error value. In my eyes this is acceptable.

Update August 2022

At this point, I initially used time.process_time() to check how long the functions in main() took. This was done to get a better understanding of the bottlenecks in the process. The logging.debug function was used to report it:

DEBUG: reading data took 0.456 s.
DEBUG: calculating entropy took 0.000 s.
DEBUG: calculating Χ² took 0.000 s.
DEBUG: calculating probability of χ² test value took 0.000 s.
DEBUG: calculating Monte Carlo value of π took 0.927 s.
DEBUG: calculating serial correlation coefficient took 2.517 s.

To confirm this, ent_without_numpy.py was profiled again (output removed):

> python -m cProfile -s cumtime ent_without_numpy.py test/random.dat|head -n 32

24477882 function calls (24477621 primitive calls) in 6.773 seconds

  Ordered by: cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    14/1    0.000    0.000    6.773    6.773 {built-in method builtins.exec}
        1    0.000    0.000    6.773    6.773 ent_without_numpy.py:9(<module>)
        1    0.171    0.171    6.768    6.768 ent_without_numpy.py:29(main)
        1    0.125    0.125    4.677    4.677 ent_without_numpy.py:171(correlation)
        8    1.751    0.219    4.076    0.510 {built-in method builtins.sum}
        1    0.029    0.029    1.290    1.290 ent_without_numpy.py:257(monte_carlo)
        1    1.130    1.130    1.130    1.130 ent_without_numpy.py:182(<listcomp>)
10485761    0.973    0.000    0.973    0.000 ent_without_numpy.py:184(<genexpr>)
10485761    0.834    0.000    0.834    0.000 ent_without_numpy.py:186(<genexpr>)
        1    0.607    0.607    0.607    0.607 ent_without_numpy.py:267(<listcomp>)
  1747627    0.278    0.000    0.518    0.000 ent_without_numpy.py:274(<genexpr>)
        1    0.000    0.000    0.453    0.453 ent_without_numpy.py:120(readdata)
        1    0.000    0.000    0.450    0.450 __init__.py:581(__init__)
        1    0.000    0.000    0.450    0.450 __init__.py:649(update)
        1    0.450    0.450    0.450    0.450 {built-in method _collections._count_elements}
  1747627    0.240    0.000    0.240    0.000 ent_without_numpy.py:272(<genexpr>)
        1    0.000    0.000    0.171    0.171 ent_without_numpy.py:83(textout)
        1    0.000    0.000    0.171    0.171 statistics.py:321(fmean)

When we look at the functions (except main), it is clear that correlation uses the most time, followed by monte_carlo. As expected, readdata is next down the list.

Let’s time it without profling:

> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null
    4.09 real         3.76 user         0.33 sys

Using readdata() and the timeit module in IPython, I found that converting the bytes data to a list of floats as is done in correlation is relatively slow, taking roughly 1.2 seconds:

In [4]: import collections

In [5]: d, c = readdata("test/random.dat");

In [6]: len(d), len(c)
Out[6]: (10485760, 256)

In [7]: import timeit

In [8]: timeit.timeit('[float(j) for j in d]',number=1,globals=globals())
Out[8]: 1.2279654936864972

After unsuccesfull experimentation with array.array I decided to try and use bytes directly instead of float for the calculations of the scct-values:

@@ -179,11 +177,10 @@ def correlation(d):
        Serial correlation coeffiecient.
    """
    totalc = len(d)
-    a = [float(j) for j in d]
-    b = a[1:] + [a[0]]
-    scct1 = sum(i * j for i, j in zip(a, b))
-    scct2 = sum(a) ** 2
-    scct3 = sum(j * j for j in a)
+    b = d[1:] + bytes(d[0])
+    scct1 = sum(i * j for i, j in zip(d, b))
+    scct2 = sum(d) ** 2
+    scct3 = sum(j * j for j in d)
    scc = totalc * scct3 - scct2
    if scc == 0:
        raise ValueError

It should be noted here that when we iterate over bytes, we automatically get integers:

In [1]: foo = bytes([24, 128, 72, 254, 2, 12])
Out[1]: b'\x18\x80H\xfe\x02\x0c'

In [2]: [type(j) for j in foo]
Out[2]: [int, int, int, int, int, int]

So we don’t have to worry about overflow:

In [3]: [j * j for j in foo]
Out[3]: [576, 16384, 5184, 64516, 4, 144]

In [4]: sum(j * j for j in foo)
Out[4]: 86808

After the change we profile again:

> python -m cProfile -s cumtime ent_without_numpy.py test/random.dat | tail -n +9 | head -n 27
- Serial correlation coefficient is -0.000296 (totally uncorrelated = 0.0).
        24477881 function calls (24477620 primitive calls) in 5.804 seconds

  Ordered by: cumulative time

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    14/1    0.000    0.000    5.804    5.804 {built-in method builtins.exec}
        1    0.000    0.000    5.804    5.804 ent_without_numpy.py:9(<module>)
        1    0.051    0.051    5.798    5.798 ent_without_numpy.py:27(main)
        8    1.942    0.243    4.494    0.562 {built-in method builtins.sum}
        1    0.004    0.004    3.837    3.837 ent_without_numpy.py:169(correlation)
        1    0.029    0.029    1.295    1.295 ent_without_numpy.py:254(monte_carlo)
10485761    1.086    0.000    1.086    0.000 ent_without_numpy.py:181(<genexpr>)
10485761    0.944    0.000    0.944    0.000 ent_without_numpy.py:183(<genexpr>)
        1    0.605    0.605    0.605    0.605 ent_without_numpy.py:264(<listcomp>)
  1747627    0.280    0.000    0.522    0.000 ent_without_numpy.py:271(<genexpr>)
        1    0.000    0.000    0.442    0.442 ent_without_numpy.py:118(readdata)
        1    0.000    0.000    0.440    0.440 __init__.py:581(__init__)
        1    0.000    0.000    0.440    0.440 __init__.py:649(update)
        1    0.440    0.440    0.440    0.440 {built-in method _collections._count_elements}
  1747627    0.242    0.000    0.242    0.000 ent_without_numpy.py:269(<genexpr>)
        1    0.000    0.000    0.168    0.168 ent_without_numpy.py:81(textout)
        1    0.000    0.000    0.168    0.168 statistics.py:321(fmean)
        1    0.168    0.168    0.168    0.168 {built-in method math.fsum}

Note that the time needed for correlation has been reduced, but the result is unchanged.

Time without profling:

> /usr/bin/time python ent_without_numpy.py test/random.dat >/dev/null
        3.04 real         2.94 user         0.09 sys

This is a reduction in runtime of about 25%, not bad.

Lessons learned

  • For dealing with large amounts of numeric data, numpy can be a huge improvement w.r.t. performance.
  • Profiling has significant overhead. Run your scripts under time(1) to look at the unencumbered run time.
  • Even when not using numpy, float is often faster than int.
  • But not converting data always saves time.
  • If you don’t see improvements anymore, let it rest and come back later.

For comments, please send me an e-mail.


Related articles


←  Profiling Python scripts (4): vecops.indexate Python & standard output redirection on ms-windows  →