Improving Python's speed by 40% when running Home Assistant
We use Alpine for most of our Containers. It is the perfect distribution for containers because it is small (BusyBox based), available for a lot of CPU architectures, and the package system is slim. Alpine uses musl as their C library instead of the more commonly used glibc.
Alpine with musl are relatively young compared to their peers (15 and 9 years old) but have seen a significant development pace. Because things move so fast, a lot of misconceptions exist about both based on things that are no longer true. The goal of this post is to address a couple of those and how we have solved them.
This blogpost is not meant as a musl vs. glibc flamewar. Each use case is different and has its own trade-offs. For example, we use glibc in our OS.
For the tests, I used the images from Docker Python library, and the result is published to our base images. I used pyperformance for lab testing and the Home Assistant internal benchmark tools for more real-life comparison. The test environment was running inside a container on the same Docker host.
C/POSIX standard library
I often read: Python is slower when it uses musl as the default C library. This fact is not 100% correct. If the Python runtime was compiled with the same GCC and with -O3
, the glibc variant is a bit faster in the lab benchmark, but in the real world, the difference is insignificant. Alpine compiles it with -Os
while most other distributions compile it with -O2
. This causes the often written difference between the Python runtime interpreters. But when using the same compiler optimizations, musl based Python runtimes have no negative side-effects.
But there is a game-changer, which makes the musl one more useful compared to the glibc-based runtime. It is the memory allocator jemalloc, a general-purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. There is an interesting effect, which I found on some blogpost about Rust. There were some developers who saw that musl is much faster when using jemalloc compared to glibc, while glibc is slower when using jemalloc. For sure, the benefit with glibc and jemalloc is not the speed as they optimize memory management, but musl get both benefits. While the difference between pure musl and glibc can be ignored, the difference between musl + jemalloc and glibc are substantial (with disabled GCC memory allocator built-in optimization). Yes, today's jemalloc is compatible with musl (there was a time which it was not).
Compiler
How you compile Python is also essential. There were statements from Fedora or Redhat about disable semantic-interposition
to get a high-performance boost. I was not able to reproduce this on GCC 9.3.0, but I also saw no adverse side-effects. I can recommend disabling the semantics like the built-in allocator optimization and link jemalloc at build time. I will also recommend using the -O3
optimization. We never saw an issue with these aggressive optimizations on our targeted platforms. I need to say, unlike the distro Python runtime interpreters, we don't need to run everywhere. So we can use the --enable-optimizations
without any overwrite and add more flags. I can say today, PGO/LTO/O3 make Python faster and it works on our target CPUs.
Python packages
Alpine indeed has no manylinux compatibility with musl. If you don't cache your builds, it needs to compile the C extensions when installing packages that require it. This process takes time, just like if you would cross-build with Qemu for different CPU architectures. You cannot get precompiled binaries from PyPi. This is not a problem for us as the provided binaries on PyPI are mostly not optimized for our target systems.
To fix installation times of Python package, we created our own wheel index and backend to compile all needed wheels and keep it up to date using CI agents. We pre-build over 1k packages for each CPU architecture, and the build time of the Docker file is not so important at all.
Alpine Linux
Alpine is a great base system for Container and allows us to provide the best experience to our user. A big thanks to Alpine Linux, musl, and jemalloc, which make this all possible.
The table shows the results comparing the Alpine Linux's Python runtime and our optimization (GCC 9.3.0/musl). All tests done using Python 3.8.3.
Benchmark | Alpine | Optimized |
---|---|---|
2to3 | 924 ms | 699 ms: 1.32x faster (-24%) |
chameleon | 37.9 ms | 25.6 ms: 1.48x faster (-33%) |
chaos | 393 ms | 273 ms: 1.44x faster (-31%) |
crypto_pyaes | 373 ms | 245 ms: 1.52x faster (-34%) |
deltablue | 22.8 ms | 16.4 ms: 1.39x faster (-28%) |
django_template | 184 ms | 145 ms: 1.27x faster (-21%) |
dulwich_log | 157 ms | 122 ms: 1.29x faster (-22%) |
fannkuch | 1.81 sec | 1.32 sec: 1.38x faster (-27%) |
float | 363 ms | 263 ms: 1.38x faster (-28%) |
genshi_text | 113 ms | 83.9 ms: 1.34x faster (-26%) |
genshi_xml | 226 ms | 171 ms: 1.32x faster (-24%) |
go | 816 ms | 598 ms: 1.36x faster (-27%) |
hexiom | 36.8 ms | 24.2 ms: 1.52x faster (-34%) |
json_dumps | 34.8 ms | 25.6 ms: 1.36x faster (-26%) |
json_loads | 61.2 us | 47.4 us: 1.29x faster (-23%) |
logging_format | 30.0 us | 23.5 us: 1.28x faster (-22%) |
logging_silent | 673 ns | 486 ns: 1.39x faster (-28%) |
logging_simple | 27.2 us | 21.3 us: 1.27x faster (-22%) |
mako | 54.5 ms | 35.6 ms: 1.53x faster (-35%) |
meteor_contest | 344 ms | 219 ms: 1.57x faster (-36%) |
nbody | 526 ms | 305 ms: 1.73x faster (-42%) |
nqueens | 368 ms | 246 ms: 1.49x faster (-33%) |
pathlib | 64.4 ms | 45.2 ms: 1.42x faster (-30%) |
pickle | 20.3 us | 17.1 us: 1.19x faster (-16%) |
pickle_dict | 40.2 us | 33.6 us: 1.20x faster (-16%) |
pickle_list | 6.77 us | 5.88 us: 1.15x faster (-13%) |
pickle_pure_python | 1.85 ms | 1.27 ms: 1.45x faster (-31%) |
pidigits | 274 ms | 222 ms: 1.24x faster (-19%) |
pyflate | 2.53 sec | 1.74 sec: 1.45x faster (-31%) |
python_startup | 14.9 ms | 12.1 ms: 1.23x faster (-19%) |
python_startup_no_site | 9.84 ms | 8.24 ms: 1.19x faster (-16%) |
raytrace | 1.61 sec | 1.23 sec: 1.30x faster (-23%) |
regex_compile | 547 ms | 398 ms: 1.38x faster (-27%) |
regex_dna | 445 ms | 484 ms: 1.09x slower (+9%) |
regex_effbot | 10.3 ms | 9.96 ms: 1.03x faster (-3%) |
regex_v8 | 81.8 ms | 71.6 ms: 1.14x faster (-12%) |
richards | 265 ms | 182 ms: 1.46x faster (-31%) |
scimark_fft | 1.31 sec | 851 ms: 1.54x faster (-35%) |
scimark_lu | 616 ms | 384 ms: 1.61x faster (-38%) |
scimark_monte_carlo | 390 ms | 248 ms: 1.57x faster (-36%) |
scimark_sor | 838 ms | 571 ms: 1.47x faster (-32%) |
scimark_sparse_mat_mult | 19.0 ms | 13.2 ms: 1.43x faster (-30%) |
spectral_norm | 567 ms | 388 ms: 1.46x faster (-32%) |
sqlalchemy_declarative | 364 ms | 286 ms: 1.27x faster (-21%) |
sqlalchemy_imperative | 60.3 ms | 46.8 ms: 1.29x faster (-22%) |
sqlite_synth | 6.88 us | 5.09 us: 1.35x faster (-26%) |
sympy_expand | 1.39 sec | 1.05 sec: 1.32x faster (-24%) |
sympy_integrate | 67.3 ms | 49.5 ms: 1.36x faster (-26%) |
sympy_sum | 505 ms | 389 ms: 1.30x faster (-23%) |
sympy_str | 945 ms | 656 ms: 1.44x faster (-31%) |
telco | 17.9 ms | 12.5 ms: 1.44x faster (-31%) |
tornado_http | 347 ms | 273 ms: 1.27x faster (-21%) |
unpack_sequence | 232 ns | 212 ns: 1.09x faster (-9%) |
unpickle | 41.6 us | 30.7 us: 1.36x faster (-26%) |
unpickle_list | 10.5 us | 9.24 us: 1.14x faster (-12%) |
unpickle_pure_python | 1.28 ms | 945 us: 1.36x faster (-26%) |
xml_etree_parse | 335 ms | 292 ms: 1.15x faster (-13%) |
xml_etree_iterparse | 281 ms | 226 ms: 1.24x faster (-20%) |
xml_etree_generate | 330 ms | 219 ms: 1.51x faster (-34%) |
xml_etree_process | 263 ms | 181 ms: 1.45x faster (-31%) |