Improving Python's speed by 40% when running Home Assistant

We use Alpine for most of our Containers. It is the perfect distribution for containers because it is small (BusyBox based), available for a lot of CPU architectures, and the package system is slim. Alpine uses musl as their C library instead of the more commonly used glibc.

Alpine with musl are relatively young compared to their peers (15 and 9 years old) but have seen a significant development pace. Because things move so fast, a lot of misconceptions exist about both based on things that are no longer true. The goal of this post is to address a couple of those and how we have solved them.

This blogpost is not meant as a musl vs. glibc flamewar. Each use case is different and has its own trade-offs. For example, we use glibc in our OS.

For the tests, I used the images from Docker Python library, and the result is published to our base images. I used pyperformance for lab testing and the Home Assistant internal benchmark tools for more real-life comparison. The test environment was running inside a container on the same Docker host.

C/POSIX standard library

I often read: Python is slower when it uses musl as the default C library. This fact is not 100% correct. If the Python runtime was compiled with the same GCC and with -O3, the glibc variant is a bit faster in the lab benchmark, but in the real world, the difference is insignificant. Alpine compiles it with -Os while most other distributions compile it with -O2. This causes the often written difference between the Python runtime interpreters. But when using the same compiler optimizations, musl based Python runtimes have no negative side-effects.

But there is a game-changer, which makes the musl one more useful compared to the glibc-based runtime. It is the memory allocator jemalloc, a general-purpose malloc implementation that emphasizes fragmentation avoidance and scalable concurrency support. There is an interesting effect, which I found on some blogpost about Rust. There were some developers who saw that musl is much faster when using jemalloc compared to glibc, while glibc is slower when using jemalloc. For sure, the benefit with glibc and jemalloc is not the speed as they optimize memory management, but musl get both benefits. While the difference between pure musl and glibc can be ignored, the difference between musl + jemalloc and glibc are substantial (with disabled GCC memory allocator built-in optimization). Yes, today's jemalloc is compatible with musl (there was a time which it was not).

Compiler

How you compile Python is also essential. There were statements from Fedora or Redhat about disable semantic-interposition to get a high-performance boost. I was not able to reproduce this on GCC 9.3.0, but I also saw no adverse side-effects. I can recommend disabling the semantics like the built-in allocator optimization and link jemalloc at build time. I will also recommend using the -O3 optimization. We never saw an issue with these aggressive optimizations on our targeted platforms. I need to say, unlike the distro Python runtime interpreters, we don't need to run everywhere. So we can use the --enable-optimizations without any overwrite and add more flags. I can say today, PGO/LTO/O3 make Python faster and it works on our target CPUs.

Python packages

Alpine indeed has no manylinux compatibility with musl. If you don't cache your builds, it needs to compile the C extensions when installing packages that require it. This process takes time, just like if you would cross-build with Qemu for different CPU architectures. You cannot get precompiled binaries from PyPi. This is not a problem for us as the provided binaries on PyPI are mostly not optimized for our target systems.

To fix installation times of Python package, we created our own wheel index and backend to compile all needed wheels and keep it up to date using CI agents. We pre-build over 1k packages for each CPU architecture, and the build time of the Docker file is not so important at all.

Alpine Linux

Alpine is a great base system for Container and allows us to provide the best experience to our user. A big thanks to Alpine Linux, musl, and jemalloc, which make this all possible.

The table shows the results comparing the Alpine Linux's Python runtime and our optimization (GCC 9.3.0/musl). All tests done using Python 3.8.3.

BenchmarkAlpineOptimized
2to3924 ms699 ms: 1.32x faster (-24%)
chameleon37.9 ms25.6 ms: 1.48x faster (-33%)
chaos393 ms273 ms: 1.44x faster (-31%)
crypto_pyaes373 ms245 ms: 1.52x faster (-34%)
deltablue22.8 ms16.4 ms: 1.39x faster (-28%)
django_template184 ms145 ms: 1.27x faster (-21%)
dulwich_log157 ms122 ms: 1.29x faster (-22%)
fannkuch1.81 sec1.32 sec: 1.38x faster (-27%)
float363 ms263 ms: 1.38x faster (-28%)
genshi_text113 ms83.9 ms: 1.34x faster (-26%)
genshi_xml226 ms171 ms: 1.32x faster (-24%)
go816 ms598 ms: 1.36x faster (-27%)
hexiom36.8 ms24.2 ms: 1.52x faster (-34%)
json_dumps34.8 ms25.6 ms: 1.36x faster (-26%)
json_loads61.2 us47.4 us: 1.29x faster (-23%)
logging_format30.0 us23.5 us: 1.28x faster (-22%)
logging_silent673 ns486 ns: 1.39x faster (-28%)
logging_simple27.2 us21.3 us: 1.27x faster (-22%)
mako54.5 ms35.6 ms: 1.53x faster (-35%)
meteor_contest344 ms219 ms: 1.57x faster (-36%)
nbody526 ms305 ms: 1.73x faster (-42%)
nqueens368 ms246 ms: 1.49x faster (-33%)
pathlib64.4 ms45.2 ms: 1.42x faster (-30%)
pickle20.3 us17.1 us: 1.19x faster (-16%)
pickle_dict40.2 us33.6 us: 1.20x faster (-16%)
pickle_list6.77 us5.88 us: 1.15x faster (-13%)
pickle_pure_python1.85 ms1.27 ms: 1.45x faster (-31%)
pidigits274 ms222 ms: 1.24x faster (-19%)
pyflate2.53 sec1.74 sec: 1.45x faster (-31%)
python_startup14.9 ms12.1 ms: 1.23x faster (-19%)
python_startup_no_site9.84 ms8.24 ms: 1.19x faster (-16%)
raytrace1.61 sec1.23 sec: 1.30x faster (-23%)
regex_compile547 ms398 ms: 1.38x faster (-27%)
regex_dna445 ms484 ms: 1.09x slower (+9%)
regex_effbot10.3 ms9.96 ms: 1.03x faster (-3%)
regex_v881.8 ms71.6 ms: 1.14x faster (-12%)
richards265 ms182 ms: 1.46x faster (-31%)
scimark_fft1.31 sec851 ms: 1.54x faster (-35%)
scimark_lu616 ms384 ms: 1.61x faster (-38%)
scimark_monte_carlo390 ms248 ms: 1.57x faster (-36%)
scimark_sor838 ms571 ms: 1.47x faster (-32%)
scimark_sparse_mat_mult19.0 ms13.2 ms: 1.43x faster (-30%)
spectral_norm567 ms388 ms: 1.46x faster (-32%)
sqlalchemy_declarative364 ms286 ms: 1.27x faster (-21%)
sqlalchemy_imperative60.3 ms46.8 ms: 1.29x faster (-22%)
sqlite_synth6.88 us5.09 us: 1.35x faster (-26%)
sympy_expand1.39 sec1.05 sec: 1.32x faster (-24%)
sympy_integrate67.3 ms49.5 ms: 1.36x faster (-26%)
sympy_sum505 ms389 ms: 1.30x faster (-23%)
sympy_str945 ms656 ms: 1.44x faster (-31%)
telco17.9 ms12.5 ms: 1.44x faster (-31%)
tornado_http347 ms273 ms: 1.27x faster (-21%)
unpack_sequence232 ns212 ns: 1.09x faster (-9%)
unpickle41.6 us30.7 us: 1.36x faster (-26%)
unpickle_list10.5 us9.24 us: 1.14x faster (-12%)
unpickle_pure_python1.28 ms945 us: 1.36x faster (-26%)
xml_etree_parse335 ms292 ms: 1.15x faster (-13%)
xml_etree_iterparse281 ms226 ms: 1.24x faster (-20%)
xml_etree_generate330 ms219 ms: 1.51x faster (-34%)
xml_etree_process263 ms181 ms: 1.45x faster (-31%)