To optimize that code snippet, use temporary variables instead of member lookups to avoid slow getattr and setattr calls. It still won’t beat a compiled language, number crunching is the worst sport for Python.
Which is why in Python in practice you pay the cost of moving your data to a native module (numpy/pandas/polars) and do all your number crunching over there and then pull the result back.
Not saying it's ideal but it's a solved problem and Python is eating good in terms of quality dataframe libraries.
All those class variables are already in __slots__ so in theory it shouldnt matter. Your advice is good
self.shift_index -= 16
shift_byte = (self.shift >> self.shift_index) & 0x5555
shift_byte = (shift_byte + (shift_byte >> 1)) & 0x3333
shift_byte = (shift_byte + (shift_byte >> 2)) & 0x0F0F
self.shift_byte = (shift_byte + (shift_byte >> 4)) & 0x00FF
but only for exactly 2-4 milliseconds per 1 million pulses :) Declaring local variable in a tight loop forces Python into a cycle of memory allocations and garbage collection negative potential gains :(
SWAR : 0.288 seconds -> 0.33 MiB/s
SWAR local : 0.284 seconds -> 0.33 MiB/s
This whole snipped is maybe what 50-100 x86 opcodes? Native code runs at >100MB/s while Python 3.14 struggles around 300KB/s. Python 3.4 (Sigrok hardcoded requirement) is even worse: