SWAR (SIMD Within A Register)
SWAR — SIMD Within A Register — treats a single machine word as a vector of smaller lanes (e.g. a 64-bit register as 8 bytes or 32 bit-pairs) and operates on all lanes at once with ordinary integer instructions, using carefully chosen masks to stop carries from crossing lane boundaries. It is the data-parallel half of bit-manipulation (the complement to branchless-programming‘s control-flow trick): one cheap word-wide op does the work of a loop over lanes, with no dedicated SIMD hardware required.
The canonical example — parallel popcount
The constant-time population-count in bit-twiddling-hacks (and hackers-delight) is SWAR:
the 0x55555555 / 0x33333333 / 0x0f0f0f0f mask sequence adds bits in 2-bit, then 4-bit, then 8-bit
lanes in parallel (a logarithmic-depth tree of masked adds), then a final multiply-and-shift sums the
byte lanes. No branches, no table, no per-bit loop — the whole word’s bits are counted in ~12 ops.
Where else it shows up
- Byte-pattern detection — testing “does this word contain a zero byte / a given byte?” in one
word-wide expression (the basis of fast
strlen/memchr); a staple of bit-twiddling-hacks. - Parallel add/compare across lanes with mask-isolated carries — packed arithmetic before SSE/NEON.
The standing caveat applies, with a twist
Like the rest of the spoke (branchless-programming‘s lesson), much classic SWAR is now matched by true SIMD (SSE/AVX/NEON) and dedicated instructions (POPCNT) — hardware caught up. But SWAR’s value is more durable here: it needs no SIMD ISA at all, so it still wins on minimal/embedded targets, in portable code, and inside the wide-word inner loops the vector units can’t reach. It is the clearest case of bit manipulation as parallelism extracted from plain integer hardware.
Related
bit-manipulation · population-count · branchless-programming · bit-twiddling-hacks · hackers-delight