gh-124951: Optimize base64 encode & decode for an easy 2-3x speedup [no SIMD] (GH-143262)

Optimize base64 encoding/decoding by eliminating loop-carried dependencies. Key changes:
- Add `base64_encode_trio()` and `base64_decode_quad()` helper functions that process complete groups independently
- Add `base64_encode_fast()` and `base64_decode_fast()` wrappers
- Update `b2a_base64` and `a2b_base64` to use fast path for complete groups

Performance gains (encode/decode speedup vs main, PGO builds):
```
             64 bytes    64K        1M
  Zen2:      1.2x/1.8x   1.7x/2.8x  1.5x/2.8x
  Zen4:      1.2x/1.7x   1.6x/3.0x  1.5x/3.0x  [old data, likely faster]
  M4:        1.3x/1.9x   2.3x/2.8x  2.4x/2.9x  [old data, likely faster]
  RPi5-32:   1.2x/1.2x   2.4x/2.4x  2.0x/2.1x
```

Based on my exploratory work done in https://github.com/python/cpython/compare/main...gpshead:cpython:claude/vectorize-base64-c-S7Hku

See PR and issue for further thoughts on sometimes MUCH faster SIMD vectorized versions of this.

Gregory P. Smith committed 3mo ago

61fc72a4a431cbfd42f22e2af76177c73431c3e6

Parent: 6b9a6c6

Committed by GitHub <noreply@github.com> on 1/2/2026, 6:03:05 AM