A ByteString performance mystery

Aug. 23 2017haskell

This post will recount an interesting performance mystery from work a long way back. I wasn't involved in this murder mystery at the time, but that's all the more reason to get involved writing it up!

The problem is pretty simple: we have a buffer that is assumed to be ASCII alphanumerics, and we want to convert it to lowercase for comparison. It's in an inner loop, so it should be as fast as possible, though we are not quite willing to implement the entire program in C. We still want a typed interface and Hedgehog coverage.

If we want this to be a fast inner loop, we probably want to avoid unnecessary branching. Luckily, our input is in the right shape. There's a fairly well-known trick for upper- and lower-casing ASCII alphanumeric strings. Upper and lower case are exactly 0x20 apart, so it is common to check the case, then add or subtract 0x20. However, checking the case first would introduce a branch, so we can instead perform bitwise AND 0x20 (upper) or OR 0x20 (lower). This will produce some unusual output for characters that aren't alphanumeric, but we don't care in this particular case.

We can implement this function in fairly idiomatic Haskell using ByteString primitives:

import           Data.Bits ((.|.))
import           Data.ByteString (ByteString)
import qualified Data.ByteString as B

toLowerNative :: ByteString -> ByteString
toLowerNative =
  B.map (.|. 0x20)

... and we can see that it works given ASCII:

λ> toLowerNative "ABCDE"
λ> toLowerNative "ABCDE123123123"
λ> toLowerNative "ABCDE123123123abcde"

... and we can write a benchmark to demonstrate that it's linear in the size of the input:

Benchmark bench: RUNNING...
benchmarking toLowerNative/100
time                 699.9 ns   (689.4 ns .. 712.1 ns)

benchmarking toLowerNative/500
time                 3.133 μs   (3.092 μs .. 3.193 μs)

benchmarking toLowerNative/1000
time                 6.245 μs   (6.128 μs .. 6.384 μs)

benchmarking toLowerNative/2000
time                 12.66 μs   (12.39 μs .. 12.99 μs)

... but we don't have anything to benchmark against! We shouldn't be especially surprised that a map across ByteStrings runs in linear time.

The interesting bit came when it was benchmarked against the exact same function written in C and invoked via the FFI. So, let's do that:

void to_lower (char* str, int len) {
  for (int i = 0; i < len; i++) {
    str[i] |= 0x20;

We can bind it in Haskell and marshal our ByteStrings in the usual fashion. This function has only benign side effects. It idempotently modifies memory provided by ByteString's useAsCStringLen function, but touches nothing else. So, it is safe to use unsafePerformIO, convincing GHC to treat it as an ordinary pure function.

{-# LANGUAGE ForeignFunctionInterface #-}
import           Data.ByteString (ByteString)
import qualified Data.ByteString as B
import           Foreign.C.Types (CChar)
import           Foreign.Ptr (Ptr)
import qualified System.IO.Unsafe as Unsafe

toLowerFFI :: ByteString -> ByteString
toLowerFFI bs =
  Unsafe.unsafePerformIO . B.useAsCStringLen bs $ \(ptr, len) -> do
    to_lower ptr len
    B.packCStringLen (ptr, len)

foreign import ccall to_lower :: Ptr CChar -> Int -> IO ()

... and when we add it to the benchmark, it's an order of magnitude faster!

benchmarking toLowerFFI/100
time                 236.0 ns   (233.6 ns .. 239.1 ns)

benchmarking toLowerFFI/500
time                 349.4 ns   (334.2 ns .. 366.1 ns)

benchmarking toLowerFFI/1000
time                 477.6 ns   (432.3 ns .. 528.6 ns)

benchmarking toLowerFFI/2000
time                 637.8 ns   (586.2 ns .. 690.2 ns)

Assembly dump

We know from various wonderful Haskell performance guides that there's a bunch of stuff we can try next to improve this native version. I'm not going to go there. Let's just peek at the assembly for the speedy version.

$ objdump -S lower.o

lower.o:        file format Mach-O 64-bit x86-64

Disassembly of section __TEXT,__text:
; void to_lower (char* str, int len) {
       0:       55      pushq   %rbp
       1:       48 89 e5        movq    %rsp, %rbp
; for (int i = 0; i < len; i++) {
       4:       85 f6   testl   %esi, %esi
       6:       7e 63   jle     99 <_to_lower+0x6B>
       8:       89 f0   movl    %esi, %eax
; str[i] |= 0x20;
       a:       83 fe 1f        cmpl    $31, %esi
       d:       76 45   jbe     69 <_to_lower+0x54>
       f:       83 e6 1f        andl    $31, %esi
      12:       49 89 c0        movq    %rax, %r8
      15:       49 29 f0        subq    %rsi, %r8
      18:       74 3a   je      58 <_to_lower+0x54>
      1a:       48 8d 57 10     leaq    16(%rdi), %rdx
      1e:       0f 28 05 4b 00 00 00    movaps  75(%rip), %xmm0
; for (int i = 0; i < len; i++) {
      25:       4c 89 c1        movq    %r8, %rcx
      28:       0f 1f 84 00 00 00 00 00         nopl    (%rax,%rax)
; str[i] |= 0x20;
      30:       0f 10 4a f0     movups  -16(%rdx), %xmm1
      34:       0f 10 12        movups  (%rdx), %xmm2
      37:       0f 56 c8        orps    %xmm0, %xmm1
      3a:       0f 56 d0        orps    %xmm0, %xmm2
      3d:       0f 11 4a f0     movups  %xmm1, -16(%rdx)
      41:       0f 11 12        movups  %xmm2, (%rdx)
      44:       48 83 c2 20     addq    $32, %rdx
      48:       48 83 c1 e0     addq    $-32, %rcx
      4c:       75 e2   jne     -30 <_to_lower+0x30>
      4e:       85 f6   testl   %esi, %esi
      50:       75 05   jne     5 <_to_lower+0x57>
      52:       eb 17   jmp     23 <_to_lower+0x6B>
      54:       45 31 c0        xorl    %r8d, %r8d
      57:       4c 01 c7        addq    %r8, %rdi
      5a:       4c 29 c0        subq    %r8, %rax
      5d:       0f 1f 00        nopl    (%rax)
      60:       80 0f 20        orb     $32, (%rdi)
; for (int i = 0; i < len; i++) {
      63:       48 ff c7        incq    %rdi
      66:       48 ff c8        decq    %rax
      69:       75 f5   jne     -11 <_to_lower+0x60>
; };
      6b:       5d      popq    %rbp
      6c:       c3      retq

Making sense of it

Not sure about you, but I'm not very good at reading assembly. I know what lots of the instructions do, and can follow simple control flow, but keeping track of complex changes in state and carry flags takes a lot of energy. I don't take great insight from this assembly dump; I just try to zoom in on the interesting bits.

Edit 24/08: Turns out I'm really bad at reading x86! I had originally confused a call for uninlined ByteString.Map for the inner loop. I've removed a little of the Twitch Reads ASM-style content for now.


The native version is going one byte at a time in a loop. That's about it. I had some incorrect ASM tea-leaf reading here before, but no longer.


The C version has compiled to a much larger instruction pile. We've got nice sourcemaps, because I compiled with debugging symbols enabled (-g), allowing us to relate each line of C to the generated code. This helps somewhat if you're a rube like myself.

The second occurrence of |= 0x20 is the meat of the loop, and it goes a long way to explaining the dramatic performance gap. It's using orps on the xmm registers to do the heavy lifting. A quick Google will reveal these to be parts of Intel's SIMD Streaming Extensions, or SSE.

In short, the compiled C is making use of CPU vector operations. It's packing the 128-bit registers with characters and OR'ing up to 16 characters at once. When it doesn't have enough characters to pack those registers, it seems to just proceed a byte at a time until the input is exhausted.

Glancing back at our benchmark, each C result is about 16 times faster, give or take. SIMD explains most of the gap away.


So, why doesn't GHC generate vectorised code in this case?

You probably shouldn't ask me, I learned everything I know about SSE from Wikipedia just now.

I'd like to hear about the backstory here, and why it's difficult. Get in touch if you can be bothered! I will go read up a little.

Edit 24/08: Austin Seipp has provided a great wealth of information on automatic vectorisation on Reddit , read it!

Edit 24/08: Thanks to Ilya Yanok for pointing out some of my incorrect comments on the native ASM.

The code in this post is truncated. Get the full source here.