I've been writing about some Haskell black magic lately, and I'm sorry to say that continues here: we're going to implement a couple of different sorts of fusion.

*Fusion* is one of those words that made me recoil for quite a few years, since it sounds extremely complicated and arcane. A few years into building applications in Haskell, I still hadn't forced the thunk.

Fusion powers libraries Haskell developers use all day, like `vector`

, `conduit`

, `pipes`

, and `text`

. It's clearly a valuable technique that library authors should understand, and that application programmers can blissfully ignore.

This is a post in which I find it is not so scary, though there is a *lot* of overwhelming material out there.

As per usual, I am writing to formalise my own notes and understanding.

These notes are even rougher than usual.

I have compiled a list of superior resources on this topic. I recommend you consult these before taking my notes too seriously.

Please do get in touch if I'm extremely wrong about something, but don't be a pedant.

When you first pick up Haskell, it's common to spend some time implementing the list functions from the Prelude. For instance, the canonical implementation of my favourite function, list append, uses plain old recursion:

```
(++) :: [a] -> [a] -> [a]
(++) [] ys = ys
(++) (x:xs) ys = x : xs ++ ys
```

That little function, the first I ever wrote, is in the Prelude verbatim. This is great! Little general combinators are right at home in our standard library.

If you look a little closer at the Prelude's source code, however, some terrifying magic sits alongside our good friend `(++)`

:

```
{-# NOINLINE [1] (++) #-} -- We want the RULE to fire first.
{-# RULES
"++" [~1] forall xs ys. xs ++ ys = augment (\c n -> foldr c n xs) ys
#-}
```

When optimisations are enabled, a rewrite rule swaps out our simple implementation for a *completely different* inline function. There's also a `NOINLINE`

pragma for the `(++)`

symbol. The integer subscripts and tilde are key to understanding what GHC will do here^{1}:

- GHC's simplifier runs in phases, which decrease from N to 0
`(++)`

will not be inlined until phase 1. It may be inlined during phase 1 or 0.- The rewrite rule is encouraged to fire up until phase 1. It will not fire in phase 1 or 0.

In sum, we make quite an effort to replace the list append operator with a fold. Rewrite rules are subject to GHC heuristics, and are not guaranteed to fire. At the last second, if it has not been replaced with an inline fold, `(++)`

is treated like an ordinary function again.

OK, so GHC has been programmed to favour a different implementation of the function when optimisations are enabled. It does this for most of the list functions defined in `base`

. Why?

Function composition is very intuitive. Every composition saves us from reimplementing features. Unfortunately, naive function composition chains can lead to inefficiency: each function must produce intermediate values to pass to the next.

Let's choose a function chain to benchmark, and compile with `-O0`

. We should be able to demonstrate excessive allocation, especially when compared against `-O2`

:

```
main :: IO ()
main =
print expr
expr :: [Int]
expr =
take 100000 . map (+1) . filter (\x -> x `mod` 2 == 0) $ [1..1000000]
```

```
$ ghc -rtsopts -prof -O0 Plain.hs
$ ./Plain +RTS -pa
$ head Plain.prof
<snip>
total time = 0.58 secs (578 ticks @ 1000 us, 1 processor)
total alloc = 78,136,000 bytes (excludes profiling overheads)
```

```
$ ghc -rtsopts -prof -O2 Plain.hs
$ ./Plain +RTS -pa
$ head Plain.prof
<snip>
total time = 0.56 secs (563 ticks @ 1000 us, 1 processor)
total alloc = 39,735,832 bytes (excludes profiling overheads)
```

Optimisations are cutting our allocation load in half, though the time isn't affected much, since this is a simple linear-time program.

This is a pretty coarse metric, and it's hard to know whether this change is entirely thanks to short-cut fusion. Before I run off to decipher the Core, let's redefine all the functions we're using, so we can use `-O2`

without the rewrite rules in question:

```
import Prelude hiding (map, take, filter)
main :: IO ()
main =
print expr
expr :: [Int]
expr =
take 100000 . map (+1) . filter (\x -> x `mod` 2 == 0) $ [1..1000000]
map :: (a -> b) -> [a] -> [b]
map _ [] = []
map f (x:xs) = f x : map f xs
take :: Int -> [a] -> [a]
take n _ | n <= 0 = []
take _ [] = []
take n (x:xs) = x : take (n-1) xs
filter :: (a -> Bool) -> [a] -> [a]
filter _pred [] = []
filter pred (x:xs)
| pred x = x : filter pred xs
| otherwise = filter pred xs
```

```
$ ghc -rtsopts -prof -O2 Plain.hs
$ ./Plain +RTS -pa
$ head Plain.prof
<snip>
total time = 0.56 secs (561 ticks @ 1000 us, 1 processor)
total alloc = 67,736,504 bytes (excludes profiling overheads)
```

Better than `-O0`

, but with about 30MB of additional allocations.

Looking at the Core, our recursive functions pasted from the Haskell Report have not been inlined. GHC is relatively powerless to optimise naively-recursive functions. Inlining recursive definitions is usually a bad idea, since it will either fail to terminate or replace a finite loop with a huge unrolled function. If we can't inline, we can't take full advantage of GHC's wonderful optimisations. This compels us toward more interesting techniques.

The fusion scheme used in `base`

for list functions is known as *short-cut fusion*. It originated in the 1993 paper A Short Cut to Deforestation and can be found written up on pretty much every Haskell-inclined blog in the world.

`foldr/build`

This technique involves rewriting all our list *consumer* functions in terms of `foldr`

, rewriting all our list *producers* in terms of a dual combinator called `build`

, and letting GHC rewrite `foldr/build`

chains into smaller functions via a known identity.^{2}

```
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr f z [] = z
foldr f z (x:xs) = f x (foldr f z xs)
build :: ((a -> [a] -> [a]) -> [b] -> c) -> c
build g = g (:) []
```

It's usually straightforward to rewrite our consumer functions in terms of `foldr`

:

```
map f xs = foldr (\a b -> f a : b) [] xs
filter f xs = foldr (\a b -> if f a then a:b else b) [] xs
```

... though `take`

is a little messy, so I'll elude it and point you at GHC.List.

If you look carefully at the type signature, the first argument to `build`

is a specialised right fold. `build`

takes this fold function and pushes cons through it:

```
> build foldr [1..10]
[1,2,3,4,5,6,7,8,9,10]
```

In other words, `build`

finalises a suspended fold. If we see a `build`

on the right-hand side of a `foldr`

, we can eliminate the `foldr`

entirely, perpetuating the more interesting suspended one instead. This forms an identity that is expressed in GHC as a rewrite rule:

`{-# RULES "my foldr/build" forall g k z. foldr k z (build g) = g k z #-}`

It is also helpful to reimplement the enumeration function, `from`

, in terms of build. `from`

takes two bounds, and produces a specialised fold, wrapped in a `build`

. It is the simplest list producer. Note that it is defined recursively.

```
from :: Int -> Int -> [Int]
from a b =
build (from' a b)
where from' c d = \k n -> if c > d then n else k c (from' (c+1) d k n)
```

The last part of the puzzle is inlining. If our `foldr`

-based functions don't get inlined, the rewrite rule cannot be applied, since rewriting is entirely syntactic.

```
{-# INLINE map #-}
{-# INLINE filter #-}
```

Now, our chain of list functions is inlined into a chain of `foldr`

, terminated by a `build`

. The rewrite rule is able to eliminate all but the outer `foldr`

. We have a single loop body that GHC is free to optimise, and we do not perform redundant allocation.

Rules are based on heuristics, so they don't always fire. If short-cut fusion does *not* take place, these `foldr`

-based functions are likely to be worse than the naive recursive versions. This is one motivation for the sneaky rewrite-based redefinition of functions like `(++)`

: we get to keep the naive implementation when things don't go our way.

`destroy/unfoldr`

`foldr/build`

was not the last word in fusion. It proved very practical and is still in use, but is better suited for some functions over others.

It was very unintuitive to implement `take`

in terms of `foldr`

and `build`

. We didn't try to implement `zip`

, but rest assured it is both difficult and disappointing: intermediate lists still get produced.

To this end, an alternative scheme for short-cut fusion was devised. This scheme is dual to `foldr/build`

, and is known as `unfoldr/destroy`

. It is implemented via similar rewrite mechanisms.

From Shortcut Fusion for Accumulating Parameters and Zip-like Functions by Svenningsson:

```
unfoldr :: (b -> Maybe (a, b)) -> b -> [a]
unfoldr f b =
case f b of
Nothing -> []
Just (a, b') -> a : unfoldr f b'
destroy :: (forall a. (a -> Maybe (b, a)) -> a -> c) -> [b] -> c
destroy g xs = g listpsi xs
where listpsi :: [a] -> Maybe (a, [a])
listpsi [] = Nothing
listpsi (x:xs) = Just (x, xs)
```

If you squint at the type signatures of `unfoldr`

and `destroy`

, you will see a familiar thing at play: `destroy`

expects a generalised unfold as its first argument, much like `build`

expected a specialised fold. It takes that suspended unfold and applies it to the least interesting unfold possible, the list unravel:

```
> destroy unfoldr [1..10]
[1,2,3,4,5,6,7,8,9,10]
```

The identity we use to eliminate redundant destroy/unfoldr pairs will also look a little familiar, though inverted:

`{-# RULES "my destroy/unfoldr" forall g psi e. destroy g (unfoldr psi e) = g psi e #-}`

... and functions implemented via these combinators look a little bit vexing:

```
from n m =
unfoldr (\i -> if i > m then Nothing else Just (i, succ i)) n
zip xs ys =
destroy (\ psi1 e1 ->
destroy (\ psi2 e2 ->
unfoldr (zipDU psi1 psi2) (e1,e2)
) ys
) xs
where zipDU psi1 psi2 (e1,e2) =
case psi1 e1 of
Nothing -> Nothing
Just (x,xs) ->
case psi2 e2 of
Nothing -> Nothing
Just (y,ys) -> Just ((x,y),(xs,ys))
```

Though that implementation of `zip`

looks complicated, the `destroy/unfoldr`

rewrite rules do their thing, and GHC is able to eliminate the Maybe entirely. Skim the paper, it's concise.

GHC did not adopt `destroy/unfoldr`

.

I am extremely unqualified to elaborate on why GHC uses `foldr/build`

over `unfoldr/destroy`

, or stream fusion for that matter. I would love to read a folklore piece to that effect.

Given all the loop fusion frameworks we come across seem to be based on either folds or unfolds, it seems inevitable that there's some more general theoretical framework tying it all together.

There's a lot of work on the correctness of short-cut fusion. Of course, I have read none of it, so I cannot point you in the right direction. I'm so tired, friends. I can't fuse any more streams today.

I did, however, quite enjoy reading Thomas Harper and Ralf Hinze's treatment. In Theory and Practice of Fusion, `foldr/build`

and `destroy/unfoldr`

are presented as dual Church and co-Church encodings in a uniform setting.

Stream fusion is a newer framework. It emerged from libraries like `bytestring`

and `vector`

about ten years ago. It's easy enough to implement on your own, I just wore myself out trying to produce nice canned examples. I recommend reading From Lists to Streams to Nothing at All ^{3}, which lays it all out pretty clearly.

I had originally set out to write about stream fusion today, since it is so easy to implement. I've decided instead to talk about short-cut fusion, and sit on the stream fusion post until I properly explain how it generalises destroy/unfoldr. It does!

Roman Leshchinskiy has a nice article on that front.

We've fused loops in Haskell before! We've rolled our folds together, and we've mashed up our fmaps. However, fusion is very different to *runtime* constructions like Yoneda and Foldl. There's no runtime overhead, the results can be much more dramatic, and the user does not need to know about it.

Compile-time fusion is a form of metaprogramming. We use GHC directives to control the way functions are substituted and inlined. We are free to subvert our own function definitions, replacing them with equivalent representations that compose well and can be optimised down to nothing at all.

`RULES`

and `NOINLINE`

pragmas can be used in any Haskell module. Very aggressive optimisation schemes can be pursued by library authors without any need to poke around inside GHC. As with our original `(++)`

, any non-optimising Haskell compiler can ignore the pragmas as comments and still end up with a functional program.

Doing this well requires intimate knowledge of GHC's optimisation passes, though there is significant prior art to learn from, especially in the field of streams.

There is also risk and frustration in this approach, due to its reliance on GHC heuristics. Users may hold your library incorrectly, producing code that GHC can't optimise for complex reasons. Failure to fuse in a performance-sensitive library may be a catastrophic outcome. A more explicit metaprogramming approach may be necessary for extreme applications.

- A Short Cut to Deforestation
- Shortcut Fusion for Accumulating Parameters and Zip-like Functions
- From Lists To Streams To Nothing At All
- Wadler, Recursive types for free!
- A Library Writer's Guide to Shortcut Fusion, Thomas Harper
- Theory and Practice of Fusion, Hinze, Harper, James

- Franklin Chen's good post
- Chris Done's good motivation
- Kwang Yul Seo's good motivation
- Shortcut fusion for pipes, Gabriel Gonzalez
- Response to Reddit thread, Gabriel Gonzalez
- Response to Reddit thread, Michael T
- GHC Optimization and Fusion

If you haven't seen these pragmas before, I recommend briefly skimming the GHC user's guide, particularly the section on pragmas and phase control.↩

If you're wondering about the

`augment`

function we saw before, it is just a variant of`build`

that provides an initial list, rather than`[]`

. It is especially useful for combinators like`(++)`

, where the right-hand side of the list should not be folded over. It is defined as`augment g xs = g (:) xs`

.↩Speaking of

*nothing at all*, I know*nothing at all*about the theory of stream fusion. I've read the paper, and I've looked at the source code for a few libraries that rely on fusion for performance, but beside that I am just regular Joey Haskell: armed with functions, datatypes, and sprinkles of wonderful GHC bullshit. You should really read the paper, especially if you were considering taking this post seriously.↩