Safe map merge

Published: 2017-03-03
Words: 1740

One of my most frequent bug sources is Data.Map.union, a left-biased map merge that discards the right-hand side value when it encounters on duplicates. The Monoid instance for Map uses this as its associative function (mappend), so it's easy as hell to just cram two maps together and ignore the consequences.

Most of the bugs I cause / encounter via union are cousins of shadowing, i.e. accidental reuse of a name that was supposed to be unique. This means a user down the line will encounter absolutely baffling behaviour - something they referred to was something else! They're particularly insidious bugs, as concise as <> and easily palmed off as user error, to be dealt with after v0.

If you do take the time to do things correctly the first time, Data.Map provides further friction: unionWith does not allow us to merge with monadic or applicative effects. You'll need to write your own slowpoke union function out of sequenceA and unionWith, and we don't have time for that!

Luckily, upstream has your back. containers >= 0.5.9.1 provides a joyous API for fast and precise map merges, of both the pure and effectful variety. Jump straight to the new API, Data.Map.Merge, if you like reading documentation.

Quick start

The new merge API comes in both strict and lazy variants, Data.Map.Merge.Strict and Data.Map.Merge.Lazy. Much like the rest of Data.Map, these share the underlying Map type; they can be used interchangeably, affecting only the strictness of the leaves.

The most general function is mergeA, which takes three merge tactics parameterised by some Applicative, and uses them to merge two maps in that Applicative. There's also merge, which specialises mergeA to Identity.

The three tactic parameters cover every case we may encounter when merging two maps: keys defined in one map but not the other (WhenMissing), and keys defined in both maps (WhenMatched).

I mostly use zipWithAMatched, traverseMissing, and preserveMissing, but there are a whole bunch of tactics available, supporting pretty much any map operation you can think of. For instance, zipWithMaybeAMatched allows us to conditionally remove matching items from the result map during a merge, something that was quite irritating before.

Union with effects

It's now quick and easy to perform error handling while taking the union. For instance, if duplicate keys are a problem, we can include an effectful zipWithMatched tactic, and turn them into Left:

import           Data.Map.Strict (Map)
import qualified Data.Map.Merge.Strict as M

-- | Union two maps, applying some effectful function to duplicates.
unionWithA ::
     (Applicative f, Ord k)
  => (k -> a -> a -> f a)
  -> Map k a
  -> Map k a
  -> f (Map k a)
unionWithA f m1 m2 =
  M.mergeA
    M.preserveMissing -- Preserve keys found in m1 but not m2
    M.preserveMissing -- Preserve keys found in m2 but not m1
    (M.zipWithAMatched f) -- Apply f when keys in both m1 and m2
    m1
    m2

More concretely, you might want to specialise the above to Either or EitherT:

newtype Dict = Dict (Map Text Text)

data DictError =
    ReusedName Text
    
unionDict :: Dict -> Dict -> Either DictError Dict
unionDict (Dict m1) (Dict m2) =
  Dict $ unionWithA (\k _ _ -> Left (ReusedName k)) m1 m2

Validating records

With so many tactics, we don't need to be implementing dour functions like unionWithA all day; we can do almost anything we like involving two maps. For instance, I recently found myself using mergeA in a type checker to provide fine-grained errors when comparing two records for field name equality:

newtype Record = Record (Map Text Text)

data RecordError =
    MissingField Text
  | ExtraneousField Text

-- | Validate that a user-supplied record fits its declaration.
validateRecord :: Record -> Record -> Either RecordError ()
validateRecord (Record have) (Record want) =
  void $ M.mergeA
    -- Fields we have, but don't want
    (M.traverseMissing (\k _ -> Left (ExtraneousField k)))
    -- Fields we want, but don't have
    (M.traverseMissing (\k _ -> Left (MissingField k)))
    -- Fields defined in both - all good!
    -- We are free to change the value to anything, like ()
    (M.zipWithAMatched (\_ _ _ -> pure ()))
    want
    have

(It was not much of a stretch to use mostGeneralUnifier as the WhenMatched tactic in this same merge, unifying the records and throwing all relevant errors in just three lines.)

This really does happen

Isolated in a blog post like this, map merge bugs look a little bit trivial. In practice, they occur somewhere in the last mile of your project, at the edges of the system, when most of the exciting work has been done already and your life is comprised mostly of plumbing.

Let's take a step back in time, all the way back to 2015. We're working on a production application at a large enterprise, in Haskell of course. It's slow going. Not much is novel here.

From amongst the boilerplate, in a big pile of Either and EitherT, two maps appear, and need to be combined. There's nothing like unionWithA or mergeA handy, because we're all the way back in 2015! A bad solution is only two-to-four keystrokes away.

To illustrate how this kind of thing sneaks up on you and exploits your innate human weakness, I'm going to close out this post with a vaguely-realistic worked example. It's intended to be dull as dishwater - try to follow along closely, get appropriately bored, and see if you spot the union.

Suppose we're implementing a small interface description language. This is just another way of serialising the definitions of datatypes, something a little like protobuf:

struct Foo {
    foo : String
  , bar : Int32
  }

Let's get some imports out of the way:

{-# LANGUAGE GeneralizedNewtypeDeriving #-}

import           Control.Monad
import           Control.Monad.IO.Class (MonadIO (..))
import           Control.Monad.Trans.Except (ExceptT (..), runExceptT, withExceptT)

import           Data.Bifunctor (first)
import           Data.Text (Text)
import qualified Data.Text as T
import qualified Data.Text.IO as T
import           Data.Map.Strict (Map)
import qualified Data.Map.Strict as M
import qualified Data.Map.Merge.Strict as M
import           Data.Foldable
import           Data.Traversable

import           System.Environment (getArgs)
import           System.IO (FilePath, IO, hPutStrLn, stderr)

In our IDL, each structure looks a lot like a record. Records have names and a list of fields. Each field has a name and a type.

newtype StructName = StructName {
    unStructName :: Text
  } deriving (Eq, Ord, Show)

data Struct = Struct {
    structFields :: [(StructFieldName, StructFieldType)]
  } deriving (Eq, Ord, Show)

newtype StructFieldName = StructFieldName {
    unStructFieldName :: Text
  } deriving (Eq, Ord, Show)

data StructFieldType
  = StructString
  | StructVar StructName
  deriving (Eq, Ord, Show)

Each structure can freely refer to the others, so the structure names must be globally unique. We might represent a parsed file as a map from StructName to Struct:

newtype Definitions = Definitions {
    unDefinitions :: Map StructName Struct
  } deriving (Eq, Ord, Show, Monoid)

... and let's assume someone's sketched out a useful error type for us, along with a parser:

data ParseError = ParseError
  deriving (Eq, Ord, Show)

-- Parse a file into a set of definitions.
parseFile :: FilePath -> ExceptT ParseError IO Definitions
parseFile = undefined <=< (liftIO . T.readFile) -- implement me!

We're not monsters, or so we think. Non-monsters (nonmons) like to provide as much feedback as possible to their users. We, us, the good programmers, would like to make sure that every StructVar points to something that's been defined before, somewhere. This means we can't do anything without all the definitions, all together, all at once.

It's 5pm now, and the colour is draining from your face. The coffee machine has been switched off. People are dribbling out of the office and into the streets. But, your users! Your users are still at risk. We must gather and validate these datatypes now, quickly, at speed, and get this into production before it is too late:

-- Top-level application errors, composing our various other error types.
data IDLError
  = IDLParseError ParseError
  | IDLDefinitionsError DefinitionsError
  deriving (Eq, Ord, Show)

-- Semantic errors that can occur in the definitions.
data DefinitionsError
  = UnknownStruct StructName
  deriving (Eq, Ord, Show)

-- don't use Show on errors! write this nice rendering function!
renderIDLError :: IDLError -> Text
renderIDLError =
  undefined -- implement me!

main :: IO ()
main = do
  args <- getArgs -- wow, I'm really in a hurry!
  e <- flip fmap (runExceptT (parseFiles args)) $ \ee -> do
    p <- first IDLParseError ee
    first IDLDefinitionsError (validateDefinitions p)
  either
    (hPutStrLn stderr . T.unpack . renderIDLError)
    (const (putStrLn "Well done! I'll give you your datatypes tomorrow"))
    e

-- | Decide if this definition set is valid, by checking that each
-- 'StructVar' points to a known definition, else throwing a 'DefinitionsError'.
validateDefinitions :: Definitions -> Either DefinitionsError ()
validateDefinitions (Definitions defs) =
  for_ defs $ \(Struct fields) ->
    for_ fields $ \(_fieldname, fieldtype) ->
      case fieldtype of
        StructString ->
          pure ()
        StructVar n ->
          maybe (Left (UnknownStruct n)) (const (pure ())) (M.lookup n defs)

-- | Parse a big set of definition files into a single definition set.
parseFiles :: [FilePath] -> ExceptT ParseError IO Definitions
parseFiles fps =
  fmap fold (traverse parseFile fps)

Spot the bug! It's a huge one, and a bad one, and you most definitely do think about it a little as you commit it. Maybe you leave a little comment or file an issue saying 'fix map merges'. Doing the right thing is pretty tedious, and can totally be a v1 thing. Right? Ship it.

(The bug is in parseFiles, namely the use of fold. This is using the Monoid instance, which uses Data.Map.union. If two structures have the same name, we just keep whichever one we saw first, and throw the other one away.)

It'd be really tedious to deal with this properly in 2015. Luckily, in 2017, we can reach straight for mergeA and our unionWithA function, safely eliminating the fold with very little API friction and nary a second thought:

-- | Parse and validate a set of definition files.
main' :: [FilePath] -> ExceptT IDLError IO Definitions
main' files = do
  defs <- withExceptT IDLParseError (traverse parseFile files)
  ExceptT (return (first IDLDefinitionsError (validateDefinitions' defs)))

-- | Validate a set of parsed definition files.
validateDefinitions' :: [Definitions] -> Either DefinitionsError Definitions
validateDefinitions' defs =
  fmap Definitions $
     foldM
       (unionWithA (\k _ _ -> Left (UnknownStruct k))
       mempty 
       (fmap unDefinitions defs)

Feature history

This API was contributed by David Feuer. It appeared under Data.Map.Strict.Merge and Data.Map.Lazy.Merge in containers-0.5.8.1. Those modules were quickly deprecated in favour of the current Data.Map.Merge.Strict/Data.Map.Merge.Lazy family in containers-0.5.9.1.