Skip to content

Data Generic programming

The other day this tweet showed up.

I like those questions because they ask for multiple questions at once:

  • What APIs are available to achieve this goal?
  • What tradeoffs are available in this particular situation?
  • What is the most idiomatic in this language?

And finding an answer that fits all those questions is quite a challenge

In this specific instance there seem to have two main solutions:

1. Use a fold

fold is called reduce in switft but the idea is the same:

func split3<A, S>(arr: S, test: (A) -> Three) -> ([A],[A],[A]) where S: Sequence, S.Element == A {
    arr.reduce(into: ([],[],[])) { acc, i in
        switch test(i) {
        case .first: acc.0.append(i)
        case .second: acc.1.append(i)
        case .third: acc.2.append(i)
        }
    }
}

Here i'm using a custom data type Three which has three possible values:

enum Three: Equatable {
    case first, second, third
}

The reduce implementation is using an accumulator of three empty arrays and adds element to the correct one depending on which of the three value the original array maps into.

This has the benefit to iterate the array only once, it's also a single expression instead of using multiple imperative statements, that meanst that you can't reorder the lines without stopping the program from compiling. The problem is that it's not particularly readable

2. Use a loop

One of the bread and butters of c programmers is to use loops instead of higher order functions, this situation lends itself very well to a loop since accessing the arrays from within the loop feels very natural:


func splitLoop<A, S>(arr: S, test: (A) -> Three) -> ([A],[A],[A]) where S: Sequence, S.Element == A {
    var arr1: [A] = []
    var arr2: [A] = []
    var arr3: [A] = []
    for i in arr {
        switch test(i) {
        case .first: arr1.append(i)
        case .second: arr2.append(i)
        case .third: arr3.append(i)
        }
    }
    return (arr1,arr2,arr3)
}

This code is a bit longer but is fairly easy to understand, however it's a bit verbose and it introduces a new possible source of errors compared to the previous option: Nothing stops you from returning (arr1, arr1, arr1) or (arr1, arr2, arr1). Of course that's a trivial bug to fix but not a trivial bug to detect since the compiler won't warn you.

The pragmatic solution

Using 3 filter turns out to produce a solution that's short, easy to read and easy to understand:

func splitFilter<A, S>(arr: S, test: (A) -> Three) -> ([A],[A],[A]) where S: Sequence, S.Element == A {
    let arr1 = arr.filter { i in test(i) == .first }
    let arr2 = arr.filter { i in test(i) == .second }
    let arr3 = arr.filter { i in test(i) == .third }
    return (arr1, arr2, arr3)
}

This has the same problem that you can flip around the return value but the rest of the body of the function is really nice to read! It's a bit unsatisfying because it traverses the list three times instead of once but you could argue that's just a constant factor in an otherwise O(n) solution.

Solving the generic version

The question that popped into my mind was as follow:

if filter allows you to partition your functions in 2, and split partitions our function in 3, is there a version that splits any array into n?

There are many ways of solving this problem but they all involve some sort of gymnasitc to do with meta-programming or macros. Thankfully you need none of that if you use Idris.

There are two key observations to make to reach a solution:

  • We need to be generic over n which will be the number of ways we partition our original array
  • We need to be generic over the type of the predicate: It needs to have n different values in order to discriminate which "bucket" each value is going to end up in.

Using the first observation we know we have to implement a function that looks a bit like this:

splitN : (n : Nat) -> (pred : a -> ???) -> (ls : List a) -> Vect n (List a)
  • The first argument n is the number of ways we split our list
  • pred is the predicate that tells us which "bucket" each element is supposed to land in
  • ls : List a is the original list we want to split.
  • Vect n (List a) are our n buckets that contains all the elements from the original list but placed in their corresponding batch.

I've left the return type of pred a mystery because before I tell you what it is we need to understand what it does. pred is a function that, given any value of the array, tells you "which bucket" the value belongs to. If there are three possible buckets, like in the original tweet, then ??? must be a type like Three that we saw earlier. But if there are only 2 buckets, then ??? should be Bool! because we want to split our list into two possible partitions: the one for which pred is true and the one for which pred is false.

If we are looking to split our array in 4 then we need a type that has four possible values, and elements that map into the first value will end up in the first bucket, elements that map to the second value end up in the second bucket, etc.

So that means the type of ??? is a type that has exactly as many inhabitants (aka as many values) as n.

Generic data descriptors

The following is the foundation of what makes Typedefs possible: an algebra of types. This interpretation is given by the wonderful thesis by Peter Morris Constructing Universes for Generic Programming (You can probably find it on the internet if you look hard enough). Long story short we need this data type:

-- CFT stands for "Context Free Type"
data CFT : Type where

  -- Void
  Zero : CFT

  -- Unit
  One : CFT

  -- Algebraic definitions
  (+) : (a, b : CFT) -> CFT
  (*) : (a, b : CFT) -> CFT

(I've eluded variables, application and Mu because they are not used for what we are doing now)

This data type describes strictly positive types and allows you to build bigger types out of smaller ones. Here Zero represents the type with 0 inhabitants, and One the type with exactly one value, this is often referred to as the unit type. * and + refer to products and coproducts of types, in layman's words, "pairing" of types and "eithering" of types.

In order to represent a value with 2 in habitants it is enough to write

Two : CFT
Two = One + One

Three can be defined as Two + One, and Four as Three + One etc. each of them allow you describe a type of exactly the same number of different values as their names. So Two has two possible values.

We can automate the process by writing a function that converts from Nat to CFT by creating a description for a type with the number of values corresponding to the Nat value:

TyFromNat : Nat -> CFT
TyFromNat Z = Zero
TyFromNat (S k) = One + TyFromNat k

Since CFT is just a description for a type we need to instanciate it as a Type in the host language, Idris:

data Ty : CFT -> Type where
  Left : Ty s -> Ty (s + t)
  Right : Ty t -> Ty (s + t)
  Unit : Ty One
  Pair : Ty s -> Ty t -> Ty (s * t)

(Again I've eluded things we don't care about like contexts and Mu)

This allow us to create values for each of our type descriptions:

FunnyBool : Type
FunnyBool = Ty (TyFromNat 2)

FunnyTrue : FunnyBool
FunnyTrue = Left Unit

FunnyFalse : FunnyBool
FunnyFalse = Right (Left Unit)

Here Left Unit has type Ty (TyFromNat 2), you might be confused as to why this means and that's because it means nothing. However, we, as humans, are going to assign the meaning "true" to this value, that's why It's called "FunnyTrue" and has type "funnyBool", same thing for "False"

Equipped with this power we can defined a function generic over the number of buckets we want to sort our list into, as well as the type of the function necessary in order to select the correct buckets:

Implementing the generic split function

The full type signature makes use of what we've just defined with CFT, TyFromNat and Ty:

splitN : (n : Nat) -> (pred : a -> Ty (TyFromNat n)) -> (ls : List a) -> Vect n (List a)

Now we see that the predicate must return a value which precisely tells us which bucket the value ends up in, if n is two then there are two buckets and our function is basically a boolean predicate, if n is three we are in the same position as our original problem.

The full implementation is fairly straightforward despite the complexity of the type signature. It involves doing nothing when the list is empty, and making a recursive call when it's not. In the recursive case we need to remember to insert the head of the list in the correct bucket, for this we use an auxillary function addToCorrectBucket:

addToCorrectBucket : (value : a)
                  -> {n : Nat}
                  -> (choice : Ty (TyFromNat n) [])
                  -> (vect : Vect n (List a))
                  -> Vect n (List a)
addToCorrectBucket value {n = 0} choice [] = [] -- when there are no buckets
                                                -- then don't do anything
addToCorrectBucket value {n = (S k)} (Left Unit) (y :: xs) =
  (value :: y) :: xs -- if we found the bucket (y), leave the rest of the list
                     -- as is (xs) , but add the value to the bucket
addToCorrectBucket value {n = (S k)} (Right x) (y :: xs) =
  y :: addToCorrectBucket value x xs -- if this bucket (y) is not the correct
                                     -- bucket then just keep going and don't
                                     -- modify it

splitN : (n : Nat) -> (pred : a -> Ty (TyFromNat n)) -> (ls : List a) -> Vect n (List a)
splitN n p [] = replicate n [] -- If we've gone through the entire list
                               -- retrn `n` empty buckets
splitN n p (x :: xs) = let check = p x
                           rec = splitN n p xs in
                           addToCorrectBucket x check rec

With this we can write a simple test to partition a list in two:

splitN 2 (fromBool . isEven) [1 .. 4]

(fromBool is just a way to conver from Bool to Ty (TyFromNat 2) so that the types align correctly)

and we get [[2,4], [1,3]] as expected.

Just to make sure I also wrote a test that sorts out polynomial into 3 buckets: the first one if they have 2 real solutions, the second one if they have 1 solution and the last one if they have no solutions:

record Polynomial where
  constructor MkPoly
  x2 : Int -- x squared term
  x : Int -- x term
  c : Int -- constant term

-- A polynomial a * x^2 + b * x + c has
-- 2 solutions if b^2 - 4*a*c > 0
-- 1 solution if b^2 - 4*a*c === 0
-- 0 solutions if b^2 - 4*a*c < 0
solutions : Polynomial -> Ty (TyFromNat 3) []
solutions (MkPoly a b c) =
  let v = b * b - 4 * a * c in
      if v > 0 then Left Unit -- 2 solutions
               else if v == 0 then Right (Left Unit) -- 1 solution
               else Right (Right (Left Unit)) -- no solutions

poly1 : Polynomial
poly1 = MkPoly 2 1 3 -- 0 solutions

poly2 : Polynomial
poly2 = MkPoly (-1) 2 3 -- 2 solutions

poly3 : Polynomial
poly3 = MkPoly 1 2 1 -- 1 solution

split3Test : splitN 3 Splits.solutions [Splits.poly1, Splits.poly2, Splits.poly3]
         === [[Splits.poly2],[Splits.poly3],[Splits.poly1]]

The last line tells us that splitN 3 solutions [poly1, poly2, poly3] indeed sorts each polynomial into the correct bucket:

[ [poly2] -- 2 solutions
, [poly3] -- 1 solution
, [poly1] -- 0 solutions
]

Conclusion

Type descriptions are nice, we can do all sorts of things without having to resort to macros or code-generation. This means we can use the same function for multiple uses. In this instance we could highly optimise our traversal of the list and every call to splitN would benefit from it. It also makes the code a lot more uniform to read since now there is no difference anymore between partitionning in 2, 3 or 4 parts, no need to write special functions for those cases.

See https://github.com/idris-lang/Idris2/wiki/Types-meta-programming-manifesto for a list of idea of things we can try with them, additionally typedefs was written using the same principles.