AFC - Abacus Formula Compiler for Java

LOOKUP, MATCH, and INDEX

The LOOKUP function, and its foundations MATCH and INDEX, are likely to be heavily used in Abacus’ financial applications. So they must be implemented efficiently both space- and runtime-wise.

Fixed-size Tables

Most lookup tables are going to be constant, I predict. Often they will be largish tables of data imported from official sources, like tax rates of different sorts.

MATCH

An efficient implementation of MATCH for such constant tables uses an array. The array allows for memory-efficient representation of the table data, and for time-efficient search. Consider:

B C D E F G H
1 2
=MATCH(C1,D1:H1)
(MC_Out)
15
(MC_In)
10 15 20 25 30

which is implemented as:

final double get$0() {
    return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                         $constarr$0());
}
final double[] $constarr$0() {
    if ($constarr$0 == null)
        $constarr$0 = new double[] { 10.0, 15.0, 20.0, 25.0, 30.0 };
    return $constarr$0;
}

where fun_MATCH_Ascending performs a binary search, as Excel does. If the array contains values computed at runtime as in:

B C D E F G H
2 2
=MATCH(C2,D2:H2)
(MI_Out)
15
(MI_In)
10 15
(MI_In_1)
20
(MI_In_2)
25 30

we augment this to:

public final void reset() {
    $arr$0$init = false;
}
final double get$0() {
    return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                         $arr$0());
}
final double[] $arr$0() {
    if (!$arr$0$init) {
        $arr$0$init = true;
        double[] ds = $constarr$0();
        ds[1] = get$2();
        ds[2] = get$3();
        return ds;
    }
    return $constarr$0();
}
final double[] $constarr$0() {
    if ($constarr$0 == null)
        $constarr$0 = new double[] { 10.0, 0.0, 0.0, 25.0, 30.0 };
    return $constarr$0;
}

I did not merge $arr$0() into $constarr$0() because this allows me to reuse the constant part of the array when the computation is reset().

Unrolling

I also considered unrolling the array and search, but for large tables this would lead to significant code bloat, which hurts memory-efficiency and JIT performance. Whereas the predefined search functions on the runtime used above can be JITted efficiently.

When most or all array values are computed, and maybe even expensive to compute, then unrolling is probably the better approach as it avoids computing cells not touched by the search. But this sounds like a highly improbable scenario, so I won’t address it.

Lookups into very small tables might also perform better if unrolled. Array setup might just turn out to be too expensive for them. But, while nifty, this adds too much complexity for too little gain, I believe. And people needing the speed can rewrite their small lookups to nested IFs anyway.

INDEX

For INDEX I see two approaches: using arrays as above, or using a big switch. I did a quick check and it turns out the using switch is actually more space-efficient in the .class file, as well as more memory and time-efficient. The main reason is the way constant arrays are built in Java. They are internally built as if you had assigned each element in turn. In addition to this, switch also has the advantage that only the table values actually used have to be computed. So for:

B C D E F G H
4 20
=INDEX(D4:H4,1.0,C4)
(IC_Out)
3
(IC_In)
10 15 20 25 30

we get:

final double get$0() {
    return $idx$0((int) Runtime_v2.checkDouble(get$1()) - 1);
}
final double $idx$0(int i) {
    switch (i) {
    case 0:
        return 10.0;
    case 1:
        return 15.0;
    case 2:
        return 20.0;
    case 3:
        return 25.0;
    case 4:
        return 30.0;
    default:
        throw new FormulaException
                  ("#VALUE/REF! because index is out of range in INDEX");
    }
}

which extends very naturally to computed values:

final double get$0() {
    return $idx$0((int) Runtime_v2.checkDouble(get$3()) - 1);
}
final double $idx$0(int i) {
    switch (i) {
    case 0:
        return 10.0;
    case 1:
        return get$1();
    case 2:
        return get$2();
    case 3:
        return 25.0;
    case 4:
        return 30.0;
    default:
        throw new FormulaException
                  ("#VALUE/REF! because index is out of range in INDEX");
    }
}

LOOKUP etc.

Since all the variants of LOOKUP are mapped to INDEX and MATCH, this covers the lot. There is one minor peephole optimization at work, though. Whenever there is a call to MATCH directly within INDEX, then this gets rewritten to MATCH_INT returning an integer instead of double or bigdec to avoid the unnecessary conversions from and back to integer. So:

B C D E F G H I
7 12
=LOOKUP(C7,D7:F7,G7:I7)
(LC_Out)
2
(LC_In)
1 2 3 11 12 13

is:

final double get$0() {
    return $idx$0(RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                       $constarr$0()) - 1);
}

The equivalent, but explicitly written:

B C D E F G H I
10 12
=INDEX(G10:I10,1.0,MATCH(C10,D10:F10))
(IMC_Out)
2
(IMC_In)
1 2 3 11 12 13

is also:

final double get$0() {
    return $idx$0(RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                       $constarr$0()) - 1);
}

Multiple Lookups

What if multiple, different values are looked up? Consider:

B C D E F G H
12 2
=MATCH(C12,D12:H12)
(MMC_Out_1)
15
(MMC_In_1)
10 15 20 25 30
13 4
=MATCH(C13,D12:H12)
(MMC_Out_2)
27
(MMC_In_2)

We clearly need to factor the common construction of the array for D12:H12 out of the generated code for the cells B12 and B13. To do this, we first need to recognise the commonality. This is hard to do efficiently for the computation model compiler as it has no notion of cell adjacency and, thus, no notion of contiguous arrays. So the spreadsheet compiler passes on the both the range origin and extent for instances of ExpressionNodeForArrayReference it generates to help the computation model compiler recognize repeated occurrences. Resulting in the single instance $constarr$0():

final double get$0() {
    return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                         $constarr$0());
}
final double get$2() {
    return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get$3(),
                                                         $constarr$0());
}

The same goes for INDEX, but the common code is the generated switch:

final double get$0() {
    return $idx$0((int) Runtime_v2.checkDouble(get$1()) - 1);
}
final double get$2() {
    return $idx$0((int) Runtime_v2.checkDouble(get$3()) - 1);
}

Putting it together, we get this for LOOKUP:

final double get$0() {
    return $idx$0(RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                       $constarr$0()) - 1);
}
final double get$2() {
    return $idx$0(RuntimeDouble_v2.fun_MATCH_Ascending(get$3(),
                                                       $constarr$0()) - 1);
}

Finally, since we pass the range origin to the compiler, the rewriter can properly identify common sub-arrays such as are generated by rewriting H/VLOOKUP. Consider:

B C D E F G
19 12
=HLOOKUP(C19,D19:G21,2.0)
(SAC_Out_1)
19
(SAC_In_1)
10 15 20 25
20 22
=HLOOKUP(C19,D19:G21,3.0)
(SAC_Out_2)
11 12 13 14
21 2
=MATCH(C19,D19:G19)
(SAC_Out_3)
21 22 23 24

Both of the lookups are rewritten to MATCH and INDEX. The arrays referenced by all three instances of MATCH are then identical, but two are generated by the expression rewriter, not the spreadsheet loader. All are identified properly:

final double get$0() {
    return $idx$0(RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                       $constarr$0()) - 1);
}
final double get$2() {
    return $idx$1(RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                       $constarr$0()) - 1);
}
final double get$3() {
    return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get$1(),
                                                         $constarr$0());
}

Note that only the array accessors are factored out. AFC does not do full common subexpression elimination (CSE). But if you turn on caching and rewrite the two HLOOKUPs to applications of INDEX and the central third MATCH, then you get proper reuse of the MATCH value.

Passing the origin might, though, be a problem for users (still hypothetical) of the model compiler itself. What origins should they use? They might not have a cube model at all. So for the future I could say that if you set the origin, then it’s used. If not, then subarrays use the array’s name with a special suffix.

H/VLOOKUP with computed sub-array index

Above we saw HLOOKUP with a constant lookup row index in action. A computed row index simply results in another switch to select the appropriate sub-array. So:

B C D E F G H
23 12
=HLOOKUP(C23,D23:G25,H23)
(HLI_Out_1)
19
(HLI_In)
10 15 20 25 2
(HLI_In_1)
24 22
=HLOOKUP(C23,D23:G25,H24)
(HLI_Out_2)
11 12 13 14 3
(HLI_In_2)
25 21 22 23 24

results in:

final double get$0() {
    return get$5((double) RuntimeDouble_v2.fun_MATCH_Ascending
                          (get$1(), $constarr$0()));
}
final double get$5(double d) {
    switch ((int) Runtime_v2.checkDouble(get$2())) {
    case 1:
        return $idx$0((int) Runtime_v2.checkDouble(d) - 1);
    case 2:
        return $idx$1((int) Runtime_v2.checkDouble(d) - 1);
    case 3:
        return $idx$2((int) Runtime_v2.checkDouble(d) - 1);
    default:
        Runtime_v2.fun_ERROR
            ("#VALUE/REF! because index is out of range in H/VLOOKUP");
        return (double) -1;
    }
}

This still leaves quite a bit to be desired, like

  • Unnecessary casts to double and then back to int.
  • Non-factored subtraction (int) d - 1.

But, most importantly, this is just the code for cell B23. It is duplicated entirely for cell B24, except for the $idx$ methods. Eliminating this would again require full blown common subexpression elimination.

Input Tables of Varying Size

What if the lookup table is a repeating section? For the moment, I will not support this. Still, here are my current ideas on it.

In the current implementation of repeating sections where every element is represented by its own sub-computation instance, I see two options:

  • Build the arrays used above from the repeating data, then proceed as before. Adds additional arrays, but also caches lookup tables. This sounds like a good choice for MATCH.
  • Generate code that directly works on the internal sub-instance arrays for MATCH and INDEX. Avoids the additional arrays, but repeatedly evaluates the values touched during lookup. This sounds like a good choice for INDEX.

If AFC ever supports memory-efficient forward-only iteration of largish repeating sections, then this would have to change. MATCH and INDEX would have to sequentially scan the section until the result is found. LOOKUP would have to be implemented directly as scanning twice for MATCH and then INDEX would be silly.