 # `LOOKUP`, `MATCH`, and `INDEX`

The `LOOKUP` function, and its foundations `MATCH` and `INDEX`, are likely to be heavily used in Abacus’ financial applications. So they must be implemented efficiently both space- and runtime-wise.

## Fixed-size Tables

Most lookup tables are going to be constant, I predict. Often they will be largish tables of data imported from official sources, like tax rates of different sorts.

### `MATCH`

An efficient implementation of `MATCH` for such constant tables uses an array. The array allows for memory-efficient representation of the table data, and for time-efficient search. Consider:

 B C D E F G H 1 2=MATCH(C1,D1:H1)(MC_Out) 15(MC_In) 10 15 20 25 30

which is implemented as:

```final double get\$0() {
return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0());
}
final double[] \$constarr\$0() {
if (\$constarr\$0 == null)
\$constarr\$0 = new double[] { 10.0, 15.0, 20.0, 25.0, 30.0 };
return \$constarr\$0;
}```

where `fun_MATCH_Ascending` performs a binary search, as Excel does. If the array contains values computed at runtime as in:

 B C D E F G H 2 2=MATCH(C2,D2:H2)(MI_Out) 15(MI_In) 10 15(MI_In_1) 20(MI_In_2) 25 30

we augment this to:

```public final void reset() {
\$arr\$0\$init = false;
}
final double get\$0() {
return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$arr\$0());
}
final double[] \$arr\$0() {
if (!\$arr\$0\$init) {
\$arr\$0\$init = true;
double[] ds = \$constarr\$0();
ds = get\$2();
ds = get\$3();
return ds;
}
return \$constarr\$0();
}
final double[] \$constarr\$0() {
if (\$constarr\$0 == null)
\$constarr\$0 = new double[] { 10.0, 0.0, 0.0, 25.0, 30.0 };
return \$constarr\$0;
}```

I did not merge `\$arr\$0()` into `\$constarr\$0()` because this allows me to reuse the constant part of the array when the computation is `reset()`.

##### Unrolling

I also considered unrolling the array and search, but for large tables this would lead to significant code bloat, which hurts memory-efficiency and JIT performance. Whereas the predefined search functions on the runtime used above can be JITted efficiently.

When most or all array values are computed, and maybe even expensive to compute, then unrolling is probably the better approach as it avoids computing cells not touched by the search. But this sounds like a highly improbable scenario, so I won’t address it.

Lookups into very small tables might also perform better if unrolled. Array setup might just turn out to be too expensive for them. But, while nifty, this adds too much complexity for too little gain, I believe. And people needing the speed can rewrite their small lookups to nested `IF`s anyway.

### `INDEX`

For `INDEX` I see two approaches: using arrays as above, or using a big `switch`. I did a quick check and it turns out the using `switch` is actually more space-efficient in the .class file, as well as more memory and time-efficient. The main reason is the way constant arrays are built in Java. They are internally built as if you had assigned each element in turn. In addition to this, `switch` also has the advantage that only the table values actually used have to be computed. So for:

 B C D E F G H 4 20=INDEX(D4:H4,1.0,C4)(IC_Out) 3(IC_In) 10 15 20 25 30

we get:

```final double get\$0() {
return \$idx\$0((int) Runtime_v2.checkDouble(get\$1()) - 1);
}
final double \$idx\$0(int i) {
switch (i) {
case 0:
return 10.0;
case 1:
return 15.0;
case 2:
return 20.0;
case 3:
return 25.0;
case 4:
return 30.0;
default:
throw new FormulaException
("#VALUE/REF! because index is out of range in INDEX");
}
}```

which extends very naturally to computed values:

```final double get\$0() {
return \$idx\$0((int) Runtime_v2.checkDouble(get\$3()) - 1);
}
final double \$idx\$0(int i) {
switch (i) {
case 0:
return 10.0;
case 1:
return get\$1();
case 2:
return get\$2();
case 3:
return 25.0;
case 4:
return 30.0;
default:
throw new FormulaException
("#VALUE/REF! because index is out of range in INDEX");
}
}```

### `LOOKUP` etc.

Since all the variants of `LOOKUP` are mapped to `INDEX` and `MATCH`, this covers the lot. There is one minor peephole optimization at work, though. Whenever there is a call to `MATCH` directly within `INDEX`, then this gets rewritten to `MATCH_INT` returning an integer instead of double or bigdec to avoid the unnecessary conversions from and back to integer. So:

 B C D E F G H I 7 12=LOOKUP(C7,D7:F7,G7:I7)(LC_Out) 2(LC_In) 1 2 3 11 12 13

is:

```final double get\$0() {
return \$idx\$0(RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0()) - 1);
}```

The equivalent, but explicitly written:

 B C D E F G H I 10 12=INDEX(G10:I10,1.0,MATCH(C10,D10:F10))(IMC_Out) 2(IMC_In) 1 2 3 11 12 13

is also:

```final double get\$0() {
return \$idx\$0(RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0()) - 1);
}```

### Multiple Lookups

What if multiple, different values are looked up? Consider:

 B C D E F G H 12 2=MATCH(C12,D12:H12)(MMC_Out_1) 15(MMC_In_1) 10 15 20 25 30 13 4=MATCH(C13,D12:H12)(MMC_Out_2) 27(MMC_In_2)

We clearly need to factor the common construction of the array for `D12:H12` out of the generated code for the cells `B12` and `B13`. To do this, we first need to recognise the commonality. This is hard to do efficiently for the computation model compiler as it has no notion of cell adjacency and, thus, no notion of contiguous arrays. So the spreadsheet compiler passes on the both the range origin and extent for instances of `ExpressionNodeForArrayReference` it generates to help the computation model compiler recognize repeated occurrences. Resulting in the single instance `\$constarr\$0()`:

```final double get\$0() {
return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0());
}
final double get\$2() {
return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get\$3(),
\$constarr\$0());
}```

The same goes for `INDEX`, but the common code is the generated `switch`:

```final double get\$0() {
return \$idx\$0((int) Runtime_v2.checkDouble(get\$1()) - 1);
}
final double get\$2() {
return \$idx\$0((int) Runtime_v2.checkDouble(get\$3()) - 1);
}```

Putting it together, we get this for `LOOKUP`:

```final double get\$0() {
return \$idx\$0(RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0()) - 1);
}
final double get\$2() {
return \$idx\$0(RuntimeDouble_v2.fun_MATCH_Ascending(get\$3(),
\$constarr\$0()) - 1);
}```

Finally, since we pass the range origin to the compiler, the rewriter can properly identify common sub-arrays such as are generated by rewriting `H/VLOOKUP`. Consider:

 B C D E F G 19 12=HLOOKUP(C19,D19:G21,2.0)(SAC_Out_1) 19(SAC_In_1) 10 15 20 25 20 22=HLOOKUP(C19,D19:G21,3.0)(SAC_Out_2) 11 12 13 14 21 2=MATCH(C19,D19:G19)(SAC_Out_3) 21 22 23 24

Both of the lookups are rewritten to `MATCH` and `INDEX`. The arrays referenced by all three instances of `MATCH` are then identical, but two are generated by the expression rewriter, not the spreadsheet loader. All are identified properly:

```final double get\$0() {
return \$idx\$0(RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0()) - 1);
}
final double get\$2() {
return \$idx\$1(RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0()) - 1);
}
final double get\$3() {
return (double) RuntimeDouble_v2.fun_MATCH_Ascending(get\$1(),
\$constarr\$0());
}```

Note that only the array accessors are factored out. AFC does not do full common subexpression elimination (CSE). But if you turn on caching and rewrite the two `HLOOKUP`s to applications of `INDEX` and the central third `MATCH`, then you get proper reuse of the `MATCH` value.

Passing the origin might, though, be a problem for users (still hypothetical) of the model compiler itself. What origins should they use? They might not have a cube model at all. So for the future I could say that if you set the origin, then it’s used. If not, then subarrays use the array’s name with a special suffix.

### `H/VLOOKUP` with computed sub-array index

Above we saw `HLOOKUP` with a constant lookup row index in action. A computed row index simply results in another `switch` to select the appropriate sub-array. So:

 B C D E F G H 23 12=HLOOKUP(C23,D23:G25,H23)(HLI_Out_1) 19(HLI_In) 10 15 20 25 2(HLI_In_1) 24 22=HLOOKUP(C23,D23:G25,H24)(HLI_Out_2) 11 12 13 14 3(HLI_In_2) 25 21 22 23 24

results in:

```final double get\$0() {
return get\$5((double) RuntimeDouble_v2.fun_MATCH_Ascending
(get\$1(), \$constarr\$0()));
}
final double get\$5(double d) {
switch ((int) Runtime_v2.checkDouble(get\$2())) {
case 1:
return \$idx\$0((int) Runtime_v2.checkDouble(d) - 1);
case 2:
return \$idx\$1((int) Runtime_v2.checkDouble(d) - 1);
case 3:
return \$idx\$2((int) Runtime_v2.checkDouble(d) - 1);
default:
Runtime_v2.fun_ERROR
("#VALUE/REF! because index is out of range in H/VLOOKUP");
return (double) -1;
}
}```

This still leaves quite a bit to be desired, like

• Unnecessary casts to `double` and then back to `int`.
• Non-factored subtraction `(int) d - 1`.

But, most importantly, this is just the code for cell `B23`. It is duplicated entirely for cell `B24`, except for the `\$idx\$` methods. Eliminating this would again require full blown common subexpression elimination.

## Input Tables of Varying Size

What if the lookup table is a repeating section? For the moment, I will not support this. Still, here are my current ideas on it.

In the current implementation of repeating sections where every element is represented by its own sub-computation instance, I see two options:

• Build the arrays used above from the repeating data, then proceed as before. Adds additional arrays, but also caches lookup tables. This sounds like a good choice for `MATCH`.
• Generate code that directly works on the internal sub-instance arrays for `MATCH` and `INDEX`. Avoids the additional arrays, but repeatedly evaluates the values touched during lookup. This sounds like a good choice for `INDEX`.

If AFC ever supports memory-efficient forward-only iteration of largish repeating sections, then this would have to change. `MATCH` and `INDEX` would have to sequentially scan the section until the result is found. `LOOKUP` would have to be implemented directly as scanning twice for `MATCH` and then `INDEX` would be silly.