AFC - Abacus Formula Compiler for Java

Ranges Overlapping Section Bounds

I have started to implement support for DFOLD() over dynamic sections in the data. For this to work reasonably, I have to first support range references that overlap section bounds. This is because DSUM() et al. only take a single range reference in their arguments.

Why Support Overlaps?

The referenced data table, in which I want to support sections, consists of the top row with the labels and the subsequent data rows. Because you can only specify a single range reference, you cannot simply union the label range and the data range separately. So SEJ must detect the overlap by itself.

Shapeless Ranges

SEJ could, in fact, properly take apart a range reference that spans multiple, possibly nested, dynamic sections. Currently, it not even issues a proper error message, but fails later on with rather cryptic errors because it thinks there are plain cell outer refs to inner cells. I am going to address that with new tests.

Such generality is only reasonable for shapeless ranges, however. A shapeless range is one whose matrix layout is irrelevant. Only the cell set it spans is used. This is the case with arguments to to SUM() et al. but not, for example, to INDEX()).

Shaped Ranges

For shaped ranges, the discussion is most pertinent to vertically repeating sections (such as in tables accessed by DSUM()). However, for functions like INDEX() the principles laid out apply to horizontally repeating sections in an analogous way.

For shaped ranges, the only thing SEJ should reasonably support is (assuming vertical repetition):

  • A mix of static rows and vertical sections which fully cover the width of the range.
  • Nested vertical subsections. They do not have to cover the range width fully. This means SEJ will support SQL-like repetition of the 1: side in a 1:n join. In the example sheet below, Section and Subsection form such a constellation.

By requiring the top-level sections encountered in a shaped range to cover the range width fully, I avoid the problem of what to do with outer cells to the left or right of the sections. Should the value at the height of the first row of the section be repeated for the whole height of the section? What about the other values specified beside the sample data of the section? They would be dropped silently. Since constant values in the first row of a section are repeated anyway, this is not restrictive.

Shaped vs. Shapeless

In the example below, what is SUM(C3:C6)? Is it

r = 0
for (e : section())
    r += e.value();

or is it

r = 0
for (e : section()) 
    for (se : e.subsection())
        r += e.value();

In other words, does SUM() respect the 1:n repetition of the 1: side? I am tempted to say its the first option, as SUM’s argument is shapeless. This would make it possible to work with hierarchical data in a natural, XML-like fashion.

But then, the result of SUM() will differ from the result of DSUM(), which takes a shaped argument!

One solution might be to let the user indicate the desired behaviour using the values of the repeated cells of the 1: side in the sample data:

  • Repeating the values of the first row/column indicate 1:n join behaviour (shaped).
  • Blank values indicate hierarchical, XML-like behaviour (shapeless).

SEJ should then check that shaped references always repeat the first row values in the sample data as they cannot be treated in a shapeless fashion.

To achieve this, SEJ could, during construction of the computation model, replace such a repeating 1: cell by a reference of the form:

subelement.parent().value()

and include it within the subrange. A peephole optimization could later optimize this again to element.value() in the generated code. It is probably reasonable to require that a section with a nested subsection be either fully repeating (shaped), or not at all (shapeless). If not that, then at least all the repeating cells should be adjacent to the subsection so the latter can be extended to encompass the repeating cells.

Test Setup

Here is the setup for some fairly complex test cases:

A B C D E F G H I J K L M
1 Vertically repeating section: Name Value RepValue SubName SubValue SubRef InnerRef InnerSum SelfSum OuterSum Value2 Outer cells
2 Outer cells: -1 -1 -10
3 One 1 1 One.A 10 11
=F3+C3
2
=C3+1.0
62
=SUM(F3:G4)
7
=SUM(C$3:C$6)+C3
241
=SUM(F$2:F$11)+C3
3 4711
4 1 One.B 20 21
=F4+C3
4712
5 Two 2 2 Two.A 30 32
=F5+C5
3
=C5+1.0
62
=SUM(F5:G5)
8
=SUM(C$3:C$6)+C5
242
=SUM(F$2:F$11)+C5
2 4713
6 Three 3 3 Three.A 40 43
=F6+C6
4
=C6+1.0
83
=SUM(F6:G6)
9
=SUM(C$3:C$6)+C6
243
=SUM(F$2:F$11)+C6
1 4714
7 Outer cells: 4 4 50
8 Another vertical section: Eins 1 1 10 1 2 3
9 with horizontal section to the right: Zwei 2 2 20 0.1 0.2 0.3
10 Drei 3 3 30 0.01 0.02 0.03
11 Outer cells: 5 5 40

B3:L6 (Section)
B8:F10 (Section2)
G8:I10 (Section3)
E3:G4 (Subsection)

Test Cases

And here are the tests:

A B
14 This works:
15 Nested sum 100
=SUM(F3:F6)
16 Sum spanning multiple sections 20
=SUM(C2:C11)
17 Sum spanning multiple sections (shaped) 41
=SUM(C2:D11)
18 Sum spanning multiple and nested sections 240
=SUM(F2:F11)
19 Sum over inner refs 9
=SUM(H3:H6)
20 Sum over inner sums 207
=SUM(I3:I6)
21 Sum over inner self sums 24
=SUM(J3:J6)
22 Sum over inner outer nested sums 726
=SUM(K3:K6)
23 Sum over nested and inner section in parallel 116
=SUM(G3:H6)
24 Sum over section and cells in parallel 18856
=SUM(L3:M6)
25 Sum over mixed section extents 66.66
=SUM(F8:I10)
26
27 Errors:
28 Only nested inner section summed 30
=SUM(F3:F4)
29 Only part of section summed 2
=SUM(C2:C5)
30 ... Again 9
=SUM(C4:C7)

What should these tests do (in pseudo-code)?

Nested sum
r = 0
for (e : section()) {
    for (se : e.subsection()) {
        r += se.subvalue();
    }
}
Sum spanning multiple sections
r = -1 + 4 + 5
for (e : section()) {
    r += e.value();
}
for (e : othersection()) {
    r += e.value();
}
Sum spanning multiple sections (shaped)
r = -1 + -1 + 4 + 4 + 5 + 5
for (e : section()) {
    r += e.value();
    for (se : e.subsection()) {
        r += se.parent().repvalue();
    }
}
for (e : section2()) {
    r += e.value();
}

Peephole optimization might change ... + se.parent.repvalue() to ... + e.repvalue().

Sum spanning multiple and nested sections
r = -10 + 50 + 40
for (e : section()) {
    for (se : e.subsection()) {
        r += se.subvalue();
    }
}
for (e : section2()) {
    r += e.value2();
}
Sum over inner refs
r = 0
for (e : section()) {
    r += e.value();
}
Sum over inner sums
r = 0
for (e : section()) {
    r += helper( e );
}

where helper( e ) is

r = 0
for (se : e.subsection()) {
    r + = se.subvalue() + se.subvalue() + se.parent.value();
}
Sum over inner self sums
r = 0
for (e : section()) {
    r += helper( e.parent ) + e.value();
}

where helper( p ) is

r = 0
for (e : section()) {
    r += e.value();
}
Sum over inner outer nested sums
r = 0
for (e : section()) {
    r += helper( e.parent ) + e.value();
}

where helper( p ) is

r = -10 + 50 + 40
for (e : p.section()) {
    for (se : e.subsection()) {
        r += se.subvalue();
    }
}
for (e : p.section2()) {
    r += e.value2();
}
Sum over nested and inner section in parallel
r = 0
for (e : section()) {
    r += e.value();
    for (se : e.subsection()) {
        r + = se.subvalue() + se.subvalue() + se.parent.value();
    }
}

Again, peephole optimization might change ... + se.parent.value() to ... + e.value().

Sum over section and cells in parallel
r = 4711 + 4712 + 4713 + 4714
for (e : section()) {
    r += e.value2();
}
Sum over mixed section extents
r = 0
for (e : section2()) {
    r += e.value2();
}
for (e : section3()) {
    r += e.value1() + e.value2() + e.value3();
}

Roadmap

The most important part right now is to support DFOLD() over vertical sections, excluding the label row. So that comes first.

Next is supporting DFOLD() over nested vertical sections to enable 1:n join processing. This makes detection of 1:n repetition necessary.

Finally, all currently unsupported cases should be rejected with helpful error messages.

All the rest is nice, but not essential, as it can be worked around by judicious use of intermediate results (intermediate sums, for instance).