AFC - Abacus Formula Compiler for Java

Design Overview

The Abacus Formula Compiler (AFC) makes it possible to use computations defined in spreadsheet files within normal Java applications. The original spreadsheet files can be created, for example, in Microsoft Excel or OpenOffice Calc. AFC reads the spreadsheet files and builds corresponding Java-based formula evaluation engines. Such an engine computes output values given a set of input values.

Overview

Here’s an overview of how AFC does this:

Overview

The colours meaning your code, AFC’s code, generated things, and external dependencies.

Example

For example, compute the value of the cell B3 given the values of the cells B1 and B2 as inputs in:

A B
1 Demo computation for SEJ
2
3 Inputs
4 A 10
5 B $100.00
6
7 Outputs
8 Result $1,000.00
=B4*B5

The following code does this:

// Get an engine builder (represents AFC's simplified API).
EngineBuilder builder = SpreadsheetCompiler.newEngineBuilder();

// Load and parse the spreadsheet file into memory.
builder.loadSpreadsheet( getSpreadsheetFile() );

// Set the factory interface to implement. This interface defines the method
// Outputs newInstance( Inputs _inputs ), from which AFC derives the input
// and output interfaces.
builder.setFactoryClass( OutputFactory.class );

// Define which cells will be variable inputs to the engine, and which will be
// computable outputs, by cell name. All cells whose name correspond to a method
// on the output interface will be outputs, and similarly for inputs.
// Inputs are bound to your input methods that will be called to obtain their value.
// Outputs are bound to your output methods that are implemented by the engine.
builder.bindAllByName();

// Build an engine for the given spreadsheet, inputs, and outputs.
Engine engine = builder.compile();

// Get the factory instance from the compiled engine.
OutputFactory factory = (OutputFactory) engine.getComputationFactory();

// ---- Computation
// Compute an actual output value for a given set of actual input values.
// This code is not dependent on AFC. It is a simple instance of the strategy
// pattern.
Inputs inputs = new Inputs( 4, 40 );
Outputs outputs = factory.newInstance( inputs );
double result = outputs.getResult();
// ---- Computation

return result;

Generated Code

The call to builder.compile() internally generates JVM byte code. When decompiled, the code looks as follows. It consists of the computation engine:

package org.formulacompiler.gen;
import org.formulacompiler.examples.Inputs;
import org.formulacompiler.examples.Outputs;
import org.formulacompiler.runtime.Computation;
import org.formulacompiler.runtime.internal.Environment;

final class $Root implements Computation, Outputs
{
    private final Inputs $inputs;
    final Environment $environment;
    
    $Root(Inputs inputs, Environment environment) {
        $environment = environment;
        $inputs = inputs;
    }
    
    final double get$0() {
        return get$1() * get$2();
    }
    
    public final double getResult() {
        return get$0();
    }
    
    final double get$1() {
        return $inputs.getA();
    }
    
    final double get$2() {
        return $inputs.getB();
    }
}

and its factory:

package org.formulacompiler.gen;
import org.formulacompiler.examples.Inputs;
import org.formulacompiler.examples.OutputFactory;
import org.formulacompiler.examples.Outputs;
import org.formulacompiler.runtime.Computation;
import org.formulacompiler.runtime.ComputationFactory;
import org.formulacompiler.runtime.internal.Environment;

public final class $Factory implements ComputationFactory, OutputFactory
{
    private final Environment $environment;
    
    public $Factory(Environment environment) {
        $environment = environment;
    }
    
    public final Computation newComputation(Object object) {
        return new $Root((Inputs) object, $environment);
    }
    
    public final Outputs newInstance(Inputs inputs) {
        return new $Root(inputs, $environment);
    }
}

Design Goals

Multiple Data Formats

AFC should be usable with at least Microsoft Excel XLS and XML files, and OpenOffice Calc files.

High Performance Computation

The generated engines should be suitable for use on application servers:

  • Low memory consumption
  • High performance
  • Thread safe

High Performance Engine Construction

The generation of an engine from a spreadsheet file, however, is considered a rarer task. So, while the above goals are still desirable for it, the focus is clearly on making the engines perform exceptionally well, not engine generation. Nevertheless, engine generation must be suitable for being performed occasionally by an application server.

Cheap Instantiation

Because generation of an engine may be an expensive task, it must be possible to store generated engines and then later efficiently instantiate them from the storage. The original spreadsheet file should not be required for this instantiation.

Usability

Errors in the spreadsheet file or errors during engine generation due to limitations of AFC must be pinpointed and described in a way that allows typical spreadsheet users to correct or work around the problem.

Documentation

The range of supported formula elements must be clearly documented using automated tests.

Trustworthiness

AFC in its standard operation generates byte code for the Java VM. This is not something most programmers read and understand with ease. AFC must therefore be able to generate Java source code. All automated tests should ascertain that the bytecode compiled from the source by the Java compiler is the same as the one compiled by directly by AFC. This way, users who mistrust bytecode generation, or who need to see what an engine generated by AFC does internally, can rely on the generated source code.

Design

Derived from the design goals above is a clear separation of phases and concerns into individual components:

Compiler Overview

API Facade
(Not shown in diagram.) Defines the public API to all of AFC. For example, the interfaces Engine and Computation.
Spreadsheet Model
An in-memory representation of a spreadsheet which is constructed by loaders and consumed by engine builders. It thus decouples the engine compilers from the various spreadsheet formats supported by AFC. Cell formulas are represented by expression trees with special nodes denoting cell references. Parsing of cell formulas into expression trees can be done lazily so that only the formulas actually used by an engine builder are parsed.
Computation Model
An in-memory representation of the relevant computation extracted from a spreadsheet model. This model is hierarchical to represent dynamic sections and has no longer any notion of rows and columns, only of expressions referencing inputs or other expressions.
Expression Trees
A syntax tree representation of the formulas found in a spreadsheet or an internal computation model.
Spreadsheet Compiler
Compiles a spreadsheet model into an internal computation model. Uses the engine definition (inputs and outputs) to extract only the relevant parts.
Optimizer
Performs constant folding and expression inlining on the internal computation model.
Spreadsheet Loader
Subsystem containing the different spreadsheet file loaders (Excel .xls and .xml, and OpenOffice). Implements lazy parsing of formulas into AFC’s expression trees. The expression parser is generated using the open-source JavaCC library.
Excel .xls Loader
Loads spreadsheet files in Microsoft Excel’s native .xls format into an AFC spreadsheet model. Uses the open-source JExcelAPI library to parse the .xls format.
Excel .xml Loader
Loads spreadsheet files in Microsoft Excel’s .xml format into an AFC spreadsheet model.
OpenOffice .ods Loader
Envisioned loader(s) for the OpenOffice spreadsheet file format(s).
Bytecode Compiler
Builds an AFC engine given an internal computation model. Uses the open-source ASM library to generate the new engine class and the methods which compute the spreadsheet formulas. The generated class can be used directly to instantiate an engine, or else stored somewhere for later instantiation. It is thus possible to build the engine in a separate configuration program and then install only the generated class file with its minimal dependencies in a server application.
Engine Instance
Expression evaluation engine that computes formulas using native Java byte code methods. Some computations of core spreadsheet functions are delegated to a AFC runtime library (not shown). This implementation of an engine is very fast and space efficient. It also has no other dependencies on external components. It is, therefore, an ideal choice for heavily used server-side computations.
Bytecode Decompiler
Generates the Java source code for a compiled AFC engine using the open-source JODE library. This is mainly for debugging and for inspecting generated engines.

Realization of Design Goals

How are the design goals addressed by this design?

Multiple Data Formats

The design decouples the engines and engine builders completely from the physical representation of spreadsheets in files. This makes it fairly straightforward and inexpensive to add new formats. Use of JavaCC to generate the formula expression parsers makes supporting new expression syntaxes fairly easy.

AFC has a simple workbook loader registry. Once appropriate loaders are registered, a spreadsheet file can be opened with a generic call. The loaders look at the file’s extension to determine the appropriate one to use.

High Performance Computation

The separation of the Engine and Computation interfaces allows the concurrent use of an engine across multiple threads. Each thread simply creates its own computations on the central engine. Engines are thus fully thread-safe and non-blocking.

// Compute an actual output value for a given set of actual input values.
// This code is not dependent on AFC. It is a simple instance of the strategy
// pattern.
Inputs inputs = new Inputs( 4, 40 );
Outputs outputs = factory.newInstance( inputs );
double result = outputs.getResult();

The byte code engine is extremely fast as all necessary expressions from the spreadsheet model – after constant folding – have been compiled to native Java byte code. A byte code engine depends neither on the spreadsheet model nor on anything else except for AFC’s runtime classes.

High Performance Engine Construction

As stated above, in most scenarios performance of the construction of engines is not as critical as performance of the constructed engines themselves. Nevertheless, the design should allow for optimization of engine construction should it become necessary.

Engine construction can be separated into two distinct steps:

  1. Parse the spreadsheet file into an AFC spreadsheet model.
  2. Construct an engine (or multiple different engines) from the model for a given set of inputs from which the engine is to be able to compute a given set of outputs.
// Get an engine builder (represents AFC's simplified API).
EngineBuilder builder = SpreadsheetCompiler.newEngineBuilder();

// Load and parse the spreadsheet file into memory.
builder.loadSpreadsheet( getSpreadsheetFile() );

// Set the factory interface to implement. This interface defines the method
// Outputs newInstance( Inputs _inputs ), from which AFC derives the input
// and output interfaces.
builder.setFactoryClass( OutputFactory.class );

// Define which cells will be variable inputs to the engine, and which will be
// computable outputs, by cell name. All cells whose name correspond to a method
// on the output interface will be outputs, and similarly for inputs.
// Inputs are bound to your input methods that will be called to obtain their value.
// Outputs are bound to your output methods that are implemented by the engine.
builder.bindAllByName();

// Build an engine for the given spreadsheet, inputs, and outputs.
Engine engine = builder.compile();

// Get the factory instance from the compiled engine.
OutputFactory factory = (OutputFactory) engine.getComputationFactory();

AFC’s spreadsheet model should be designed so as to allow loaders to instantiate parts of the model lazily. This will in principle allow engines to be constructed efficiently which access only a small part of an otherwise large spreadsheet. The true need for this is not totally clear at the moment, however, because I expect users to write dedicated spreadsheets that supply computations to business applications. Currently, lazy parsing of expressions is already implemented and thus verified. Lazy instantiation of individual sheets, rows, and cells is not yet implemented and may still require some interface changes affecting both loaders and engine builders.

Use of JavaCC should ensure efficient expression parsers.

Construction of an engine from an AFC model is fairly straightforward and thus fairly cheap.

Cheap Instantiation

First of all, an engine, once compiled, does not depend on the original spreadsheet file, its internal model, or the compiler used to compile the engine. It also does not have any dependencies on third party libraries. So if an engine can be serialized to persistent storage, it can be deserialized again with little cost and no external dependencies. It is therefore a possible and, indeed, recommended scenario that applications should take spreadsheet files as input during configuration and compile the necessary computation engines from them. They should then store the engines somewhere for use later on at run-time. In addition to improving the production server’s memory footprint and performance, this will also minimize the security risk introduced by the access to arbitrary, user-supplied spreadsheet files. This access can be delegated to an offline configuration editor, instead of being part of the production server.

Here’s the general outline of such an approach:

Separate Compilation

First, the configuration, which compiles an engine and stores it somewhere (a file in this example):

// Build an engine for the given spreadsheet, inputs, and outputs.
EngineBuilder builder = SpreadsheetCompiler.newEngineBuilder();
builder.loadSpreadsheet( getSpreadsheetFile() );
builder.setFactoryClass( OutputFactory.class );
builder.bindAllByName();
SaveableEngine compiledEngine = builder.compile();

// Write the engine out to its serialized form, then drop the reference to it.
File engineSerializationFile = new File( TEMP_ENGINE_JAR );
OutputStream outStream = new BufferedOutputStream( new FileOutputStream( engineSerializationFile ) );
try {
  compiledEngine.saveTo( outStream );
}
finally {
  outStream.close();
}

Second, the code that would run in a production server:

// Instantiate an engine from the serialized form.
File engineSerializationFile = new File( TEMP_ENGINE_JAR );
InputStream inStream = new BufferedInputStream( new FileInputStream( engineSerializationFile ) );
Engine loadedEngine = FormulaRuntime.loadEngine( inStream );
OutputFactory factory = (OutputFactory) loadedEngine.getComputationFactory();

// Compute an actual output value for a given set of actual input values.
Inputs inputs = new Inputs( 4, 40 );
Outputs outputs = factory.newInstance( inputs );
double result = outputs.getResult();

return result;

Note that the production server must have called StandardEngine.register() somewhere so the EngineFactory knows what type of engine to instantiate.

The byte code engine is really a Java class generated at run-time. Its serialization is, therefore, simply the byte array that would be stored in the .class file. Note, however, that this byte array can be stored anywhere, it need not be an actual .class file somewhere on your classpath. Instantiation of a byte code engine from external storage is, therefore, exactly what a class loader does.

It should be noted that this design emphasizes using a few engines often as opposed to use of a huge number of different engines, each only once or twice. This is due to class loading and JIT overhead.

Usability

This goal has not been specifically addressed in the current design. However, the use of JavaCC to generate the expression parsers should lead to fairly good error messages when unsupported or erroneous expressions are encountered.

Documentation

Use of JavaCC to generate the expression parsers leads to a fairly readable documentation of the supported (or, at least, parsed) expression syntax.

Test-driven development should ensure that all supported features are clearly documented and verified by tests. The infrastructure to add tests in a consistent way is not documented yet.

Trustworthiness

AFC uses JODE to decomile generated engines. This lets you have a look at the generated code in an understandable format. AFC takes great care to produce byte-code that is the equivalent of code produced by javac from Java sources.

The tests, however, do not yet verify that the decompiled code, when run through javac again, corresponds exactly to the bytecode produced by AFC.