## Shennat dissertation: Dimensional analysis of Lucid programs [410 views]

I recently graduated my 17th and last PhD student, Monem Shennat. This is the abstract of his dissertation with my annotations (the abstract of a University of Victoria dissertation is limited to 500 words).

The problem he tackled was that of dimensional analysis of multidimensional Lucid programs. This means determining, for each variable in the program, the set of relevant dimensions, those whose coordinates are necessary for evaluating individual components.

Objective: to design Dimensional Analysis (DA) algorithms for the multidimensional dialect PyLucid of Lucid, the equational dataﬂow language.

Dataflow is hardly an unknown concept but most dataflow systems are stream based – there is only one dimension, time. Lucid, by contrast allows multiple dimensions. Evaluation is demand-driven in which demands are for values of variables at given coordinates. These demands generate demands for possibly different variables at possibly different coordinates. Values computed are cached, labeled by the coordinates needed.

Significance: DA is indispensable for an efficient implementation of multidimensional Lucid and should aid the implementation of other data flow systems, such as Google’s TensorFlow.

DA is indispensable because to retrieve a value from the cache we need to know which coordinates will form the label to be searched for. Without DA the cache would fill with duplicate entries labeled by irrelevant dimensions.

Dataflow is a form of computation in which components of multidimensional datasets (MDDs) travel on communication lines in a network of processing stations. Each processing station incrementally transforms its input MDDs to its output, another (possibly very different) MDD.

MDDs are very common in Health Information Systems and data science in general. An important concept is that of relevant dimension. A dimension is relevant if the coordinate of that dimension is required to extract a value. It is very important that in calculating with MDDs we avoid non-relevant dimensions, otherwise we duplicate entries (say, in a cache) and waste time and space.

Suppose, for example, that we are measuring rainfall in a region. Each individual measurement (say, of an hour’s worth of rain) is determined by location (one dimension), day, (a second dimension) and time of day (a third dimension). All three dimensions are a priorirelevant.

Now suppose we want the total rainfall for each day. In this MDD (call it N) the relevant dimensions are location and day, but time of day is no longer relevant and must be removed. Normally this is done manually.

Research question: can this process be automated?

We answer this question affirmatively by devising and testing algorithms that produce useful and reliable approximations (specifically, upper bounds) for the dimensionalities of the variables in a program. By dimensionality we mean the set of relevant dimensions. For example, if M is the MDD of raw rain measurements, its dimensionality is {location, day, hour}, and that of N is {location, day}. Note that the dimensionality is more than just the rank, which is simply the number of dimensions.

Background: There is extensive research on dataflow itself, which we summarize. However, an exhaustive literature search uncovered no relevant previous DA work other than that of the GLU (Granular LUcid) project in the 90s. Unfortunately the GLU project was funded privately and remains proprietary – not even the author has access to it.

The GLU project was funded at SRI in Stanford by Mitsubishi. The GLU project did carry out some DA.

Methodology: We proceeded incrementally, solving increasingly difficult instances of DA corresponding to increasingly sophisticated language features. We solved the case of one dimension (time), two dimensions (time and space), and multiple dimensions.

These algorithms proceed by accumulating approximations. For example, to determine which variables are time sensitive, we start with those that obviously are: those defined by a fby expression. Then we add those that are defined in terms of a time sensitive variable by a data operation or a next operator. Any variable defined by first is definitively not time sensitive.

The accumulation process continues until it settles down and no new time sensitive variables are discovered. At that point we have to assume that all variables discovered are actually time sensitive but we can be sure that variables not added are time constant.

We also solved the difficult problem (which the GLU team never solved) of determining the dimensionality of programs that include user defined functions, including recursively defined functions. We do this by adapting the PyLucid interpreter (to produce the DAM interpreter) to evaluating the entire program over the (finite) domain of dimensionalities.

This is tricky because this evaluation normally will not terminate. So we instrument the evaluator by e.g. counting calls to the evaluation function. Then we cap this count, evaluate, increase the cap, re-evaluate, etc., until the values settle down. (In practice very small caps suffice.)

Results: Experimentally validated algorithms that produce useful upper bounds for the dimensionalities of the variables in multidimensional PyLucid programs, including those with user defined functions.

Our results are purely experimental, we do not provide formal proofs. But the experiments were 100% successful. 