Intermediate Representation

Intermediate Representation

Cetus’ IR contrasts with the Polaris Fortran translator’s IR in that it uses a hierarchical statement structure. The Cetus IR directly reflects the block structure of a program. Polaris lists the statements of each procedure in a flat way, with a reference to the outer statement being the only way for determining the block structure. There are also important differences in the representation of expressions, which further reflects the differences between C and Fortran. The Polaris IR includes assignment statements, whereas Cetus represents assignments in the form of expressions. This corresponds to the C language’s feature to include assignment side effects in any expression.

The IR is structured such that the original source program can be reproduced, but this is where source-to-source translators face an intrinsic dilemma. Keeping the IR and output similar to the input will make it easy for the user to recognize the transformations applied by the compiler. On the other hand, keeping the IR language independent leads to a simpler compiler architecture, but may make it impossible to reproduce the original source code as output. In Cetus, the concept of statements and expressions are closely related to the syntax of the C language, facilitating easy source-to-source translation. However, the drawback is increased complexity for pass writers (since they must think in terms of C syntax) and limited extensibility of Cetus to additional languages. That problem is mitigated by the provision of several abstract classes, which represent generic control constructs. Generic passes can then be written using the abstract interface, while more language-specific passes can use the derived classes. We feel it is important to work with multiple languages at an early stage, so that our result is not simply a design that is extensible in theory but also in practice. Toward this goal, we have begun adding a C++ front end and generalizing the IR so that we can evaluate these design trade-offs. Other potential front ends are Java and Fortran 90.

Class Hierarchy Design

Our design goal was a simple IR class hierarchy easily understood by users. It should also be easy to maintain, while being rich enough to enable future extension without major modification. The basic building blocks of a program are the translation units, which represent the content of a source file, and procedures, which represent individual functions. Procedures include a list of simple or compound statements, representing the program control flow in a hierarchical way. That is, compound statements, such as IF-constructs and FOR-loops include inner (simple or compound) statements, representing then and else blocks or loop bodies, respectively. Expressions represent the operations being done on variables, including the assignments to variables.

Major Classes

All of the classes are documented in detail with javadoc. The API can be found at http://cetus.ecn.purdue.edu/ in the Documentation section.

A brief discussion of important base classes is below.

Program

A Program object represents the entire program. It is the root of the syntax tree.

TranslationUnit

A TranslationUnit represents a single source file. It may only appear as the immediate child of the Program.

Declarations

Declarations appear in many places. As children of TranslationUnit, they represent global declarations of variables, functions, or classes. As parameters of a Procedure, they represent declarations of formal parameters. As children of ClassDeclaration, they representmethods and data members.

Expressions

Expressions are normally of most interest to optimization passes. All assignments, function calls, and mathematical computations are Expressions.

Statements

Statements serve two purposes: to provide control constructs, and to provide wrappers for Declarations and Expressions. Statements such as IfStatement provide a control construct, whereas DeclarationStatement and ExpressionStatement simply allow Declarations and Expressions to appear wherever a Statement may appear.

Specifiers

Specifiers are most often found in lists. For example, a variable declaration may be prefixed with a list of specifiers such as const and float.

Relationship Between Grammar and IR

When designing any class hierarchy, some general principles are followed. The main principle is that a derived class satisfies an is a relationship with its parent, such that the data and methods of the parent make sense in the context of the child. This principle is followed to some extent in Cetus, but it would be more accurate to say that a derived class can appear in the grammar wherever its parent can appear.

There is a distinction between the class hierarchy and the syntax tree. For example, in the syntax tree, the parent of a TranslationUnit is a Program, however neither TranslationUnit nor Program have a parent in the class hierarchy.

Syntax Tree Invariants

One important aspect that makes an infrastructure useful is providing a good set of tools to help debug future compiler passes. Cetus provides basic debugging support through the Java language, which contains exceptions and assertions as built-in features. Cetus executes within a Java virtual machine, so a full stack trace including source line numbers is available whenever an exception is caught or the compiler terminates abnormally.

Such error reporting is useless unless the IR is designed to prevent programmers from corrupting the program representation. In other words, there must be a way of detecting the state of the IR is not as it should be, in order to report an error. To provide error checking and detect common errors by pass writers, the Cetus IR maintains several invariants. Violating one of the invariants below will probably result in an exception, to the extent that it is possible for Cetus to recognize what you have done.

Invariant

  1. If an object has a parent, then it has exactly one parent.

Consequences

  • You may not take an object that has a parent and add it as the child of another object.
  • If you want to use the same object in more than one place in the syntax tree, you must clone the original object.
  • Clones are identical to the originals except their parent is null.

Invariant

  1. An object P can enumerate all of its children.

Consequences

  • An iterator object can be created with P that can iterate over P‘s children. In some cases, the iterator will not visit data that is actually part of P itself. Nearly everything is kept in the list of children, and we may more move data into the list in the future.

Invariant

  1. An object P controls the addition and removal of its children.

Consequences

  • An object C cannot become the child of an object P without P‘s permission.
  • Before an object C can set its parent reference to an object PP must already recognize C is a child. I.e. C must already be in the list of P‘s children.
  • The child reference and parent reference (i.e. downlink and uplink) must be set in that order. The addXYZ methods of many classes will take care of this for you. There are also atomic swap methods that respect the ordering.

Annotations

Cetus extends our initial implementation of Annotations to a completely new IR structure and API. Comments, pragmas and other annotations were initially parsed in as Annotations and enclosed inside DeclarationStatements (in most cases). In the new implementation, parsed in annotations are converted to a new internal representation through the AnnotationParser. The AnnotationParser stores annotations under corresponding subclasses:

  • PragmaAnnotation
    • CetusAnnotation (#pragma cetus …)
    • OmpAnnotation (#pragma omp …)
  • CommentAnnotation (e.g. /* … */)
  • CodeAnnotation (Raw printing)

Annotations can be attached to existing IR where the semantics define so or can be inserted as stand-alone IR. In the above mentioned subclasses, Cetus and Omp Annotations have specific semantics and hence if present in the input source, they are attached to corresponding source IR. However, this is not possible for Comments as it is difficult to determine their association with the IR except for in certain cases (which are not handled currently). Hence, all Comments and other annotations are inserted as stand-alone.

Annotations that need to be attached to existing IR are stored as member objects of this Annotatable IR (Statements and Declarations), the new API for manipulating these is hence available through Annotatable. Standalone annotations are enclosed in special IR such as AnnotationDeclaration or AnnotationStatement (note that AnnotationStatement is different from previous release). The API in Annotatable *COMPLETELY REPLACES* previous functionality provided through Tools.java.

See the latest Cetus tutorial at http://cetus.ecn.purdue.edu for examples on how to use the new Annotation and Annotatable API during analysis and transformations.