Hydride: A Retargetable and Extensible Synthesis-based Compiler for Modern Hardware Architectures

As modern hardware architectures evolve to support increasingly diverse, complex instruction sets for meeting the performance demands of modern workloads in image processing, deep learning, etc., it has become ever more crucial for compilers to provide robust support for evolution of their internal abstractions and retargetable code generation support to keep pace with emerging instruction sets. We propose Hydride, a novel approach to compiling for complex, emerging hardware architectures. Hydride uses vendor-defined pseudocode specifications of multiple hardware ISAs to automatically design retargetable instructions for AutoLLVM IR, an extensible compiler IR which consists of (formally defined) language-independent and target-independent LLVM IR instructions to compile to those ISAs, and automatically generated instruction selection passes to lower AutoLLVM IR to each of the specified hardware ISAs. Hydride also includes a code synthesizer that automatically generates code generation support for schedule-based languages, such as Halide, to optimally generate AutoLLVM IR. Our results show that Hydride is able to represent 3,557 instructions combined in x86, Hexagon, ARM architectures using only 397 AutoLLVM IR instructions, including (Intel) SSE2, SSE4, AVX, AVX2, AVX512, (Qualcomm) Hexagon HVX, and (ARM) NEON vector ISAs. We created a new Halide compiler with Hydride

Overview of Hydride Workflow for compiling for Halide

Hydride Similarity Analysis

Hydride parses vendor provided pseudocode specifications into executable formal semantics. These formal semantics allow the Hydride Automatic IR Generator to analyze similarities across ISA operations. To define what Hydride means by similarities, we first define equivalent ISA operations. A pair of ISA operations are equivalent to each other if they produce equivalent outputs when executed with the same inputs. Note that this definition implies that both the types (i.e. vector register sizes, element bit-width) of the inputs and the outputs to be same. Accordingly, an element-wise vector addition on 512-bit registers is not equivalent to an element-wise vector addition on 1024-bit registers. However, we know that semantically they are doing the same computation, except minor differences such as the element bit-width and vector register sizes. Intuitively, we may abstract both of these operations into a more generalized vector addition primitive with additional parameters for both bit-width and vector sizes. We can therefore see that these two ISA operations are not equivalent but are in fact similar. Hydride generalizes this notion of similarity and automates this process of identifying how to automatically generate these similar instructions, which we refer to as AutoLLVM IR. Hydride currently supports compilation to 3 target architectures: x86, HVX and ARM. The table below describes the size of the AutoLLVM IR generated by Hydride’s IR Automatic IR generator for each target architecture, as well as across architectures.

ArchitectureISA SizeAutoLLVM SizeIR Size % of ISA Size
x862,0291366.7%
HVX30711537.5%
ARM1,22117714.5%
x86 + HVX2,3362329.9%
x86 + ARM3,2503029.3%
HVX + ARM1,52828618.7%
x86 + HVX + ARM3,55739711.2%
AutoLLVM IR results for each architecture

For architectures such as x86 and ARM, Hydride is effectively able to generate a significantly smaller sized AutoLLVM IR of 6.7% and 14.5% respectively. This is because both of these architectures support various vector register sizes, element bit-widths, signed/unsigned variants of ISA operations which can be abstracted into a single parameterized IR operation. For HVX, we see a more modest reduction. This is because HVX is a Digital Signal Processor (DSP) architecture with specialized operations. The operations do not support as many variants and are specialized to work with specific element-types. Additionally, HVX supports fewer possible vector register sizes which also results in a more modest reduction.

Hydride Code Synthesizer

The Hydride Code Synthesizer compiles programs written in domain specific languages such as Halide to the AutoLLVM IR generated offline by Hydride. It is designed to be front-end and target agnostic; the same code synthesizer framework can be used for different input languages such as Halide or MLIR as well as for different target architectures such as x86, Hexagon, and ARM. This eliminates the need for creating target specific backends for front-end compilers. A key underlying technology used by the Hydride Code Synthesizer is Syntax Guided Program Synthesis (SyGuS).  Briefly, the SyGus problem in this context is described as finding an equivalent AutoLLVM IR program with respect to a given input Halide or MLIR program. The constraints are semantic, in that the synthesized program must exhibit the same logical behavior as the provided input program for all possible inputs , as well as syntactic , that is it should do so by choosing instructions from AutoLLVM IR as its building blocks. The Hydride Code Synthesizer requires the formal semantics of the front-end IR and AutoLLVM IR to achieve this goal.

As an example, consider the common operation of Max Pooling used in neural networks. The code snippet below shows the computation expressed in the Halide DSL. It performs max pooling across a parameterized window. The code depicting the handling of boundary conditions has been omitted for brevity.

// Clamped Max Pool
maximum(c,x,y,b) = output_min_;
maximum(c,x,y,b) = max(maximum(c,x,y,b), input(c, x_rx, y_ry, b));
output_(c,x,y,b) = min(maximum(c,x,y,b), output_max_)

Once the programmer has described the algorithm in Halide (as shown above), they can specify how their algorithm should be evaluated using schedules in Halide. We reorder the loop nests accordingly and vectorize along the c axis using a vectorization factor of 64. The figure below illustrates the vectorized computation expressed in Halide IR. We can see that it is performing an element-wise maximum operation on 8 bit elements on a 64 wide vector.

for(max.iter, 0, t123){
  maximum[ramp[0,1,64]] = max(maximum[ramp(0,1,64)], input(...))
}

The Hydride Code Synthesizer takes as input the Halide IR expression shown above and performs program synthesis to generate an equivalent program in Auto LLVM IR. It finds that the @autollvm.max operation with a given parameterization (shown below) is equivalent to our max pooling operation. The Instruction Selector generated offline by the Hydride IR Generator would map this AutoLLVM IR operation to the __mm512_max_epi8 instruction for x86 architectures. Thus we have successfully compiled our max pooling program for x86!

%result = <64 x i8> @autollvm.max(<64 x i8> %maximum, <64 x i8> %input, i64 /* lanes */ 64 , i64 /* bitwidth */ 8, i64 /* signed? */ 1)

An added complexity which exists is that there may be multiple equivalent AutoLLVM IR programs for a given input expression. So how do we select which program we desire? The Hydride Code Synthesizer automatically generates a cost-model for the program synthesis to guide the output generation towards desired programs. Currently, we use the sum of latencies of each target instruction as a proxy for performance and minimize that sum when generating an equivalent program. Note that such a cost model must be target specific as different architectures would have different characteristics. Other cost-models can easily be substituted without requiring any modifications to the code synthesizer.

We compare Hydride against Halide’s back end code generators for x86, Hexagon and ARM on a wide range of image processing and machine learning kernels, all of which are implemented in the Halide programming language. We also compare Hydride against Halide’s back end code generator for LLVM IR (without architecture-specific intrinsics) to compare the effectiveness of LLVM’s optimizations. The benchmarks we evaluate 

Hydride Performance Evaluation

We compare Hydride against Halide’s back end code generators for x86, Hexagon and ARM on a wide range of image processing and machine learning kernels, all of which are implemented in the Halide programming language. We also compare Hydride against Halide’s back end code generator for LLVM IR (without architecture-specific intrinsics) to compare the effectiveness of LLVM’s optimizations.

Across most benchmarks for x86, Hexagon and ARM, performance with Hydride is comparable to performance with Halide with its architecture-specific back ends. In some cases we observe slowdowns because the current implementation of Hydride code synthesizer uses a small window of input IR operations to synthesize AutoLLVM instructions from; therefore, these small slowdowns can be averted by making synthesis more scalable over larger window of operations in a given program.