



# **FPGA CODE ACCELERATION**

opportunities & challenges











|            | HLL to circo                             | <b>lit</b>                |
|------------|------------------------------------------|---------------------------|
|            | Imperative HLL                           | Circuit                   |
| language   | algorithmic, procedural time insensitive | behavioral,<br>timed      |
| sequencing | temporal & sequential                    | concurrent                |
| memory     | central, virtual                         | distributed,<br>registers |
| limit      | memory size                              | circuit area              |

| Challenge<br>Algorithm                                                           |                                                                                                           |  |  |  |
|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|--|--|--|
| Stored Program                                                                   | Circuit                                                                                                   |  |  |  |
| minimize number of steps<br>per unit data                                        | minimize area used per<br>unit data<br>maximize unit data per<br>clock cycle<br>minimize clock cycle time |  |  |  |
| Needed: complete application restructuring using spatially concurrent algorithms |                                                                                                           |  |  |  |
| 11                                                                               | 1                                                                                                         |  |  |  |



| Data                      | access         |      |
|---------------------------|----------------|------|
|                           | Stored Program | FPGA |
| cache memory              |                | х    |
| virtual memory            |                | Х    |
| memory mapped I/O         |                | х    |
| standard memory interface |                | x    |





### What is ROCCC?

 Riverside Optimizing Compiler for Configurable Circuits
 Designed for the generation of FPGA code accelerators from C

- Not an EDA tool!
- Focus is on loop nests and streaming data
  - extensive loop and array analysis for parallelism
- Designed to:
  - achieve same clock speed as hand-written HDL
  - keep the user in full control:
    - algorithm and design space exploration
    - · only what is well understood is automated

# Origins Of ROCCC 1.0 •SA-C (Single Assignment C): Colorado State

- Language + compiler project
  - demonstrated efficient hardware based upon single assignment property

#### **OStreamsC: LANL**

- Addition of streaming mechanisms to C, explicit parallelism
- Now ImpulseC

#### **OROCCC 1.0**

- Purely top down approach
- Limits the design space that can be explored



















# **ROCCC 2.0**

### O Enable higher level programming of FPGAs

- Transform a subset of C
- Focus on hardware-appropriate algorithms as opposed to generic C

#### O Enable reusability, composability, and portability

- Modular approach to hardware construction
- Separate platform interface from compiler
- Tuning of applications to particular platform accomplished through optimizations GUI

#### O Generate high throughput circuits

- Extensive optimizations

• Free, open source toolset



# Modular Bottom-up Design

#### • Modules and Systems

#### • Modules:

- Self contained hardware computational blocks, expressed in C
- Bottom-up composition of modules
- Standard database (SQLite3) contains all information on modules either in C, VHDL, or netlists

#### o Systems

- Process streams of data in critical loops
- Can use modules
- Many user controllable optimizations

|                                                                       | lule Code                                                                                                                   |  |  |
|-----------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|--|--|
|                                                                       | // Interface<br>typedef struct                                                                                              |  |  |
| <ul> <li>Interface</li> <li>A struct that specifies</li> </ul>        | {     int realOne_in ;     int imagOne_in ;     int realTwo_in ; input registers     int imaTwo_in ;     int realOmega in ; |  |  |
| inputs and outputs                                                    | int imagOmega_in ;                                                                                                          |  |  |
| <ul> <li>Implementation</li> <li>A function that describes</li> </ul> | <pre>int A0_out ; int A1_out ; Output registers int A2_out ; int A3_out ; } FFT_t ;</pre>                                   |  |  |
| computation inside the<br>black box                                   | // Implementation<br>FFT_t FFT(FFT_t f)                                                                                     |  |  |
| <ul> <li>All outputs must be<br/>assigned</li> </ul>                  | int tmp1 ; internal registers<br>int tmp2 ;                                                                                 |  |  |
|                                                                       | <pre>tmp1 = f.realOmega_in * f.realTwo_in ; tmp2 = f.imagOmega_in * f.imagTwo_in ;</pre>                                    |  |  |
|                                                                       | <pre>f.A0_out = f.realOne_in + tmp1 - tmp2 ; // The other outputs computations go here return f ;</pre>                     |  |  |

| Module                                                                                | Instantiation                                                                                     |
|---------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
|                                                                                       | <pre>#include "roccc-library.h"</pre>                                                             |
|                                                                                       | typedef struct                                                                                    |
| o All Modules accessible as<br>functions                                              | {     int input0_in ;     // Other inputs                                                         |
| - "roccc-library.h"                                                                   | <pre>int tmp0_out ; // Other outputs</pre>                                                        |
| - Called as any other C                                                               | } FFTOneStage_t ;                                                                                 |
| function                                                                              | <pre>FFTOneStage_t FFTOneStage(FFTOneStage_t t) {</pre>                                           |
| <ul> <li>Standard database<br/>maintains all module<br/>information</li> </ul>        | FFT(t.input0_in,<br>t.omega0_in,<br>t.input16_in,<br>t.omega1_in,<br>t.input17_in                 |
| <ul> <li>GUI supports automatic<br/>insertion of module<br/>instantiations</li> </ul> | t.input1_in, THOUSE HIStantiation<br>t.input1_in,<br>t.temp0_out,<br>t.temp1_out,<br>t.temp2_out, |
| Instantiations                                                                        | t.temp3_out) ;                                                                                    |
|                                                                                       | // Others                                                                                         |
|                                                                                       | return t ;<br>}                                                                                   |

| System Code                                                                                                                                 |                                                               |  |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|--|--|--|
| <pre>#include "roccc-library.h"</pre>                                                                                                       |                                                               |  |  |  |
| <pre>void firSystem() {     int A[100] ;     int B[100] ; </pre>                                                                            | t & output streams                                            |  |  |  |
| int i ;<br>int endValue ; <b>d</b><br>int myTmp ;                                                                                           | actual value passed to hardware at runtime                    |  |  |  |
| <pre>for(i = 0 ; i &lt; endValue ; ++i) {     // Data reuse is detected in     // loops by the compiler     FIR(A[i], A[i+1], A[i+2],</pre> | module instantiation can be<br>duplicated if loop is unrolled |  |  |  |
| }                                                                                                                                           |                                                               |  |  |  |

# **ROCCC Transformations**

| o Standard                                                                                                                                                                                                                             | OLoop                                                                                                                                                   | oArray                                                                                                                                                                                                                                          |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>CFG, DFG, and UD/DU</li> <li>Constant propagation/<br/>folding</li> <li>Copy propagation</li> <li>Dead and unreachable<br/>code elimination</li> <li>Scalar renaming</li> <li>Common subexpression<br/>elimination</li> </ul> | <ul> <li>Unrolling</li> <li>Fusing</li> <li>Interchange</li> <li>Peeling</li> <li>Tiling</li> <li>Unswitching</li> <li>Invariant code motion</li> </ul> | <ul> <li>Scalar replacement</li> <li>Array renaming</li> <li>RAW and WAW<br/>elimination</li> <li>Feedback elimination</li> <li>Systolic array<br/>generation</li> <li>Sequential loop<br/>pipelined unrolling</li> <li>Temporal CSE</li> </ul> |
|                                                                                                                                                                                                                                        | spec                                                                                                                                                    | cific to hardware                                                                                                                                                                                                                               |

### Hardware Optimizations

#### **ORedundancy Specification**

- **OPipelining and retiming** 
  - Ability to specify weights for basic operations to guide the number of pipeline stages

#### **OSmart Buffer Generation**

- Keep memory accesses that will be reused to minimize off chip requests

#### OSystolic array generation

- Wavefront algorithms are coded as nested *for* loops that iterate over a two-dimensional array
- Specific set of optimizations to transform into a systolic array

#### O Temporal common subexpression elimination

- Remove common code across loop iterations
- Extends to the removal of modules across loop iterations and addition of feedback variables

# Platform Independent Interfacing

### **o**ROCCC Generates Generic Hooks

- Memories address generator
- Streams FIFOs

### Each interface configurable by the user on a stream-by-stream basis

- Number of outgoing memory requests
- Number of reads/writes per clock cycle

### • Optimized for maximum throughput

- Can support reading values every clock cycle or as fast as can be fed





# **ROCCC 2.0 Features**

## O Support for modular circuit design in C

- Reusable modules, in C, VHDL or cores
- Bottom-up design and top-down designs supported
- External IP cores integrated in ROCCC designs
- Partial compilation and partial synthesis with reuse
- Inlining and black box instantiation supported
- O Subset of C with no additional keywords
  - ROCCC-code compiled and run with a software compiler
- O Support for variable bit width
  - User control over precision of arithmetic operations

# **O**Floating point operations

- half (16-bit), single (32-bit) or double precision (64-bit)

# **ROCCC 2.0 Features**

- O Automatic VHDL testbench code generation
- Platform independent code generation
  - allows fast re-targeting
- OUser control over all transformations, optimizations
  - Automate only what is fully understood, by us and the user!
  - High-level transformations: loops and arrays
  - Low-level optimizations: DFG and circuit levels
  - Fine-grained control over low level optimizations
    - pipelining depth, fanout tree generation
    - parallelization of input and output streams

# **PERFORMANCE?**

dynamic time warping

a data mining example





```
bsequenceSearch(T,Q)
series of n points
ime series of m points
e(Q)
to n-m+1
.ize(C<sub>s,m</sub>)
D(C_{s,m},Q)
ninimum if necessary
```

```
c,Q)

series of n points, C(0) = \infty

series of m points, Q(0) = \infty

to m

= |C(1)-Q(i)|

/ xor operation

to n

0 to m

= |C(j)-Q(i)| + d(i-1,s\oplus 1))
```

,s⊕1)





Window Length

Οu





# Details

## O Clock rate

- Normalization Module runs at 174.6 MHz.
- Warping Matrix runs at 240 MHz.

## **O**Area

- Normalization Module requires 13% of FPGA Area
- Warping Matrix requires 7% of FPGA Area

# • Throughput:

- Normalization unit generates results every clock cycle
- Warping Matrix generates a result every 128 cycles
- 8 Warping Matrix units are used

# **PERFORMANCE?**

comparing to GPUs on image algorithms

#### **Evaluation Platforms** OGPU - NVidia Tesla processor - 30 Streaming Multiprocessors (8 cores each) total of 240 cores - Both GPGPU-Sim and measured **O**Benchmarks values running on a Tesla processor - Brightness Filter - 1, 4, 10, and 30 Streaming Multiprocessor configurations tested - Color Extraction **OFPGA - Virtex 6 LX760** - Box Filter - Gaussian Blur - Programmed in ROCCC compliant C - Blend - Unrolled for different levels of parallelism - Sobel • One loop body, 4x4 loop bodies, and 4x8 loop - Median Filter bodies - Pixelation - All hardware for one application created through the tuning parameters of ROCCC and one source file





# **Productivity** face detection example

# **Viola-Jones Face Detection**

# O Detect forward facing faces

- Works on a sliding window over an image
- Window must be scaled several times to detect different sized faces

## O 24 classifier stages

- Many features in each stage
- Each feature is identical computation with different values
- Software runs at 1.5 seconds / frame (.66 frames/ sec) on average
  - 600x600 sized images





# Results

# **OProductivity gains**

- Well known, de facto standard, available in code repositories, OpenCV, 17,000 lines of C code
- Two engineers, with no prior knowledge of code or application domain (Computer Vision)
- read original paper, open the code distribution
- ROCCC ported algorithm developed in 3 days, resulting in synthesizable VHDL

## OApproximately 3000 features in software

- 1 feature 0.2% of the FPGA
- Approximately 510 features per FPGA, 2040 total on the HC-1

## OStage optimizations

- Last 8 stages (1467 total features) very close to original algorithm
- 343 frames per second
- 520X improvement over software





