

# **OSCAR Automatic Parallelizing Compiler Automatic Speedup and Power Reduction** Kasahara & Kimura Lab, Waseda University, TOKYO http://www.kasahara.cs.waseda.ac.jp

## **OSCAR Automatic Parallelizing Compiler**

To improve effective performance, cost-performance and software productivity and reduce power **Multigrain Parallelization** 

**coarse-grain parallelism** among loops and subroutines, near fine grain parallelism



#### **Cancer Treatment Carbon Ion Radiotherapy**



among statements in addition to **loop** parallelism

### **Data Localization**

Automatic data management for distributed shared memory, cache and local memory

**Data Transfer Overlapping** Data transfer overlapping using

Data Transfer Controllers (DMAs)

**Power Reduction** 

Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.

### **Software Coherent Cache**

Parallelizing compiler directed software coherence technique for shared memory multicore systems without hardware cache coherence control

Speedup by cache software coherant control

- **Advantages**
- Smaller hardware and lower power consumption brought by removing expensive hardware cache coherence mechanism
- Higher performance by compiler's careful cache operation scheduling as well as memory optimization
- Evaluation
- # of PE: 1PE, 2PE, 4PE
- NIOS II multicore system implemented in Arria10 SoC FPGA
  - I\$: 32KB / D\$:32KB (Each PE)
- Application
  - NAS Parallel Benchmarks
  - Matrix Multiply (Size: 100x100)

**3.6 times speedup for nas parallel benchmark CG** (Conjugate Gradient)



Parallelizing of "National Research Institute for Earth Science and Disaster Resilience" Earthquake Wave Simulation GMS by OSCAR Compiler



#### **Parallelizing Speedup of GMS**

120.00

4



#### Task graph of OSCAR API Program



#### Execution environment: Hitachi SR16000 Model VM1 (IBM POWER7 Processor: 128core)

**SC 1**8



Parallel Processing of MATLAB/Simulink by OSCAR Compiler on Intel, ARM & Renesas multi cores Kasahara & Kimura Lab, Waseda University, TOKYO

**OSCAR Compiler** MATLAB/Simulink Multi grain Parallelization http://www.kasahara. cs.waseda.ac.jp

## Automatic Parallelization of MATLAB/Simulink by OSCAR Compiler



Various 4core Multicores (Intel Xeon, ARM Cortex A15 and Renesas SH4A)



Road Tracking, Image Compression : <u>http://www.mathworks.co.jp/jp/help/vision/examples</u> Buoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink Color Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/ Vessel Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/

#### (2) Generate Gantt chart → Scheduling in a multicore

| <mark>void</mark> VesselExt<br>{                     | raction_step ( )                                                                                                                                                                                   |    |
|------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| int thr1<br>int thr2<br>int thr3<br>{                |                                                                                                                                                                                                    |    |
| ,<br>tł<br>oscar<br>tł<br>oscar<br>tł<br>oscar<br>tł | _thread_create ( & thr1 ,<br>nread_function_001 , (void*)1 ) ;<br>_thread_create ( & thr2 ,<br>nread_function_002 , (void*)2 ) ;<br>_thread_create ( & thr3 ,<br>nread_function_003 , (void*)3 ) ; |    |
| Ve                                                   | esselExtraction_step_PEO();                                                                                                                                                                        |    |
| oscar_<br>oscar_<br>oscar_<br>}                      | _thread_join ( thr1 ) ;<br>_thread_join ( thr2 ) ;<br>_thread_join ( thr3 ) ;                                                                                                                      |    |
| 3 (3) Ge                                             | enerate parallelized C co                                                                                                                                                                          | de |
| using the OSCAR API                                  |                                                                                                                                                                                                    |    |
| $\rightarrow$ Multiplatform execution                |                                                                                                                                                                                                    |    |

#### x3.6 for intel, x3.1 for ARM, x3.5 for Renesas

(Intel, ARM and SH etc)



# **Vector Processing of Parallelized Program** by OSCAR Compiler on NEC SX-Aurora TSUBASA

**OSCAR Compiler SX-Aurora TSUBASA** 

