### Multigrain Parallelization and Compiler/Architecture Co-design for 30 Years with LCPC

### Hironori Kasahara

Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute

#### Waseda University, Tokyo, Japan IEEE Computer Society President Elect 2017, President 2018

| 1980 BS, 82 MS, 85 Ph.D., Dept. EE, Waseda Univ.                                                                                                                                                                                                                                                                                                                                                                                              | Reviewed Papers: 214, Invited Talks: 145, Published                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1985 Visiting Scholar: U. of California, Berkeley                                                                                                                                                                                                                                                                                                                                                                                             | Unexamined Patent Application:59 (Japan, US, GB,                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| 1986 Assistant Prof., 1988 Associate Prof., 1997                                                                                                                                                                                                                                                                                                                                                                                              | China Granted Patents: 30), Articles in News Papers,                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Prof. Dept. of EECE, Waseda Univ. Now Dept. of                                                                                                                                                                                                                                                                                                                                                                                                | Web News, Medias incl. TV etc.: 572                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Computer Sci. & Eng.                                                                                                                                                                                                                                                                                                                                                                                                                          | <b>Committees in Societies and Government</b> 245                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| 1989-90 Research Scholar: U. of Illinois, Urbana-                                                                                                                                                                                                                                                                                                                                                                                             | <b>IEEE Computer Society President 2018</b> , BoG(2009-                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Champaign, Center for Supercomputing R&D                                                                                                                                                                                                                                                                                                                                                                                                      | 14), Multicore STC Chair (2012-), Japan Chair (2005-                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| <ul> <li>1987 IFAC World Congress Young Author Prize</li> <li>1997 IPSJ Sakai Special Research Award</li> <li>2005 STARC Academia-Industry Research Award</li> <li>2008 LSI of the Year Second Prize</li> <li>2008 Intel AsiaAcademic Forum Best Research Award</li> <li>2010 IEEE CS Golden Core Member Award</li> <li>2014 Minister of Edu., Sci. &amp; Tech. Research Prize</li> <li>2015 IPSJ Fellow</li> <li>2017 IEEE Fellow</li> </ul> | <ul> <li>07), IPSJ Chair: HG for Mag. &amp; J. Edit, Sig. on ARC.</li> <li>[METI/NEDO] Project Leaders: Multicore for</li> <li>Consumer Electronics, Advanced Parallelizing</li> <li>Compiler, Chair: Computer Strategy Committee</li> <li>[Cabinet Office] CSTP Supercomputer Strategic</li> <li>ICT PT, Japan Prize Selection Committees, etc.</li> <li>[MEXT] Info. Sci. &amp; Tech. Committee,</li> <li>Supercomputers (Earth Simulator, HPCI Promo.,</li> <li>Next Gen. Supercomputer K) Committees, etc.</li> </ul> |

# **OSCAR Parallelizing Compiler**

### To improve effective performance, cost-performance and software productivity and reduce power

**Multigrain Parallelization**(LCPC1991,2001,04) coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in addition to loop parallelism

### **Data Localization**

Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2,Cache2001,03) Software Coherent Control (2017)

#### Data Transfer Overlapping(2016 partially)

Data transfer overlapping using Data Transfer Controllers (DMAs)

### **Power Reduction**

(2005 for Multicore, 2011 Multi-processes, 2013 on ARM)

Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.



# Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)



# MTG of Su2cor-LOOPS-DO400

### Coarse grain parallelism PARA\_ALD = 4.3



# **Data-Localization: Loop Aligned Decomposition**

- Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence.
  - Most data in LR can be passed through LM.
  - LR: Localizable Region, CAR: Commonly Accessed Region









Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores
 Execution time with 128 cores was slower than 1 core (0.9 times speedup)

- Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler
  - > OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization

## **110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000** (Power7 Based 128 Core Linux SMP) (LCPC2015)



### Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion) 327 times speedup on 144 cores

Hitachi 144cores SMP Blade Server BS500: Xeon E7-8890 V3(2.5GHz 18core/chip) x8 chip



- Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores (327.6 times speedup)
  - Reduction of treatment cost and reservation waiting period is expected



## Model Base Designed Engine Control on V850

### **Multicore with Denso**

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded

1.95 times speedup on 2core V850 multicore processor.



#### Speedup with 2cores for Engine Crankshaft Handwritten Program on **RPX Multi-core Processor** 1.6 times Speed up by 2 cores against 1 core loop2 sb6 sb4 loop15 loop5 loop10 loop7 loop12 1.8 1.60 1.6 oop9 1.4 loop11 loop18 loop16 loop14 1.2 oop1 1core 0.8 2core Macrotask graph after 0.6 Macrotask graph with a lot of conditional task fusion branches 0.4 2core core 0.2 Branches are fused to macrotasks for static scheduling sb2 sb4 Grain is too fine (us) for dynamic scheduling. emt6

14

# **OSCAR Compile Flow for Simulink Applications**



# Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores

(Intel Xeon, ARM Cortex A15 and Renesas SH4A)



#### to-grayscale-/

Vessel Detection : <u>http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/</u>





- On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control.
- 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution.



Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores (41.6W).

Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W).

# **OSCAR Heterogeneous Multicore**



### An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control



### OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores (LCPC2009Homo, 2010 Hetero)

### List of Directives (22 directives)

Parallel Execution API Power Control API parallel sections (\*) fvcontrol flush (\*) get fystatus critical (\*) Timer API execution get current time Memoay Mapping API Accelerator threadprivate (\*) accelerator task entry distributedshared Cache Control onchipshared cache writeback Synchronization API cache selfinvalidate • groupbarrier complete memop Data Transfer API noncacheable dma transfer aligncache dma contiguous parameter dma stride parameter 2 hint directives for OSCAR compiler dma flag check accelerator task dma flag send oscar comment from V2.0 (\* from OpenMP)



**Automatic Local Memory Management Data Localization: Loop Aligned Decomposition** 

- Decomposed loop into LRs and CARs
  - LR (Localizable Region): Data can be passed through LDM
  - CAR (Commonly Accessed Region): Data transfers are required among processors

**Single dimension Decomposition** 







# Adjustable Blocks

- Handling a suitable block size for each application
  - different from a fixed block size in cache
  - each block can be divided into smaller blocks with integer divisible size to handle small arrays and scalar variables

Block<sub>Number</sub> Level

| Level 0 | Block <sub>0</sub> <sup>0</sup>       |               |                             |                                 |         |                   |                                 |                             |  |  |
|---------|---------------------------------------|---------------|-----------------------------|---------------------------------|---------|-------------------|---------------------------------|-----------------------------|--|--|
| Level 1 | Block <sub>0</sub> <sup>1</sup>       |               |                             | Block <sub>1</sub> <sup>1</sup> |         |                   |                                 |                             |  |  |
| Level 2 | Block <sub>0</sub> <sup>2</sup> Block |               |                             | $ck_1^2$                        | Blo     | $ck_2^2$          | Block <sub>3</sub> <sup>2</sup> |                             |  |  |
| Level 3 | B <sub>0</sub> <sup>3</sup>           | ${\sf B_1}^3$ | B <sub>2</sub> <sup>3</sup> | ${\sf B_{3}}^{3}$               | $B_4^3$ | ${\sf B_{5}}^{3}$ | ${\sf B_6}^3$                   | B <sub>7</sub> <sup>3</sup> |  |  |

# Multi-dimensional Template Arrays for Improving Readability

- a mapping technique for arrays with varying dimensions
  - each block on LDM corresponds to multiple empty arrays with varying dimensions
  - these arrays have an additional dimension to store the corresponding block number
    - TA[Block#][] for single dimension
    - TA[Block#][][] for double dimension
    - TA[Block#][][][] for triple dimension
    - ...
- LDM are represented as a one dimensional array
  - without Template Arrays, multidimensional arrays have complex index calculations
    - A[i][j][k] -> TA[offset + i' \* L + j' \* M + k']
  - Template Arrays provide readability
    - A[i][j][k] -> TA[Block#][i'][j'][k']



#### **Speedups by the Local Memory Management Compared with Utilizing Shared Memory on Benchmarks Application using RP2**



20.12 times speedup for 8cores execution using local memory against sequential execution using off-chip shared memory of RP2 for the AACenc

# Software Coherence Control Method on OSCAR Parallelizing Compiler

- Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis to detect parallelism among coarse grain tasks).
- SCAR compiler automatically controls coherence using following simple program restructuring methods:
  - > To cope with stale data problems:

Data synchronization by compilers

- > To cope with false sharing problem:
  - Data Alignment
  - Array Padding
  - Non-cacheable Buffer



MTG generated by earliest executable condition analysis

# Automatic Software Coherent Control for Manycores

# Performance of Software Coherence Control by OSCAR Compiler on 8-core RP2



### **1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)**

**Co-design of Compiler and Architecture** 

Looking at various applications, design a parallelizing compiler and design a multiprocessor/multicore-processor to support compiler optimization



### **OSCAR(Optimally Scheduled Advanced Multiprocessor)**



#### **OSCAR Memory Space (Global Address Space)**



#### LOCAL MEMORY SPACE

30

Hierarchical Barrier Synchronization

- Specifying a hierarchical group barrier
  - #pragma oscar group\_barrier (C)
  - !\$oscar group\_barrier (Fortran)



Kasahara-Kimura Lab., Waseda University

# NWT

Machine Cycle Time PE Performance PE Memory Size Crossbar Bandwidth

Mass

Number of PEs

9.5ns (105MHz) 1.68GFlops 256MB/PE 4B/cycle x 2 (send/receive simultaneous)/PE = 421MB/s x 2 /PE 140PEs + 2Control Proc.

NAL computer center, Chofu, Tokyo, Feb. 1, 1993



#### Fujitsu Vector Parallel Supercomputer with Crossbar to a Chip

# VPP500/NWT



# VPP500/NWT



# VPP500/NWT



# Earth Simulator

(http://www.es.jamstec.go.jp/)

- Earth Environmental simulation like Global Warming, El Nino, PlateMovement for the all lives onr this planet.
- •Developed in Mar. 2002 by STA (MEXT) and NEC with 400 M\$ investment under Dr. Miyoshi's direction.





Mr. Hajime Miyoshi



### 4 core multicore RP1 (2007), 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas



Automatic Parallelization of JPEG-XR for
Drinkable Inner Camera (Endo Capsule)
10 times more speedup needed after parallelization for 128 cores of
Power 7. Less than 35mW power consumption is required.

• TILEPro64





## **OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology**



#### **Summary**

- To get speedup and power reduction on homogeneous and heterogeneous multicore systems, collaboration of architecture and compiler will be more important.
- Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including "Earthquake Wave Propagation", medical applications including "Cancer Treatment Using Carbon Ion", and "Drinkable Inner Camera", industry application including "Automobile Engine Control", and "Wireless communication Base Band Processing" on various multicores.
  - For example, the automatic parallelization gave us 110 times speedup for "Earthquake Wave Propagation Simulation" on 128 cores of IBM Power 7 against 1 core, 327 times speedup for "Heavy Particle Radiotherapy Cancer Treatment" on 144cores Hitachi Blade Server using Intel Xeon E7-8890, 1.95 times for "Automobile Engine Control" on Renesas 2 cores using SH4A or V850, 55 times for "JPEG-XR Encoding for Capsule Inner Cameras" on Tilera 64 cores Tile64 manycore.
- In automatic power reduction, consumed powers for real-time multi-media applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution.
- For more speedup and power reduction, we have been developing a new architecture/compiler co-designed multicore with vector accelerator based on vector pipelining with vector registers, chaining, load-store pipeline, advanced DMA controller without need of modification of CPU instruction set.