

## **Green Multicore Computing** Hironori Kasahara, Ph.D., IEEE Fellow, IPSJ Fellow Senior Executive Vice President, Waseda University IEEE Computer Society President 2018

URL: http://www.kasahara.cs.waseda.ac.jp/

| <b>1980 BS, 82 MS, 85 Ph.D.</b> , Dept. EE, Waseda Univ.<br>1985 Visiting Scholar: U. of California, Berkeley,<br>1986 Assistant Prof., 1988 Associate Prof., 1989-90                                                                                                                                                                                 | Reviewed Papers: 218, Invited Talks: 186, Granted<br>Patents: 52 (Japan, US, GB, China), Articles in<br>News Papers, Web News, Medias incl. TV etc.: 615                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Research Scholar:U. of Illinois, Urbana-Champaign,<br>Center for Supercomputing R&D, 1997 Prof., 2004<br>Director, Advanced Multicore Research Institute,<br>2017 member: the Engineering Academy of Japan<br>and the Science Council of Japan<br>2018 Nov. Senior Vice President, Waseda Univ.                                                       | Committees in Societies and Government 260<br>IEEE Computer Society: President 2018, Executive<br>Committee(2017-2019), BoG(2009-14), Strategic<br>Planning Committee Chair 2018, Multicore STC<br>Chair (2012-), Japan Chair(2005-07),<br>IPSJ Chair: HG for Magazine. & J. Edit, Sig. on ARC.<br>[METI/NEDO] Project Leaders: Multicore for<br>Consumer Electronics, Advanced Parallelizing<br>Compiler, Chair: Computer Strategy Committee<br>[Cabinet Office] CSTP Supercomputer Strategic<br>ICT PT, Japan Prize Selection Committees, etc.<br>[MEXT] Info. Sci. & Tech. Committee,<br>Supercomputers (Earth Simulator, HPCI Promo.,<br>Next Gen. Supercomputer K) Committees, etc. |
| 1987 IFAC World Congress Young Author Prize1997 IPSJ Sakai Special Research Award2005 STARC Academia-Industry Research Award2008 LSI of the Year Second Prize2008 Intel AsiaAcademic Forum Best Research Award2010 IEEE CS Golden Core Member Award2014 Minister of Edu., Sci. & Tech. Research Prize2015 IPSJ Fellow, 2017 IEEE Fellow, Eta Kappa Nu |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |

1

1

### **1987 OSCAR**(<u>Optimally Scheduled Advanced Multiprocessor</u>)

**<u>Co-design of Compiler and Architecture</u>** 

Looking at various applications, design a parallelizing compiler and design a multiprocessor/multicore-processor to support compiler optimization



### **OSCAR**(<u>O</u>ptimally <u>S</u>cheduled <u>A</u>dvanced Multiprocesso<u>r</u>)



# NWT

Machine Cycle Time PE Performance PE Memory Size Crossbar Bandwidth

Mass

Number of PEs

9.5ns (105MHz) 1.68GFlops 256MB/PE 4B/cycle x 2 (send/receive simultaneous)/PE = 421MB/s x 2 /PE 140PEs + 2Control Proc.

Museum

NAL computer center, Chofu, Tokyo, Feb. 1, 1993

# VPP500/NWT



# Earth Simulator

(http://www.es.jamstec.go.jp/)

- Earth Environmental simulation like Global Warming, El Nino, PlateMovement for the all lives onr this planet.
- •Developed in Mar. 2002 by STA (MEXT) and NEC with 400 M\$ investment under Dr. Miyoshi's direction.





Mr. Hajime Miyoshi









### THE(Times Higher Education) World Academic Summit, ETH (Zurich), 2019.9.10





## Oxford University, 11/12-13 (Research Collaboration)

Vice Chancellor Prof. Louise Richardson (Wol 2020での基調講演(予定)) Head of Astrophysics : Prof. Rob Fender Dept. of Physics: Prof. Ian Shipsey Astrophysics: Prof. H.Falche, et. al. Merton College Warden: Prof. Irene Tracy Fellow: Dr. Peter Braam Sub Warden: Prof. Judy Armitage CS: Prof. Jeremy Gibbons

![](_page_9_Picture_3.jpeg)

Choral Evensong, 750<sup>th</sup> Anniversary Room

## Waseda Open Innovation Valley Project

![](_page_10_Figure_1.jpeg)

## Waseda University Open Innovation Ecosystem

![](_page_11_Figure_1.jpeg)

![](_page_11_Picture_2.jpeg)

## **IEEE Computer Society**

![](_page_12_Picture_1.jpeg)

#### The first President from the outside of USA and Canada in 72 years history of IEEE CS

**Bjarne Stroustrup:** Morgan Stanley & Columbia Univ. **2018 IEEE Computer Society Computer Pioneer Award** IEEE COMPSAC2018 Keynote & Award Ceremony

![](_page_12_Picture_4.jpeg)

#### **IEEE CS Awards Ceremonies with CS President 2018**

![](_page_13_Picture_1.jpeg)

![](_page_13_Picture_2.jpeg)

June BoG Award Dinner with CS Award Winners and their Families, Phoenix

![](_page_13_Picture_4.jpeg)

Technical Achievemen t Award, in COMPSAC, Tokyo

![](_page_13_Picture_6.jpeg)

Computer Pioneer Award to C++ Bjarne Stroustrup in COMPSAC, Tokyo

![](_page_13_Picture_8.jpeg)

B. Ramakrishna Rau Award in MICRO, Fukuoka

![](_page_13_Picture_10.jpeg)

Award Ceremony in SC (Super Computing 2018 with 13 thousands participants), Dallas

#### Cooperation with International Organizations in 2018

![](_page_14_Picture_1.jpeg)

**IPSJ Leaders**, March, **IPSJ Convention**, Tokyo

⊕ 搜狗同传

Japan (IPSJ), China(CCF), Korea(KIISE) in March, Waseda U., Tokyo

![](_page_14_Picture_4.jpeg)

**Okawa Foundation**, CS Japan **Chapter, Multicore STC &** Japanese Government Symp.

![](_page_14_Picture_6.jpeg)

**MoU with UN ITU** in AI for Good, May, Geneva

CINCC 机大会 CCFof computer society, I have a China National Computer Congress e, God, do I, Madison and 年轻能干的研究人员的计算 社会、为什么green和强大的 CNCC201  $\langle \mathbf{0} \rangle$ CNCC

**CCF China National Computer Congress, Oct.**, Hangzhou

![](_page_14_Picture_10.jpeg)

MoU with Baidu, July, Green Comp. C., Tokyo

![](_page_14_Picture_12.jpeg)

![](_page_14_Picture_13.jpeg)

**IEEE CS China Office** moderated Tencent-Waseda Univ. Joint Symposium, Nov., Waseda U., Tokyo

**Russian Academy of Science: Russian Computer Science 70th** Anniversary, Nov., Moscow

## ACM/IEEE SC (SuperComputing) 19, Denver, Nov.17-22, 2019

![](_page_15_Picture_1.jpeg)

Cornel Univ. Prof. Steven Squyres: Mars Exploration, Caltech. Dr. Katie Bouman: Visualization of Blackhole 16

# **Multicores for Performance and Low Power**

Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers ("K" more than 10MW).

![](_page_16_Figure_2.jpeg)

IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" Power ∝ Frequency \* Voltage<sup>2</sup> (Voltage ∝ Frequency)

▶ Power ∝ Frequency<sup>3</sup>

If <u>Frequency</u> is reduced to <u>1/4</u> (Ex. 4GHz→1GHz), Power is reduced to 1/64 and Performance falls down to 1/4 . <<u>Multicores</u>> If <u>8cores</u> are integrated on a chip, Power is still <u>1/8</u> and

**<u>Performance</u>** becomes <u>2 times</u>.

![](_page_17_Figure_0.jpeg)

Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores
Execution time with 128 cores was slower than 1 core (0.9 times speedup)

- Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler
  - > OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization

# **OSCAR Parallelizing Compiler**

## To improve effective performance, cost-performance and software productivity and reduce power

**Multigrain Parallelization**(LCPC1991,2001,04) coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in addition to loop parallelism

### **Data Localization**

Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2,Cache2001,03) Software Coherent Control (2017)

### Data Transfer Overlapping(2016 partially)

Data transfer overlapping using Data Transfer Controllers (DMAs)

### **Power Reduction**

(2005 for Multicore, 2011 Multi-processes, 2013 on ARM)

**Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.** 

![](_page_18_Figure_10.jpeg)

![](_page_19_Figure_0.jpeg)

#### Demo of NEDO Green Multicore Processor for Real Time Consumer Electronics at Council of Science and Engineering Policy on April 10, 2008

http://www8.cao.go.jp/cstp/gaiyo/honkaigi/74index.html

**Codesign of Compiler and** 第74回総合科学技術会議【平成20年4月10日】 **Multiprocessor Architecture** since 1985 4 core multicore RP1 (2007), 8 core multicore RP2 (2008) 標準半導( and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas RP-1 (ISSCC2007 #5.3) RP-2(ISSCC2008 #4.5) RP-X(ISSCC2010 #5.3) /10 Core 1 第74回総合科学技術会議の様子(2) 合科学技術会議の様子(1) Core3 Core6 Core7 Core5 Core4 GCPG→□ 90nm, 8-layer, triple-Vth, CMOS 90nm, 8-layer, triple-Vth, CMOS 45nm, 8-laver, triple-Vth, CMOS 97.6 mm<sup>2</sup> (9.88 x 9.88 mm) 153.8 mm<sup>2</sup> (12.4 x 12.4 mm) 104.8 mm<sup>2</sup> (10.61 x 9.88 mm) 1.0V (internal), 1.8/3.3V (I/O) 1.0-1.4V (internal), 1.8/3.3V (I/O) 1.0-1.2V (internal), 1.2-3.3V (I/O) 600MHz .4.32 GIPS 16.8 GFLOPS 600MHz , 8.64 GIPS, 33.6 GFLOPS 648MHz, 13.7GIPS, 115GOPS, 36.2GFLOPS

11.4 GOPS/W(32b換算)

18.3 GOPS/W(32b換算)

第74回総合科学技術会議の様子(3)

第74回総合科学技術会議の様子(4)

**Prime Minister FUKUDA** is touching our multicore chip during execution.

21

37.3 GOPS/W(32b換算)

# **Green Computing Systems R&D Center** Waseda University

**Established by Prof. Kasahara supported by METI (Mar. 2011)** 

<R & D Target>

Hardware, Software, Application for Super Low-Power Manycore

>More than 64 cores

>Natural air cooling (No fan) Cool, Compact, Clear, Quiet

**>Operational by Solar Panel** 

<Industry, Government, Academia>

Hitachi, Fujitsu, NEC, Renesas, Olympus, Toyota, Denso, Mitsubishi, Toshiba, **OŠCAR** Technology, etc

**<Ripple Effect>** >Low CO<sub>2</sub> (Carbon Dioxide) Emissions

Creation Value Added Products

> Automobiles, Medical, IoT, Servers

![](_page_21_Picture_12.jpeg)

![](_page_21_Picture_13.jpeg)

![](_page_21_Picture_14.jpeg)

**Beside Subway Waseda Station,** Near Waseda Univ. Main Campus

# **Generation of Coarse Grain Tasks**

## Macro-tasks (MTs)

- Block of Pseudo Assignments (BPA): Basic Block (BB)
- Repetition Block (RB) : natural loop

Subroutine Block (SB): subroutine

![](_page_22_Figure_5.jpeg)

# Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)

![](_page_23_Figure_1.jpeg)

#### **Earliest Executable Conditions**

| Macrotask No. | Earliest Executable Condition |
|---------------|-------------------------------|
| 1             |                               |
| 2             | 1 2                           |
| 3             | (1) 3                         |
| 4             | 2 4 OR (1) 3                  |
| 5             | (4) 5 AND [ 2 4 OR (1) 3 ]    |
| 6             | 3 OR (2) 4                    |
| 7             | 5 OR (4) 6                    |
| 8             | (2) 4 OR (1) 3                |
| 9             | (8) 9                         |
| 10            | (8) 10                        |
| 11            | 89 OR 810                     |
| 12            | 11 12 AND [ 9 OR (8) 10 ]     |
| 13            | 11 13 OR 11 12                |
| 14            | (8) 9 OR (8) 10               |
| 15            | 2 <u>15</u>                   |

### Automatic processor assignment in 103.su2cor

• Using 14 processors

**Coarse grain parallelization within DO400** 

![](_page_25_Figure_3.jpeg)

### MTG of Su2cor-LOOPS-DO400 • Coarse grain parallelism PARA\_ALD = 4.3

![](_page_26_Figure_1.jpeg)

# **Data-Localization: Loop Aligned Decomposition**

- Decompose multiple loop (Doall and Seq) into CARs and LRs considering inter-loop data dependence.
  - Most data in LR can be passed through LM.
  - LR: Localizable Region, CAR: Commonly Accessed Region

![](_page_27_Figure_4.jpeg)

![](_page_28_Figure_0.jpeg)

# Inter-loop data dependence analysis in TLG

- Define exit-RB in TLG as Standard-Loop
- Find iterations on which a iteration of Standard-Loop is data dependent
  - e.g. K<sub>th</sub> of RB3 is data-dep on K-1<sub>th</sub>,K<sub>th</sub> of RB2, on K-1<sub>th</sub>,K<sub>th</sub>,K+1<sub>th</sub> of RB1

![](_page_29_Figure_4.jpeg)

Example of TLG

# **Decomposition of RBs in TLG**

- Decompose GCIR into  $DGCIR^p(1 \le p \le n)$ 
  - n: (multiple) num of PCs, DGCIR: Decomposed GCIR
- Generate CAR on which DGCIR<sup>p</sup>&DGCIR<sup>p+1</sup> are data-dep.
- Generate LR on which DGCIR<sup>p</sup> is data-dep.

![](_page_30_Figure_5.jpeg)

![](_page_31_Figure_0.jpeg)

## Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim

#### after padding before padding PARAMETER (N1=513, N2=513) PARAMETER (N1=513, N2=544) COMMON U(N1,N2), V(N1,N2), P(N1,N2), COMMON U(N1,N2), V(N1,N2), P(N1,N2), \* UNEW(N1,N2), VNEW(N1,N2), \* UNEW(N1,N2), VNEW(N1,N2), PNEW(N1,N2), UOLD(N1,N2), PNEW(N1,N2), UOLD(N1,N2), 1 1 VOLD(N1,N2), POLD(N1,N2), \* VOLD(N1,N2), POLD(N1,N2), \* 2 CU(N1,N2), CV(N1,N2), CU(N1,N2), CV(N1,N2), 2 Z(N1,N2), H(N1,N2) \* Z(N1,N2), H(N1,N2) 4MB 4MB padding

Box: Access range of DLG0

## 110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000

### (Power7 Based 128 Core Linux SMP) (LCPC2015)

![](_page_33_Figure_2.jpeg)

## Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion)

327 times speedup on 144 cores

Hitachi 144cores SMP Blade Server BS500: Xeon E7-8890 V3(2.5GHz 18core/chip) x8 chip

![](_page_34_Figure_3.jpeg)

- Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores (327.6 times speedup)
  - > Reduction of treatment cost and reservation waiting period is expected

# Parallelization of 3D-FFT for New Magnetic Material Computation on Hitachi SR16000 Power7 CC-Numa Server

![](_page_35_Figure_1.jpeg)

### **OSCAR** optimization

 reducing number of data transpose with interchange, code motion and loop fusion


## Engine Control by multicore with Denso

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.



Hard real-time automobile engine control by multicore using local memories

Millions of lines C codes consisting conditional branches and basic blocks





#### **Macro Task Fusion for Static Task Scheduling**



## **3.1 Restructuring : Inline Expansion**

Inline expansion is effective

**To increase coarse grain parallelism** 

Expands functions having inner parallelism

Improves coarse grain parallelism



MTG before inline expansion

MTG after inline expansion

#### **3.2 Restructuring: Duplicating If-statements**

Duplicating if-statements is effective

- To increase coarse grain parallelism
- Duplicates fused tasks having inner parallelism



#### MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements



## **Evaluation of Crankshaft Program** with Multi-core Processors



- □ Attain 1.54 times speedup on RPX
  - There are no loops, but only many conditional branches and small basic blocks and difficult to parallelize this program
- This result shows possibility of multi-core processor for engine control programs

# Infineon AURIX TC277

Abbreviations :

PCACHE: **Program Cache** DCACHE: Data Cache DSPR: Data Scratch-Pad RAM PSPR: Program Scratch-Pad RAM BROM: Boot ROM PFlash: Program Flash DFlash: Data Flash (EEEPROM) SRI Slave Interface S SRI Master Interface Μ



# Macrotask Graph, Dependence details and schedules





#### Automatic Parallelization of an Engine Control C Program with 400 thousands lines on AUTOSAR on 2 cores of Infineon AURIX TC277

- > Original sequential execution time on 1 core: 145500 cycles
- Sequential execution time by OSCAR on 1 core: 29700 cycles
  - 4.9 times speedup on 1 core against original execution by OSCAR Compilers automatic data allocation for local scratch pad memory, flush memory modules
- 2 core execution by OSCAR Compiler: 16400 cycles
  - > 1.81 times speedup with 2 core against 1 core execution with OSCAR Compiler
  - > 8.7 times speedup against original sequential execution.



MTG – 16ms

#### **OSCAR Compile Flow for Simulink Applications**



#### Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores

(Intel Xeon, ARM Cortex A15 and Renesas SH4A)



Buoy Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink

Color Edge Detection : http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-convertingto-grayscale-/

Vessel Detection : http://www.mathworks.co.ip/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/

#### Automatic Pallalelization Tool of MATLAB/Simulink: OSCAR Tech "OSCARator" https://www.oscartech.jp/en/

- OSCARator is a simulation accelerator of MATLAB/Simulink on multicore processor
  - based on "OSCAR Compiler" Automatic Parallelization Technology developed by Kasahara and Kimura Lab. Waseda University



#### Speedup of Simulink Models by OSCARator on 4 cores Intel Core i5 Processor

https://www.oscartech.jp/en/

#### 6.51 times speed up on 4 cores against 1 core MATLAB Accerelator Mode for VesselExtraction



(Compared with MATLAB Accelerator Mode Simulation)

#### Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler

• Shortest execution time mode



Realtime processing mode with dead line constraints



## An Example of Machine Parameters for the Power Saving Scheme

- Functions of the multiprocessor
  - Frequency of each proc. is changed to several levels
  - Voltage is changed together with frequency
  - Each proc. can be powered on/off

| state          | FULL | MID   | LOW  | OFF |
|----------------|------|-------|------|-----|
| frequency      | 1    | 1/2   | 1/4  | 0   |
| voltage        | 1    | 0.87  | 0.71 | 0   |
| dynamic energy | 1    | 3 / 4 | 1/2  | 0   |
| static power   | 1    | 1     | 1    | 0   |

#### • State transition overhead

| state<br>FULL<br>MID | FULL<br>0<br>40k | MID<br>40k<br>0  | LOW<br>40k<br>40k | OFF<br>80k<br>80k | state<br>FULL<br>MID | FULL<br>0<br>20 | MID<br>20<br>0 | LOW<br>20<br>20 | OFF<br>40<br>40 |
|----------------------|------------------|------------------|-------------------|-------------------|----------------------|-----------------|----------------|-----------------|-----------------|
| LOW                  | 40k              | 40k              | 0                 | 80k               | LOW                  | 20              | 20             | 0               | 40              |
| OFF                  | dela             | 80k<br>y time [l | 80k  <br>u.t.]    | 0                 | OFF  <br>enero       | 40<br>gy overh  | 40<br>nead [µJ | 40  <br> ]      | 0               |

## **Power Reduction Scheduling**







Fig. 6. V/F control of applu(4proc.)

#### Low-Power Optimization with OSCAR API



### Speedup for H.264 and Optical Flow on ARM Cortex-A9 Android 3 cores by OSCAR Automatic Parallelization



#### Automatic Power Reduction on ARM CortexA9 with Android

http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ ODROID X2

Samsung Exynos4412 Prime, ARM Cortex-A9 Quad core 1.7GHz~0.2GHz, used by Samsung's Galaxy S3



Power for 3cores was reduced to  $1/5 \sim 1/7$  against without software power control Power for 3cores was reduced to  $1/2 \sim 1/3$  against ordinary 1core execution

## Automatic Power Reuction on Intel Haswell H.264 decoder & Optical Flow (3cores)



Power for 3cores was reduced to  $1/3 \sim 1/4$  against without software power control Power for 3cores was reduced to  $2/5 \sim 1/3$  against ordinary 1core execution

## **OSCAR Heterogeneous Multicore**



#### An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control





Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library)



#### Automatic Power Reduction of OpenCV Face Detection on big.LITTLE ARM Processor



- Samsung Exynos 5422 Processor
  - 4x Cortex-A15 2.0GHz, 4x Cortex-A7 1.4GHz big.LITTLE Architecture
  - 2GB LPDDR3 RAM Frequency can be changed by each cluster unit

#### OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores

(LCPC2009 Homogeneous, 2010 Heterogeneous)

Specification: http://www.kasahara.cs.waseda.ac.jp/api/regist.php?lang=en&ver=2.1

#### List of Directives (22 directives)

Parallel Execution API Power Control API parallel sections (\*) fvcontrol flush (\*) get fystatus critical (\*) Timer API execution get current time Memoay Mapping API Accelerator threadprivate (\*) accelerator task entry distributedshared Cache Control onchipshared cache writeback Synchronization API cache selfinvalidate groupbarrier complete memop Data Transfer API noncacheable dma transfer aligncache dma contiguous parameter dma stride parameter 2 hint directives for OSCAR compiler dma flag check accelerator task dma flag send oscar comment from V2.0 (\* from OpenMP)

## Software Coherence Control Method on OSCAR Parallelizing Compiler

- Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis to detect parallelism among coarse grain tasks).
- SCAR compiler automatically controls coherence using following simple program restructuring methods:
  - > To cope with stale data problems:

Data synchronization by compilers

> To cope with false sharing problem:

Data Alignment

- Array Padding
- Non-cacheable Buffer



MTG generated by earliest executable condition analysis

## 8 Core RP2 Chip Block Diagram



## Automatic Software Coherent Control for Manycores

#### Performance of Software Coherence Control by OSCAR Compiler on 8-core RP2



## **OSCAR Software Cache Coherent Control for NIOS and RISCV cores on FPGA**

3.57x Speedups for NIOS and 3.68x for RISCV using 4 cores for NPB CG



## **Automatic Local Memory Management Data Localization: Loop Aligned Decomposition**

- Decomposed loop into LRs and CARs
  - LR (Localizable Region): Data can be passed through LDM
  - CAR (Commonly Accessed Region): Data transfers are required among processors

**Single dimension Decomposition** 







## **Adjustable Blocks**

Handling a suitable block size for each application

- different from a fixed block size in cache
- each block can be divided into smaller blocks with integer divisible size to handle small arrays and scalar variables



# **Block Replacement Policy**

#### Compiler Control Memory block Replacement

- using live, dead and reuse information of each variable from the scheduled result
- different from LRU in cache that does not use data dependence information

#### Block Eviction Priority Policy

- 1. (Dead) Variables that will not be accessed later in the program
- 2. Variables that are accessed only by other processor cores
- 3. Variables that will be later accessed by the current processor core
- 4. Variables that will immediately be accessed by the current processor core

#### Speedups by OSCAR Automatic Local Memory Management compared to Executions Utilizing Centralized Shared Memory on Embedded and Scientific Application on RP2 8core Multicore



Maximum of 20.44 times speedup on 8 cores using local memory against sequential execution using off-chip shared memory Automatic Parallelization of JPEG-XR for
Drinkable Inner Camera (Endo Capsule)
10 times more speedup needed after parallelization for 128 cores of
Power 7. Less than 35mW power consumption is required.

• TILEPro64




## **OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology**





# Future Multicore Products with Automatic Parallelizing Compiler

**Advanced medical systems** 



### **Next Generation Automobiles**

- Safer, more comfortable, energy efficient, environment friendly
- Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

#### Smart phones



-From everyday recharging to less than once a week

- Solar powered operation in emergency condition

- Keep health



#### Cancer treatment, Drinkable inner camera

- Emergency solar powered
- No cooling fun, No dust , clean usable inside OP room

#### Personal / Regional Supercomputers



Solar powered with more than 100 times power efficient : FLOPS/W

Regional Disaster Simulators saving lives from tornadoes, localized heavy rain, fires with earth quakes

#### **Summary**

- Waseda University Green Computing Systems R&D Center supported by METI has been researching on low-power high performance Green Multicore hardware, software and application with industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology.
- OSCAR Automatic Parallelizing and Power Reducing Compiler <u>has succeeded</u> <u>speedup</u> and/or\_power reduction of scientific <u>applications including "Earthquake</u> Wave Propagation", medical applications including <u>"Cancer Treatment</u> Using Carbon Ion", and <u>"Drinkable Inner Camera</u>", industry application including <u>"Automobile Engine Control"</u>, "Smartphone", and "Wireless communication Base Band Processing" on <u>various multicores from</u> different vendors including <u>Intel,</u> <u>ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu.</u>
- In automatic parallelization, 110 times speedup for "Earthquake Wave Propagation Simulation" on 128 cores of IBM Power 7 against 1 core, 55 times speedup for "Carbon Ion Radiotherapy Cancer Treatment" on 64cores IBM Power7, 1.95 times for "Automobile Engine Control" on Renesas 2 cores using SH4A or V850, 55 times for "JPEG-XR Encoding for Capsule Inner Cameras" on Tilera 64 cores Tile64 manycore.
  - > The compiler will be available on market from OSCAR Technology.
- In <u>automatic power reduction</u>, <u>consumed powers for real-time multi-media</u> <u>applications</u> like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of <u>ARM</u> Cortex A9 and <u>Intel Haswell</u> and 1/4 using <u>Renesas</u> SH4A 8 cores against ordinary single core execution.
- Local memory management for automobiles and software coherent control have been patented and already realized by OSCAR compiler.