#### **OSCAR Codesigned Compiler and Multicore Architecture**



#### Prof. Hironori Kasahara, IEEE Life-Fellow, IPSJ Fellow Senior Executive Vice President (2018-2022), Waseda University



1

IEEE Computer Society President 2018 Board Member: The Academy of Engineering of Japan

URL: http://www.kasahara.cs.waseda.ac.jp/

1980 BS, 82 MS, 85 Ph.D., Dept. EE, Waseda Univ. Reviewed Papers: 232, Invited Talks: 230, 1985 Visiting Scholar: U. of California, Berkeley, Granted Patents: 67 (Japan, US, GB, DE, China), 1986 Assistant Prof., 1988 Associate Prof., 1989-90 Articles in News Papers, Web News, TV etc.: 697 Research Scholar: U. of Illinois, Urbana-Champaign, **Committees in Societies and Government 287** Center for Supercomputing R&D, 1997 Prof. IEEE Computer Society: President 2018, Executive 2004 Director, Advanced Multicore Research Institute, Committee(2017-2019), BoG(2009-14), Strategic 2017member: the Engineering Academy of Japan (2020-Planning Committee Chair 2018, Multicore STC Chair Board Mem) and the Science Council of Japan (2012-), Japan Chair(2005-07), **2018** IEEE Computer Society President **IPSJ** Chair: HG for Magazine. & J. Edit, Sig. on ARC. Senior Vice President, Waseda Univ. (2018 Nov.-2022 Sept.) [METI/NEDO] Project Leaders: Multicore for **AWARD: 1987 IFAC World Congress Young Author Prize Consumer Electronics, Advanced Parallelizing** 1997 IPSJ Sakai Special Research Award, **Compiler**, Chair: Computer Strategy Committee 2005 STARC Academia-Industry Research Award, **Cabinet Office CSTP Supercomputer Strategic ICT** 2008 LSI of the Year Second Prize, PT, Japan Prize Selection Committees, etc. 2008 Intel Asia Academic Forum Best Research Award, [MEXT] Info. Sci. & Tech. Committee, 2010 IEEE CS Golden Core Member Award Supercomputers (Earth Simulator, HPCI Promo., Next 2014 Minister of Edu., Sci. & Tech. Research Prize Gen. Supercomputer K) Committees 2015 IPSJ Fellow, 2017 IEEE Fellow, Eta Kappa Nu JST Moonshot Project G3 Robot & AI Vice Chair, 2019 Spirit of IEEE Computer Society Award, **[COCN]** Board Member in Council of 2020 IPSJ Contribution Award, Competitiveness Nippon, etc.

The 36th LCPC2023, October 11-13, 2023, Lexington, Kentucky, USA

#### Some of papers in and just after Ph.D. Course in Waseda U.

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-33, NO. 11, NOVEMBER 1984

104

#### Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing

HIRONORI KASAHARA, MEMBER, IEEE, AND SEINOSUKE NARITA, SENIOR MEMBER, IEEE

IEEE JOURNAL OF ROBOTICS AND AUTOMATION, VOL. RA-1, NO. 2, JUNE 1985

#### Parallel Processing of Robot-Arm Control Computation on a Multimicroprocessor System

HIRONORI KASAHARA MEMBER, IEEE, AND SEINOSUKE NARITA, SENIOR MEMBER, IEEE

2nd International Conference on Superecomputing Santa Clara, CA, USA May 3-8,1987

A PARALLEL PROCESSING SCHEME FOR THE SOLUTION OF SPARSE LINEAR EQUATIONS USING STATIC OPTIMAL-MULTIPROCESSOR-SCHEDULING ALGORITHMS

H.Kasahara", T.Fujii", H.Nakayama", S.Narita", and Leon O.Chua"

 Dept. of Electrical Eng., Waseda University, Tokyo, 160, Japan
 Dept. of Electrical Eng. and Computer Sciences, University of California, Berkeley, CA 94720, U.S.A.

Copyright © IFAC 10th Triennial World Congress, Munich, FRG, 1987

の何後後間のんでたいせきかりてき

#### PARALLEL PROCESSING OF ROBOT MOTION SIMULATION

#### H. Kasahara, H. Fujii and M. Iwata

Department of Electrical Engineering, Waseda University, 3–4–1 Ohkubo Shinjuku-ku, Tokyo 160, Japan





Jorid Congres

1023

## The First Compiler Codesigned Multiprocessor OSCAR (Optimally Scheduled Advanced Multiprocessor) in 1987



AMD29325 32-bit Floating-point unit

H. Kasahara, "OSCAR Fortran Multigrain Compiler", Stanford University, Hosted by Professor John L. Hennessy and Professor Monica Lam, May. 15. 1995.

AMD29325 32-bit Floating-point unit



## **SIGMA-1**

**ETL**(Electrotechnical Laboratory), Ministry of International Trade and Industry, **Japan.** (AIST, METI)

**SIGMA-1** is a large-scale computer based on fine-grain dataflow architecture designed to show feasibility of fine-grain dataflow computer to highly parallel computation over conventional von Neumann computers. SIGMA-1 project was stated on 1984 and the 128 processing element (PE) started working on 1988. SIGMA-1 was design and built at the Electrotechnical Laboratory (ETL for short), Ministry of International Trade and Industry, Japan. SIGMA-1 is still the largest scale dataflow computer so far built and it achieved more 100 Mflops as a maximum measured performance of the total system. As for the language for SIGMA-1, ETL developed **Dataflow-C language** as a subset of C programming language. Dataflow-C is a single assignment language that can compile C-like source programs to highly parallel executable dataflow machine codes (Fig. 1).

https://link.springer.com/referenceworkentry/10.1007/978-0-387-09766-4\_287



The SIGMA-1 Dataflow Computer

Toshitsugu Yuba, Kei Hiraki, Toshio Shimada, Satoshi Sekiguchi and Kenji Nishida ACM '87: Proceedings of the 1987 Fall Joint Computer Conference on Exploring technology: today and tomorrow December 1987, Pages 578–585





## 1993年 スーパーコンピュータVPP500、数値風洞(NWT) ACM/IEEE SC '94: Washington, D.C. November, 1994にて発表

## Mr. Hajime Miyoshi





#### スパーコンピュータNWTの外観









#### 商用VPP5000 (仏気象庁他)



## **Cedar Supercomputer**

University of Illinois at Urbana-Champaign, CSRD (Center for Supercomputing R&D)

#### **Prof. David Kuck**



Hironori visited CSRD from 1989 to 1990.

2021年ノーベル物理学賞 <u>プリンストン大 真鍋淑郎先生</u> 大気・海洋大循環モデル ()

- :気・海洋大循環モデル (http://www.es.jamstec.go.jp/) • Earth Environmental simulation like Global Warming, El Nino, PlateMovement for the all lives onr this planet.
- •Developed in Mar. 2002 by STA (MEXT) and NEC with 400 M\$ investment under Dr. Miyoshi's direction.

(Dr.Miyoshi: Passed away in Nov.2001. NWT, VPP500, SX6)

![](_page_7_Picture_4.jpeg)

**Earth Simulator** 

![](_page_7_Picture_6.jpeg)

Mr. Hajime Miyoshi

40 TFLOPS Peak (40\*10<sup>12</sup>) 35.6 TFLOPS Linpack June 2002 Top1 Cores: 5,120, Rmax:35.86TFlop/s Rpeak: 40.96TFlop/s, Power: 3.2MW

![](_page_7_Picture_9.jpeg)

#### "K" Supercomputer by Riken, No.1 in TOP500, June 20 & Nov.2,2011

![](_page_8_Figure_1.jpeg)

## **OSCAR Parallelizing Compiler**

### To improve effective performance, cost-performance and software productivity and reduce power

**Multigrain Parallelization**(LCPC1991,2001,04) coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in addition to loop parallelism

#### **Data Localization**

Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2,Cache2001,03) Software Coherent Control (2017)

#### Data Transfer Overlapping(2016 partially)

Data transfer overlapping using Data Transfer Controllers (DMAs)

#### **Power Reduction**

(2005 for Multicore, 2011 Multi-processes, 2013 on ARM)

Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.

![](_page_9_Figure_10.jpeg)

## Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks)

![](_page_10_Figure_1.jpeg)

![](_page_10_Figure_2.jpeg)

## Generation of Coarse Grain Tasks Macro-tasks (MTs)

- Block of Pseudo Assignments (BPA): Basic Block (BB)
- Repetition Block (RB) : natural loop
- Subroutine Block (SB): subroutine

![](_page_11_Figure_4.jpeg)

## MTG of Su2cor-LOOPS-DO400 Coarse grain parallelism PARA\_ALD = 4.3

![](_page_12_Figure_1.jpeg)

![](_page_13_Figure_0.jpeg)

## Performance of APC Compiler on IBM pSeries690 16 Processors High-end Server

- IBM XL Fortran for AIX Version 8.1
  - Sequential execution :-O5 -qarch=pwr4
  - Automatic loop parallelization : -O5 -qsmp=auto -qarch=pwr4
  - OSCAR compiler

: -O5 -qsmp=noauto -qarch=pwr4 (su2cor: -O4 -qstrict)

![](_page_14_Figure_6.jpeg)

## Performance of Multigrain Parallel Processing for 102.swim on IBM pSeries690

![](_page_15_Figure_1.jpeg)

#### **NEC/ARM MPCore Embedded 4 core SMP**

![](_page_16_Figure_1.jpeg)

3.48 times speedup by OSCAR compiler against sequential processing

17

![](_page_17_Figure_0.jpeg)

OSCAR compiler gave us 2.32 times speedup against Intel Fortran Itanium Compiler revision 10.1

![](_page_18_Figure_0.jpeg)

## **Decomposition of RBs in TLG**

- Decompose GCIR into  $DGCIR^p(1 \le p \le n)$ 
  - n: (multiple) num of PCs, DGCIR: Decomposed GCIR
- Generate CAR on which DGCIR<sup>p</sup>&DGCIR<sup>p+1</sup> are data-dep.
- Generate LR on which DGCIR<sup>p</sup> is data-dep.

![](_page_19_Figure_5.jpeg)

#### An Example of Data Localization for Spec95 Swim

![](_page_20_Figure_1.jpeg)

(a) An example of target loop group for data localization

## Data Layout for Removing Line Conflict Misses by Array Dimension Padding Declaration part of arrays in spec95 swim

#### before padding

after padding

![](_page_21_Figure_3.jpeg)

## 8 Core RP2 Chip Block Diagram

![](_page_22_Figure_1.jpeg)

#### Demo of NEDO Green Multicore Processor for Real Time Consumer Electronics at Council of Science and Engineering Policy on April 10, 2008

http://www8.cao.go.jp/cstp/gaiyo/honkaigi/74index.html

![](_page_23_Figure_2.jpeg)

**Prime Minister** FUKUDA is touching our multicore chip during execution.

![](_page_24_Figure_0.jpeg)

## Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler

• Shortest execution time mode

Dead Line

Time

![](_page_25_Figure_2.jpeg)

Time

Dead Line

Time

Dead Line

## An Example of Machine Parameters for the Power Saving Scheme

- Functions of the multiprocessor
  - Frequency of each proc. is changed to several levels
  - Voltage is changed together with frequency
  - Each proc. can be powered on/off

| state          | FULL | MID   | LOW  | OFF |
|----------------|------|-------|------|-----|
| frequency      | 1    | 1/2   | 1/4  | 0   |
| voltage        | 1    | 0.87  | 0.71 | 0   |
| dynamic energy | 1    | 3 / 4 | 1/2  | 0   |
| static power   | 1    | 1     | 1    | 0   |

#### State transition overhead

| state             | FULL | MID | LOW | OFF   | state    | FULL     | MID | LOW | OFF |
|-------------------|------|-----|-----|-------|----------|----------|-----|-----|-----|
| FULL              | 0    | 40k | 40k | 80k   | FULL     | 0        | 20  | 20  | 40  |
| MID               | 40k  | 0   | 40k | 80k   | MID      | 20       | 0   | 20  | 40  |
| LOW               | 40k  | 40k | 0   | 80k   | LOW      | 20       | 20  | 0   | 40  |
| OFF               | 80k  | 80k | 80k | 0     | OFF      | 40       | 40  | 40  | 0   |
| delay time [u.t.] |      |     |     | energ | gy overh | nead [µJ | J]  |     |     |

## **Power Reduction Scheduling**

![](_page_27_Figure_1.jpeg)

![](_page_27_Figure_2.jpeg)

![](_page_27_Figure_3.jpeg)

Fig. 6. V/F control of applu(4proc.)

## Low-Power Optimization with OSCAR API

![](_page_28_Figure_1.jpeg)

#### Multicore Program Development Using OSCAR API V2.0

![](_page_29_Figure_1.jpeg)

#### https://www.computer.org/product/education/multi-core-video-lectures-bundle

#### Created by Multicore STC Chair Hironori Kasahara

## Multi-core Video Lecture Series

Multi-Core Lecture Series consists of 11 one-hour lectures by some of the world's leading researchers in the field. This series is not a course and it consists of the presentation for those who are in the research field. This is more intended for research information sharing than educational training. Topics that are covered during these lectures are listed below. This series also includes an hour discussion of the

#### Video Presentations:

- Automatic Parallelization by David Padua
- Autoparallelization for GPUs by Wen-Mei Hwu
- Dependences and Dependence Analysis by Utpal Banerjee
- Dynamic Parallelization by Rudolf Eigenmann
- Instruction Level Parallelization by Alexandru Nicolau
- Multigrain Parallelization and Power Reduction by Hironori Kasahara

lecturers.

- The Polyhedral Model by Paul Feautrier
- Vector Computation by David Kuck
- Vectorization by P. Sadayappan
- Vectorization/Parallelization in the IBM Compiler by Yaoqing Gao
- Vectorization/Parallelization in the Intel Compiler by Peng Tu
- Roundtable Discussion by all presenters

Iome / Education / Course

#### Multi-core Roundtable Discussion Video Lecture

MULTI-CORE VIDEO SERIES

![](_page_30_Picture_20.jpeg)

#### Dependences and Dependence Analysis Video Lecture

#### MULTI-CORE VIDEO SERIES

![](_page_30_Picture_23.jpeg)

Dependences and Dependence Analysis by Utpal Banerjee Utpal Banerjee's research interests in computer science are in the general area of parallel processing and he has published four books on loop transformations and dependence analysis, with a fifth one on instruction level parallelism on the way.

#### Multigrain Parallelization and Power Reduction Video Lecture

![](_page_30_Picture_26.jpeg)

Multigrain Parallelization and Power Reduction by Hironori Kasahara. Professor Kasahara has been researching on OSCAR Automatic Parallelizing and Power Reducing Compiler and OSCAR Multicore architecture for more than 30 years, and led four Japanese national projects on parallelizing compilers, multicores, and green computing.

![](_page_30_Picture_28.jpeg)

#### **IEEE Computer Society** The first President from outside North America in 72 years history of IEEE CS

![](_page_31_Picture_1.jpeg)

#### ACM/IEEE SC (SuperComputing) 19, Denver, Nov.17-22, 2019

![](_page_32_Picture_1.jpeg)

Cornel Univ. Prof. Steven Squyres: Mars Exploration, CalTech. Dr. Katie Bouman: Visualization of Black Hole

#### Intel Xeon E5-2650v4 – Benchmark results on upto 8 cores

- x86-64 based Architecture, 12 Cores, 2.2 GHz 2.9 GHz
- □ 30 MiB shared L3 cache,
- □ L3 Cache: Shared by all cores
- speedup to sequential version
- gcc as backend

- NAS parallel benchmark suite
  - BT: Block Tri-diagonal solver
  - CG: Conjugate Gradient computation
  - SP: Scalar Penta-diagonal solver
- SPEC2000
  - art: Image recognition / Neural networks
  - equake: Seismic wave propagation simulation
  - swim: Shallow water modelling Fortran 77
- MediaBench II : MPEG2 encoding

![](_page_33_Figure_15.jpeg)

#### AMD EPYC 7702P – Benchmark results on upto 8 cores

![](_page_34_Figure_1.jpeg)

16 MiB L3 cache per 4 core cluster,

speedup to sequential version

shared within the cluster

- NAS parallel benchmark suite **BT: Block Tri-diagonal solver** CG: Conjugate Gradient computation SP: Scalar Penta-diagonal solver SPEC2000 art: Image recognition / Neural networks equake: Seismic wave propagation simulation
  - swim: Shallow water modelling Fortran 77

![](_page_34_Figure_6.jpeg)

#### NVIDIA Carmel ARMv8.2 – Benchmark results on upto 4 cores

- □ Arm v8.2 based Architecture, 6 Cores, 1.4 GHz
- □ 4 MiB shared L3 cache,
- □ L3 Cache: Shared across all cores
- speedup to sequential versior
   gcc as backend
- NAS parallel benchmark suite
  - BT: Block Tri-diagonal solver
  - CG: Conjugate Gradient computation
  - SP: Scalar Penta-diagonal solver
- SPEC2000
  - art: Image recognition / Neural networks
  - equake: Seismic wave propagation simulation
  - swim: Shallow water modelling Fortran 77
- MediaBench II : MPEG2 encoding

![](_page_35_Figure_14.jpeg)

overall good speedup is observed

equake: seq.: 19.0 s, 4 core OSCAR: 7.18 s

#### SiFive Freedom U740 – Benchmark results on upto 4 cores

- RISC-V based Architecture, 4 Cores, 1.2 GHz
- 2 MiB shared L2 cache
- L2 Cache: Shared across all cores
- speedup to sequential versiongcc as backend

- NAS parallel benchmark suite
  - BT: Block Tri-diagonal solver
  - CG: Conjugate Gradient computation
  - SP: Scalar Penta-diagonal solver
- SPEC2000
  - art: Image recognition / Neural networks
  - equake: Seismic wave propagation simulation
  - swim: Shallow water modelling Fortran 77
- MediaBench II : MPEG2 encoding

![](_page_36_Figure_14.jpeg)

overall good speedup is observed, swim superlinear

BT: seq.: 2041 s, 4 core OSCAR: 551 s

## **Heterogeneous Multicore Architecture** targeted by OSCAR API

![](_page_37_Figure_1.jpeg)

#### OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores

#### (LCPC2009 Homogeneous, 2010 Heterogeneous)

Specification: http://www.kasahara.cs.waseda.ac.jp/api/regist.php?lang=en&ver=2.1

## List of Directives (22 directives)

- Parallel Execution API
  - parallel sections (\*)
  - flush (\*)
  - critical (\*)
  - execution
- Memoay Mapping API
  - threadprivate (\*)
  - distributedshared
  - onchipshared
- Synchronization API
  - groupbarrier

#### Data Transfer API

- dma\_transfer
- dma\_contiguous\_parameter
- dma\_stride\_parameter
- dma\_flag\_check
- dma\_flag\_send

#### (\* from OpenMP)

- Power Control API
  - fvcontrol
  - get\_fvstatus
- Timer API
  - get\_current\_time
- Accelerator
  - accelerator\_task\_entry
- Cache Control
  - cache\_writeback
  - cache\_selfinvalidate
  - complete\_memop
  - noncacheable
  - aligncache

2 hint directives for OSCAR compiler

- accelerator\_task
- oscar\_comment

from V2.0

#### An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control

![](_page_39_Figure_1.jpeg)

#### Speedups & Power Reduction on RP-X Heterogeneous Multicore with 8 CPUs and 4 DRPs

![](_page_40_Figure_1.jpeg)

(Optical Flow with a hand-tuned library)

## Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library)

![](_page_40_Figure_4.jpeg)

## **Green Computing Systems R&D Center** Waseda University

Established by Prof. Kasahara supported by METI (Mar. 2011)

< R & D Target>

Hardware, Software, Application for Super Low-Power Manycore

>More than 64 cores

>Natural air cooling (No fan) Cool, Compact, Clear, Quiet >Operational by Solar Panel

<Industry, Government, Academia> Hitachi, Fujitsu, NEC, Renesas, Olympus, Toyota, Denso, Mitsubishi, Toshiba, **OŠCAR** Technology, etc

<Ripple Effect>

>Low CO<sub>2</sub> (Carbon Dioxide) Emissions >Creation Value Added Products > Automobiles, Medical, IoT, Servers

![](_page_41_Picture_9.jpeg)

Fujitsu M9000 SPARC VII 256 core SMP

![](_page_41_Picture_11.jpeg)

**Beside Subway Waseda Station**, Near Waseda Univ. Main Campus

<u>The 25<sup>th</sup> International Workshop on Languages and Compilers for</u> <u>Parallel Computing (LCPC2012), September 11-13, 2012</u> <u>at Waseda U. Green Computing Center</u>

![](_page_42_Picture_1.jpeg)

#### A Strategic Initiative of Computing: Systems and Applications (SISA)- Integrating HPC, Big Data, AI and Beyond, Jan. 18-19, 2017

#### A Strategic Initiative of Computing: Systems and Applications

(SISA) --Integrating HPC, Big Data, AI and Beyond-- Jan. 18-19, 2017

III. Extreme Scale and Beyond **Opening:** Prof. Gao, Prof. Kasahara Waseda VP Shuji Hashimot Keynote: Paul Messina ANL, USA

#### I. Architecture and Applications

Keynote: William J. Dally, NVIDIA and Stanford University, USA Depei Qian, BUAA, China

- Kimihiko Hirao, RIKEN, Japan
- > G. W. Yang, Tsinghua Univ. China
- > J. Sexton, IBM, USA

II. System Software and Applications Keynote : Rick. Stevens ANL, USA

- S. Mikhail Smelyanskiy Intel USA
- Fred. Streitz, LLNL USA
- ▶ R. Govind, IIS, India
- H. Hironori Kasahara. Waseda Univ,

Motoaki Saito, PEZY, Japan Eiji Ishida, MEXT, Japan Toshiyuki Shimizu, Fujitsu, Japan

IV. Integration of HPC, Big Data, and AI

#### Keynote: Thomas Sterling, Indiana Univ., USA

- Masaru Kitsuregawa, NII and Univ. of Tokyo, Japan
- Thomas Schulthess, ETH, Swiss
- Moriyuki Takamura/Toshiaki Kitamura, Oscar Tech, Japan

![](_page_43_Picture_21.jpeg)

![](_page_43_Picture_22.jpeg)

![](_page_43_Picture_23.jpeg)

## **10** Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000

(Power7 Based 128 Core Linux SMP) (LCPC2015)

![](_page_44_Figure_2.jpeg)

![](_page_45_Figure_0.jpeg)

> Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores

Execution time with 128 cores was slower than 1 core (0.9 times speedup)

- Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler
  - > OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization

## Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule) on Tilera

![](_page_46_Figure_1.jpeg)

## Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion) 327 times speedup on 144 cores

Hitachi 144cores SMP Blade Server BS500: Xeon E7-8890 V3(2.5GHz 18core/chip) x8 chip

![](_page_47_Figure_2.jpeg)

Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores(327.6 times speedup)

Reduction of treatment cost and reservation waiting period is expected

日本乗用車のエンジン制御計算をデンソー2コアECU 上で、1.95倍の速度向上に成功。(見神、梅田)

![](_page_48_Picture_1.jpeg)

multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.

![](_page_48_Figure_3.jpeg)

欧州農耕作業車エンジン制御計算をインフィニオン 2コアプロセッサ上で8.7倍の高速化に成功。

Automatic Parallelization of an Engine Control C Program with 400 thousands lines on AUTOSAR on 2 cores of Infineon AURIX TC277

## **Macro Task Fusion for Static Task Scheduling**

![](_page_49_Figure_1.jpeg)

MFG of sample program before maro task fusion

MFG of sample program after macro task fusion

MTG of sample program after macro task fusion

## Speedups of NPB/CG Scientific Code by OSCAR Compiler on NEC SX-Aurora TSUBASA A100-1 8 cores 10C VE <u>57 times speedup</u> for <u>8 vector cores</u> by OSCAR Parallelization & NEC Vectorization against NEC 1 core Vectorization

![](_page_50_Figure_1.jpeg)

https://www.hpc.nec/documents/guide/pdfs/AuroraVE TuningGuide.pdf

## 8 Core RP2 Chip Block Diagram

![](_page_51_Figure_1.jpeg)

## Software Coherence Control Method on OSCAR Parallelizing Compiler

- Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis to detect parallelism among coarse grain tasks).
- SCAR compiler automatically controls coherence using following simple program restructuring methods:
  - > To cope with stale data problems:

Data synchronization by compilers

- > To cope with false sharing problem:
  - Data Alignment

Array Padding

Non-cacheable Buffer

![](_page_52_Figure_9.jpeg)

MTG generated by earliest executable condition analysis

## Automatic Software Coherent Control for Manycores

## Preliminary Performance of Software Coherence Control by OSCAR Compiler on 8-core RP2

![](_page_53_Figure_2.jpeg)

## **OSCAR Software Cache Coherent Control for NIOS and RISCV cores on FPGA**

3.57x Speedups for NIOS and 3.68x for RISCV using 4 cores for NPB CG

![](_page_54_Figure_2.jpeg)

## **Automatic Local Memory Management Data Localization: Loop Aligned Decomposition**

- **Decomposed loop into LRs and CARs** 
  - LR (Localizable Region): Data can be passed through LDM
  - CAR (Commonly Accessed Region): Data transfers are required among processors

**Single dimension Decomposition** 

![](_page_55_Figure_6.jpeg)

**Multi-dimension Decomposition** 

![](_page_55_Figure_8.jpeg)

## Adjustable Blocks managed by Compiler

■ <u>A suitable block size for each application</u>

different from a fixed block size in cache

 each block can be divided into smaller blocks with integer divisible size to handle small arrays and scalar variables

![](_page_56_Figure_4.jpeg)

## Block Replacement Policy by OSCAR Compiler

### Compiler Control Memory block Replacement

- using live, dead and reuse information of each variable from the scheduled result
- different from LRU in cache that does not use data dependence information

## Block Eviction Priority Policy

- 1. (Dead) Variables that will not be accessed later in the program
- 2. Variables that are accessed only by other processor cores
- 3. Variables that will be later accessed by the current processor core
- 4. Variables that will immediately be accessed by the current processor core

#### Speedups by OSCAR Automatic Local Memory Management compared to Executions Utilizing Centralized Shared Memory on Embedded and Scientific Application on RP2 8core Multicore

![](_page_58_Figure_1.jpeg)

against sequential execution using off-chip shared memory

#### Speedups of Deep Learning Winograd 2D-Convolution generated by TVM on NEC **Personal Vector Supercomputer SX-Aurora TSUBASA 8 Core Type 10C** OSCAR Parallelization and NEC Vectorization gave us 363x Speedup against a Scalar Core

![](_page_59_Figure_1.jpeg)

60

Performance for Multimedia & Scientific Applications on OSCAR Vector Accelerator (A Vector Processor with Local Memory or Distributed Shared Memory and DMA Controller Managed by the OSCAR Compiler Redesigned Improving Japanese Supercomputer Technology in 1980-2000)

![](_page_60_Figure_1.jpeg)

![](_page_61_Picture_0.jpeg)

## Future Multicore Products with Automatic Parallelizing Compiler

![](_page_61_Picture_2.jpeg)

#### **Next Generation Automobiles**

- Safer, more comfortable, energy efficient, environment friendly

- Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

#### Smart phones

![](_page_61_Figure_8.jpeg)

-From everyday recharging to less than once a week

- Solar powered operation in emergency condition

- Keep health

#### Advanced medical systems

# I systems Personal / Regional Supercomputers

![](_page_61_Picture_14.jpeg)

![](_page_61_Picture_15.jpeg)

#### Cancer treatment, times | Drinkable inner camera

- Emergency solar powered
- No cooling fun, No dust , clean usable inside OP room

## Solar powered with more than 100 times power efficient : FLOPS/W

 Regional Disaster Simulators saving lives from tornadoes, localized heavy rain, fires with earth quakes

![](_page_62_Picture_0.jpeg)