### Green Multicore Computing: Low Power High Performance



Hironori Kasahara, Ph.D., IEEE Fellow:

**IEEE Computer Society President 2018** 

Professor, Dept. of Computer Science & Engineering
Director, Advanced Multicore Processor Research Institute
Senior Executive VP, Waseda 早稲田University, Japan

URL: http://www.kasahara.cs.waseda.ac.jp/

1980 BS, 82 MS, 85 Ph.D., Dept. EE, Waseda Univ. 1985 Visiting Scholar: U. of California, Berkeley, 1986 Assistant Prof., 1988 Associate Prof., 1989-90 Research Scholar: U. of Illinois, Urbana-Champaign, Center for Supercomputing R&D, 1997 Prof., 2004 Director, Advanced Multicore Research Institute, 2017 member: the Engineering Academy of Japan and the Science Council of Japan 2018 Nov. Senior Vice President, Waseda Univ.

1987 IFAC World Congress Young Author Prize
1997 IPSJ Sakai Special Research Award
2005 STARC Academia-Industry Research Award
2008 LSI of the Year Second Prize
2008 Intel AsiaAcademic Forum Best Research Award
2010 IEEE CS Golden Core Member Award
2014 Minister of Edu., Sci. & Tech. Research Prize
2015 IPSJ Fellow, 2017 IEEE Fellow, Eta Kappa Nu

Reviewed Papers: 216, Invited Talks: 162, Granted Patents: 43 (Japan, US, GB, China), Articles in News Papers, Web News, Medias incl. TV etc.: 584

Committees in Societies and Government 255
IEEE Computer Society: President 2018, BoG(2009-14), Executive Committee(2017-), Multicore STC
Chair (2012-), Japan Chair(2005-07),
IPSJ Chair: HG for Magazine. & J. Edit, Sig. on ARC.
[METI/NEDO] Project Leaders: Multicore for
Consumer Electronics, Advanced Parallelizing
Compiler, Chair: Computer Strategy Committee
[Cabinet Office] CSTP Supercomputer Strategic
ICT PT, Japan Prize Selection Committees, etc.
[MEXT] Info. Sci. & Tech. Committee,
Supercomputers (Earth Simulator, HPCI Promo.,
Next Gen. Supercomputer K) Committees, etc.

## **Multicores for Performance and Low Power**

Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers

("K" more than 10MW).



IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler"

Power ∝ Frequency \* Voltage<sup>2</sup>
(Voltage ∝ Frequency)

If <u>Frequency</u> is reduced to  $\frac{1/4}{4}$  (Ex. 4GHz $\rightarrow$ 1GHz),

Power is reduced to 1/64 and Performance falls down to 1/4.

< Multicores >

If **8cores** are integrated on a chip,

**Power** is still 1/8 and

Performance becomes 2 times.

#### Parallel Soft is important for scalable performance of multicore (LCPC2015)

Just more cores don't give us speedup

Development cost and period of parallel software
are getting a bottleneck of development of embedded systems, eg. IoT, Automobile

Earthquake wave propagation simulation GMS developed by National

Research Institute for Earth Science and Disaster Resilience (NIED)



**Multicore Server** 



- Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores
  - Execution time with 128 cores was slower than 1 core (0.9 times speedup)
- Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler
  - OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization

### 1987 OSCAR(Optimally Scheduled Advanced Multiprocessor)

#### **Co-design of Compiler and Architecture**

Looking at various applications, design a parallelizing compiler and design a multiprocessor/multicore-processor to support compiler optimization



## NWT



## Earth Simulator

(http://www.es.jamstec.go.jp/)

- Earth Environmental simulation like Global Warming, El Nino, PlateMovement for the all lives onr this planet.
- •Developed in Mar. 2002 by STA (MEXT) and NEC with 400 M\$ investment under Dr. Miyoshi's direction. (Dr.Miyoshi: Passed away in Nov.2001. NWT, VPP500, SX6)



Mr. Hajime Miyoshi

Image of Earth Simulator

4 Tennis Courts



40 TFLOPS Peak (40\*10<sup>12</sup>) 35.6 TFLOPS Linpack



## 4 core multicore RP1 (2007), 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas



## Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2

by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Without Power With Power Control **Control** (Frequency, (**Voltage** : 1.4**V**) **Resume Standby:** Power shutdown & **Voltage lowering 1.4V-1.0V)** 5 Avg. Power

Avg. Power 5.73 [W]

73.5% Power Reduction

1.52 [W]

## **Demo of NEDO Multicore for Real Time Consumer Electronics** at the Council of Science and Engineering Policy on April 10, 2008

#### 第74回総合科学技術会議【平成20年4月10日】







第74回総合科学技術会議の様子(2)



第74回総合科学技術会議の様



第74回総合科学技術会議の様子(4)

**CSTP Members Prime Minister:** Mr. Y. FUKUDA

**Minister of State for** Science, Technology and Innovation **Policy:** 

Mr. F. KISHIDA

**Chief Cabinet Secretary:** 

Mr. N. MACHIMURA

Minister of Internal Affairs and **Communications:** 

Mr. H. MASUDA

**Minister of Finance:** 

Mr. F. NUKAGA

Minister of **Education**, Culture, Sports, Science and Technology: Mr. K. TOKAI

Minister of **Economy, Trade and Industry:** 

Mr. A. AMARI

## Green Computing Systems R&D Center

## **Waseda University**

## **Supported by METI (Mar. 2011 Completion)**

<R & D Target>

Hardware, Software, Application for Super Low-Power Manycore

- >More than 64 cores
- Natural air cooling (No fan) Cool, Compact, Clear, Quiet
- > Operational by Solar Panel
- <Industry, Government, Academia</p>
  Hitachi, Fujitsu, NEC, Renesas, Olympus,
  Toyota, Denso, Mitsubishi, Toshiba,
  OSCAR Technology, etc
- < Ripple Effect >
- >Low CO<sub>2</sub> (Carbon Dioxide) Emissions
- > Creation Value Added Products
- > Automobiles, Medical, IoT, Servers



Beside Subway Waseda Station, Near Waseda Univ. Main Campus

## **OSCAR Parallelizing Compiler**

To improve effective performance, cost-performance and software productivity and reduce power

Multigrain Parallelization(LCPC1991,2001,04) coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in

addition to loop parallelism

#### **Data Localization**

Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2, Cache 2001, 03)
Software Coherent Control (2017)

#### Data Transfer Overlapping(2016 partially)

Data transfer overlapping using Data Transfer Controllers (DMAs)

#### **Power Reduction**

(2005 for Multicore, 2011 Multi-processes, 2013 on ARM)

Reduction of consumed power by compiler control DVFS and Power gating with hardware supports.



### MTG of Su2cor-LOOPS-DO400

### Coarse grain parallelism PARA\_ALD = 4.3



#### Data Localization PEO PE<sub>1</sub> dlg2 fglb dlg3 dlg0 Data Localization Group A schedule for MTG after Division MTGtwo processors

### Multicore Program Development Using OSCAR API V2.0

#### **Sequential Application Program in Fortran or C**

(Consumer Electronics, Automobiles, Medical, Scientific computation, etc.)

Homogeneous

Hetero

Manual parallelization / power reduction

#### **Accelerator Compiler/ User**

Add "hint" directives before a loop or a function to specify it is executable by the accelerator with how many clocks

#### Waseda OSCAR **Parallelizing Compiler**

- Coarse grain task parallelization
- **Data Localization**
- **DMAC** data transfer
- Power reduction using **DVFS, Clock/ Power gating**

Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ.

**OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores** 

Directives for thread generation, memory, data transfer using DMA, power managements

**Parallelized APIF or C** program

Proc0

**Code** with directives Thread 0

Proc1

**Code** with directives Thread 1

**Accelerator 1** Code

Accelerator 2 Code

**Low Power** Homogeneous **Multicore Code** Generation

API Analyzer |

**Existing** sequential compiler

Low Power Heterogeneous **Multicore Code** Generation

API Analyzer (Available from Waseda)

Existing sequential compiler

Server Code Generation

**OpenMP** Compiler

**OSCAR: Optimally Scheduled Advanced Multiprocessor API:** Application Program Interface

**Generation of** parallel machine codes using sequential compilers



Homegeneous Multicore s from Vendor A (SMP servers)



arious multicores Heterogeneous **Multicores** from Vendor B



Shred memory servers

## 110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000

(Power7 Based 128 Core Linux SMP) (LCPC2015)



## Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion)

327 times speedup on 144 cores



- ➤ Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores (327.6 times speedup)
  - > Reduction of treatment cost and reservation waiting period is expected

# Parallelization of 3D-FFT for New Magnetic Material Computation on Hitachi SR16000 Power7 CC-Numa Server



#### **OSCAR** optimization

 reducing number of data transpose with interchange, code motion and loop fusion

## **Engine Control by multicore with Denso**

Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.



 Millions of lines C codes consisting conditional branches and basic blocks





## **OSCAR Compile Flow for Simulink Applications**



Generate C code using Embedded Coder





/\* Model step function \*/

Byold VesselExtraction\_step(void)
{
 int32\_T i;
 real\_Tu0;

 /\* DataTypeConversion: '<S1>/Data Type Conversion' incorporates:
 \* Inport: '<Root>/In!'
 \*/\* for (i = 0; i < 16384; i++) {
 VesselExtraction\_B.DataTypeConversion[i] = VesselExtraction\_U.In1[i];
 }

 /\* End of DataTypeConversion: '<S1>/Data Type Conversion' \*/
 /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter' \*/
 /\* Constant: '<S1>/h1' \*/
 VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion,
 VesselExtraction\_P.1\_Value, &VesselExtraction\_B.Dfilter,
 (P\_Dfilter\_VesselExtraction\_T \*)&VesselExtraction\_D.Dfilter);

 /\* End of Outputs for SubSystem: '<S1>/2Dfilter' \*/
 /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter1' \*/
 /\* Constant: '<S1>/h2' \*/
 VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion,
 VesselExtraction\_Dfilter(VesselExtraction\_B.Dfilter1,
 (P\_Dfilter\_VesselExtraction\_T \*)&VesselExtraction\_B.Dfilter1);
 /\* PosselExtraction\_P.Dgilter(VesselExtraction\_B.Dfilter1);

#### Simulink model

#### C code



## Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores

(Intel Xeon, ARM Cortex A15 and Renesas SH4A)



Road Tracking, Image Compression: <a href="http://www.mathworks.co.jp/jp/help/vision/examples">http://www.mathworks.co.jp/jp/help/vision/examples</a>

Buoy Detection: http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink

Color Edge Detection : <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/">http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/</a>

Vessel Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/">http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/</a>

## **Low-Power Optimization with OSCAR API**



## Automatic Power Reduction on ARM CortexA9 with Android

http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ ODROID X2

Samsung Exynos4412 Prime, ARM Cortex-A9 Quad core 1.7GHz~0.2GHz, used by Samsung's Galaxy S3



Power for 3cores was reduced to  $1/5 \sim 1/7$  against without software power control Power for 3cores was reduced to  $1/2 \sim 1/3$  against ordinary 1core execution

### **Automatic Power Reuction on Intel Haswell**

## H.264 decoder & Optical Flow (3cores)



Power for 3cores was reduced to  $1/3 \sim 1/4$  against without software power control Power for 3cores was reduced to  $2/5 \sim 1/3$  against ordinary 1core execution

## An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control



## 33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X



## Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library)

With Power Reduction
by OSCAR Compiler
70% of power reduction

Average: I.76[W] Average:0.54[W] 2.5 2.5 2 Power [W] / Voltage [V] Power[W] / Voltage [V] 1.5 —Voltage [V] -Voltage [V] —Power [W] —Power [W] 0.5 0.5 0 0 200 600 400 600 800 1000 0 200 400 300 1000 時刻 時刻 1cycle : 33[ms]  $\rightarrow$ 30[fps]

## Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule)

10 times more speedup needed after parallelization for 128 cores of Power 7. Less than 35mW power consumption is required.





## OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology



#### **Target:**

- Solar Powered
- Compiler power reduction.
- >Fully automatic parallelization and vectorization including local memory management and data transfer.

#### **Vector Accelerator**

#### **Features**

- Attachable for any CPUs (Intel, ARM, IBM)
- Data driven initiation by sync flags



#### Function Units [tentative]

- Vector Function Unit
  - 8 double precision ops/clock
  - 64 characters ops/clock
  - Variable vector register length
  - Chaining LD/ST & Vector pipes
- Scalar Function Unit

#### Registers[tentative]

- · Vector Register 256Bytes/entry, 32entry
- Scalar Register 8Bytes/entry
- Floating Point Register 8Bytes/entry
- Mask Register 32Bytes/entry

## OSCAR Technology Corp.

Started up on Feb. 28, 2013:

Licensing the all patents and OSCAR compiler from Waseda Univ.

Founder and CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company,

Director of National U., Invited Prof. of Waseda U.)

Executives: Dr. M. Ohashi: COO (Ex- OO of Ono Sokki)

Mr. A. Nodomi : CTO (Ex- Spansion)

Mr. N. Ito (Ex- Visiting Prof. Tokyo Agricult. And Tech. U.)

**Dr. K. Shirai** (Ex- President of Waseda U., Ex- Chairman of Japanese Open U.)

Mr. K. Ashida (Ex- VP Sumitomo Trading, Adhida Consult. CEO)

Mr. S. Tsuchida (Co-Chief Investment Officer of Innovation Network

Corp. of Japan)

Auditor: Mr. S. Honda (Ex- Senior VP and General Manager of MUFG)

Dr. S. Matuda (Emeritus Prof. of Waseda U., Chairman of WERU INVESTME

Mr. Y. Hirowatari (President of AGS Consulting)

Advisors: Prof. H. Kasahara (Waseda U.)

**Prof. K. Kimura** (Waseda U.)















## Future Multicore Products with Automatic Parallelizing Compiler



#### **Next Generation Automobiles**

- Safer, more comfortable, energy efficient, environment friendly
- Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

#### **Smart phones**



- -From everyday recharging to less than once a week
- Solar powered operation in emergency condition
- Keep health

#### **Advanced medical systems**



## Cancer treatment, Drinkable inner camera

- Emergency solar powered
- No cooling fun, No dust, clean usable inside OP room

## Personal / Regional Supercomputers



## Solar powered with more than 100 times power efficient: FLOPS/W

 Regional Disaster Simulators saving lives from tornadoes, localized heavy rain, fires with earth quakes