## 世界最速のコンピュータってどんなの?



## 早稲田大学 理工学術院 情報理工学科 教授 笠原博徳 IEEE Computer Society President 2018, 早稲田大学副総長(2018-2022)

JST博士学生支援事業SPRING/BOOST委員長/ACM・IEEE ISCA(コンピュータアーキテクチャ会議)2025実行委員長

1980 早大電気工学科卒、1982 同修士課程了

1985 早大大学院博士課程了 工学博士,学振第一回PD カリフォルニア大学バークレー客員研究員

1986 早大理工専任講師、1988年 助教授

1989-1990 イリノイ大学Center for

Super computing R&D客員研究員

1997 教授、現在 理工学術院情報理工学科

2004 アドバンストマルチコア研究所所長

2017 日本工学アカデミー会員、日本学術会議連携会員

2018 IEEE Computer Society会長、

早大副総長(-2022年9月)

2019-2023 産業競争力懇談会(COCN) 理事

2020-2024 日本工学アカデミー理事

2023- ACM/IEEE ISCA2025@Tokyo 実行委員長

### 【受賞】

1987 IFAC World Congress Young Author Prize

1997 情報処理学会坂井記念特別賞

2005 半導体理工学研究センタ共同研究賞

2008 LSI・オブ・ザ・イヤー 2008 準グランプリ、

**Intel Asia Academic Forum Best Research Award** 

2010 IEEE CS Golden Core Member Award

2014 文部科学大臣表彰科学技術賞研究部門

2015 情報処理学会フェロー

2017 IEEE Fellow, IEEE Eta-Kappa-Nu

2019 IEEE CS Spirit of Computer Society Award

2020 情報処理学会功績賞、SCAT表彰 会長大賞

**2023 IEEE Life Fellow** 

<u> 査読付き論文234件、招待講演244、国際特許取得70件(米・英・中・日等)、</u> 新聞・Web記事・TV等メディア掲載 709件



【政府·学会委員等】 歴任数 300件

IEEE Computer Society President 2018、Executive Committee委員長、理事(2009-14)、戦略的計画委員会委員長、Nomination Committee委員長、Multicore STC 委員長

Nomination Committee委員長、Multicore STC 委員長、IEEE CS Japan委員長、IEEE技術委員、IEEE Medal選定委員、

ACM/IEEE SC'21基調講演選定委員等

【経済産業省・NEDO】情報家電用マルチコア・アドバンスト並列化コンパイラ・グリーンコンピューティング・プロジェクトリーダ、NEDOコンピュータ戦略委員長等 【内閣府】スーパーコンピュータ戦略委員、政府調達苦情検討委員、総合科学技術会議情報通信PT 研究開発基盤領域&セキュリティ・ソフト検討委員、日本国際賞選定委員

【文部科学省】地球シミュレータ(ES)中間評価委員、情報科学技術委員、 HPCI計画推進委員、次世代スパコン(京)中間評価委員・概念設計評価委員、地球シミュレータES2導入技術アドバイザイリー委員長等、

JST: ムーンショットG3ロボット&AIアドバイザ, 博士学生支援事業SPRING/ BOOST 委員長,SBIRフェーズ1委員長,大学発新産業創出基金事業ガバニングボード



MASEDA Uniordo Chinese President: Hu Jintao

## 早稲田大学



1928



## **WASEDA University**



Tokyo, Japan



### The "Group of Four" who contributed to



DOMESTIC DIV



### **1940**



## 1993

### 1956 )



### 2007 %





### 2012

2

Mr. Jack Ma



British Prime Minister: Boris Johnson

Microsoft:

Dr. Bill Gates

WASEDA UNIVERSITY

Number of **International Students** 7,942 125 countries and territories

## **Waseda University**

早稲田大学

Alumni CEOs in Japan

10,606



## **Graduate Employability**

In private university of Japan (#2 in Japan, #27 in the world)

**FACULTY** 

[数員]

5,468

NUMBER OF BOOKS

[図書館蔵書]

5,800,000

ALUMNI [卒皇生]

630,000

PARTNER INSTITUTIONS 【協定大学・機関】

848 (93 countries)

ENROLLMENT

(学生数)

49,436

UNDERGRADUATE

STUDENTS

[学部生]

41,051

Schools

GRADUATE

STUDENTS

[大学院生]

8,385



S mercari HE





Toshio FUKUD/

Masaru IBUKA Tadashi YANAI Prime

Ministers 8th Shigenobu Okuma

17th Shigenoby Okuma 55th Tanzan Ishibashi

74th Noboru Takeshita 76th Toshiki Kaifu

B4th Kelps Obuchi

85th Yashira Mori 91st Yasso Fukuda

95th Yoshihiko Noda 100<sup>th</sup> Fumio Kishida dainaikaku/index.html Business

Leaders

Fire Insurance Co., Ltd.

### Hironori KASAHARA AIII TANAKA



President. International Political Science Association (IPSA) President 2016



Senior Executive Vice President **EEE Computer Society President** 2018. The first president from outside USA and Canada in 72 years CS history. CS has 84,000 members from, 168 countries.



Sayuri Yoshinaga



Waseds Alumnus, Prof. Emeritus Nagova Univ., Prof. Meije Univ. IEEE President 2020. The first from Asia in 135 years history. EEE has 420,000 members



Haruki MURAKAMI



Hirokazu KOREEDA



Yuzuru HANYU



Yui Suzaki



3

## Oxford Univ.で笠原グリーンコンピューティング招待講演(2019年11/13)

## \_日本で唯一、早稲田はオックスフォードと大学間協定締結

## Oxford大は、2017年よりTHE大学ランキング8年連続No.1

前Vice Chancellor Prof. Louise Richardson (WoI 2020での基調講演(予定))

Head of Astrophysics: Prof. Rob Fender Dept. of Physics: Prof. Ian Shipsey Astrophysics: Prof. H.Falche, et. al. 現Vice Chancellor Prof. Irene Tracy 前Merton College Warden

Fellow: Dr. Peter Braam

**Sub Warden: Prof. Judy Armitage** 

**CS: Prof. Jeremy Gibbons** 



https://www.top500.org/











FIND OUT MORE AT top500.org





2021 ACM A.M. Turing Award: Prof. Jack Dongarra, Univ. of Tennessee https://amturing.acm.org/award winners/dongarra 3406337.cfm For his pioneering contributions to numerical algorithms and libraries that enabled high performance computational software to keep pace with

| - 11 1  |      |     | $\cap$ | $\neg$ |
|---------|------|-----|--------|--------|
| - 11 1  | 11/1 | _   | // I   | 12     |
| - 1 ( ) | 11   | Г / | 1      | 7      |
|         | 1.1  |     | . ~    |        |

The List.

|   | JOINE Z  | 023                                                                                       | SITE          | COUNTRY | CORES     | PFLOP/S | MW   |
|---|----------|-------------------------------------------------------------------------------------------|---------------|---------|-----------|---------|------|
| 1 | Frontier | HPE Cray EX235a, AMD Opt 3rd Gen EPYC (64C 2GHz), AMD Instinct MI250X, Slingshot-11       | DOE/SC/ORNL   | USA     | 8,699,904 | 1,194.0 | 22.7 |
| 2 | Fugaku   | Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D                                          | RIKEN R-CCS   | Japan   | 7,630,848 | 442.0   | 29.9 |
| 3 | LUMI     | HPE Cray EX235a, AMD Opt 3rd Gen EPYC (64C 2GHz), AMD Instinct MI250X, Slingshot-11       | EuroHPC/CSC   | Finland | 2,220,288 | 309.0   | 6.01 |
| 4 | Leonardo | Atos Bullsequana intelXeon (32C, 2.6 GHz), NVIDIA A100 quad-rail NVIDIA HDR100 Infiniband | EuroHPC/CINEC | Italy   | 1,824,768 | 238.7   | 7.40 |
| 5 | Summit   | IBM POWER9 (22C, 3.07GHz), NVIDIA Volta GV100 (80C), Dual-Rail Mellanox EDR Infiniband    | DOE/SC/ORNL   | USA     | 2,414,592 | 148.6   | 10.1 |



No. 1 June 2022-23 870万プロセッサ, 22.7MW Frontier - HPE Cray EX235a, 8,699,904 total cores,

AMD Optimized 3rd Generation EPYC 64C 2GHz,

**AMD** Instinct MI250X accelerators,

**HPE Slingshot-11 interconnect** 168京回演算/秒

Oak Ridge National Laboratory (ORNL), USA

Rmax: 1.19 (ExaFlop/s), Rpeak 1.68(ExaFlop/s)

HPCG:14,054[TFlop/s]

**発布性能:明日不明テいたル )ニュース > サイエンス ) 草原- 原政** 

○ サイエンス 写真・鬱蘇



### https://en.wikipedia.org/ wiki/5 nm process

MOSFET scaling (process nodes)

- •<u>10 μm</u> 1971
- •<u>6 μm</u> 1974
- $\bullet 3 \mu m 1977$
- <u>1.5 μm</u> 1981
- •1 µm 1984
- •800 nm 1987
- •600 nm 1990
- •350 nm 1993
- •<u>250 nm</u> 1996
- •180 nm 1999
- •130 nm 2001
- •90 nm 2003
- •65 nm 2005
- •45 nm 2007
- •32 nm 2009
- 2003
- •<u>22 nm</u> 2012
- •<u>14 nm</u> 2014
- •<u>10 nm</u> 2016
- •<u>7 nm</u> 2018
- •<u>5 nm</u> 2020
- •Future 3 nm ~ 2022
- •<u>2 nm</u> ~ 2023

https://hc2023.hotchips.org/

AMD Next Generation
"Zen 4" Core and
4th Gen AMD EPYC™
9004 Server CPU

### Kai Troester

AMD Fellow Silicon Design Engineer
"Zen 4" Lead Architect

### Ravi Bhargava

AMD Senior Fellow "Zen 4" EPYC™ Performance Architect

Hot Chips 2023





## "Zen 4c" Doubles Cores Per Compute Die

"Zen 4" CCD

Maria de la constitución de la c

STATE OF THE PROPERTY.

Core + L2

"Zen 4c" CCD

5nm

| 1997/        |       |              |              |         |              |  |  |  |  |
|--------------|-------|--------------|--------------|---------|--------------|--|--|--|--|
| Core +<br>L2 |       | Core +<br>L2 | Core +<br>L2 | 11      | Core +<br>L2 |  |  |  |  |
| Core +<br>L2 | 16 MB | Core +<br>L2 | Core +<br>L2 | 16 MB   | Core +<br>L2 |  |  |  |  |
| Core +<br>L2 | L3    | Core +<br>L2 | Core +<br>L2 | ia<br>• | Core +<br>L2 |  |  |  |  |
| Core +<br>L2 | 7 6   | Core +<br>L2 | Core +<br>L2 |         | Core +<br>L2 |  |  |  |  |

| Cores         | 8    | 16   |
|---------------|------|------|
| L2 cache/core | 1 MB | 1 MB |
| L3 cache/core | 4 MB | 2 MB |

## AMD Chiplet for 64 cores: Zen2

Processor Chiplet 7nm, Memory Chiplet: 12nm

## **Technology-optimized Chiplet Organization**













### Move cores and cache to 7nm



86% of die area benefits

### https://olcf.ornl.gov/wp-content/uploads/Frontiers-Architecture-Frontier-Training-Series-final.pdf

## What is a Dragonfly topology?

- A set of groups that are connected all-to-all
  - Every group has one or more links to every other group

Another view of a Dragonfly Topology

- A group of endpoints connected to switches that are connected all-to-all
- A set of groups that are connected all-to-all





| Technology-                                                             | Driven, Highly-S                                                | calable Dragonfly To                                     | opology*                        |
|-------------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------------------|---------------------------------|
| John Kim                                                                | William J. Dally                                                | Steve Scott                                              | Dennis Abts                     |
| Northwestern University<br>Evanston, IL 60208<br>jjk12@northwestern.edu | Stanford University<br>Stanford, CA 94305<br>dally@stanford.edu | Cray Inc.<br>Chippewa Falls, WI 54729<br>sscott@cray.com | Google Inc.<br>dabts@google.com |
| @computer society                                                       | 1063-6897/08<br>DOI 10.1109/                                    | \$25.00 © 2008 IEEE<br>ISCA.2008.19                      |                                 |



OAK RIDGE LEADERSHIP













|   | MAY 20   | )24 https://www.top500.org/                                                                                         | SITE            | COUNTRY | CORES     | RMAX<br>PFLOP/S | POWER<br>MW |
|---|----------|---------------------------------------------------------------------------------------------------------------------|-----------------|---------|-----------|-----------------|-------------|
| 1 | Frontier | HPE Cray EX235a, AMD Opt 3rd Gen EPYC (64C 2GHz), AMD Instinct MI250X, Slingshot-11                                 | DOE/SC/ORNL     | USA     | 8,699,904 | 1,206.0         | 22.7        |
| 2 | Aurora   | HPE Cray EX - Intel Exascale Compute Blade, Xeon CPU Max 9470 (52C 2.4GHz), Intel Data Center GPU Max, Slingshot-11 | DOE/SC/ANL      | USA     | 9,264,128 | 1,012.0         | 38.7        |
| 3 | Eagle    | Microsoft NDv5, Xeon Platinum 8480C (48C 2GHz), NVIDIA H100, NVIDIA Infiniband NDR                                  | Microsoft Azure | USA     | 1,123,200 | 561.2           |             |
| 4 | Fugaku   | Fujitsu A64FX (48C, 2.2GHz), Tofu Interconnect D                                                                    | RIKEN R-CCS     | Japan   | 7,630,848 | 442.0           | 29.9        |
| 5 | LUMI     | HPE Cray EX235a, AMD Opt 3rd Gen EPYC (64C 2GHz), AMD Instinct MI250X, Slingshot-11                                 | EuroHPC/CSC     | Finland | 2,220,288 | 379.7           | 6.01        |

### 







## 年200億個以上のプロセッサが生産

マルチコアプロセッサ:スマホ,タブレット,IoTデバイス,自動車制御,サーバ,スパコン等

例: ARM,IBM Power,Renesas RH850, Infineon,SPARC,RISC V



64-bit iPhone 13 2021



https://www.apple.com/jp/shop/buy-iphone

Launched **September 14, 2021** 

**Designed by** Apple Inc.

Common manufacturer(s):

Max. CPU clock rate to 3.23 GHz in iPhone 13 Pro

**Technology node:** 5 nm

6 Cores: 2 "Avalanche"高性能コア & 4 "Blizzard"省エネコア

Instruction set: A64, Transistors: 15 billion (15億個)

**5 core GPU(s):** Apple-designed **GPU** in iPhone 13

https://en.wikipedia.org/wiki/Apple A15

### 理研富岳スーパーコンピュータ 2020年6月から2021年11月まで世界No.1



**RIKEN** Center for Computational Science,

Fujitsu (arm based processor)

Cores: 7,299,072; Memory: 4,866,048GB;

Processor: A64FX 48Cores, 2.2GHz

Interconnect: Tofu interconnect D

Linpack (Rmax)415,530 TFlop/s;

Theoretical Peak (Rpeak): 513PFLOPS

HPCG [TFlop/s]13,366.4; **Power: 28.3MW** 

48コア/チップ, 2.2GHz, 7 nm FinFET,

約7百30万コア、28MW

理論最高性能:51京回浮動小数点演算/秒

2020年6月時点





## AIREC (Al-driven Robot for Embrace and Care) Led by Prof. Sugano Supported by Japanese Government "Moonshot" Project from 2020







## Self Driving Cars (自動運転)

Connected, Security, Big Data, Traffic Cloud





http://self-drivings.com/self-driving-cars-updated-market-analysis/

Deep Leaning (多層ニューラルネット)により画像認識







## NVIDIA自動車用アクセラレータ DRIVE PX XAVIER



### NVIDIA DRIVE 機能安全アーキテクチャ



System Operates Safely Even when Faults Detected
Holistic System — Process & Methods, Processor Design, Software, Algorithms, System Design, Validation
ISO 26262 ASIL-D Safety Level | Partnership with BlackBerry QNX and TTTECH | New AutoSIM Virtual Reality 3D Simulator

### 28 TIVIDIA

### NVIDIA END TO END DRIVE プラットフォーム



### **DRIVE AGX PEGASUS**

ROBOTAXI DRIVE PLATFORM レベル5 完全自動運転

- Xavier (Volta GPU integrated) x 2
- Next generation discrete-GPU x 2
- 320 TOPS CUDA TensorCore
- ASIL D Certification
- Combined Memory Bandwidth: >1TBytes/sec
- Automotive I/Os
  - 16x GMSL High-speed Camera Inputs
  - Multiple 10Gbit Ethernet
  - CAN, Flexray
- 400W
- Late Q1 Early Access Partners
- · Supercomputing Data Center in your Trunk

## ノーベル物理学賞、機械学習の基礎を築いた米国とカナダの Science Portal

2024.10.08

スウェーデン王立科学アカデミーは8日、2024年のノーベル物理学賞を、人工ニューラルネット! (神経回路網)による機械学習の分野を切り開いた2氏に授与すると発表した。受賞が決まったのは米 ストン大学のジョン・ホップフィールド名誉教授(91)と、カナダ・トロント大学のジェフリー・ヒン 名誉教授(76)。2021年の真鍋淑郎氏(米国籍)以来となる、日本人の物理学賞受賞はならなかった。



ノーベル物理学賞の受賞が決まったホップフィールド(左)、ヒントン両氏のイラスト(ニクラス・エルメヘード氏、ノー

人工知能(AI)草創期の1960年代には、コンピューターが認識できるように知識をデータ化した。これに

対し、人間などの脳の神経細胞の働きをモデルにした方法でデータを処理するのが、人工ニューラルネット

の手法だ。神経細胞は「ノード」(結び目、点)で表され、シナプスのような接続を介して影響し合う。80

年代以降、人工ニューラルネットに関する重要な研究が進展してきた。

https://scienceportal.jst.go.jp/newsflash/20241008 n01/



細胞はシナプスを介してネットワークをつくり、 団提供)



人工ニューラルネットのイメージ (ノーベル財団提供)

## ACM チューリング賞





## コンピュータ分野のノーベル賞

**Prof. Yoshua Bengio** モントリオール大学 2024年3月7日 大川賞授賞式 **Okawa Prize** 

Alphabet(Google親会社)会長

Turing Award > Winners

Jeffrey Ullman

アルゴリズムと 2020 プログラミング言語



Geoffrey Hinton

Yoshua Bengio

ディープ・ラーニング



John L. Hennessy

2017 コンピュータ構成法 RISC:スマホ-スパコン

毎年生産される200億個以上の プロセッサの99%がRISC

Tim Berners-Lee

2016 World Wide Web





Alfred Aho

アルゴリズムと プログラミング言語







Yann LeCun

2018 AI ディープ・ラーニング



コンピュータ構成法 RISC:スマホ-スパコン



Whitfield Diffie

2015 公開鍵暗号



Pat Hanrahan

アニメーションと 2019 3Dグラフィックス



Martin Hellman

2015 公開鍵暗号



## 毎年生産される32bit,64bitプロセッサの99%はRISC



例: ARM,IBM Power,Renesas RH850, Infineon,SPARC,RISC V



shop/buy-iphone

**★A15** 

https://www.apple.com/jp/

| 64-bit | iPhone 13 | 2021



**IBM Watson** 

http://watson2016.com/



Renesas

理研富岳

## Amazon buys nuclear-powered data center from Talen

Thu, Mar 7, 2024, 10:01PM Nuclear News



Susquehanna nuclear plant in Salem Township, Penn., along with the data center in foreground. (Photo: Talen Energy)

Talen Energy announced its sale of a 960-megawatt data center campus to cloud service provider Amazon Web Services (AWS), a subsidiary of Amazon, for \$650 million.

The data center, Cumulus Data Assets, sits on a 1,200-acre campus in Pennsylvania and is directly powered by the adjacent Susquehanna Steam Electric Station, which generates 2.5 gigawatts of power.

## 早稲田大学博士在学中及び学位取得直後の代表的国際論文

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-33, NO. 11, NOVEMBER 1984

1023

## Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing

HIRONORI KASAHARA, MEMBER, IEEE, AND SEINOSUKE NARITA, SENIOR MEMBER, IEEE



104 IEEE JOURNAL OF ROBOTICS AND AUTOMATION, VOL. RA-1, NO. 2, JUNE 1985

Parallel Processing of Robot-Arm Control Computation on a Multimicroprocessor System

HIRONORI KASAHARA MEMBER, IEEE, AND SEINOSUKE NARITA, SENIOR MEMBER, IEEE

2nd International Conference on Superecomputing May 3-8,1987 Santa Clara, CA, USA

A PARALLEL PROCESSING SCHEME FOR THE SOLUTION OF SPARSE LINEAR UATIONS USING STATIC OPTIMAL-MULTIPROCESSOR-SCHEDULING ALGORITHMS

H. Kasahara", T. Pujii", H. Nakayama", S. Narita", and Leon O. Chua"

Dept. of Electrical Eng., Waseda University, Tokyo, 160, Japan
 Dept. of Electrical Eng. and Computer Sciences.
 University of California

Copyright © 1FAC 10th Triennial World Congress, Munich, FRG, 1987

## PARALLEL PROCESSING OF ROBOT MOTION SIMULATION

H. Kasahara, H. Fujii and M. Iwata

Department of Electrical Engineering, Waseda University, 3-4-1 Ohkubo Shinjuku-ku, Tokyo 160, Japan



## The First Compiler Codesigned Multiprocessor

## OSCAR (Optimally Scheduled Advanced Multiprocessor) in 1987



AMD29325 32-bit Floating-point unit

H. Kasahara, "OSCAR Fortran Multigrain Compiler", Stanford University, Hosted by Professor John L. Hennessy and Professor Monica Lam, May. 15. 1995.



## 笠原博徳が設計・開発に参加した3つの世界No.1 スーパーコンピュータ

"NWT: 数値風洞", "Earth Simulator: 地球シミュレータ", and "K:京"



日本スパコンの父 三好甫氏 航空宇宙技術研究所 早稲田大学数学卒 Father of Japanese Supercomputer Mr. Hajime Miyoshi,

Waseda Alumnus, <u>Leader of NWT,</u> <u>Earth Simulator</u>



NWT:数值風洞 (Numerical Wind Tunnel), 1993,1.68GFLOPS <Fujitsu VPP 500, 5000>



Earth Simulator, 2002

40 TFLOPS Peak (40\*10<sup>12</sup>) 35.6 TFLOPS Linpack 3.2MW





## **IEEE Computer Society**

The first President from outside North America in 72 years history of IEEE CS



## ACM/IEEE SC (SuperComputing) 19, Denver, Nov.17-22, 2019



## ムーアの法則の終焉

<u>ムーアの法則(Moore's law):</u> インテル創業者の一人であるゴードン・ムーア (Gordon E. Moore: IEEE Computer Pioneer Award)が、1965年の論文で提唱した「半導体の集積率は18か月で2倍になる」という経験則。

コンピュータの高性能化と低消費電力化にはマルチコアが必須



IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" **Power** ∝ Frequency \* Voltage<sup>2</sup>



(Voltage ∝ Frequency)

**Power** ∝ Frequency<sup>3</sup>

周波数 <u>Frequency</u> を<u>1/4</u>にすると (Ex. 4GHz→1GHz),

消費電力は 1/64に削減性能は 1/4に低下.

<マルチコア>

**8cores** をチップに集積すると、 電力は 依然1/8で性能 は2 倍向上

## Power Reduction by Power Supply, Clock Frequency and Voltage Control by OSCAR Compiler

Frequency and Voltage (DVFS), Clock and Power gating of each cores are scheduled considering the task schedule since the dynamic power proportional to the cube of F (F<sup>3</sup>) and the leakage power (the static power) can be reduced by the power gating (power off).



## 太陽光電力で動作する情報機器

コンピュータの消費電力をHW&SW協調で低減。電源喪失時でも動作することが可能。

リアルタイムMPEG2デコードを、8コアホモジニアス マルチコアRP2上で、消費電力1/4に削減





## 太陽電池で駆動可



## 総合科学技術会議(平成20年4月10日)での NEDOリアルタイム情報家電用マルチコアチップ(笠原リーダー)・デモの様子

http://www8.cao.go.jp/cstp/gaiyo/honkaigi/74index.html

第74回総合科学技術会議【平成20年4月10日】



第74回総合科学技術会議の様子(1)



第74回総合科学技術会議の様子(3)



第74回総合科学技術会議の様子(2)



第74回総合科学技術会議の様子(4)

1985年よりコンパイラ(ソフト) ・アーキテクチャ(ハード)協調 設計マルチプロセッサの研究

4 core multicore RP1 (2007), 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas

| <b>RP-1</b> (ISSCC2007 #5.3)               | <b>RP-2</b> (ISSCC2008 #4.5)                   | <b>RP-X</b> (ISSCC2010 #5.3)                                   |  |  |
|--------------------------------------------|------------------------------------------------|----------------------------------------------------------------|--|--|
| Core 0 Core 1 Signary Core 2 Core 3 Core 3 | Core2 Core3  Core4  Core4  Core5  Core5  Core5 | PCIe  S-ATA  Wedia  PS  MX-2  Bus  Router  Media  Ps  Core 4-7 |  |  |
| 90nm, 8-layer, triple-Vth, CMOS            | 90nm, 8-layer, triple-Vth, CMOS                | 45nm, 8-layer, triple-Vth, CMOS                                |  |  |
| 97.6 mm <sup>2</sup> (9.88 x 9.88 mm)      | 104.8 mm <sup>2</sup> (10.61 x 9.88 mm)        | 153.8 mm <sup>2</sup> (12.4 x 12.4 mm)                         |  |  |
| 1.0V (internal), 1.8/3.3V (I/O)            | 1.0-1.4V (internal), 1.8/3.3V (I/O)            | 1.0-1.2V (internal), 1.2-3.3V (I/O)                            |  |  |
| 600MHz ,4.32 GIPS,16.8 GFLOPS              | 600MHz , 8.64 GIPS, 33.6 GFLOPS                | 648MHz, 13.7GIPS, 115GOPS, 36.2GFLOPS                          |  |  |
| 11.4 GOPS/W(32b換算)                         | 18.3 GOPS/W(32b換算)                             | 37.3 GOPS/W(32b換算)                                             |  |  |

## 経済産業省/NEDOプロジェクトにて、早稲田大学・日立製作所・ ルネサスエレクトロニクスが共同開発した 世界初のコンパイラ協調型グリーンマルチコア RP2 チップ



自動車、スマートホン、データセンター、サーバ等汎用的マルフェックで開発。

各メルモリンは では、一数カンに を可能と を可能と を可能と を可能と



## 世界をリードするマルチコア用コンパイラ技術

## OSCARコンパイラの世界唯一技術

## 1.マルチグレイン並列化(全ての並列性を利用)

粗粒度タスク並列化、ループ並列化、近細粒度並列化によりプログラム全域の並列性を利用するマルチグレイン並列化機能により、従来の命令レベル並列性より大きな並列性を抽出し、複数マルチコアで速度向上

## 2.プログラム全域にわたるメモリ利用最適化

 コンパイラによるローカルメモリへのデータ 分割配置、DMAコントローラによるタスク実 行とオーバーラップしたデータ転送によりメ モリアクセス・データ転送オーバーヘッド最 小化

## 3.プロセッサ・メモリ・ネットワーク等の停止・動作速度制御による自動省エネ

 コンパイラによる低消費電力制御機能を用いたアプリケーション内でのきめ細かい周 波数・電圧制御・電源遮断により消費電力 低減



### https://www.computer.org/product/education/multi-core-video-lectures-bundle





Multi-Core Lecture Series consists of 11 one-hour lectures by some of the world's leading researchers in the field. This series is not a course and it consists of the presentation for those who are in the research field. This is more intended for research information sharing than educational training. Topics that are covered during these lectures are listed below. This series also includes an hour discussion of the lecturers.

### Video Presentations:

- · Automatic Parallelization by David Padua
- Autoparallelization for GPUs by Wen-Mei Hwu
- Dependences and Dependence Analysis by Utpal Banerjee
- · Dynamic Parallelization by Rudolf Eigenmann
- Instruction Level Parallelization by Alexandru Nicolau
- Multigrain Parallelization and Power Reduction by Hironori Kasahara
- The Polyhedral Model by Paul Feautrier
- Vector Computation by David Kuck
- · Vectorization by P. Sadayappan
- Vectorization/Parallelization in the IBM Compiler by Yaoqing Gao
- Vectorization/Parallelization in the Intel Compiler by Peng Tu
- · Roundtable Discussion by all presenters

Home / Education / Courses

### Multi-core Roundtable Discussion Video Lecture

MULTI-CORE VIDEO SERIES



## Dependences and Dependence Analysis Video Lecture

**MULTI-CORE VIDEO SERIES** 



Dependences and Dependence Analysis by Utpal Banerjee Utpal Banerjee's research interests in computer science are in the general area of parallel processing and he has published four books on loop transformations and dependence analysis, with a fifth one on instruction level parallelism on the way.

## Multigrain Parallelization and Power Reduction Video Lecture



Multigrain Parallelization and Power Reduction by Hironori Kasahara. Professor Kasahara has been researching on OSCAR Automatic Parallelizing and Power Reducing Compiler and OSCAR Multicore architecture for more than 30 years, and led four Japanese national projects on parallelizing compilers, multicores, and green computing.









## 実施場所:グリーン・コンピューティング・システム研究開発センター

2011年4月13日竣工, 2011年5月13日開所

経済産業省「2009年度産業技術研究開発施設整備費補助金」

先端イノベーション拠点整備事業

### <目標>

太陽電池で駆動可能で 冷却ファンが不要な

超低消費電力・高性能マルチコア/ メニーコアプロセッサ\*のハードウェア、 ソフトウェア、応用技術の研究開発

\*1チップ上に多数のプロセッサコアを 集積する次世代マルチコアプロセッサ

く産学連携>

日立,富士通,ルネサス,NEC,トヨタ, デンソー,オリンパス,NSITEX、三菱電機, オスカーテクノロジ等

く波及効果>

### 超低消費電力メニーコア

- ▶co<sub>₂</sub>排出量削減
- >サーバ国際競争力強化
- ▶我が国の産業利益を支える 情報家電,自動車等の高付加価値化



### 環境に優しい低消費電力・高性能計算 グリーン・コンピューティング:



交通シミュレーション・信号 NTTデータ・日立

環境への貢献 カーボンニュートラル 生命・SDGs への貢献



センター: 100WM(火力発電所必要) W=1GW(原子力発電所必要

療

スマホ

太陽光駆動

鏡オリンパス

車載(グリーンエンジン制御・ 自動運転Deep Learning・

ADAS·MATLAB/Simulink自 動並列化) デンソー

ルネサス.NEC

Engine Control by multicore with Denso Though so far parallel processing of the engine control on multicore has been very difficult. Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor.



高信頼・低コスト・ フト開発

FA 三菱

HPC,AI,BigData高速化·低消費電力化

OSCARマルチコア/サーバ 災 **OSCAR** Many-core 生 Accelerator App OS パーソナル カプセル内視 医 スパコン **Cancer Treatment** 

Carbon Ion Radiotherapy 55 times speedup by 64 processor IBM Power 7 64 core SMP (Hitachi SR16000)

重粒子ガン治療日立

Intel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000)

車体設計・ <u>ディー</u>プ 日立

高速化



データ・ク

### 首都圏直下型地震火災延焼、 住民避難指示

Power Reduction of MPEG2 Decoding to 1/4 on 8 Core Homogeneous Multicore RP-2



低消費電力化

世界の人々への貢献 安全安心便利な製品・サービス (産官学連携・ベンチャー)

カメラ

## A Strategic Initiative of Computing: Systems and Applications (SISA)- Integrating HPC, Big Data, AI and Beyond, Jan. 18-19, 2017

## A Strategic Initiative of Computing: Systems and Applications

(SISA) --Integrating HPC, Big Data, AI and Beyond-- Jan. 18-19, 2017

**Opening:** Prof. Gao, Prof. Kasahara

III. Extreme Scale and Beyond

Waseda VP Shuji Hashimotokeynote: Paul Messina ANL, USA

### I. Architecture and Applications

Keynote: William J. Dally,

NVIDIA and Stanford University, USA Depei Qian, BUAA, China

- Kimihiko Hirao, RIKEN, Japan
- > G. W. Yang, Tsinghua Univ. China
- > J. Sexton, IBM, USA

### II . System Software and Applications

Keynote: Rick. Stevens ANL, USA

- S. Mikhail Smelyanskiy Intel USA
- > Fred. Streitz, LLNL USA
- R. Govind, IIS, India
- > H. Hironori Kasahara. Waseda Univ,

- Motoaki Saito, PEZY, Japan
- Eiji Ishida, MEXT, Japan
- Toshiyuki Shimizu, Fujitsu, Japan

IV. Integration of HPC, Big Data, and AI

Keynote: Thomas Sterling, Indiana Univ., USA

- Masaru Kitsuregawa, NII and Univ. of Tokyo, Japan
- Thomas Schulthess, ETH, Swiss
- Moriyuki Takamura/Toshiaki Kitamura, Oscar Tech, Japan







## **Automatic Power Reduction for MPEG2 Decode on Android Multicore**



**ODROID X2 ARM Cortex-A94 cores** 

http://www.youtube.com/channel/UCS43lNYEIkC8i KIgFZYQBQ



- On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control.
- 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution.

## **Power Reduction on Intel Haswell** for Real-time Optical Flow For HD 720p(1280x720) moving pictures



15fps (Deadline66.6[ms/frame])



Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores (41.6W).

Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W).

## An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control



## Speedups & Power Reduction on RP-X Heterogeneous Multicore with 8 CPUs and 4 DRPs

Power Reduction in a real-time execution controlled

33 Times Speedup Using OSCAR Compiler and API on Renesas RP-X with 8 CPUs & 4 DRP Accelerators



## Automatic Power Reduction of OpenCV Face Detection on big.LITTLE ARM Processor



- ODROID-XU3 Cortex-A7 Cortex-A15
  - Samsung Exynos 5422 Processor
    - 4x Cortex-A15 2.0GHz, 4x Cortex-A7 1.4GHz big.LITTLE Architecture
    - 2GB LPDDR3 RAM cluster unit

Frequency can be changed by each

## ソーラーパワー・パーソナル・スパコン:新アクセラレータ・グリーンマルチコア

(AI、ビッグデータ、自動運転車、交通制御、ガン治療、地震、ロボット)

世界最高性能・低電力化機能OSCARコンパイラとの協調



<u>ベクトルアクセラレータ併置・</u> 共有メモリ型マルチコアシステム

性能:8TFLOPS,主メモリ:8TB

電力: 40W, **効率**: 200GFLOPS/W

- 命令拡張なくどのプロセッサにも付加できるベクトルアクセラレータ
- ▶ 低消費電力で高速に立ち上がるベクト ルで、低コスト設計
- コンパイラによる自動ベクトル・並列化及 び自動電力削減
  - ▶ 周波数·電源電圧制御機能
  - バリア高速同期・ローカル分散メモリで 無駄削減
- ▶ ローカルメモリ利用で低メモリコスト
- 誰でもチューニングなく使用でき、低コスト短期間ソフト開発可能

## ISCA2025, June 21-25, 2025, Waseda University, Tokyo, Japan



## General Co-Chairs: Jean-Luc Gaudiot (Prof. UCI) Hironori Kasahara (Prof. Waseda)









## What Are the Most Cited ISCA Papers?

by David Patterson on Jun 15, 2023 https://www.sigarch.org/what-are-the-most-cited-isca-papers/

Rank Citations Year Title ( means it won the ISCA Influential Paper Award) First Author + HOF Authors Type

| 1  | 5351 | 1995 | The SPLASH-2 programs: Characterization and methodological considerations                                                        | Stephen Woo, Anoop Gupta                                                      | Tool  | Benchmark                 | 26 | 1201 | 1997 | The SGI Origin: A ccNUMA highly scalable server                                                        | James Laudon                                                                   | Arch  | Consistency/<br>Coherence   |
|----|------|------|----------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------|-------|---------------------------|----|------|------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|-------|-----------------------------|
| 2  | 4214 | 2017 | In-datacenter performance analysis of a Tensor Processing<br>Unit                                                                | Norm Jouppi, David<br>Patterson                                               | Arch  | Machine<br>Learning       | 27 | 1177 | 2009 | A durable and energy efficient main memory using phase<br>change memory technology                     | Ping Zhou                                                                      | Tool  | Benchmark                   |
| 3  | 3834 | 2000 | ★ Wattch: A framework for architectural-level power<br>analysis and optimizations                                                | David Brooks, Margaret<br>Martonosi                                           | Tool  | Power                     | 28 | 1175 | 2014 | Flipping bits in memory without accessing them: An<br>experimental study of DRAM disturbance errors    | Yoongu Kim, Onur Mutlu,<br>Chris Wilkerson                                     | Micro | Security                    |
| 4  | 3386 | 1993 | ★ Transactional memory: Architectural support for lock-free data structures                                                      | Maurice Herlihy                                                               | Micro | Parallelism               | 29 | 1166 | 2010 | Debunking the 100X GPU vs. CPU myth: an evaluation of<br>throughput computing on CPU and GPU           | Victor Lee                                                                     | Tool  | Simulator                   |
| 5  | 2690 | 2016 | EIE: Efficient inference engine on compressed deep neural network                                                                | Song Han, Bill Dally, Mark<br>Horowitz                                        | Arch  | Machine<br>Learning       | 30 | 1104 | 2017 | SCNN: An accelerator for compressed-sparse convolutional neural networks                               | Angshuman Parashar, Joel<br>Emer, Bill Dally, Steve<br>Keckler                 | Arch  | Machine<br>Learning         |
| 6  | 2620 | 2007 | * Power provisioning for a warehouse-sized computer                                                                              | Xiaobo Fan, Lutz Barroso                                                      | Micro | Power                     | 31 | 1070 | 2015 | ShiDianNao: Shifting vision processing closer to the sensor                                            | Zidong Du                                                                      | Arch  | Machine<br>Learning         |
| 7  | 2507 | 1992 | Active messages: a mechanism for integrated<br>communication and computation                                                     | Thorsten von Eiken                                                            | Micro | Parallelism               |    |      |      |                                                                                                        | Joffrey Kuskin, Kourosh<br>Gharachorloo, Anoop                                 |       |                             |
| 8  | 2391 | 2011 | Dark silicon and the end of multicore scaling                                                                                    | Hadi Esmaeilzadeh, Doug<br>Burger, Karthikeyan<br>Sankaralingan               | Micro | Parallelism               | 32 | 1051 | 1994 | ★ The Stanford FLASH multiprocessor                                                                    | Gupta, John Hennessy, Mark<br>Horowitz<br>Lance Hammond, Christos              | Arch  | Parallelism<br>Consistency/ |
| 9  | 2352 | 1995 | ★ Simultaneous multithreading: Maximizing on-chip<br>parallelism                                                                 | Dean Tullsen, Susan Eggers,                                                   | Micro | Parallelism               | 33 | 1027 | 2004 | * Transactional memory coherence and consistency                                                       | Kozyrakis, Kunle Olukotun                                                      | Micro | Coherence                   |
|    |      |      | * Improving direct-mapped cache performance by the                                                                               | Hank Levy                                                                     |       |                           | 34 | 1021 | 1992 | Lazy release consistency for software distributed shared<br>memory                                     | Pete Keleber                                                                   | Micro | Consistency/<br>Coherence   |
| 10 | 2243 | 1990 | addition of a small fully-associative cache and prefetch<br>buffers                                                              | Norm Jouppi                                                                   | Micro | Cache                     | 35 | 1006 | 2001 | Cache decay: Exploiting generational behavior to reduce cache leakage power                            | Stefanos Kaxiras, Margaret<br>Martonosi                                        | Micro | Power                       |
| 11 | 1801 | 2009 | Architecting phase change memory as a scalable DRAM<br>Alternative                                                               | Benjamin Lee, Doug Burger,<br>Engin Ipek, Onur Mutlu<br>Kourosh Gharachorloo. | Micro | NV RAM                    | 36 | 993  | 1990 | The directory-based cache coherence protocol for the<br>DASH multiprocessor                            | <u>Daniel Lencski</u> , Kourosh<br>Gharachorloo, Anoop<br>Gupta, John Hennessy | Micro | Consistency/<br>Coherence   |
| 12 | 1790 | 1990 | Memory consistency and event ordering in scalable<br>shared-memory multiprocessors                                               | Anoop Gupta, John<br>Hennessy                                                 | Micro | Consistency/<br>Coherence | 37 | 991  | 2000 | Clock rate versus IPC: The end of the road for<br>conventional microarchitectures                      | Vikas Agarwal, Doug<br>Burger, Steve Keckler                                   | Micro | Parallelism                 |
| 13 | 1769 | 2009 | Scalable high performance main memory system using<br>phase-change memory technology                                             | Moinuddin Qureshi                                                             | Micro | NV RAM                    | 38 | 980  | 1990 | Weak ordering—a new definition                                                                         | Sarita Adve, Mark Hill                                                         | Micro | Consistency/<br>Coherence   |
| 14 | 1659 | 2016 | ISAAC: A convolutional neural network accelerator with<br>in-situ analog arithmetic in crossbars                                 | Ali Shafioo, Rajeev<br>Balasubramonian, Naveen                                | Arch  | Machine                   | 39 | 957  | 2008 | Technology-driven, highly-scalable dragonfly topology                                                  | John Kim, Bill Dally                                                           | Micro | Interconnect                |
| 15 | 1643 | 2003 |                                                                                                                                  | Muralimanohar<br>Kevin Skandron                                               | Micro | Learning                  | 40 | 938  | 2007 | Adaptive insertion policies for high performance caching                                               | Moinuddin Qureshi, Joel<br>Emer, Yale Patt                                     | Micro | Cache                       |
| 16 |      | 2016 | Everiss: A spatial architecture for energy-efficient                                                                             | Yu-Hsin Chan, Joel Emer                                                       | Micro | Machine                   | 41 | 935  | 1973 | Banyan networks for partitioning multiprocessor systems                                                | Rodney Goke, Jack Lipovski                                                     | Micro | Interconnect                |
|    |      |      | dataflow for convolutional neural networks  Prime: A novel processing-in-memory architecture for                                 |                                                                               |       | Learning<br>Machine       | 42 | 911  | 2008 | 3D-Stacked Memory Architectures for Multi-core<br>Processors                                           | Gabriel Loh                                                                    | Micro | Cache                       |
| 17 |      |      | neural network computation in ReRAM-based main<br>memory                                                                         | Ping Chi, Yuan Xie                                                            | Arch  | Learning                  | 43 | 895  | 2010 | High performance cache replacement using re-reference interval prediction (RRIP)                       | Aamer Jaleel, Joel Emer                                                        | Micro | Cache                       |
| 18 | 1401 | 2014 | A reconfigurable fabric for accelerating large-scale datacenter services                                                         | Andrew Putnam, Hadi<br>Esmacilzadeh                                           | Micro | Interconnect              | 44 | 877  | 2015 | A scalable processing-in-memory accelerator for parallel graph processing                              | Junwhan Ahn, Omer Mutlu                                                        | Arch  | Parallelism                 |
| 19 | 1374 | 1992 | The turn model for adaptive routing                                                                                              | Christopher Glass                                                             | Micro | Interconnect              | 45 | 873  | 2000 | Transient fault detection via simultaneous multithreading                                              | Steven Reinhardt                                                               | Micro | Reliability                 |
| 20 | 1350 | 1995 | Multiscalar processors                                                                                                           | Guri Sohi, T. N. Vijaykumar                                                   | Micro | Parallelism               | 46 | 868  | 2008 | Corona: System implications of emerging nanophotonic technology                                        | Dana Vantrease, Norm<br>Jouppi                                                 | Micro | Interconnect                |
| 21 | 1302 | 2000 | Memory access scheduling                                                                                                         | Andrew Putnam, Bill Dally                                                     | Micro | Parallelism               | 47 | 863  | 2009 | An analytical model for a GPU architecture with<br>memory-level and thread-level parallelism awareness | Sunpyo Hong                                                                    | Tool  | Parallelism                 |
| 22 | 1284 | 1997 |                                                                                                                                  | Subbarao Palacharla, Norm<br>Jouppi, Jim Smith                                | Micro | Parallelism               | 48 | 859  | 2004 | Single-ISA heterogeneous multi-core architectures for<br>multithreaded workload performance            | Rakesh Kumar, Norm<br>Jouppi, Partha Ranganathan<br>& Dean Tullsen             | Micro | Parallelism                 |
| 23 | 1221 | 2002 | power                                                                                                                            | Sung Kim, Trevor Mudge                                                        | Micro | Power                     | 49 | 854  | 1974 | A Preliminary Architecture for a Basic Data Flow<br>Processor                                          | Jack Dennis                                                                    | Arch  | Parallelism                 |
| 24 | 1210 | 1996 | <ul> <li>Exploiting choice: Instruction fetch and issue on an<br/>implementable simultaneous multithreading processor</li> </ul> | Hank Levy Susan Eggers,<br>Joel Emer. Dean Tulisen                            | Micro | Parallelism               | 50 | 854  | 1983 | Very Long Instruction Word Architectures and the<br>FLI-512                                            | Josh Fisher                                                                    | Arch  | Parallelism                 |
| 25 | 1201 | 1997 | A Study of Branch Prediction Strategies                                                                                          | Jim Smith                                                                     | Micro | Parallelism               |    |      |      |                                                                                                        |                                                                                |       | 41                          |



# Future Multicore Products with Automatic Parallelizing Compiler



### **Next Generation Automobiles**

- Safer, more comfortable, energy efficient, environment friendly
- Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control

### **Smart phones**





- -From everyday recharging to less than once a week
- Solar powered operation in emergency condition
- Keep health

## **Advanced medical systems**





- Emergency solar powered
- No cooling fun, No dust, clean usable inside OP room

## Personal / Regional Supercomputers



Solar powered with more than 100 times power efficient: FLOPS/W

 Regional Disaster Simulators saving lives from tornadoes, localized heavy rain, fires with earth quakes