# **Automatic Parallelization by OSCAR Compiler for NEC SX-Aurora TSUBASA** Hironori Kasahara, Ph.D., IEEE Fellow, IPSJ Fellow Senior Executive Vice President, Waseda University IEEE Computer Society President 2018 URL: <a href="http://www.kasahara.cs.waseda.ac.jp/">http://www.kasahara.cs.waseda.ac.jp/</a> (Booth 857 ITBL) 1980 BS, 82 MS, 85 Ph.D., Dept. EE, Waseda Univ. 1985 Visiting Scholar: U. of California, Berkeley, 1986 Assistant Prof., 1988 Associate Prof., 1989-90 Research Scholar: U. of Illinois, Urbana-Champaign, Center for Supercomputing R&D, 1997 Prof., 2004 Director, Advanced Multicore Research Institute, 2017 member: the Engineering Academy of Japan and the Science Council of Japan 2018 Nov. Senior Vice President, Waseda Univ. 1987 IFAC World Congress Young Author Prize 1997 IPSJ Sakai Special Research Award 2005 STARC Academia-Industry Research Award 2008 LSI of the Year Second Prize 2008 Intel AsiaAcademic Forum Best Research Award 2010 IEEE CS Golden Core Member Award 2014 Minister of Edu., Sci. & Tech. Research Prize 2015 IPSJ Fellow, 2017 IEEE Fellow, Eta Kappa Nu 2019 IEEE Spirit of Computer Society Award Reviewed Papers: 216, Invited Talks: 180, Granted Patents: 50 (Japan, US, GB, China), Articles in News Papers, Web News, Medias incl. TV etc.: Committees in Societies and Government 260 IEEE Computer Society: President 2018, Executive Committee(2017-2019), BoG(2009-14), Strategic Planning Committee Chair 2018, Multicore STC Chair (2012-), Japan Chair(2005-07), IPSJ Chair: HG for Magazine. & J. Edit, Sig. on ARC. [METI/NEDO] Project Leaders: Multicore for Consumer Electronics, Advanced Parallelizing Compiler, Chair: Computer Strategy Committee [Cabinet Office] CSTP Supercomputer Strategic ICT PT, Japan Prize Selection Committees, etc. [MEXT] Info. Sci. & Tech. Committee, Supercomputers (Earth Simulator, HPCI Promo., Next Gen. Supercomputer K) Committees, etc. ## About WASEDA - 早稲田大学 ## **IEEE Computer Society** The first President from the outside of USA and Canada in 72 years history of IEEE CS Bjarne Stroustrup: Morgan Stanley & Columbia Univ. 2018 IEEE Computer Society Computer Pioneer Award IEEE COMPSAC2018 Keynote & Award Ceremony July 25, 2018 Award Ceremony Rihga Royal Hotel Tokyo •84,000+ members •480 chapters •168 countries 31 technical committees & councils DR. DAVID E. SHAW #### **IEEE CS Awards Ceremonies with CS President 2018** June BoG Award Dinner with CS Award Winners and their Families, Phoenix Technical Achievemen t Award, in COMPSAC, Tokyo Computer Pioneer Award to C++ Bjarne Stroustrup in COMPSAC, Tokyo B. Ramakrishna Rau Award in MICRO, Fukuoka Award Ceremony in SC (Super Computing 2018 with 13 thousands participants), Dallas ## ACM/IEEE SC (SuperComputing) 19, Denver, Nov.17-22, 2019 ## **Multicores for Performance and Low Power** Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers ("K" more than 10MW). IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" Power ∞ Frequency \* Voltage<sup>2</sup> (Voltage ∞ Frequency) If <u>Frequency</u> is reduced to <u>1/4</u> (Ex. 4GHz→1GHz), Power is reduced to <u>1/64</u> and Performance falls down to <u>1/4</u>. < <u>Multicores</u>> If <u>8cores</u> are integrated on a chip, Power is still 1/8 and Performance becomes 2 times. #### Parallel Soft is important for scalable performance of multicore (LCPC2015) Just more cores don't give us speedup Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eq. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) proposed method original (sun studio) - Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores - Execution time with 128 cores was slower than 1 core (0.9 times speedup) - Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler - OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization ## **Power Reduction of MPEG2 Decoding to 1/4** on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Avg. Power 5.73 [W] 73.5% Power Reduction 1.52 [W] ## **OSCAR Parallelizing Compiler** To improve effective performance, cost-performance and software productivity and reduce power $Multigrain\ Parallelization {\small (LCPC1991,2001,04)}$ coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in addition to loop parallelism #### **Data Localization** Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2,Cache2001,03) Software Coherent Control (2017) ### Data Transfer Overlapping(2016 partially) Data transfer overlapping using Data Transfer Controllers (DMAs) #### **Power Reduction** (2005 for Multicore, 2011 Multi-processes, 2013 on ARM) Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Multicore Program Development Using OSCAR API V2.0 #### **Sequential Application Program in Fortran or C** (Consumer Electronics, Automobiles, Medical, Scientific computation, etc.) Hetero Homogeneous Manual parallelization / power reduction #### **Accelerator Compiler/ User** Add "hint" directives before a loop or a function to specify it is executable by the accelerator with how many clocks #### Waseda OSCAR **Parallelizing Compiler** - **Coarse grain task** parallelization - **Data Localization** - **DMAC** data transfer - Power reduction using **DVFS, Clock/ Power gating** Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ. **OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores** Directives for thread generation, memory, data transfer using DMA, power managements **Parallelized** API F or C program Proc0 **Code** with directives Thread 0 Proc1 Code with directives Thread 1 Accelerator 1 Code **Accelerator 2** Code Low Power Homogeneous **Multicore Code** Generation API Analyzer **Existing** sequential compiler Low Power Heterogeneous **Multicore Code** Generation API Analyzer (Available from Waseda) Existing sequential compiler Server Code Generation > **OpenMP** Compiler **OSCAR: Optimally Scheduled Advanced Multiprocessor API:** Application Program Interface **Generation of** parallel machine codes using sequential compilers Homegeneous Multicore s from Vendor A (SMP servers) various multicores Heterogeneous **Multicores** from Vendor B Shred memory servers ## **Automatic Pallalelization Tool of MATLAB/Simulink:** ## OSCAR Tech "OSCARator" https://www.oscartech.jp/en/ - OSCARator is a simulation accelerator of MATLAB/Simulink on multicore processor - based on "OSCAR Compiler" Automatic Parallelization Technology developed by Kasahara and Kimura Lab. Waseda University # Speedup of Simulink Models by OSCARator on 4 cores Intel Core i5 Processor https://www.oscartech.jp/en/ 6.51 times speed up on 4 cores against 1 core MATLAB Accerelator Mode for VesselExtraction Intel Core i5 7400T 2.4GHz (4 cores) 16GB (SODIMM 2400MHz) Windows 10 Pro (1903) MATLAB R2019a Update 5 MinGW GCC 6.3 #### RoadTracking - from Computer Vision Toolbox - https://jp.mathworks.com/help/vision/exam ples/color-based-road-tracking.html #### VesselExtraction - from MATLAB Central - modified for Simulink Model - https://www.mathworks.com/matlabcentral/ fileexchange/24990-retinal-blood-vesselextraction #### HybridVehicle - Hybrid Vehicle Powertrain - developed by Kusaka Lab. Waseda University - http://www.f.waseda.jp/jin.kusaka/ (Compared with MATLAB Accelerator Mode Simulation) ## **Low-Power Optimization with OSCAR API** ## **Automatic Power Reuction on Intel Haswell** H.264 decoder & Optical Flow (3cores) Power for 3cores was reduced to $1/3 \sim 1/4$ against without software power control Power for 3cores was reduced to $2/5 \sim 1/3$ against ordinary 1core execution ## **OSCAR Heterogeneous Multicore** #### DTU Data Transfer Unit #### **LPM** Local Program Memory #### LDM Local Data Memory #### **DSM** DistributedShared Memory #### **CSM** CentralizedShared Memory #### **FVR** Frequency/Volta ge Control Register ## An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control ## Speedup of NPB/CG by OSCAR Compiler on NEC SX-Aurora TSUBASA A100-1 8 cores 10C VE 57 times speed up for 8 core Parallelization by OSCAR Compiler & NEC Vectorization against NEC 1 core Vectorization 2 core OSCAR Parallelization + NEC 1 core Vectorization 4 core 0.00 8 core 0.00 NEC 1core Vectorization # Speedup of NPB/BT by OSCAR Compiler on NEC SX-Aurora TSUBASA A100-1 8 cores 10C VE ## OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology #### **Target:** - Solar Powered - Compiler power reduction. - >Fully automatic parallelization and vectorization including local memory management and data transfer. #### **Vector Accelerator** #### **Features** - Attachable for any CPUs (Intel, ARM, IBM) - Data driven initiation by sync flags #### Function Units [tentative] - Vector Function Unit - 8 double precision ops/clock - 64 characters ops/clock - Variable vector register length - Chaining LD/ST & Vector pipes - Scalar Function Unit #### Registers[tentative] - Vector Register 256Bytes/entry, 32entry - Scalar Register 8Bytes/entry - Floating Point Register 8Bytes/entry - Mask Register 32Bytes/entry # **Future Multicore Products with Automatic Parallelizing Compiler** #### **Next Generation Automobiles** - Safer, more comfortable, energy efficient, environment friendly - Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control #### Smart phones - -From everyday recharging to less than once a week - Solar powered operation in emergency condition - Keep health #### Advanced medical systems Cancer treatment, Drinkable inner camera - Emergency solar powered - No cooling fun, No dust, clean usable inside OP room Personal / Regional Supercomputers Solar powered with more than 100 times power efficient: FLOPS/W Regional Disaster Simulators saving lives from tornadoes, localized heavy rain, fires with earth quakes