International Workshop on A Strategic Initiative of Computing: Systems and Applications (SISA): Integrating HPC, Big Data, AI and Beyond # Integrated Development of Parallelizing and Power Reducing Compiler and Multicore Architecture for HPC to Embedded Applications ## Hironori Kasahara Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University (早稲田大学), Tokyo, Japan **IEEE Computer Society** President Elect 2017, President 2018 URL: http://www.kasahara.cs.waseda.ac.jp/ Waseda Univ. Green Computing R&D Center, Jan. 18, 2017 ## **Multicores for Performance and Low Power** Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers ("K" more than 10MW). IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" Power ∝ Frequency \* Voltage<sup>2</sup> (Voltage ∝ Frequency) ightharpoonup Power ightharpoonup Frequency<sup>3</sup> If <u>Frequency</u> is reduced to <u>1/4</u> (Ex. 4GHz→1GHz), Power is reduced to 1/64 and Performance falls down to 1/4. < <u>Multicores</u>> If **8cores** are integrated on a chip, **Power** is still 1/8 and Performance becomes 2 times. With 128 cores, OSCAR compiler gave us 100 times speedup against 1 core execution and 211 times speedup against 1 core using Sun (Oracle) Studio compiler. ## **Power Reduction of MPEG2 Decoding to 1/4** on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Avg. Power 5.73 [W] 73.5% Power Reduction 1.52 [W] ## Renesas-Hitachi-Waseda Low Power 8 core RP2 Developed in 2007 in METI/NEDO project IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" ### **Demo of NEDO Multicore for Real Time Consumer Electronics** at the Council of Science and Engineering Policy on April 10, 2008 ### 第74回総合科学技術会議【平成20年4月10日】 第74回総合科学技術会議の様子(1) 第74回総合科学技術会議の様子(2) 第74回総合科学技術会議の様子(3) 第74回総合科学技術会議の様子(4) **CSTP Members Prime Minister:** Mr. Y. FUKUDA Minister of State for Science, Technology and Innovation **Policy:** Mr. F. KISHIDA **Chief Cabinet Secretary:** Mr. N. MACHIMURA Minister of Internal Affairs and **Communications:** Mr. H. MASUDA **Minister of Finance:** Mr. F. NUKAGA Minister of **Education, Culture,** Sports, Science and Technology: Mr. K. TOKAI Minister of **Economy, Trade and Industry:** Mr. A. AMARI ## Earliest Executable Condition Analysis for Coarse Grain Tasks (Macro-tasks) ## **OSCAR Parallelizing Compiler** To improve effective performance, cost-performance and software productivity and reduce power ### **Multigrain Parallelization** coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism ### **Data Localization** Automatic data management for distributed shared memory, cache and local memory ### **Data Transfer Overlapping** Data transfer overlapping using Data Transfer Controllers (DMAs) ### **Power Reduction** Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. ## **Data Localization: Loop Aligned Decomposition** - Decomposed loop into LRs and CARs - LR (Localizable Region): Data can be passed through LDM CAR (Commonly Accessed Region): Data transfers are required among processors **Multi-dimension Decomposition Single dimension Decomposition** DLGO DLG1 DLG2 CAR LR CAR LR LR DO I=1,101 $A(1)=2^{*}1$ DOI=34.35 DOI=67.68 DO I=1,33 DOI=36.66 DOI=69,101 ENDDO DO I=1.33 DO I=1,100 DOI=34.34 B(I)=B(I-1)+A(I)+A(I+1) DOI=35.66 **ENDDO** DO I=67.67 DO I=2,100 DO I=68,100 C(I)=B(I)\*B(I-1)**ENDDO** DO I=2.34 DOI=35.67 DO I=68,100 9 ## **Data Localization** ## Local Memory Management Using Adjustable Blocks - Decide a suitable block size for each application - different from fixed block sizes like in cache - each block can be divided into smaller blocks with integer divisible size to handle small arrays and scalar variables | | Block <sub>Number</sub> Level | | | | | | | | |---------|---------------------------------|-----------------------------|---------------------------------|-----------------------------|---------------------------------|-----------------------------|---------------------------------|-----------------------------| | | 1 Block on Local Memory | | | | | | | | | Level 0 | Block <sub>0</sub> <sup>0</sup> | | | | | | | | | Level 1 | Block <sub>0</sub> <sup>1</sup> | | | | Block <sub>1</sub> <sup>1</sup> | | | | | Level 2 | Block <sub>0</sub> <sup>2</sup> | | Block <sub>1</sub> <sup>2</sup> | | Block <sub>2</sub> <sup>2</sup> | | Block <sub>3</sub> <sup>2</sup> | | | Level 3 | $B_0^3$ | B <sub>1</sub> <sup>3</sup> | $B_2^3$ | B <sub>3</sub> <sup>3</sup> | B <sub>4</sub> <sup>3</sup> | B <sub>5</sub> <sup>3</sup> | B <sub>6</sub> <sup>3</sup> | B <sub>7</sub> <sup>3</sup> | # Multi-dimensional Template Arrays for Improving Readability - a mapping technique for arrays with varying dimensions - each block on LDM corresponds to multiple empty arrays with varying dimensions - these arrays have an additional dimension to store the corresponding block number - TA[Block#][] for single dimension - TA[Block#][][] for double dimension - TA[Block#][][][] for triple dimension - ... - LDM are represented as a one dimensional array - without Template Arrays, multidimensional arrays have complex index calculations - A[i][j][k] -> TA[offset + i' \* L + j' \* M + k'] - Template Arrays provide readability - A[i][j][k] -> TA[Block#][i'][j'][k'] ## **Block Replacement** - ➤ Appropriate memory blocks considering schedules are replaced - Dead, live and reuse information of each block is used. - Different from LRU using past information, this method uses future information available from static schedule. - ➤ Block Replacement Priority - 1. Dead Block (Variables) that will not be accessed later. - 2. Live Blocks that are accessed only by the other cores. - 3. Live Block that will be accessed by the current core latest. - 4. Live Block that will be accessed by the current core soon and data transfer overhead can be hidden by DMA overlapped transfer. Multicore Program Development Using OSCAR API V2.0 ## **Sequential Application Program in Fortran or C** (Consumer Electronics, Automobiles, Medical, Scientific computation, etc.) Homogeneous Hetero Manual parallelization / power reduction ### **Accelerator Compiler/ User** Add "hint" directives before a loop or a function to specify it is executable by the accelerator with how many clocks ## Waseda OSCAR Parallelizing Compiler - Coarse grain task parallelization - Data Localization - DMAC data transfer - Power reduction using DVFS, Clock/ Power gating Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ. OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores Directives for thread generation, memory, data transfer using DMA, power managements Parallelized API F or C program Proc0 Code with directives Thread 0 Proc1 Code with directives Thread 1 Accelerator 1 Code Accelerator 2 Code Low Power Homogeneous Multicore Code Generation API Analyzer Existing sequential compiler Low Power Heterogeneous Multicore Code Generation API Analyzer (Available from Waseda) Existing sequential compiler Server Code Generation OpenMP Compiler OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface Generation of parallel machine codes using sequential compilers Homegeneous Multicore s from Vendor A (SMP servers) various multicores Heterogeneous Multicores from Vendor B Shred memory servers # Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization) 8.9times speedup by 12 processors Intel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000) 55 times speedup by 64 processors IBM Power 7 64 core SMP (Hitachi SR16000) ## **Engine Control by multicore with Denso** Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor. Hard real-time automobile engine control by multicore ## **OSCAR Compile Flow for Simulink Applications** Generate C code using Embedded Coder /\* Model step function \*/ |void VesselExtraction\_step(void) int32\_T i; real\_T u0; for (i = 0; i < 16384; i++) { VesselExtraction\_B.DataTypeConversion[i] = VesselExtraction\_U.In1[i]; /\* End of DataTypeConversion: '<S1>/Data Type Conversion' \*/ /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter' \*/ /\* Constant: '<\$1>/h1' \*/ VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion, VesselExtraction\_P.hl\_Value, &VesselExtraction\_B.Dfilter, (P\_Dfilter\_VesselExtraction\_T \*)&VesselExtraction\_P.Dfilter); /\* End of Outputs for SubSystem: '<S1>/2Dfilter' \*/ /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter1' \*/ /\* Constant: '<81>/h2' \*/ VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion, VesselExtraction\_P.h2\_Value, &VesselExtraction\_B.Dfilter1, (P\_Dfilter\_VesseTExtraction\_T \*)&VesseTExtraction\_P.Dfilter1); ### Simulink model ### C code # Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores (Intel Xeon, ARM Cortex A15 and Renesas SH4A) Road Tracking, Image Compression: <a href="http://www.mathworks.co.jp/jp/help/vision/examples">http://www.mathworks.co.jp/jp/help/vision/examples</a> Buoy Detection : <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink">http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink</a> Color Edge Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/">http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/</a> Vessel Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/">http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/</a> ## 110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000 (Power7 Based 128 Core Linux SMP) ## Automatic Parallelization of Still Image Encoding Using JPEG-XR for the Next Generation Cameras and Drinkable Inner Camera # Parallel Processing of Face Detection on Manycore, Highend and PC Server □ OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server. ## **OSCAR Heterogeneous Multicore** #### DTU Data Transfer Unit ### **LPM** Local Program Memory #### LDM Local Data Memory #### **DSM** DistributedShared Memory ### **CSM** CentralizedShared Memory ### **FVR** Frequency/Volta ge Control Register ### An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control # 33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X ## Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) Without Power Reduction by OSCAR Compiler 70% of power reduction ## **Low-Power Optimization with OSCAR API** ## Automatic Power Reduction for MPEG2 Decode on Android Multicore ODROID X2 ARM Cortex-A9 4 cores <a href="http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ">http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ</a> - On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control. - 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution. Power Reduction on Intel Haswell for Real-time Optical Flow Intel CPU Core i7 4770K For HD 720p(1280x720) moving pictures 15fps (Deadline66.6[ms/frame]) Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores (41.6W). Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W). ### Power Reduction of Face Recognition on Intel Haswell 3 cores by OSCAR Compiler - Reduced Power to 2/5 on Intel- Kasahara & Kimura Lab, Waseda University, TOKYO http://www.kasahara.cs.waseda.ac.jp - ■OSCAR Compiler - ■Intel Haswell - ■Power Reduction ### Measuring Environment CPU: Intel Core i7 4770K No. of Cores: 4 Frequency: 3.5GHz~0.8GHz Motherboard: ASUS H81M-A Measuring current from CPU power source ### Speedup and Power reduction on Intel Haswell 3 Cores ## **OSCAR Technology** Started up on Feb.28, 2013: Licensing the all patents and OSCAR compiler from Waseda Univ. CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company, VP of National Univ., Invited Prof. of Waseda U. ) Executives: Mr. T. Ito (Visiting Prof. Tokyo Agricult. and Eng. U.) Prof. K. Shirai (Ex-President of Waseda U **Chairman of Japanese Open Univ.** ) CTO: Mr. M. Takamura (Ex-Fellow Fujitsu Lab., Fujitsu VPP500, 5000 & NWT Development Leader ) Mr. K. Ashida (Ex-VP Sumitomo Trading, Ashida Consult. CEO, A leader of Business World Auditor: Dr. S. Matsuda ( Prof. Emeritus Waseda U. **Ex-President Ventures and Entrepreneurs Society**) Advisors: Dr. T. Sato ( Patent Attorney, Ex-President of Patent Attorneys Assoc., Gov. IP Committee) Fujitsu VPP5000 **Ms. K. Ishiguro** ( Lawyer, Supreme Court Trainer) Mr. A. Fukuda (Leader of Alumni Assoc.) **Prof. K. Kimura** (Waseda Univ.) Prof. H. Kasahara (Waseda Univ.) ## OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology ### **Target:** - Solar Powered with compiler power reduction. - Fully automatic parallelization and vectorization including local memory management and data transfer. ### **Vector Accelerator** ### **Features** - Attachable for any CPUs (Intel, ARM, IBM) - Data driven initiation by sync flags ### **Function Units [tentative]** - Vector Function Unit - 8 double precision ops/clock - 64 characters ops/clock - Variable vector register length - Chaining LD/ST & Vector pipes - Scalar Function Unit ### Registers[tentative] - Vector Register 256Bytes/entry, 32entry - Scalar Register 8Bytes/entry - Floating Point Register 8Bytes/entry - Mask Register 32Bytes/entry ### **Summary** - Waseda University Green Computing Systems R&D Center supported by METI has been researching on low-power high performance Green Multicore hardware, software and application with government and industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology. - OSCAR Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including "Earthquake Wave Propagation", medical applications including "Cancer Treatment Using Carbon Ion", and "Drinkable Inner Camera", industry application including "Automobile Engine Control", "Smartphone", and "Wireless communication Base Band Processing" on various multicores from different vendors including Intel, ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu. - Propagation Simulation" on 128 cores of IBM Power 7 against 1 core, 55 times speedup for "Carbon Ion Radiotherapy Cancer Treatment" on 64cores IBM Power7, 1.95 times for "Automobile Engine Control" on Renesas 2 cores using SH4A or V850, 55 times for "JPEG-XR Encoding for Capsule Inner Cameras" on Tilera 64 cores Tile64 manycore. - > The compiler will be available on market from OSCAR Technology. - In automatic power reduction, consumed powers for real-time multi-media applications like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of ARM Cortex A9 and Intel Haswell and 1/4 using Renesas SH4A 8 cores against ordinary single core execution. ## Fujitsu VPP500/NWT: PE Unit