## **Future of Green Multicore Computing** Hironori Kasahara **IEEE Computer Society** President Elect 2017, President 2018 Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan URL: http://www.kasahara.cs.waseda.ac.jp/ Waseda Univ. Green Computing R&D Center ## Hironori Kasahara Voted 2017 IEEE Computer Society President-Elect LOS ALAMITOS, Calif., 30 September 2016 – Hironori Kasahara, a Professor of Computer Science at Waseda University in Tokyo, and Director of the Advanced Multicore Research Institute, has been voted IEEE Computer Society 2017 President-Elect. Kasahara is a former member of the IEEE-CS Board of Governors, has served as chair of the IEEE-CS Multicore STC and CS Japan Chapter, and board member of the IEEE Tokyo Section. Kasahara will serve as the 2018 IEEE CS President for a one-year term beginning 1 January 2018. Kasahara garnered 3,278 votes, compared with 2,804 votes cast for Hausi A. Müller, a Professor of Computer Science and Associate Dean of Research, Faculty of Engineering at University of Victoria, Canada, and a member of IEEE-CS Board of Governors. The President oversees IEEE-CS programs and operations and is a nonvoting member of most IEEE-CS program boards and committees. The 2016 election had a 12.69% turnout, with 6,357 ballots cast. The turnout was higher than the 2015 election with and 12.68% turnout (6,239 ballots cast) and the 2014 election with a 12.66% turnout (6,728 ballots cast). #### 2016 IEEE Computer Society Election Results Press Release | Ballot counts Posted 29 September 2016 #### Hironori Kasahara selected 2017 President-Elect (2018 President) Hironori Kasahara has served as a chair or member of 225 society and government committees, including a member of the CS Board of Governors; chair of CS Multicore STC and CS Japan chapter; associate editor of IEEE Transactions on Computers; vice PC chair of the 1996 ENIAC 50th Anniversary International Conference on Supercomputing; general chair of LCPC; PC member of SC, PACT, PPoPP, and ASPLOS; board member of IEEE Tokyo section; and member of the Earth Simulator committee. He received a PhD in 1985 from Waseda University, Tokyo, joined its faculty in 1986, and has been a professor of computer science since 1997 and a director of the Advanced Multicore Research Institute since 2004. He was a visiting scholar at University of California, Berkeley, and the University of Illinois at Urbana-Champaign's Center for Supercomputing R&D. Kasahara received the CS Golden Core Member Award, IFAC World Congress Young Author Prize, IPSJ Fellow and Sakai Special Research Award, and the Japanese Minister's Science and Technology Prize. He led Japanese national projects on parallelizing compilers and embedded multicores, and has presented 210 papers, 132 invited talks, and 27 patents. His research has appeared in 520 newspaper and Web articles. #### **IEEE Computer Society** 60,000+ members, volunteer-led organization, 200 technical conferences, industry-oriented "Rock Stars", 17 scholarly journals and 13 magazines, awards program, Digital Library with more than 550,000 articles and papers, 400 local and regional chapters, 40 technical committees, ## IEEE Computer Society BoG (Board of Governors) Feb.1, 2017 #### Toward 2018 - Refining content and services to further improve the satisfaction of CS members; - Considering an <u>incentive for volunteers</u> to further accelerate CS activities and promptly provide technical benefits for people around the globe; To express appreciation to volunteers: CS Point (Mileage) System: Annual & Life Time Honor, Premier Seating, Premier Registration, Distinguished Reviewer, etc. - Offering more <u>attractive services</u> for practitioners in <u>industry</u>; - Providing the world's best educational content and historical treasures for future generations, which only the CS can create with our pioneering researchers (for example, the Multicore Compiler Video Series found at www.computer.org/web/education/multicore-video-series); - Thinking about <u>sustainable membership fees</u> while considering the diversity of economic situations within the 10 regions; - Cooperating with other IEEE societies and sister societies in a timely and efficient manner: - Intelligibly introducing the latest computer-related technologies to younger generations, including children, so that they can realize their technological dreams. ### **Multicores for Performance and Low Power** Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers ("K" more than 10MW). IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" Power ∞ Frequency \* Voltage<sup>2</sup> (Voltage ∞ Frequency) Power ∞ Frequency<sup>3</sup> If <u>Frequency</u> is reduced to <u>1/4</u> (Ex. 4GHz→1GHz), Power is reduced to 1/64 and Performance falls down to 1/4. < <u>Multicores</u>> If **8cores** are integrated on a chip, Power is still 1/8 and Performance becomes 2 times. With 128 cores, OSCAR compiler gave us 100 times speedup against 1 core execution and 211 times speedup against 1 core using Sun (Oracle) Studio compiler. ## **Power Reduction of MPEG2 Decoding to 1/4** on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler MPEG2 Decoding with 8 CPU cores Avg. Power 5.73 [W] 73.5% Power Reduction 1.52 [W] #### **Demo of NEDO Multicore for Real Time Consumer Electronics** at the Council of Science and Engineering Policy on April 10, 2008 #### 第74回総合科学技術会議【平成20年4月10日】 第74回総合科学技術会議の様子(1) 第74回総合科学技術会議の様子(2) 第74回総合科学技術会議の様子(3) 第74回総合科学技術会議の様子(4) **CSTP Members Prime Minister:** Mr. Y. FUKUDA Minister of State for Science, Technology and Innovation **Policy:** Mr. F. KISHIDA **Chief Cabinet Secretary:** Mr. N. MACHIMURA Minister of Internal Affairs and **Communications:** Mr. H. MASUDA **Minister of Finance:** Mr. F. NUKAGA Minister of **Education, Culture,** Sports, Science and Technology: Mr. K. TOKAI Minister of **Economy, Trade and Industry:** Mr. A. AMARI ## Green Computing Systems R&D Center ## **Waseda University** ## **Supported by METI (Mar. 2011 Completion)** < R & D Target> Hardware, Software, Application for Super Low-Power Manycore - >More than 64 cores - >Natural air cooling (No fan) Cool, Compact, Clear, Quiet - > Operational by Solar Panel - <Industry, Government, Academia> Hitachi, Fujitsu, NEC, Renesas, Olympus, Toyota, Denso, Mitsubishi, Toshiba, OSCAR Technology, etc - <Ripple Effect> - >Low CO<sub>2</sub> (Carbon Dioxide) Emissions - **> Creation Value Added Products** - > Automobiles, Medical, IoT, Servers Beside Subway Waseda Station, Near Waseda Univ. Main Campus ## **OSCAR Parallelizing Compiler** To improve effective performance, cost-performance and software productivity and reduce power #### **Multigrain Parallelization** coarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism #### **Data Localization** Automatic data management for distributed shared memory, cache and local memory #### **Data Transfer Overlapping** Data transfer overlapping using Data Transfer Controllers (DMAs) #### **Power Reduction** Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. Multicore Program Development Using OSCAR API V2.0 ## **Sequential Application Program in Fortran or C** (Consumer Electronics, Automobiles, Medical, Scientific computation, etc.) Homogeneous Hetero Manual parallelization / power reduction #### **Accelerator Compiler/ User** Add "hint" directives before a loop or a function to specify it is executable by the accelerator with how many clocks ## Waseda OSCAR Parallelizing Compiler - Coarse grain task parallelization - Data Localization - DMAC data transfer - Power reduction using DVFS, Clock/ Power gating Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ. OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores Directives for thread generation, memory, data transfer using DMA, power managements Parallelized API F or C program Proc0 Code with directives Thread 0 Proc1 Code with directives Thread 1 Accelerator 1 Code Accelerator 2 Code Low Power Homogeneous Multicore Code Generation API Analyzer Existing sequential compiler Low Power Heterogeneous Multicore Code Generation API Analyzer (Available from Waseda) Existing sequential compiler Server Code Generation OpenMP Compiler OSCAR: Optimally Scheduled Advanced Multiprocessor API: Application Program Interface Generation of parallel machine codes using sequential compilers Homegeneous Multicore s from Vendor A (SMP servers) various multicores Heterogeneous Multicores from Vendor B Shred memory servers # Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization) 8.9times speedup by 12 processors Intel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000) 55 times speedup by 64 processors IBM Power 7 64 core SMP (Hitachi SR16000) ## **Engine Control by multicore with Denso** Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor. Hard real-time automobile engine control by multicore ## **OSCAR Compile Flow for Simulink Applications** Generate C code using Embedded Coder /\* Model step function \*/ |void VesselExtraction\_step(void) int32\_T i; real\_T u0; for (i = 0; i < 16384; i++) { VesselExtraction\_B.DataTypeConversion[i] = VesselExtraction\_U.In1[i]; /\* End of DataTypeConversion: '<S1>/Data Type Conversion' \*/ /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter' \*/ /\* Constant: '<\$1>/h1' \*/ VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion, VesselExtraction\_P.hl\_Value, &VesselExtraction\_B.Dfilter, (P\_Dfilter\_VesselExtraction\_T \*)&VesselExtraction\_P.Dfilter); /\* End of Outputs for SubSystem: '<S1>/2Dfilter' \*/ /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter1' \*/ /\* Constant: '<81>/h2' \*/ VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion, VesselExtraction\_P.h2\_Value, &VesselExtraction\_B.Dfilter1, (P\_Dfilter\_VesseTExtraction\_T \*)&VesseTExtraction\_P.Dfilter1); #### Simulink model #### C code # Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores (Intel Xeon, ARM Cortex A15 and Renesas SH4A) Road Tracking, Image Compression: <a href="http://www.mathworks.co.jp/jp/help/vision/examples">http://www.mathworks.co.jp/jp/help/vision/examples</a> Buoy Detection: http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink Color Edge Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/">http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/</a> Vessel Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/">http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/</a> # Parallel Processing of Face Detection on Manycore, Highend and PC Server □ OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server. #### An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control # 33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X ## Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) Without Power Reduction by OSCAR Compiler 70% of power reduction ## **Low-Power Optimization with OSCAR API** ## Automatic Power Reduction for MPEG2 Decode on Android Multicore ODROID X2 ARM Cortex-A9 4 cores <a href="http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ">http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ</a> - On 3 cores, Automatic Power Reduction control successfully reduced power to 1/7 against without Power Reduction control. - 3 cores with the compiler power reduction control reduced power to 1/3 against ordinary 1 core execution. Power Reduction on Intel Haswell for Real-time Optical Flow Intel CPU Core i7 4770K For HD 720p(1280x720) moving pictures 15fps (Deadline66.6[ms/frame]) Power was reduced to 1/4 (9.6W) by the compiler power optimization on the same 3 cores (41.6W). Power with 3 core was reduced to 1/3 (9.6W) against 1 core (29.3W). # Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule) 10 times more speedup needed after parallelization for 128 cores of Power 7. Less than 35mW power consumption is required. ## OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology #### **Target:** - Solar Powered with compiler power reduction. - Fully automatic parallelization and vectorization including local memory management and data transfer. ## Fujitsu VPP500/NWT: PE Unit ## **OSCAR Technology** Started up on Feb.28, 2013: Licensing the all patents and OSCAR compiler from Waseda Univ. CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company, **VP of National Univ., Invited Prof. of Waseda U.** ) Executives: Mr. T. Ito (Visiting Prof. Tokyo Agricult. and Eng. U.) Prof. K. Shirai (Ex-President of Waseda U **Chairman of Japanese Open Univ.** ) CTO: Mr. M. Takamura (Ex-Fellow Fujitsu Lab., Fujitsu VPP500, 5000 & NWT Development Leader ) Mr. K. Ashida (Ex-VP Sumitomo Trading, Ashida Consult. CEO, A leader of Business World Auditor: Dr. S. Matsuda ( Prof. Emeritus Waseda U. **Ex-President Ventures and Entrepreneurs Society**) Advisors: Dr. T. Sato ( Patent Attorney, Ex-President of Patent Attorneys Assoc., Gov. IP Committee) Fujitsu VPP5000 **Ms. K. Ishiguro** ( Lawyer, Supreme Court Trainer) Mr. A. Fukuda (Leader of Alumni Assoc.) **Prof. K. Kimura** (Waseda Univ.) Prof. H. Kasahara (Waseda Univ.) #### **Summary** - Waseda University Green Computing Systems R&D Center supported by METI has been researching on low-power high performance Green Multicore hardware, software and application with government and industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology. - OSCAR Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including "Earthquake Wave Propagation", medical applications including "Cancer Treatment Using Carbon Ion", and "Drinkable Inner Camera", industry application including "Automobile Engine Control", "Smartphone", and "Wireless communication Base Band Processing" on various multicores from different vendors including Intel, ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu. - Propagation Simulation" on 128 cores of IBM Power 7 against 1 core, 55 times speedup for "Carbon Ion Radiotherapy Cancer Treatment" on 64cores IBM Power7, 1.95 times for "Automobile Engine Control" on Renesas 2 cores using SH4A or V850, 55 times for "JPEG-XR Encoding for Capsule Inner Cameras" on Tilera 64 cores Tile64 manycore. - > The compiler will be available on market from OSCAR Technology. - In <u>automatic power reduction</u>, <u>consumed powers for real-time multi-media applications</u> like Human face detection, H.264, mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of <u>ARM</u> Cortex A9 and <u>Intel Haswell</u> and 1/4 using <u>Renesas</u> SH4A 8 cores against ordinary single core execution.