#### **High Performance Green Multicore Computing** Hironori Kasahara, Ph.D., IEEE Fellow **IEEE Computer Society President 2018** Professor, Dept. of Computer Science & Engineering Director, Advanced Multicore Processor Research Institute Waseda University, Tokyo, Japan URL: http://www.kasahara.cs.waseda.ac.jp/ 1980 BS, 82 MS, 85 Ph.D., Dept. EE, Waseda Univ. 1985 Visiting Scholar: U. of California, Berkeley 1986 Assistant Prof., 1988 Associate Prof., 1997, Waseda Univ., Now Dept. of Computer Sci. & Eng. 1989-90 Research Scholar: U. of Illinois, Urbana-Champaign, Center for Supercomputing R&D 2004 Director, Advanced Multicore Research Institute, 2017 member: the Engineering Academy of Japan and the Science Council of Japan 2005 STARC Academia-Industry Research Award 2008 LSI of the Year Second Prize 2008 Intel AsiaAcademic Forum Best Research Award 2010 IEEE CS Golden Core Member Award 2014 Minister of Edu., Sci. & Tech. Research Prize 2015 IPSJ Fellow 2017 IEEE Fellow, IEEE Eta Kappa Nu Reviewed Papers: 214, Invited Talks: 145, Published Unexamined Patent Application:59 (Japan, US, GB, China Granted Patents: 30), Articles in News Papers, Web News, Medias incl. TV etc.: 572 IEEE Computer Society President 2018, BoG(2009-14), Multicore STC Chair (2012-), Japan Chair (2005-07), IPSJ Chair: HG for Mag. & J. Edit, Sig. on ARC. [METI/NEDO] Project Leaders: Multicore for Consumer Electronics, Advanced Parallelizing Compiler, Chair: Computer Strategy Committee [Cabinet Office] CSTP Supercomputer Strategic ICT PT, Japan Prize Selection Committees, etc. [MEXT] Info. Sci. & Tech. Committee, Supercomputers (Earth Simulator, HPCI Promo., Next Gen. Supercomputer K) Committees, etc. ## Hironori Kasahara Voted 2017 IEEE Computer Society President-Elect LOS ALAMITOS, Calif., 30 September 2016 – Hironori Kasahara, a Professor of Computer Science at Waseda University in Tokyo, and Director of the Advanced Multicore Research Institute, has been voted IEEE Computer Society 2017 President-Elect. Kasahara is a former member of the IEEE-CS Board of Governors, has served as chair of the IEEE-CS Multicore STC and CS Japan Chapter, and board member of the IEEE Tokyo Section. Kasahara will serve as the 2018 IEEE CS President for a one-year term beginning 1 January 2018. Kasahara garnered 3,278 votes, compared with 2,804 votes cast for Hausi A. Müller, a Professor of Computer Science and Associate Dean of Research, Faculty of Engineering at University of Victoria, Canada, and a member of IEEE-CS Board of Governors. The President oversees IEEE-CS programs and operations and is a nonvoting member of most IEEE-CS program boards and committees. The 2016 election had a 12.69% turnout, with 6,357 ballots cast. The turnout was higher than the 2015 election with and 12.68% turnout (6,239 ballots cast) and the 2014 election with a 12.66% turnout (6,728 ballots cast). #### 2016 IEEE Computer Society Election Results Press Release | Ballot counts Posted 29 September 2016 #### Hironori Kasahara selected 2017 President-Elect (2018 President) Hironori Kasahara has served as a chair or member of 225 society and government committees, including a member of the CS Board of Governors; chair of CS Multicore STC and CS Japan chapter; associate editor of IEEE Transactions on Computers; vice PC chair of the 1996 ENIAC 50th Anniversary International Conference on Supercomputing; general chair of LCPC; PC member of SC, PACT, PPoPP, and ASPLOS; board member of IEEE Tokyo section; and member of the Earth Simulator committee. He received a PhD in 1985 from Waseda University, Tokyo, joined its faculty in 1986, and has been a professor of computer science since 1997 and a director of the Advanced Multicore Research Institute since 2004. He was a visiting scholar at University of California, Berkeley, and the University of Illinois at Urbana-Champaign's Center for Supercomputing R&D. Kasahara received the CS Golden Core Member Award, IFAC World Congress Young Author Prize, IPSJ Fellow and Sakai Special Research Award, and the Japanese Minister's Science and Technology Prize. He led Japanese national projects on parallelizing compilers and embedded multicores, and has presented 210 papers, 132 invited talks, and 27 patents. His research has appeared in 520 newspaper and Web articles. ## IEEE Computer Society BoG (Board of Governors) Feb.1, 2017 https://www.computer.org/web/cshistory/officers-2017 #### **Past IEEE Computer Society Presidents** | Chairs of the IRE Professional Group | Chairs & Presidents of the IEE | CE CE | |--------------------------------------|---------------------------------|---------------------------------------------| | on Electronic Computers | <b>Computer Society</b> | 1996 Mario R. Barbacci | | 1951-53 Morton M. Astrahan | 1964-65 Keith Uncapher | 1997 Barry W. Johnson | | 1953-54 John H. Howard | 1965-66 Richard I. Tanaka | 1998 Doris L. Carver | | 1954-55 Harry Larson | 1966-67 Samuel Levine | 1999 Leonard L. Tripp | | 1955-56 Jean H. Felker | <b>1968-69 Charles L. Hobbs</b> | 2000 Cuylaina M. Pollack | | 1956-57 Jerre D. Noe | 1970-71 Edward J. McCluskey | 2001 Renjamin W. Wah | | | 1972-73 Albert S. Hoagland | 2002 Willis K. King | | 1957-58 Werner Buchholz | 1974-75 Stephen S. Yau | 2002 Willis R. Ring<br>2003 Stephen Diamond | | 1958-59 Willis H. Ware | 1976 Dick B. Simmons | 2004 Carl K. Chang | | 1959-60 Richard O. Endres | 1977-78 Merlin G. Smith | S | | 1960-62 Arnold A. Cohen | 1979-80 Tse-Yun Feng | 2005 Gerald L. Engel | | 1962-64 Walter L. Anderson | 1981 Richard E. Merwin | 2006 Deborah M. Cooper | | | 1982-83 Oscar N. Garcia | 2007 Michael R. Williams | | Chairs of the AIEE Committee | 1984-85 Martha Sloan | 2008 Rangachar Kasturi | | on Large-Scale Computing Devices | 1986-87 Roy L. Russo | 2009 Susan K. (Kathy) Land, | | 1946-49 Charles Concordia | • | 2010 James D. Isaak | | 1949-51 John Grist Brainerd | 1988 Edward A. Parrish | 2011 Sorel Reisman | | 1951-53 Walter H. MacWilliams | 1989 Kenneth A. Anderson | 2012 John W. Walz | | 1953-55 Frank J. Maginniss | 1990 Helen M. Wood | 2013 David Alan Grier | | 1955-57 Edwin L. Harder | 1991 Duncan H. Lawrie | 2014 Dejan S. Milojicic | | 1957-59 Morris Rubinoff | 1992 Bruce D. Shriver | 2015 Thomas M. Conte | | 1959-61 Ruben A. Imm | 1993 James H. Aylor | 2016 Roger U. Fujii | | 1961-63 Claude A. Kagan | 1994 Laurel V. Kaleda | 2017 Jean-Luc Gaudiot | | 1963-64 Gerhard L. Hollander | 1995 Ronald G. Hoelzeman | 2018 Hironori Kasahara | | 1703-07 Uchiana L. Honanach | | | #### **IEEE Computer Society** 60,000+ members, volunteer-led organization, 200 technical conferences, industry-oriented "Rock Stars", 17 scholarly journals and 13 magazines, awards program, Digital Library with more than 550,000 articles and papers, 400 local and regional chapters, 40 technical committees, #### Toward 2018 - Refining content and services to further improve the satisfaction of CS members; - Considering an <u>incentive for volunteers</u> to further accelerate CS activities and promptly provide technical benefits for people around the globe; To express appreciation to volunteers: CS Point (Mileage) System: Annual & Life Time Honor, Premier Seating, Premier Registration, Distinguished Reviewer, etc. - Offering more <u>attractive services</u> for practitioners in <u>industry</u>; - Providing the world's best educational content and historical treasures for future generations, which only the CS can create with our pioneering researchers (for example, the Multicore Compiler Video Series found at www.computer.org/web/education/multicore-video-series); - Thinking about <u>sustainable membership fees</u> while considering the diversity of economic situations within the 10 regions; - Cooperating with other IEEE societies and sister societies in a timely and efficient manner: - Intelligibly introducing the latest computer-related technologies to younger generations, including children, so that they can realize their technological dreams. #### **Multicores for Performance and Low Power** Power consumption is one of the biggest problems for performance scaling from smartphones to cloud servers and supercomputers ("K" more than 10MW). IEEE ISSCC08: Paper No. 4.5, M.ITO, ... and H. Kasahara, "An 8640 MIPS SoC with Independent Power-off Control of 8 CPUs and 8 RAMs by an Automatic Parallelizing Compiler" Power ∞ Frequency \* Voltage<sup>2</sup> (Voltage ∞ Frequency) If <u>Frequency</u> is reduced to <u>1/4</u> (Ex. 4GHz→1GHz), Power is reduced to 1/64 and Performance falls down to 1/4. < <u>Multicores</u>> If **8cores** are integrated on a chip, Power is still 1/8 and Performance becomes 2 times. ## Parallel Soft is important for scalable performance of multicore (LCPC2015) Just more cores don't give us speedup Development cost and period of parallel software are getting a bottleneck of development of embedded systems, eg. IoT, Automobile Earthquake wave propagation simulation GMS developed by National Research Institute for Earth Science and Disaster Resilience (NIED) original (sun studio) proposed method - Automatic parallelizing compiler available on the market gave us no speedup against execution time on 1 core on 64 cores - **Execution time with 128 cores was slower than 1 core (0.9 times speedup)** - Advanced OSCAR parallelizing compiler gave us 211 times speedup with 128cores against execution time with 1 core using commercial compiler - > OSCAR compiler gave us 2.1 times speedup on 1 core against commercial compiler by global cache optimization ### **Power Reduction of MPEG2 Decoding to 1/4** on 8 Core Homogeneous Multicore RP-2 by OSCAR Parallelizing Compiler 5.73 [W] 1.52 [W] #### **Demo of NEDO Multicore for Real Time Consumer Electronics** at the Council of Science and Engineering Policy on April 10, 2008 #### 第74回総合科学技術会議【平成20年4月10日】 第74回総合科学技術会議の様子(1) 第74回総合科学技術会議の様子(2) 第74回総合科学技術会議の様子(3) 第74回総合科学技術会議の様子(4) **CSTP Members Prime Minister:** Mr. Y. FUKUDA **Minister of State for** Science, Technology and Innovation **Policy:** Mr. F. KISHIDA **Chief Cabinet Secretary:** Mr. N. MACHIMURA **Minister of Internal** Affairs and **Communications:** Mr. H. MASUDA **Minister of Finance:** Mr. F. NUKAGA Minister of **Education**, Culture, Sports, Science and Technology: Mr. K. TOKAI Minister of **Economy, Trade and Industry:** Mr. A. AMARI ## Green Computing Systems R&D Center ### **Waseda University** ### **Supported by METI (Mar. 2011 Completion)** <R & D Target> Hardware, Software, Application for Super Low-Power Manycore - >More than 64 cores - Natural air cooling (No fan) Cool, Compact, Clear, Quiet - > Operational by Solar Panel - <Industry, Government, Academia</p> Hitachi, Fujitsu, NEC, Renesas, Olympus, Toyota, Denso, Mitsubishi, Toshiba, OSCAR Technology, etc - < Ripple Effect > - >Low CO2 (Carbon Dioxide) Emissions - > Creation Value Added Products - > Automobiles, Medical, IoT, Servers Beside Subway Waseda Station, Near Waseda Univ. Main Campus ## **OSCAR Parallelizing Compiler** To improve effective performance, cost-performance and software productivity and reduce power Multigrain Parallelization(LCPC1991,2001,04) coarse-grain parallelism among loops and subroutines (2000 on SMP), near fine grain parallelism among statements (1992) in addition to loop parallelism #### **Data Localization** Automatic data management for distributed shared memory, cache and local memory (Local Memory 1995, 2016 on RP2,Cache2001,03) Software Coherent Control (2017) #### Data Transfer Overlapping(2016 partially) Data transfer overlapping using Data Transfer Controllers (DMAs) #### **Power Reduction** (2005 for Multicore, 2011 Multi-processes, 2013 on ARM) Reduction of consumed power by compiler control DVFS and Power gating with hardware supports. #### Multicore Program Development Using OSCAR API V2.0 #### **Sequential Application Program in Fortran or C** (Consumer Electronics, Automobiles, Medical, Scientific computation, etc.) Homogeneous Hetero Manual parallelization / power reduction #### **Accelerator Compiler/ User** Add "hint" directives before a loop or a function to specify it is executable by the accelerator with how many clocks #### Waseda OSCAR **Parallelizing Compiler** - Coarse grain task parallelization - **Data Localization** - **DMAC** data transfer - Power reduction using **DVFS, Clock/ Power gating** Hitachi, Renesas, NEC, Fujitsu, Toshiba, Denso, Olympus, Mitsubishi, Esol, Cats, Gaio, 3 univ. **OSCAR API for Homogeneous and/or Heterogeneous Multicores and manycores** Directives for thread generation, memory, data transfer using DMA, power managements **Parallelized APIF or C** program Proc0 Code with directives Thread 0 Proc1 **Code** with directives Thread 1 **Accelerator 1** Code **Accelerator 2** Code **Low Power** Homogeneous **Multicore Code** Generation API Analyzer | **Existing** sequential compiler Low Power Heterogeneous **Multicore Code** Generation API Analyzer (Available from Waseda) Existing sequential compiler Server Code Generation **OpenMP** Compiler **OSCAR: Optimally Scheduled Advanced Multiprocessor API:** Application Program Interface **Generation of** parallel machine codes using sequential compilers Homegeneous Multicore s from Vendor A (SMP servers) arious multicores Heterogeneous **Multicores** from Vendor B Shred memory servers # Speedup ratio for H.264 and Optical Flow on ARM Cortex-A9 Android 3 cores by OSCAR Automatic Parallelization ## **Automatic Power Reduction on ARM CortexA9 with Android** http://www.youtube.com/channel/UCS43INYEIkC8i\_KIgFZYQBQ ODROID X2 Samsung Exynos4412 Prime, ARM Cortex-A9 Quad core 1.7GHz~0.2GHz, used by Samsung's Galaxy S3 Power for 3cores was reduced to $1/5 \sim 1/7$ against without software power control Power for 3cores was reduced to $1/2 \sim 1/3$ against ordinary 1core execution ## Automatic Power Reduction of OpenCV Face Detection on big.LITTLE ARM Processor - Samsung Exynos 5422 Processor - 4x Cortex-A15 2.0GHz, 4x Cortex-A7 1.4GHz big.LITTLE Architecture - 3CR I DUDDS DAM #### **Automatic Power Reuction on Intel Haswell** ## H.264 decoder & Optical Flow (3cores) Power for 3cores was reduced to $1/3 \sim 1/4$ against without software power control Power for 3cores was reduced to $2/5 \sim 1/3$ against ordinary 1core execution ## 110 Times Speedup against the Sequential Processing for GMS Earthquake Wave Propagation Simulation on Hitachi SR16000 (Power7 Based 128 Core Linux SMP) (LCPC2015) ### Performance on Multicore Server for Latest Cancer Treatment Using Heavy Particle (Proton, Carbon Ion) 327 times speedup on 144 cores Hitachi 144cores SMP Blade Server BS500: **Xeon E7-8890 V3(2.5GHz** 18core/chip) x8 chip 350 327.60 327.6 times speed up with 144 cores 300 GCC 250 196.50 200 150 109.20 100 50 放射線医学研究所 1.00 5.00 施設の費用: 120億円 0 1PE 32pe 64pe 144pe - ➤ Original sequential execution time 2948 sec (50 minutes) using GCC was reduced to 9 sec with 144 cores (327.6 times speedup) - > Reduction of treatment cost and reservation waiting period is expected ## Cancer Treatment Carbon Ion Radiotherapy (Previous best was 2.5 times speedup on 16 processors with hand optimization) 8.9times speedup by 12 processors Intel Xeon X5670 2.93GHz 12 core SMP (Hitachi HA8000) 55 times speedup by 64 processors IBM Power 7 64 core SMP (Hitachi SR16000) ## **Engine Control by multicore with Denso** Though so far parallel processing of the engine control on multicore has been very difficult, Denso and Waseda succeeded 1.95 times speedup on 2core V850 multicore processor. ➤ Hard real-time automobile engine control by multicore using local memories Millions of lines C codes consisting conditional branches and basic blocks ### Macro Task Fusion for Static Task Scheduling ## 3.1 Restructuring: Inline Expansion - Inline expansion is effective - To increase coarse grain parallelism - Expands functions having inner parallelism Improves coarse grain parallelism MTG before inline expansion MTG after inline expansion ### MTG of Crankshaft Program Using Inline Expansion Not enough coarse grain parallelism yet! ### 3.2 Restructuring: Duplicating If-statements - Duplicating if-statements is effective - To increase coarse grain parallelism - Duplicates fused tasks having inner parallelism ### MTG of Crankshaft Program Using Inline Expansion and Duplicating If-statements Successfully increased coarse grain parallelism ## **Evaluation of Crankshaft Program with Multi-core Processors** - □ Attain 1.54 times speedup on RPX - There are no loops, but only many conditional branches and small basic blocks and difficult to parallelize this program - This result shows possibility of multi-core processor for engine control programs ### **OSCAR Compile Flow for Simulink Applications** Generate C code using Embedded Coder /\* Model step function \*/ |void VesselExtraction\_step(void) int32\_T i; real\_T u0; for (i = 0; i < 16384; i++) { VesselExtraction\_B.DataTypeConversion[i] = VesselExtraction\_U.In1[i]; /\* End of DataTypeConversion: '<S1>/Data Type Conversion' \*/ /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter' \*/ /\* Constant: '<\$1>/h1' \*/ VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion, VesselExtraction\_P.hl\_Value, &VesselExtraction\_B.Dfilter, (P\_Dfilter\_VesselExtraction\_T \*)&VesselExtraction\_P.Dfilter); /\* End of Outputs for SubSystem: '<S1>/2Dfilter' \*/ /\* Outputs for Atomic SubSystem: '<S1>/2Dfilter1' \*/ /\* Constant: '<81>/h2' \*/ VesselExtraction\_Dfilter(VesselExtraction\_B.DataTypeConversion, VesselExtraction\_P.h2\_Value, &VesselExtraction\_B.Dfilter1, (P\_Dfilter\_VesseTExtraction\_T \*)&VesseTExtraction\_P.Dfilter1); C code #### Simulink model ## Speedups of MATLAB/Simulink Image Processing on Various 4core Multicores (Intel Xeon, ARM Cortex A15 and Renesas SH4A) Road Tracking, Image Compression: <a href="http://www.mathworks.co.jp/jp/help/vision/examples">http://www.mathworks.co.jp/jp/help/vision/examples</a> Buoy Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink">http://www.mathworks.co.jp/matlabcentral/fileexchange/44706-buoy-detection-using-simulink</a> Color Edge Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/">http://www.mathworks.co.jp/matlabcentral/fileexchange/28114-fast-edges-of-a-color-image--actual-color--not-converting-to-grayscale-/</a> Vessel Detection: <a href="http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/">http://www.mathworks.co.jp/matlabcentral/fileexchange/24990-retinal-blood-vessel-extraction/</a> ## Parallel Processing of Face Detection on Manycore, Highend and PC Server □ OSCAR compiler gives us 11.55 times speedup for 16 cores against 1 core on SR16000 Power7 highend server. ## **OSCAR** Heterogeneous Multicore #### DTU Data Transfer Unit #### **LPM** Local Program Memory #### LDM Local Data Memory #### **DSM** DistributedShared Memory #### **CSM** CentralizedShared Memory #### **FVR** Frequency/Volta ge Control Register #### An Image of Static Schedule for Heterogeneous Multicore with Data Transfer Overlapping and Power Control ## OSCAR API Ver. 2.0 for Homogeneous/Heterogeneous Multicores and Manycores (LCPC2009Homo, 2010 Hetero) #### List of Directives (22 directives) - Parallel Execution API - parallel sections (\*) - flush (\*) - critical (\*) - execution - Memoay Mapping API - threadprivate (\*) - distributedshared - onchipshared - Synchronization API - groupbarrier - Data Transfer API - dma transfer - dma contiguous parameter - dma\_stride\_parameter - dma flag check - dma flag send - (\* from OpenMP) - Power Control API - fvcontrol - get fvstatus - Timer API - get\_current\_time - Accelerator - accelerator task entry - Cache Control - cache writeback - cache selfinvalidate - complete\_memop - noncacheable - aligncache - 2 hint directives for OSCAR compiler - accelerator task - oscar comment from V2.0 ## 33 Times Speedup Using OSCAR Compiler and OSCAR API on RP-X ## Power Reduction in a real-time execution controlled by OSCAR Compiler and OSCAR API on RP-X (Optical Flow with a hand-tuned library) Without Power Reduction by OSCAR Compiler 70% of power reduction #### **Low-Power Optimization with OSCAR API** ### 1987 OSCAR(Optimally Scheduled Advanced Multiprocessor) ### **Co-design of Compiler and Architecture** Looking at various applications, design a parallelizing compiler and design a multiprocessor/multicore-processor to support compiler optimization ### OSCAR(Optimally Scheduled Advanced Multiprocessor) ### **OSCAR Memory Space (Global Address Space)** #### SYSTEM MEMORY SPACE #### LOCAL MEMORY SPACE # Hierarchical Barrier Synchronization - Specifying a hierarchical group barrier - #pragma oscar group\_barrier (C) - !\$oscar group\_barrier (Fortran) ### NWT ### Fujitsu Vector Parallel Supercomputer with Crossbar to a Chip Crossbar Network VPP500: 256\*256 VPP5000: 1024\*1024 #### **VPP500** GaAs/BiCMOS/ECL **Water Cooling** 10<sub>ns</sub> 1GB/PE in Separate Cabinet 1.6GFLOPS/PE 32bit address #### **VPP5000** 0.25um CMOS 30M Tr./Chip **Forced Air Cooling** 3.5ns clock cycle 8GB/PE on PE Board 9.1GFLOPS/PE 32/64 bit address # VPP500/NWT # VPP500/NWT # VPP500/NWT # 4 core multicore RP1 (2007), 8 core multicore RP2 (2008) and 15 core Heterogeneous multicore RPX (2010) developed in NEDO Projects with Hitachi and Renesas | RP-1 (ISSCC2007 #5.3) | RP-2(ISSCC2008 #4.5) | <b>RP-X</b> (ISSCC2010 #5.3) | |---------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------| | Core 2 Core 3 GCPG . | Core0 Core2 Core3 Core4 Core5 Co | PCIe S-ATA EWO Media IPs MX-2 FE Bus Router-E Media IPs Core 4-7 | | 90nm, 8-layer, triple-Vth, CMOS | 90nm, 8-layer, triple-Vth, CMOS | 45nm, 8-layer, triple-Vth, CMOS | | 97.6 mm <sup>2</sup> (9.88 x 9.88 mm) | 104.8 mm <sup>2</sup> (10.61 x 9.88 mm) | 153.8 mm <sup>2</sup> (12.4 x 12.4 mm) | | 1.0V (internal), 1.8/3.3V (I/O) | 1.0-1.4V (internal), 1.8/3.3V (I/O) | 1.0-1.2V (internal), 1.2-3.3V (I/O) | | 600MHz ,4.32 GIPS,16.8 GFLOPS | 600MHz , 8.64 GIPS, 33.6 GFLOPS | 648MHz, 13.7GIPS, 115GOPS, 36.2GFLOPS | | 11.4 GOPS/W(32b換算) | 18.3 GOPS/W(32b換算) | 37.3 GOPS/W(32b換算) | # 8 Core RP2 Chip Block Diagram # Automatic Parallelization of JPEG-XR for Drinkable Inner Camera (Endo Capsule) 10 times more speedup needed after parallelization for 128 cores of Power 7. Less than 35mW power consumption is required. ### OSCAR Vector Multicore and Compiler for Embedded to Severs with OSCAR Technology #### **Target:** - Solar Powered - Compiler power reduction. - >Fully automatic parallelization and vectorization including local memory management and data transfer. #### **Vector Accelerator** #### **Features** - Attachable for any CPUs (Intel, ARM, IBM) - Data driven initiation by sync flags #### Function Units [tentative] - Vector Function Unit - 8 double precision ops/clock - 64 characters ops/clock - · Variable vector register length - Chaining LD/ST & Vector pipes - Scalar Function Unit #### Registers[tentative] - Vector Register 256Bytes/entry, 32entry - Scalar Register 8Bytes/entry - Floating Point Register 8Bytes/entry - Mask Register 32Bytes/entry # OSCAR Technology Corp. Started up on Feb. 28, 2013: Licensing the all patents and OSCAR compiler from Waseda Univ. Founder and CEO: Dr. T. Ono (Ex- CEO of First Section-listed Company, **Director of National U., Invited Prof. of Waseda U.)** Executives: Dr. M. Ohashi: COO (Ex- OO of Ono Sokki) Mr. A. Nodomi : CTO (Ex- Spansion) Mr. N. Ito (Ex- Visiting Prof. Tokyo Agricult. And Tech. U.) Dr. K. Shirai (Ex- President of Waseda U., Ex- Chairman of Japanese Open U.) Mr. K. Ashida (Ex- VP Sumitomo Trading, Adhida Consult. CEO) Mr. S. Tsuchida (Co-Chief Investment Officer of Innovation Network Corp. of Japan) Auditor: Mr. S. Honda (Ex- Senior VP and General Manager of MUFG) Dr. S. Matuda (Emeritus Prof. of Waseda U., Chairman of WERU INVESTME Mr. Y. Hirowatari (President of AGS Consulting) Advisors: Prof. H. Kasahara (Waseda U.) Prof. K. Kimura (Waseda U.) ### **Future Multicore Products** #### **Next Generation Automobiles** - Safer, more comfortable, energy efficient, environment friendly - Cameras, radar, car2car communication, internet information integrated brake, steering, engine, moter control #### **Smart phones** - -From everyday recharging to less than once a week - Solar powered operation in emergency condition - Keep health ### **Advanced medical systems** ## Cancer treatment, Drinkable inner camera - Emergency solar powered - No cooling fun, No dust, clean usable inside OP room # Personal / Regional Supercomputers ## Solar powered with more than 100 times power efficient: FLOPS/W Regional Disaster Simulators saving lives from tornadoes, localized heavy rain, fires with earth quakes ### **Summary** - Waseda University Green Computing Systems R&D Center supported by METI has been researching on low-power high performance Green Multicore hardware, software and application with industry including Hitachi, Fujitsu, NEC, Renesas, Denso, Toyota, Olympus and OSCAR Technology. - OSCAR Automatic Parallelizing and Power Reducing Compiler has succeeded speedup and/or power reduction of scientific applications including "Earthquake Wave Propagation", medical applications including "Cancer Treatment Using Carbon Ion", and "Drinkable Inner Camera", industry application including "Automobile Engine Control", "Smartphone", and "Wireless communication Base Band Processing" on various multicores from different vendors including Intel, ARM, IBM, AMD, Qualcomm, Freescale, Renesas and Fujitsu. - In automatic parallelization, 110 times speedup for "Earthquake Wave Propagation Simulation" on 128 cores of IBM Power 7 against 1 core, 55 times speedup for "Carbon Ion Radiotherapy Cancer Treatment" on 64cores IBM Power7, 1.95 times for "Automobile Engine Control" on Renesas 2 cores using SH4A or V850, 55 times for "JPEG-XR Encoding for Capsule Inner Cameras" on Tilera 64 cores Tile64 manycore. - > The compiler will be available on market from OSCAR Technology. - In <u>automatic power reduction</u>, <u>consumed powers for real-time multi-media applications</u> like <u>Human face detection</u>, <u>H.264</u>, <u>mpeg2 and optical flow were reduced to 1/2 or 1/3 using 3 cores of <u>ARM</u> Cortex A9 and <u>Intel Haswell</u> and 1/4 using <u>Renesas</u> SH4A 8 cores against ordinary single core execution.</u> - Local memory management for automobiles and software coherent control have been patented and already realized by OSCAR compiler.