Name

KIMURA, Keiji

Official Title

Professor

Affiliation

(School of Fundamental Science and Engineering)

Contact Information

Mail Address

Mail Address
kimura@apal.cs.waseda.ac.jp

Address・Phone Number・Fax Number

Address
3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555 Japan
Phone Number
+81-3-5286-3371
Fax Number
+81-3-3232-3594

URL

Web Page URL

http://www.apal.cs.waseda.ac.jp/

Grant-in-aids for Scientific Researcher Number
50318771

Sub-affiliation

Sub-affiliation

Faculty of Science and Engineering(Graduate School of Fundamental Science and Engineering)

Affiliated Institutes

アドバンストマルチコアプロセッサ研究所

研究員 2009-2010

アドバンストチップマルチプロセッサ研究所

研究員 2004-2008

ITバイオ・マイニング研究所

研究所員 2010-2013

アドバンストマルチコアプロセッサ研究所

研究所員 2010-2014

次世代蓄電エネルギー連携研究所

研究所員 2012-2014

低消費電力光インターコネクション研究所

研究所員 2015-

理工学術院総合研究所(理工学研究所)

兼任研究員 2018-

アドバンストマルチコアプロセッサ研究所

研究所員 2014-2019

アドバンストマルチコアプロセッサ研究所

研究所員 2019-

Educational background・Degree

Educational background

-1996 Waseda University Faculty of Science and Engineering Department of Electronics

Degree

Doctor of Engineering Coursework Computer system

Career

1999-2002Research Associate, Department of Electrical, Electronics and Computer Engineering, Waseda University
2002-2004Visiting Assistant Professor, Advanced Research Institute for Science and Engineering, Waseda University
2004-2005Assistant Professor, Department of Computer Science, Waseda University
2005-2012Associate Professor, Department of Computer Science, Waseda University
2012-Professor, Department of Computer Science and Engineering, Waseda University

Academic Society Joined

Information Processing Society of Japan

The institute of Electronics, Information and Communication Engineers

IEEE Computer Society

ACM

OfficerCareer(Outside the campus)

2014-The 28th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Program Committee
2009-2013XXVII--XXXII IEEE International Conference on Computer Design (ICCD ), Program Committee (Computer System Design and Application Track)
2011-The 17th IEEE International Conference on Parallel and Distributed Systems (ICPADS ), Program Committee (Multicore Computing and Parallel / Distributed Architecture)
2011-2014The 24--27th International Workshop on Languages and Compilers for Parallel Computing (LCPC ), Program Committee, Program Chair (2012)
2011-Advanced Parallel Processing Technology Symposium (APPT ), Program Committee
2010-IEEE International Symposium on Workload Characterization (IISWC-2010), Program Committee
2010-22nd International Symposium on Computer Architecture and High Performance Computing (SBAC PAD ), Program Committee (System Software Track)
2009-The 38th International Conference on Parallel Processing (ICPP-), Program Committee (Programming Models, Languages and Compilers)
IEEESymposiumonLow-PowerandHigh-SpeedChips(COOLChipsVIII,XII--XVII),ProgramCommittee
IEEESymposiumonLow-PowerandHigh-SpeedChips(COOLChipsIX--XI),ProgramCommitteeViceChair
Theth International Symposium on High-Performance Computer Architecture (HPCA-15), Publicity Co-Chairs
2006-IPSJ SACSIS , 2008--2013, Program Committee
2006-2008IPSJ ComSys , Program Committee
2007-IPSJ DA Symposium , University Chair,
2007-IPSJ Transaction on Advanced Computing Systems, Editor,
2005-2009/03ISPJ SACSIS , Financial Chair and Program Committee,

Research Field

Keywords

Multiprocessor Architecture, Parallelizing Compiler

Research interests Career

1998-2004Multigrain Parallelizing Cimpiler Cooperative Chip Multiprocessor

Current Research Theme Keywords:Multiprocessor Architecture, Parallelizing Compiler

Paper

低消費電力コンピューティングを実現するマルチコア技術

木村啓二, 笠原博徳

電子情報通信学会誌 97(2) p.133 - 1392014/02-

OSCAR Compiler Controlled Multicore Power Reduction on Android Platform

Hideo Yamamoto, Tomohiro Hirano, Kohei Muto, Hiroki Mikami, Takashi Goto, Dominic Hillenbrand, Moriyuki Takamura, Keiji Kimura, and Hironori Kawahara

The 26th International Workshop on Languages and Compilers for Parallel Computing,(LCPC2013) 2013/09-

Reconciling Application Power Control and Operating Systems for Optimal Power and Performance

Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura and Hironori Kasahara

8th International Workshop on Reconfigurable Communication-centric Systems-on-Chip, ReCoSoC (ReCoSoC2013) 2013/07-

Automatic Parallelization of Hand Written Automotive Engine Control Codes Using OSCAR Compiler

Dan Umeda, Yohei Kanehagi, Hiroki Mikami, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

17th Workshop on Compilers for Parallel Computing (CPC2013) 2013/07-

OSAR API v2.1: Extensions for an Advanced Accelerator Control Scheme to a Low-Power Multicore API

Keiji Kimura, Cecilia Gonzales-Alvarez, Akihiro Hayashi, Hiroki Mikami, Mamoru Shimaoka, Jun Shirako, Hironori Kasahara

17th Workshop on Compilers for Parallel Computing (CPC2013) 2013/07-

Automatic Parallelization, Performance Predictability and Power Control for Mobile-Applications

Dominic Hillenbrand, Akihiro Hayashi, Hideo Yamamoto, Keiji Kimura, Hironori Kasahara

COOL Chips XVI, IEEE Symposium on Low Power and High-Speed Chips 2013/04-

Parallelization of Automotive Engine Control Software On Embedded Multi-core Processor Using OSCAR Compiler

Yohei Kanehagi, Dan Umeda, Akihiro Hayashi, Keiji Kimura, Hironori Kasahara

COOL Chips XVI, IEEE Symposium on Low Power and High-Speed Chips 2013/04-

Automatic Design Exploration Framework for Multicores with Reconfigurable Accelerators

Cecilia Gonzalez-Alvarez, Haruku Ishikawa, Akihiro Hayashi, Daniel Jimenez-Gonzalez, Carlos Alvarez, Keiji Kimura, Hironori Kasahara

th Workshop on Reconfigurable Computing (WRC) 2013, held in conjuction with HiPEAC conference 2013 2013/01-

Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

Yasir I Al-Dosary, Keiji Kimura, Hironori Kasahara, and Seinosuke Narita

17th International Conference on Computer Games: AI, Animation, Mobile, Educational & Serious Games 2012/07-

OSCAR Parallelizing Compiler and API for Real-time Low Power Heterogeneous Multicores

kihiro Hayashi, Mamoru Shimaoka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Jun Shirako, Keiji Kimura, and Hironori Kasahara

6th Workshop on Compilers for Parallel Computing(CPC2012) 2012/01-

ヘテロジニアスマルチコア向けソフトウェア開発フレームワーク及びAPI

林明宏, 和田康孝, 渡辺岳志, 関口威, 間瀬正啓, 白子準, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム(ACS36) 5(1) p.68 - 792011/11-

A Parallelizing Compiler Cooperative Heterogeneous Multicore Processor Architecture

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, and Hironori Kasahara

Lecture Notes in Computer Science 6760p.215 - 2332011/11-

Evaluation of Power Consumption at Execution of Multiple Automatically Parallelized and Power Controlled Media Applications on the RP2 Low-power Multicore

Hiroki Mikami, Shumpei Kitaki, Masayoshi Mase, Akihiro Hayashi, Mamoru Shimaoka, Keiji Kimura, Masato Edahiro, and Hironori Kasahara

Proc. of The 23rd International Workshop on Languages and Compilers for Parallel Computing (LCPC2011) 2011/09-

Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-time Heterogeneous Multicores

A. Hayashi, Y. Wada, T. Watanabe, T. Sekiguchi, M. Mase, J. Shirako, K. Kimura, H. Kasahara

Lecture Notes in Computer Science 6548p.184 - 1982011/02-

A 45-nm37.3 GOPS/W Heterogeneous Multi-Core SOC with 16/32 Bit Instruction-Set General-Purpose Core

Osamu NISHII, Yoichi YUYAMA, Masayuki ITO, Yoshikazu KIYOSHIGE, usuke NITTA, Makoto ISHIKAWA, Tetsuya YAMADA, Junichi MIYAKOSHI, YasutakaWADA, Keiji KIMURA, Hironori KASAHARA, and Hideo MAEJIMA

IEICE TRANSACTIONS on Electronics E94-C(4) p.663 - 6692011/04-

Parallelizing Compiler Framework and API for Power Reduction and Software Productivity of Real-time Heterogeneous Multicores

A. Hayashi, Y. Wada, T. Watanabe, T. Sekiguchi, M. Mase, J. Shirako, K. Kimura, H. Kasahara

Proc. of The 23rd International Workshop on Languages and Compilers for Parallel Computing (LCPC2010) 2010/10-

OSCAR API for Real-time Low-Power Multicores and Its Performance on Multicores and SMP Servers

Keiji Kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako and Hironori Kasahara

Lecture Notes in Computer Science 5898p.188 - 2022010-

Parallelizable C and Its Performance on Low Power High Performance Multicore Processors

Masayoshi Mase, Yuto Onozaki, Keiji Kimura, Hironori Kasahara

Proc. of 15th Workshop on Compilers for Parallel Computing (CPC 2010) 2010/07-

A 45nm 37.3GOPS/W Heterogeneous Multi-Core SoC

Y. Yuyama, M. Ito, Y. Kiyoshige, Y. Nitta, S. Matsui, O. Nishii, A.Hasegawa, M. Ishikawa, T. Yamada, J. Miyakoshi, K. Terada, T. Nojiri, M. Satoh, H. Mizuno, K. Uchiyama, Y. Wada, K. Kimura, H. Kasahara, H.Maejima

IEEE INTERNATIONAL SOLID-STATE CIRCUITS CONFERENCE (ISSCC 2010) 2010/02-

自動並列化のためのElement-Sensitiveポインタ解析

間瀬正啓, 村田雄太, 木村啓二, 笠原博徳

情報処理学会論文誌プログラミング(PRO) 3(2) p.36 - 472010/03-

OSCAR API for Real-time Low-Power Multicores and Its Performance on Multicores and SMP Servers

Keiji Kimura, Masayoshi Mase, Hiroki Mikami, Takamichi Miyamoto, Jun Shirako and Hironori Kasahara

Proc. of The 22nd International Workshop on Languages and Compilers for Parallel Computing (LCPC2009) 2009/10-

Green Multicore-SoC Software-Execution Framework with Timely-Power-Gating Scheme

Masafumi Onouchi, Keisuke Toyama, Toru Nojiri, Makoto Sato, Masayoshi Mase, Jun Shirako, Mikiko Sato, Masashi Takada, Masayuki Ito, Hiroyuki Mizuno, Mitaro Namiki, Keiji Kimura, Hironori Kasahara

Proc. of 2009 International Conference on Parallel Processing p.510 - 5172009/09-

マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

間瀬正啓, 中川亮, 大國直人, 白子準, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム(ACS) 2(3) p.96 - 1062009/09-

マルチコアプロセッサ上での粗粒度タスク並列処理のためのコンパイラによるローカルメモリ管理手法

中野啓史, 桃園拓, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム(ACS) 2(2) p.63 - 742009/07-

Performance of OSCAR Multigrain Parallelizing Compiler on Multicore Processors

Hiroki Mikami, Jun Shirako, Masayoshi Mase, Takamichi Miyamoto, Hirofumi Nakano, Fumiyo Takano, Akihiro Hayashi, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Proc. of 14th Workshop on Compilers for Parallel Computing(CPC 2009) 2009/01-

Parallelization with Automatic Parallelizing Compiler Generating Consumer Electronics Multicore API

akamichi Miyamoto, Saori Asaka, Hiroki Mikami, Masayoshi Mase, Yasutaka Wada, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Proc. of IEEE International Symposium on Advances in Parallel and Distributed Computing Techniques (APDCT-08) 2008/12-

情報家電用マルチコア並列化APIを生成する自動並列化コンパイラによる並列化の評価

宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会論文誌 コンピューティングシステム(ACS) 1(3) p.83 - 952008/12-

Power Reduction Controll for Multicores in OSCAR Multigrain Parallelizing Compiler

Jun Shirako, Keiji Kimura, Hironori Kasahara

Proc. of International SoC Design Conference (ISOCC 2008) 2008/11-

ヘテロジニアスマルチコア上でのスタティックスケジューリングを用いた MP3エンコーダの並列化

和田康孝, 林明宏, 益浦健, 白子準, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム 1(1) p.105 - 1192008/06-

Parallelizing Compiler Cooperative Heterogeneous Multicore

Yasutaka Wada, Akihiro Hayashi, Takeshi Masuura, Jun Shirako, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of Workshop on Software and Hardware Challenges of Manycore Platforms (SHCMP 2008) 2008/06-

An 8 CPU SoC with Independent Power-off Control of CPUs and Multicore Software Debug Function

Yutaka Yoshida, Masayuki Ito, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Toshihiro Hattori, Jun Sakiyama, Masashi Takada, Kunio Uchiyama, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Proc. of IEEE Cool Chips XI: Symposium on Low-Power and High-Speed Chips 2008 2008/04-

Heterogeneous Multi-core Architecture that Enables 54x AAC-LC Stereo Encoding

Hiroaki Shikano, Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Tomoyuki Kodama, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

IEEE Journal of Solid-State Circuits 43(4) p.902 - 9102008/04-

Power-Aware Compiler Controllable Chip Multiprocessor

Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Hiroaki Shikano, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara E91-C(4) p.432 - 4392008/04-

A 600MHz SoC with Compiler Power-off Control of 8 CPUs and 8 Onchip-RAMs

Masayuki Ito, Toshihiro Hattori, Yutaka Yoshida, Kiyoshi Hayase, Tomoichi Hayashi, Osamu Nishii, Yoshihiko Yasu, Atsushi Hasegawa, Masashi Takada, Masaki Ito, Hiroyuki Mizuno, Kunio Uchiyama, Toshihiko Odaka, Jun Shirako, Masayoshi Mase, Keiji Kimura, Hironori Kasahara

Proc. of International Solid State Circuits Conference (ISSCC2008) p.90 - 912008/02-

Software-Cooperative Power-Efficient Heterogeneous Multi-Core for Media Processing

Hiroaki Shikano, Masaki Ito, Kunio Uchiyama, Toshihiko Odaka, Akihiro Hayashi, Takeshi Masuura, Masayoshi Mase, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Proc. of 13th Asia and South Pacific Design Automation Conference (ASP-DAC 2008) p.736 - 7412008/01-

Performance Evaluation of Compiler Controlled Power Saving Scheme

Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science 4759p.480 - 4932008/01-

Heterogeneous Multiprocessor on a Chip Which Enables 54x AAC-LC Stereo Encoding

Masaki Ito, Takashi Todaka, Takanobu Tsunoda, Hiroshi Tanaka, Tomoyuki Kodama, Hiroaki Shikano, Masafumi Onouchi, Kunio Uchiyama, Toshihiko Odaka, Tatsuya Kamei, Ei Nagahama, Manabu Kusaoke, Yusuke Nitta, Yasutaka Wada, Keiji Kimura, Hironori Kasahara

Proc. of 2007 Symposia on VLSI TEchnology and Circuits 2007/06-

MP3エンコーダを用いたOSCARヘテロジニアスチップマルチプロセッサの性能評価

鹿野裕明, 鈴木裕貴, 和田康孝, 白子準, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム Vol. 48, No. SIG8(ACS18),p.141 - 1522007/05-

Compiler Control Power Saving Scheme for Multi Core Processors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science Vol. 4339p.362 - 3762007/05-

A 4320MIPS Four-Processor Core SMP/AMP with Individually Managed Clock Frequency for Low Power Consumption

Y. Yoshida, T. Kamei, K. Hayase, S. Shibahara, O. Nishii, T. Hattori, A. Hasegawa, M. Takada, N. Irie, K. Uchiyama, T. Odaka, K. Takada, K. Kimura, H. Kasahara

2007 IEEE International Solid-State Circuits Conference(ISSCC2007) p.100 - 1012007/02-

マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

白子準, 吉田宗弘, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会論文誌コンピューティングシステム Vol. 47(ACS15)2006-

Performance Evaluation of Compiler Controlled Power Saving Scheme

Jun Shirako, Munehiro Yoshida, Naoto Oshiyama, Yasutaka Wada, Hirofumi Nakano, Hiroaki Shikano, Keiji Kimura Hironori Kasahara

Proc. of 20th ACM International Conference on Supercomputing Workshop on Advanced Low Power Systems (ALPS2006) 2006/07-

Performance Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder

Hiroaki Shikano, Yuki Suzuki, Yasutaka Wada, Jun Shirako, Keiji Kimura, Hironori Kasahara

Proc. of IEEE Symposium on Low-Power and High-Speed Chips (COOL Chips IX) p.349 - 3632006/05-

Parallelizing Compilation Scheme for Reduction of Power Consumption of Chip Multiprocessors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of 12th Workshop on Compilers for Parallel Computers (CPC 2006), 2006/01-

マルチコア化するマイクロプロセッサ

笠原博徳, 木村啓二

情報処理 47(1) p.10 - 162006/01-

マルチコアにおけるプログラミング

木村啓二, 笠原博徳

情報処理 47(1) p.17 - 232006/01-

Compiler Control Power Saving Scheme for Multi Core Processors

Jun Shirako, Naoto Oshiyama, Yasutaka Wada, Hiroaki Shikano, Keiji Kimura, Hironori Kasahara

Proc. of The 18th International Workshop on Languages and Compilers for Parallel Computing (LCPC2005), 2005/10-

チップマルチプロセッサ上jでのMPEG2エンコードの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会論文誌 46(9) p.2311 - 23252005/09-

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Kazuhisa Ishizaka, Takamichi Miyamoto, Jun Shirako, Motoki Obata, Keiji Kimura, Hironori Kasahara

Lecture Notes in Computer Science 3602p.3192005-

Multigrain Parallel Processing on Compiler Cooperative Chip Multiprocessor

Keiji Kimura, Yasutaka Wada, Hirofumi Nakano, Takeshi Kodaka, Jun Shirako, Kazuhisa Ishizaka, Hironori Kasahara

Proc. of 9th Workshop on Interaction between Compilers and Computer Architectures (INTERACT-9) p.11 - 202005/02-

Performance of OSCAR Multigrain Parallelizing Compiler on SMP Servers

Kazuhisa Ishizaka, Takamichi Miyamoto, Jun Shirako, Keiji Kimura, Hironori Kasahara

Proc. of 17th International Workshop on Languages and Compilers for Parallel Computing (LCPC2004) 2004/09-

Multigrain Parallel Processing on Compiler Cooperative OSCAR Chip Multiprocessor Architecture 'Jointly Worked'

Keiji Kimura, Yasutaka Wada, Hirofumi Nakano, Takeshi Kodaka, Jun Shirako, Kazuhisa Ishizaka, Hironori Kasahara

The IEICE Transactions on Electronics, Special Issue on High-Performance and Low-Power System LSIs and Related Technologies E86-C(4) p.570 - 5792003/02-

Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

International Journal of Parallel Programming 31(3) p.211 - 2232003/06-

Parallel Processing using Data Localization for MPEG2 Encoding on OSCAR Chip Multiprocessor

Takeshi Kodaka, Hirofumi Nakano, Keiji Kimura, Hironori Kasahara

Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04) 2004/01-

Memory Management for Data Localication on OSCAR Chip Multiprocessor

Hirofumi Nakano, Takeshi Kodaka, Keiji Kimura Hironori Kasahara

Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'04) 2004/01-

Multigrain Parallel Processing on OSCAR CMP

Keiji Kimura, Takeshi Kodaka, Motoki Obata, Hironori Kasahara

IEEE Computer Society Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'03) p.56 - 652003/01-

Performance of Multigrain Parallelization in Japanese Millennium Project IT21 Advanced Parallelizing Compiler

Hironori Kasahara, Motoki Obata, Kazuhisa Ishizaka, Keiji Kimura, Hiroki Kaminaga, Hirofumi Nakano, Kouhei Nagasawa, Akiko Murai, HIroki Itagaki, Jun Shirako

Proc. of 10th International Workshop on Compilers for Parallel Computers (CPC) Amsterdam Netherland 2003/01-

シングルチップマルチプロセッサにおけるJPEGエンコーディングのマルチグレイン並列処理

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会ハイパフォーマンスコンピューティングシステム論文誌 43(Sig 6(HPS5)) p.153 - 1622002-

近細粒度並列処理用シングルチップマルチプロセッサにおけるプロセッサコアの評価

木村啓二, 加藤孝幸, 笠原博徳

情報処理学会論文誌 42(4) p.692 - 7032001/04-

Static Coarse Grain Task Scheduling with Cache Optimization Using OpenMP

Hirofumi Nakano, Kazuhisa Ishizaka, Motoki Obata, Keiji Kimura, Hironori Kasahara

Springer Lecture Notes in Computer Science 2327 High Performance Computing (Proc. of ISHPC WOMPEI) p.479 - 4892002-

Multigrain Parallel Processing for JPEG Encoding on a Single Chip Multiprocessor

T. Kodaka, K. Kimura, H. Kasahara

IEEE Computer Society Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'02) p.57 - 632002/01-

Evaluation of Single Chip Multiprocessor Core Architecture with Near Fine Grain Parallel Processing

Keiji Kimura, Hironori Kasahara

Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'01) 2001/01-

シングルチップマルチプロセッサ上での近細粒度並列処理

木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会論文誌 40(5) p.1924 - 19341999/05-

Near Fine Grain Parallel Processing Using Static Scheduling on Single Chip Multiprocessors

Keiji Kimura, Hironori Kasahara

IEEE Computer Society Proc. of International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems (IWIA'99) p.23 - 311999/11-

Data-Localization among Doall and Sequential Loops in Coarse Grain Parallel Processing

Akimasa Yoshida, Yasushi Ujigawa, Motoki Obata, Keiji Kimura, Hironori Kasahara

Seventh Workshop on Compilers for Parallel Computers Linkoping Sweden p.266 - 2771998/01-

OSCAR Multi-grain Architecture and Its Evaluation

Hironori Kasahara, Masami Okamoto, Akimasa Yoshida, Wataru Ogata, Keiji Kimura, Gantetsu Matsui, Hidenori Matsuzaki, Hiroki Honda

IEEE Computer Society Proc. International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, (IWIA'97) p.106 - 1151997/11-

Near Fine Grain Parallel Processing without Explicit Synchronization on a Multiprocessor System

Wataru Ogata, Akimasa Yoshida, Masami Okamoto, Keiji Kimura, Hironori Kasahara

Proc. of Sixth Workshop on Compilers for Parallel Computers (Aachen Germany) 1996/12-

モデルベース設計により自動生成されたエンジン制御Cコードのマルチコア用自動並列化

梅田弾, 金羽木洋平, 見神広紀, 谷充弘(デンソー), 森裕司(デンソー), 木村啓二, 笠原博徳

組み込みシステムシンポジウム(ESS2013) 2013/10-

組込マルチコア用OSCAR APIを用いたTILEPro64上でのマルチメディアアプリケーションの 並列処理

岸本耀平, 見神広紀, 中野恵一, 林明宏, 木村啓二, 笠原博徳

組み込みシステムシンポジウム(ESS2012) 2012/10-

重粒子線がん治療用線量計算エンジンの自動並列化

林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

HPCS2012 - ハイパフォーマンスコンピューティングと計算科学シンポジウム 2012/01-

マルチコア上でのOSCAR APIを用いた並列化コンパイラによる低消費電力化手法

中川亮, 間瀬正啓, 大國直人, 白子準, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2009) p.3 - 102009/05-

情報家電用マルチコア上におけるマルチメディア処理のコンパイラによる並列化

宮本孝道, 浅香沙織, 見神広紀, 間瀬正啓, 木村啓二, 笠原博徳

SACSIS2008 - 先進的計算基盤システムシンポジウム 2008/05-

情報家電用マルチコアSMP実行モードにおける制約付きCプログラムのマルチグレイン並列化

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

組込みシステムシンポジウム2007 2007/10-

マルチコアプロセッサにおけるコンパイラ制御低消費電力化手法

白子準, 吉田宗広, 押山直人, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

先進的計算基盤システムシンポジウム(SACSIS2006) (467) p.4762006/05-

シングルチップマルチプロセッサにおける JPEGエンコーディングのマルチグレイン並列処理 (共著)

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会並列処理シンポジウム(JSPP2002) 2002/05-

統計的手法を用いた並列化コンパイラ協調マルチコアアーキテクチャシミュレータ高速化手法

田口学豊, 木村啓二, 笠原博徳

社団法人 電子情報通信学会, 信学技報 ETNET20142014/03-

不正侵入検知システムにおけるマルチコア上でのシグネチャ割当によるレイテンシ削減手法

山田正平, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-2012014/03-

小ポイントFFTのマルチコア上での自動並列化手法

古山祐樹, 見神広紀, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-2012014/03-

プロファイル情報を用いたAndroid 2D描画ライブラリSKIAのOSCARコンパイラによる並列化

後藤隆志, 武藤康平, 山本英雄, 平野智大, 見神広紀, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-207-122013/12-

Androidベースマルチコア上での自動電力制御

平野智大, 武藤康平, 後藤隆志, 見神広紀, 山本英雄, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-206-232013/08-

OSCAR API標準解釈系を用いた階層グルーピング対応ハードウェアバリア同期機構の評価

川島慧大, 金羽木洋平, 林明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-206-162013/08-

Enhancing the Performance of a Multiplayer Game by Using a Parallelizing Compiler

アルドーサリー ヤーセル, 古山祐樹, ドミニク ヒレンブランド, 木村啓二, 笠原博徳, 成田誠之助

情報処理学会研究報告 2013-OS-1252013/04-

マルチコア商用スマートディバイスの評価と並列化の試み

山本英雄, 後藤隆志, 平野智大, 武藤康平, 見神広紀, Hillenbrand Dominic, 林 明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2013-OS-1242013/02-

コンパイラと協調したシミュレー ション精度切り換え可能な マルチコアアーキテクチャシミュレータ

田口学豊, 阿部洋一, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-203-142013/01-

並列化アプリケーションを対象とした統計的手法によるメニーコア アーキテクチャシミュレーションの高速化

阿部洋一, 田口学豊, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-203-132013/01-

自動車エンジン制御ソフトウェアにおけるマルチコア上での並列処理

金羽木洋平, 梅田弾, 見神広紀, 林明宏, 沢田光男(トヨタ自動車(株)), 木村 啓二, 笠原博徳

情報処理学会研究報告 2013-ARC-203-22013/01-

Opportunities and Challenges of Application-Power Control in the Age of Dark Silicon

Dominic Hillenbrand, Yuuki Furuyama, Akihiro Hayashi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara

情報処理学会研究報告 2012-ARC-202 HPC137-262012/12-

地震動シミュレータGMSのOSCARコンパイラによる自動並列化

島岡護, 見神広紀, 林明宏, 和田康孝, 木村啓二, 森田秀和(株日立製作所), 内山 邦男(株日立製作所), 笠原博徳

情報処理学会研究報告 2012-ARC-202HPC137-112012/12-

Automatic parallelization with OSCAR API Analyzer: a cross-platform performance evaluation

Gonzalez-Alvarez Cecilia, 金羽木洋平, 竹本昂生, 岸本耀平, 武藤康平, 見神広紀, 林明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-202HPC137-102012/12-

低消費電力マルチコ アRP-Xを用いた1ワットWebサービスの実現

古山祐樹, 島岡護, 見神広紀, 林明宏, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-201-242012/08-

エンジン基本制御ソフトウェアモデルのマルチコア上での並列処理

梅田弾, 金羽木洋平, 見神広紀, 林明宏 谷充弘, 森裕司, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-201-222012/08-

並列化メディアアプリケーションを対象としたメニーコアアーキテクチャシミュレーションの 高速化の検討

阿部洋一, 石塚亮, 大胡亮太, 田口学豊, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-199-32012/03-

Javaの自動並列化における例外フローのイン ライニング解析とメソッドディスパッチの高速化

田端啓一, 木村啓二, 笠原博徳

情報処理学会研究報告 2012-ARC-199-92012/03-

JISX0180:2011「組込みソフトウェア向けコーディング規約の作成方法」を用いた Parallelizable Cの定義

木村啓二, 間瀬正啓, 笠原博徳

情報処理学会研究報告 ETNET20122012/03-

SMPサーバー上での粒子線がん治療用線量計算エンジンの自動並列化

林明宏, 松本卓司, 見神広紀, 木村啓二, 山本啓二, 崎浩典, 高谷保行, 笠原博徳

情報処理学会研究報告 2011-ARC189HPC132-22011/11-

科学技術計算プログラムの構造を利用したメニーコアアーキテクチャシミュレーション高速化手法の評価

石塚亮, 阿部洋一, 大胡亮太, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-ARC-196-142011/07-

SPECベンチマークプログラムのCUDAによる並列化の検討

平勇樹, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-HPC-130-162011/07-

メディアアプリケーションにおけるコンパイラによるI/Oオーバーヘッド隠蔽手法

林明宏, 関口威, 間瀬正啓, 和田康孝, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-ARC-195-142011/04-

低消費電力マルチコアRP2上での複数メディアアプリケーション実行時の消費電力評価

見神広紀, 北基俊平, 佐藤崇文, 間瀬正啓, 木村啓二, 石坂一久, 酒井淳嗣, 枝廣正人, 笠原博徳

情報処理学会研究報告 2011-ARC-194-12011/03-

OSCAR API標準解釈系を用いたParallelizable Cプログラムの評価

佐藤卓也, 見神広紀, 林明宏, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究報告 2011-ARC-191-22010/10-

情報家電用ヘテロジニアスマルチコア用自動並列化コンパイラフレームワーク

林明宏, 和田康孝, 渡辺岳志, 関口威, 間瀬正啓, 木村啓二, 伊藤雅之, 長谷川淳, 佐藤真琴, 野尻徹, 内山邦男, 笠原博徳

情報処理学会研究会報告 2010-ARC-190-7(SWoPP2010)2010/08-

情報家電用ヘテロジニアスマルチコアRP-Xにおけるコンパイラ低消費電力制御性能

和田康孝, 林明宏, 渡辺岳志, 関口威, 間瀬正啓, 白子準, 木村啓二, 伊藤雅之, 長谷川淳, 佐藤真琴, 野尻徹, 内山邦男, 笠原博徳

情報処理学会研究会報告 2010-ARC-190-8(SWoPP2010)2010/08-

プログラム構造に着目したメニーコアアーキテクチャシミュレータの高速化手法

石塚亮, 大友俊也, 大胡亮太, 木村啓二, 笠原博徳

情報処理学会研究会報告 2010-ARC-190-20(SWoPP2010)2010/08-

組込み向けマルチコア上での複数アプリケーション動作時の自動並列化されたアプリケーションの処理性能

宮本孝道, 間瀬正啓, 木村啓二, 石坂一久, 酒井淳嗣, 枝廣正人, 笠原博徳

情報処理学会研究会報告 2010-ARC-1882010/03-

H.264/AVCエンコーダのマルチコアプロセッサにおける階層的並列処理

見神広紀, 宮本孝道, 木村啓二, 笠原博徳

情報処理学会研究会報告 2010-ARC-1872010/01-

自動並列化のためのElement-Sensitiveポインタ解析

間瀬 正啓, 村田 雄太, 木村 啓二, 笠原 博徳

情報処理学会第76回プログラミング研究会 2009/10-

マルチコアにおけるParallelizable Cプログラムの自動並列化

間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究会報告 2009-ARC-174-15(SWoPP2009)2009/08-

並列度・タスク実行時間の偏りを考慮した標準タスクグラフセットSTG Ver3を用いたスケジューリングアルゴリズムの評価

島岡護, 今泉和浩,鷹野芙美代,木村啓二,笠原博徳

第119回 ハイパフォーマンスコンピューティング研究会 2009/02-

メディアアプリケーションを用いた並列化コンパイラ協調型ヘテロジニアスマルチコアアーキテクチャのシミュレーション評価

神山輝壮, 和田康孝, 林 明宏, 間瀬正啓, 中野啓史, 渡辺岳志, 木村啓二, 笠原博徳

社団法人 電子情報通信学会, 信学技報 ICD2008-1402009/01-

マルチコアのためのコンパイラにおけるローカルメモリ管理手法

桃園拓, 中野啓史, 間瀬正啓, 木村啓二, 笠原博徳

社団法人 電子情報通信学会, 信学技報 ICD2008-1412009/01-

マルチコア上でのOSCAR API を用いた低消費電力化手法

中川亮, 間瀬正啓, 白子準, 木村啓二, 笠原博徳

社団法人 電子情報通信学会, 信学技報 ICD2008-1452009/01-

ポインタ解析を用いた制約付きCプログラムの自動並列化

間瀬正啓, 馬場大介, 長山晴美, 村田雄太, 木村啓二, 笠原博徳,

情報処理学会研究会報告 2007-ARC-178-142008/05-

階層グルーピング対応バリア同期機構の評価

山田海斗, 間瀬正啓, 白子準, 木村啓二, 伊藤雅之, 服部俊洋, 水野弘之, 内山邦男, 笠原博徳

情報処理学会研究会報告 2007-ARC-178-42008/05-

マルチコアプロセッサ上でのマルチメディア処理の並列化

宮本孝道, 田村圭, 田野裕秋, 見神広紀, 浅香沙織, 間瀬正啓, 木村啓二, 笠原博徳

情報処理学会研究会報告 2007-ARC-175-15(デザインガイア2007)2007/11-

54倍速AACエンコードを実現するヘテロジニアスマルチコアアーキテクチャの検討

鹿野裕明, 伊藤雅樹, 戸高貴司, 津野田賢伸, 兒玉征之, 小野内雅文, 内山邦男, 小高俊彦, 亀井達也, 永濱 衛, 草桶 学, 新田祐介, 和田康孝, 木村啓二, 笠原博徳

電子情報通信学会, 信学技報 ICD2007-71, Vol. 107(195) 2007/08-

ヘテロジニアスマルチコア上での階層的粗粒度タスクスタティックスケジューリング手法

和田康孝, 林明宏, 伊能健人, 白子準, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-174-17(SWoPP2007)2007/08-

ヘテロジニアスマルチコア上でのコンパイラによる低消費電力制御

林明宏, 伊能健人, 中川亮, 松本繁, 山田海斗, 押山直人, 白子準, 和田康孝, 中野啓史, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 2007-ARC-174-18(SWoPP2007)2007/08-

情報家電用マルチコアSMP実行モードにおけるマルチグレイン並列処理

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 深津幸 二, 宮本孝道, 白子準, 中野啓史, 木村啓二, 亀井達也, 服部俊洋, 長谷川淳, 佐藤真琴, 伊藤雅樹, 内山 邦男, 小高俊彦, 笠原博徳

情報処理学会研究会報告 2007-ARC-173-052007/05-

独立に周波数制御可能な 4320MIPS、SMP/AMP対応 4プロセッサLSIの開発

早瀬 清, 吉田 裕, 亀井達也, 芝原真一, 西井 修, 服部俊洋, 長谷川 淳, 高田雅士, 入江直彦, 内山邦男, 小高俊彦, 高田 究, 木村啓二, 笠原博徳

情報処理学会研究会報告 2007-ARC-173-062007/05-

マルチグレイン並列化コンパイラにおけるローカルメモリ管理手法

三浦 剛, 田川友博, 村松裕介, 池見明紀, 中川正洋, 中野啓史, 白子 準, 木村啓二, 笠原博徳

情報処理学会研究会報告 2007-ARC-172/HPC-109-11 (HOKKE2007)2007/03-

マルチコア上でのマルチメディアアプリケーションの自動並列化

宮本孝道, 浅香沙織, 鎌倉信仁, 山内宏真, 間瀬正啓, 白子準, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2007-171-132007/01-

SMPサーバ及び組み込み用マルチコア上でのOSCARマルチグレイン自動並列化コンパイラの性能

白子準, 田川友博, 三浦剛, 宮本孝道, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2006-170-02 (デザインガイア2006)2006/11-

OSCARコンパイラにおける制約付きCプログラムの自動並列化

間瀬正啓, 馬場大介, 長山晴美, 田野裕秋, 益浦健, 深津幸二, 宮本孝道, 白子準, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2006-170-01 (デザインガイア2006)2006/11-

OSCARマルチコア上でのローカルメモリ管理手法

中野啓史, 仁藤拓実, 丸山貴紀, 中川正洋, 鈴木裕貴, 内藤陽介, 宮本孝道, 和田康孝, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2006-169-282006/08-

マルチコアプロセッサ上での粗粒度タスク並列処理におけるデータ転送オーバラップ方式

宮本孝道, 中川正洋, 浅野尚一郎, 内藤陽介, 仁藤拓実, 中野啓史, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC-2006-167, HPC-2006-1052006/02-

MP3エンコーダを用いたヘテロジニアスチップマルチプロセッサの性能評価

鹿野裕明, 鈴木裕貴, 和田康孝, 白子準, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC-2006-1662006/01-

ヘテロジニアスチップマルチプロセッサにおける粗粒度タスクスタティックスケジューリング手法

和田康孝, 押山直人, 鈴木裕貴, 内藤陽介, 白子準, 中野啓史,鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC-2006-1662006/01-

マルチコアプロセッサ上でのデータローカライゼーション

中野啓文, 浅野尚一郎, 内藤陽介, 仁藤拓実, 田川友博, 宮本孝道, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-165-102005/12-

ホモジニアスマルチコアにおけるコンパイラ制御低消費電力化手法

白子準, 押山直人, 和田康孝, 鹿野裕明, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-164-10 (SWoPP205)2005/08-

配列間接アクセスを用いないコード生成法を用いた電子回路シミュレーション

黒田亮, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-161-1 (SHINING2005)2005/01-

共有メモリ型マルチプロセッササーバ上におけるOSCARマルチグレイン自動並列化コンパイラの性能評価

白子準, 宮本孝道, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2005-161-5 (SHINING2005)2005/01-

OSCARチップマルチプロセッサ上でのMPEG2エンコードの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC02004-160-102004/12-

OSCARチップマルチプロセッサ上でのマルチグレイン並列性評価

和田康孝, 白子準, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2004-159-11 (SWoPP2004)2004/08-

OSCARチップマルチプロセッサ上でのデータ転送ユニットを用いたデータローカライゼーション

中野啓文, 内藤陽介, 鈴木貴久, 小高剛, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2004-159-20 (SWoPP2004)2004/08-

OSCAR チップマルチプロセッサ上でのMPEG2エンコーディングの並列処理 (共著)

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-154-10 (SWoPP2003)2003/08-

OSCAR CMP 上でのスタティックスケジューリングを用いたデータローカライゼーション手法 (共著)

中野啓文, 内藤陽介, 鈴木貴久, 小高剛, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-154-14 (SWoPP2003)2003/08-

SMPマシン上での粗粒度タスク並列処理におけるデータプリフェッチ手法

宮本孝道, 山口高弘, 飛田高雄, 石坂一久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-155-062003/11-

データローカライゼーションを伴うMPEG2エンコーディングの並列処理

小高剛, 中野啓文, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2004-156-32004/02-

チップマルチプロセッサ上での粗粒度タスク並列処理によるデータローカライゼーション (共著)

中野啓文, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2003-151-3 (SHINING2003)2003/01-

OSCAR チップマルチプロセッサ上でのマルチグレイン並列処理 (共著)

木村啓二, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-150-72002/11-

OSCAR 型シングルチップマルチプロセッサにおける動きベクトル探索処理 (共著)

小高剛, 鈴木貴久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-150-62002/11-

SMPマシン上での粗粒度タスク並列処理オーバーへッドの解析 (共著)

和田康孝, 中野啓文, 木村啓二, 小幡元樹, 笠原博徳

情報処理学会研究報告 ARC2002-148-32002/05-

シングルチップマルチプロセッサにおけるマルチグレイン並列処理 (共著)

内田貴之, 小高剛, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-146-32002/02-

OSCAR型シングルチップマルチプロセッサ上でのJPEGエンコーディングプログラムのマルチグレイン並列処理 (共著)

小高剛, 内田貴之, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2002-146-42002/02-

シングルチップマルチプロセッサ上でのマルチメディアアプリケーションの近細粒度並列処理 (共著)

小高剛, 宮下直久, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2001-144-112001/11-

キャッシュ最適化を考慮したマルチプロセッサシステム上での粗粒度タスクスタティックスケジューリング手法 (共著)

中野啓文, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC2001-144-122001/08-

近細粒度並列処理用シングルチップマルチプロセッサにおけるプロセッサコアの構成 (共著)

木村啓二, 内田貴之, 加藤孝幸, 笠原博徳

情報処理学会研究報告 ARC139-16(SWoPP2000)2000-

シングルチップマルチプロセッサ上での近細粒度並列処理の性能評価 (共著)

木村啓二, 間中邦之, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会研究報告 ARC134-41999/08-

マルチグレイン並列化コンパイラのメモリアクセスアナライザ (共著)

岩井啓輔, 小幡元樹, 木村啓二, 天野英晴, 笠原博徳

電子通信情報学会技術報告 CPSY99-621999/08-

マルチグレイン並列化コンパイラとそのアーキテクチャ支援 (共著)

笠原博徳, 尾形航, 木村啓二, 小幡元樹, 飛田高雄, 稲石大祐

社団法人電子情報通信学会 信学技報 ICD98-10 CPSY98-10 FTS98-101998/04-

シングルチップマルチプロセッサ上でのマルチグレイン並列処理 (共著)

木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会研究報告 ARC98-130-51998/08-

最早実行可能条件解析を用いたキャッシュ利用の最適化 (共著)

稲石大祐, 木村啓二, 藤本謙作, 尾形航, 笠原博徳

情報処理学会研究報告 ARC98-130-61998/08-

処理とデータ転送のオーバーラッピングを考慮したダイナミックスケジューリングアルゴリズム (共著)

木村啓二, 橋本茂, 古郷誠, 尾形航, 笠原博徳

電子情報通信学会研究報告 CPSY97-401997-

マルチグレイン並列処理用マルチプロセッサシステム (共著)

岩井啓輔, 藤原崇, 森村知弘, 天野英晴, 木村啓二, 尾形航, 笠原博徳

電子情報通信学会研究報告 CPSY97-461997/08-

FPGAを用いたマルチプロセッサシステムテストベッドの実装 (共著)

尾形航, 山本泰平, 水尾学, 木村啓二, 笠原博徳

情報処理学会研究報告 ARC128-14 HPC70-141998/03-

IBM pSeries 690上でのOSCARマルチグレイン自動並列化コンパイラの精嚢評価

石坂一久, 白子準, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会第66回全国大会 2004/03-

マルチプロセッサシステム上でのキャッシュ最適化を考慮した粗粒度タスクスタティックスケジューリング手法 (共著)

中野啓文, 石坂一久, 小幡元樹, 木村啓二, 笠原博徳

情報処理学会第62回全国大会 2001/03-

マルチメディアアプリケーションのシングルチップマルチプロセッサ上での近細粒度並列処理

小高剛, 木村啓二, 宮下直久, 笠原博徳

情報処理学会第62回全国大会 2001/03-

近細粒度並列処理に適したシングルチップマルチプロセッサのメモリアーキテクチャの評価

松元信介, 木村啓二, 笠原博徳

情報処理学会第62回全国大会 2001/03-

マルチグレイン並列処理用シングルチップマルチプロセッサにおけるデータ転送ユニットの検討

宮下直久, 木村啓二, 小高剛, 笠原博徳

情報処理学会第62回全国大会 2001/03-

シングルチップマルチプロセッサの近細粒度並列処理に対する性能評価

加藤考幸, 尾形航, 木村啓二,内田貴之, 笠原博徳

情報処理学会第60回全国大会 2000/03-

最早実行可能条件解析を用いたキャッシュ最適化手法

稲石大祐, 木村啓二, 藤本謙作, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会第58回全国大会 1999/03-

マルチグレイン並列処理用シングルチップマルチプロセッサアーキテクチャ

木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会第56回全国大会 1998/03-

マクロタスク最早実行可能条件解析を用いたキャッシュ最適化手法

稲石大祐, 木村啓二, 尾形航, 岡本雅巳, 笠原博徳

情報処理学会第56回全国大会 1998/03-

A 45-nm 37.3 GOPS/W Heterogeneous Multi-Core SOC with 16/32 Bit Instruction-Set General-Purpose Core

Nishii, Osamu;Yuyama, Yoichi;Ito, Masayuki;Kiyoshige, Yoshikazu;Nitta, Yusuke;Ishikawa, Makoto;Yamada, Tetsuya;Miyakoshi, Junichi;Wada, Yasutaka;Kimura, Keiji;Kasahara, Hironori;Maejima, Hideo

IEICE TRANSACTIONS ON ELECTRONICS E94C(4) p.663 - 6692011-2011

DOIWoS

Detail

ISSN:0916-8524

Multi Media Offload with Automatic Parallelization

ISHIZAKA KAZUHISA;SAKAI JUNJI;EDAHIRO MASATO;MIYAMOTO TAKAMICHI;MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(59) p.1 - 72010/03-2010/03

CiNii

Detail

ISSN:09196072

Multi Media Offload with Automatic Parallelization

ISHIZAKA KAZUHISA;SAKAI JUNJI;EDAHIRO MASATO;MIYAMOTO TAKAMICHI;MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(59) p.1 - 72010/03-2010/03

CiNii

Detail

ISSN:09196072

Multi Media Offload with Automatic Parallelization

ISHIZAKA KAZUHISA;SAKAI JUNJI;EDAHIRO MASATO;MIYAMOTO TAKAMICHI;MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(59) p.1 - 72010/03-2010/03

CiNii

Detail

ISSN:09196072

Multi Media Offload with Automatic Parallelization

ISHIZAKA KAZUHISA;SAKAI JUNJI;EDAHIRO MASATO;MIYAMOTO TAKAMICHI;MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(59) p.1 - 72010/03-2010/03

CiNii

Detail

ISSN:09196072

Parallelizing Compiler Directed Software Coherence

MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(7) p.1 - 102010/04-2010/04

CiNii

Detail

ISSN:09196072

Parallelizing Compiler Directed Software Coherence

MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(7) p.1 - 102010/04-2010/04

CiNii

Detail

ISSN:09196072

A Compiler Framework for Heterogeneous Multicores for Consumer Electronics

HAYASHI AKIHIRO;WADA YASUTAKA;WATANABE TAKESHI;SEKIGUCHI TAKESHI;MASE MASAYOSHI;KIMURA KEIJI;ITO MASAYUKI;HASEGAWA ATSUSHI;SATO MAKOTO;NOJIRI TOHRU;UCHIYAMA KUNIO;KASAHARA HIRONORI

2010(7) p.1 - 92010/07-2010/07

CiNii

Detail

ISSN:09196072

Performance of Power Reduction Scheme by a Compiler on Heterogeneous Multicore for Consumer Electronics "RP-X"

WADA YASUTAKA;HAYASHI AKIHIRO;WATANABE TAKESHI;SEKIGUCHI TAKESHI;MASE MASAYOSHI;SHIRAKO JUN;KIMURA KEIJI;ITO MASAYUKI;HASEGAWA ATSUSHI;SATO MAKOTO;NOJIRI TOHRU;UCHIYAMA KUNIO;KASAHARA HIRONORI

2010(8) p.1 - 102010/07-2010/07

CiNii

Detail

ISSN:09196072

An Acceleration Technique of Many Core Architecture Simulator Considering Program Structure

ISHIZUKA RYO;OOTOMO TOSHIYA;DAIGO RYOTA;KIMURA KEIJI;KASAHARA HIRONORI

2010(20) p.1 - 72010/07-2010/07

CiNii

Detail

ISSN:09196072

Evaluation of Parallelizable C Programs by the OSCAR API Standard Translator

SATO TAKUYA;MIKAMI HIROKI;HAYASHI AKIHIRO;MASE MASAYOSHI;KIMURA KEIJI;KASAHARA HIRONORI

2010(2) p.1 - 62010/10-2010/10

CiNii

Detail

ISSN:09196072

Automatic Parallelization of Dose Calculation Engine for A Particle Therapy on SMP Servers

Akihiro Hayashi;Takuji Matsumoto;Hiroki Mikami;Keiji Kimura;Keiji Yamamoto;Hironori Saki;Yasuyuki Takatani;Hironori Kasahara

IPSJ SIG Notes 2011(2) p.1 - 92011/11-2011/11

CiNii

Detail

Outline:A particle therapy has been attracted much attention over the years. This is because a particle therapy is really effective for the cancers and has a small effect on normal cells. However, it takes a long time to simulates the dose calculation before the treatment. It is essential to gain the performance of a treatment simulation by using multicore processors. In this paper, we realize an automatic parallelization of dose calculation engine for a particle therapy. We apply a kind of technique which increase the paralellism to the calculation engine in order that parallelizing compiler is able to exploit the loop level parallelism. As a result, the proposed method attains speedup up to 50.0x with 64 IBM Power 7 processors and 9.0x with 12 Intel Xeon processors.

Automatic Parallelization of Dose Calculation Engine for A Particle Therapy on SMP Servers

Akihiro Hayashi;Takuji Matsumoto;Hiroki Mikami;Keiji Kimura;Keiji Yamamoto;Hironori Saki;Yasuyuki Takatani;Hironori Kasahara

IPSJ SIG Notes 2011(2) p.1 - 92011/11-2011/11

CiNii

Detail

Outline:A particle therapy has been attracted much attention over the years. This is because a particle therapy is really effective for the cancers and has a small effect on normal cells. However, it takes a long time to simulates the dose calculation before the treatment. It is essential to gain the performance of a treatment simulation by using multicore processors. In this paper, we realize an automatic parallelization of dose calculation engine for a particle therapy. We apply a kind of technique which increase the paralellism to the calculation engine in order that parallelizing compiler is able to exploit the loop level parallelism. As a result, the proposed method attains speedup up to 50.0x with 64 IBM Power 7 processors and 9.0x with 12 Intel Xeon processors.

A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

KIMURA KEIJI;MASE MASAYOSHI;KASAHARA HIRONORI

IEICE technical report. Dependable computing 111(462) p.127 - 1322012/02-2012/02

CiNii

Detail

ISSN:0913-5685

Outline:JISX0180:2011 "Framework of establishing coding guidelines for embedded system development" was decided to improve the quality of embeded systems. Parallelizable C has bee also proposed to support exploitation of parallelism by a parallelizing compiler. This paper proposes a definition of Parallelizable C by JISX0180:2011 aiming at the improvement of productivity for embeded multicore developers with parallelizing compilers. An evaluation has been carried out using rewritten programs by the defined coding guideline on ordinary SMPs and a consumer electronics multicore. As the result, 5.54x speedup on IBM p5 550Q (8core), 2.42x speedup on Intel Core i7 960 (4core), and 2.79x speedup on Renesas/Hitachi/Waseda RP2 (4core) have been achieved, respectively.

A Definition of Parallelizable C by JISX0180:2011 "Framework of establishing coding guidelines for embedded system development"

KIMURA KEIJI;MASE MASAYOSHI;KASAHARA HIRONORI

IEICE technical report. Computer systems 111(461) p.127 - 1322012/02-2012/02

CiNii

Detail

ISSN:09135685

Outline:JISX0180:2011 "Framework of establishing coding guidelines for embedded system development" was decided to improve the quality of embeded systems. Parallelizable C has bee also proposed to support exploitation of parallelism by a parallelizing compiler. This paper proposes a definition of Parallelizable C by JISX0180:2011 aiming at the improvement of productivity for embeded multicore developers with parallelizing compilers. An evaluation has been carried out using rewritten programs by the defined coding guideline on ordinary SMPs and a consumer electronics multicore. As the result, 5.54x speedup on IBM p5 550Q (8core), 2.42x speedup on Intel Core i7 960 (4core), and 2.79x speedup on Renesas/Hitachi/Waseda RP2 (4core) have been achieved, respectively.

A Latency Reduction Technique for IDS by Allocating Decomposed Signature on Multi-core

Shohei Yamada;Hiroki Mikami;Keiji Kimura;Hironori Kasahara

IPSJ SIG Notes 2014(2) p.1 - 82014/02-2014/02

CiNii

Detail

Outline:Cyber attacks targeting on companies and government organizations have been increasing and highly developed. An Intrusion Detection System (IDS) is one of efficient solutions to prevent those attacks. An IDS detects illegal network accesses in realtime by monitoring the network and filtering suspicious IP packets. Large processing performance is required for IDSs to process a large number of IP packets in realtime. In order to satisfy this requirement, a latency reduction technique for signature-based IDSs by allocating decomposed signature on multicores is proposed in this paper. The proposed technique is implemented in Suricata, which is an open source IDS, and evaluated it with several data sets, such as DARPA Intrusion Detection Evaluation Data Set. The evaluation results show the proposed techniques with four cores achieves 3.22 times performance improvement in maximum comparing with two cores without signature decomposition.

Automatic Parallelization of Small Point FFT on Multicore Processor

Yuuki Furuyama;Hiroki Mikami;Keiji Kimura;Hironori Kasahara

IPSJ SIG Notes 2014(3) p.1 - 82014/02-2014/02

CiNii

Detail

Outline:Fast Fourier Transorm (FFT) is one of the most frequently used algorihtms in many applications including digital signal processing and image processing to compute Descrite Fourier Transform (DFT). Although small size FFT programs must be used in baseband signal processing such as LTE and so on, it's difficult to use special hardwares like DSPs for computing such a small problem because of their relatively large data transfer and control overhead. This paper proposes an automatic parallelization method to generate parallelized programs with low overhead for small size FFTs suited for shared memory multicore processor by applying cache optimization to avoide false sharing between cores. The proposed method has been implemented in OSCAR automatic parallelizing compiler, parallelized small point FFT programs from 32 points to 256 points and evaluated them on RP2 multicore processor having 8 SH-4A cores. It achieved 1.97 times speedup on 2 SH-4A cores and 3.9 times speedup on 4 SH-4A cores in a 256 points FFT program. In addition to the FFT programs, the proposed approach is applied to Fast Hadamard Transform (FHT) which has similar computation to the FFT. The results are 1.91 times speedup on 2 SH-4A cores and 3.32 times speedup on 4 SH-4A cores. It shows effectiveness of the proposed method and easiness of applying the method to many kinds of programs.

Parallelization of Automobile Engine Control Software on Multicore Processor

KANEHAGI YOUHEI;UMEDA DAN;MIKAMI HIROKI;HAYASHI AKIHIRO;SAWADA MITSUO;KIMURA KEIJI;KASAHARA HIRONORI

Technical report of IEICE. ICD 112(425) p.3 - 102013/01-2013/01

CiNii

Detail

ISSN:0913-5685

Outline:The calculation load in the automobile control system is increasing to achive more safety, comfort and energy-saving Accordingly, control processor cores needs high performance However, the improvement of clock frequency in processor cores is difficult, and it is important to use multicore processor Using the multicore for the engine control, performance, development cost, development period, etc are problems be-cause it is difficult to parallelize softwares This paper proposes a parallelization method of the automobile engine control software on the multicore processor, which has only functioned on single-core processors Con-cretely, it is applied restructuring the sequential program for extracting more parallelism, for example inlining functions and duplicating conditional branches, and the OSCAR compiler allows us perform automatic par-allelization and generation of a parallel C program Using proposed method, the automobile engine control software, which is difficult to parallelize manually because of very fine-grained program, is parallelized and give us 1 71x speedup using 2 cores on RP-X multicore It is confirmed that parallehzation of the automobile engine control software is effective

An Accelemtion Technique of Many-core Architecture Simulation with Parallelized Applications by Statistical Technique

Abe Yoichi;Taguchi Gakuho;Kimura Keiji;Kasahara Hironori

Technical report of IEICE. ICD 112(425) p.57 - 632013/01-2013/01

CiNii

Detail

ISSN:0913-5685

Outline:This paper proposes an automatic decision technique of the number of clusters and samplmg points for an acceleration technique of many-core architecture simulation by statistical methods This techinque, firstly, focuses on a structure of a benchmark program, especially loops The number of sampling points is exploited from iterations of a target loop by statistical methods If the variation of the cost of the iterations is large, these iterations are grouped into clusters Thus, this technique enables higher estimation accuracy with fewer sampling points However, the number of clusters must be decided by hand in our previous works The automatic decision technique of the number of clusters by "x means" is proposed m this paper As a preliminary evaluation of the proposed technique, sequential execution costs of several benchmark programs are estimated As a result, when MPEG2 encoder program with SIF16, which causes large variation among the cost of iterations, is used, 1 92% error is achieved with 14 iterations as sampling pomts of 450 iterations exploited by x-means

A Parallelizing Compiler Cooperative Multicore Architecture Simulator with Changeover Mechanism of Simulation Modes

TAGUCHI GAKUHO;ABE YOUICHI;KIMURA KEIJI;KASAHARA HIRONORI

Technical report of IEICE. ICD 112(425) p.65 - 712013/01-2013/01

CiNii

Detail

ISSN:0913-5685

Outline:A parallelizing compiler cooperative multicore architecture simulation framework, which enables reducing simulation time by a flexible simulation-mode changeover mechanism, is proposed A multicore architecture simulator in this framework has two modes, namely, functional-and-fast simulation mode and cycle-accurate-and slow simulation modes This framework generates appropriate sampling points for cycle-accurate mode and runtime for mode changeover of the simulator depending on a parallelized application by cooperating with a parallelizing compiler The proposed framework is evaluated with EQUAKE from SPEC2000 The evaluation result shows 50 times to 500 times speedup can be achieved within 1 6% error

Multicore Technologies Realizing Low-power Computing

KIMURA Keiji;KASAHARA Hironori

The Journal of the Institute of Electronics, Information, and Communication Engineers 97(2) p.133 - 1392014/02-2014/02

CiNii

Detail

ISSN:09135693

Automatic Parallelization of Designed Engine Control C Codes by MATLAB/Simulink

Dan Umeda;Youhei Kanehagi;Hiroki Mikami;Akihiro Hayashi;Mitsuhiro Tani;Hiroshi Mori;Keiji Kimura;Hironori Kasahara

IPSJ Journal 55(8) p.1817 - 18292014/08-2014/08

CiNii

Detail

ISSN:03875806

Outline:Recently, more safety, comfort and environmental feasibility are required for the automobile. Accordingly, control systems need performance enhancement on microprocessors for real-time software which realize that. However, the improvement of clock frequency has been limited by power consumption and the performance of a single-core processor which controls power has reached the limits. For these factors, multi-core processors will be used for automotive control system. Recently Model-based Design by MATLAB and Simulink has been used for developing automobile systems because of elimination time of development and improvement of reliability. However, auto-generated-code from MATLAB and Simulink has been functioned on only single core processor so far. This paper proposes a parallelization method of engine control C codes for a multi-core processor generated from MATLAB and Simulink using Embedded Coder. The engine control C code which composed of many conditional branches and arithmetic assignment statements and are difficult to parallelize have been parallelized automatically using OSCAR automatic parallel compiler. In this result, it is succeeded to attain performance improvement on RP2 and V850E2R. Maximum 1.9x speedup on two cores and 3.76x speedup on four cores are attained.

A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

TAGUCHI Gakuho;KIMURA Keiji;KASAHARA Hironori

IEICE technical report. Dependable computing 113(498) p.289 - 2942014/03-2014/03

CiNii

Detail

ISSN:0913-5685

Outline:A parallelizing compiler cooperative acceleration technique for multicore architecture simulation is proposed in this paper. Profile data of a sequential execution of a target application on a real machine is decomposed into multiple clusters by x-means clustering. Then, sampling points for a detail simulation mode in each cluster are calculated. In addition, a parallelizing compiler generates a parallelized code by taking both of the clustering information and the source code of the target application. The evaluation results show, in the case of the simulation for 16 cores, 437 times speedup is achieved with 0.04% error for equake, and 28 times speedup is achieved with 0.04% error for mpeg2 encoder.

A Parallelizing Compiler Cooperative Acceleration Technique of Multicore Architecture Simulation using a Statistical Method

TAGUCHI Gakuho;KIMURA Keiji;KASAHARA Hironori

IEICE technical report. Computer systems 113(497) p.289 - 2942014/03-2014/03

CiNii

Detail

ISSN:0913-5685

Outline:A parallelizing compiler cooperative acceleration technique for multicore architecture simulation is proposed in this paper. Profile data of a sequential execution of a target application on a real machine is decomposed into multiple clusters by x-means clustering. Then, sampling points for a detail simulation mode in each cluster are calculated. In addition, a parallelizing compiler generates a parallelized code by taking both of the clustering information and the source code of the target application. The evaluation results show, in the case of the simulation for 16 cores, 437 times speedup is achieved with 0.04% error for equake, and 28 times speedup is achieved with 0.04% error for mpeg2 encoder.

A Latency Reduction Technique for IDS by Allocating Decomposed Signature on Multi-core

YAMADA SHOHEI;MIKAMI HIROKI;KIMURA KEIJI;KASAHARA HIRONORI

Technical report of IEICE. ICD 113(474) p.7 - 142014/02-2014/02

CiNii

Detail

ISSN:0913-5685

Outline:Cyber attacks targeting on companies and government organizations have been increasing and highly developed. An Intrusion Detection System (IDS) is one of efficient solutions to prevent those attacks. An IDS detects illegal network accesses in realtime by monitoring the network and filtering suspicious IP packets. Large processing performance is required for IDSs to process a large number of IP packets in realtime. In order to satisfy this requirement, a latency reduction technique for signature-based IDSs by allocating decomposed signature on multicores is proposed in this paper. The proposed technique is implemented in Suricata, which is an open source IDS, and evaluated it with several data sets, such as DARPA Intrusion Detection Evaluation Data Set. The evaluation results show the proposed techniques with four cores achieves 3.22 times performance improvement in maximum comparing with two cores without signature decomposition.

Automatic Parallelization of Small Point FFT on Multicore Processor

FURUYAMA YUUKI;MIKAMI HIROKI;KIMURA KEIJI;KASAHARA HIRONORI

Technical report of IEICE. ICD 113(474) p.15 - 222014/02-2014/02

CiNii

Detail

ISSN:0913-5685

Outline:Fast Fourier Transorm (FFT) is one of the most frequently used algorihtms in many applications including digital signal processing and image processing to compute Descrite Fourier Transform (DFT). Although small size FFT programs must be used in baseband signal processing such as LTE and so on, it's difficult to use special hardwares like DSPs for computing such a small problem because of their relatively large data transfer and control overhead. This paper proposes an automatic parallelization method to generate parallelized programs with low overhead for small size FFTs suited for shared memory multicore processor by applying cache optimization to avoide false sharing between cores. The proposed method has been implemented in OSCAR automatic parallelizing compiler, parallelized small point FFT programs from 32 points to 256 points and evaluated them on RP2 multicore processor having 8 SH-4A cores. It achieved 1.97 times speedup on 2 SH-4A cores and 3.9 times speedup on 4 SH-4A cores in a 256 points FFT program. In addition to the FFT programs, the proposed approach is applied to Fast Hadamard Transform (FHT) which has similar computation to the FFT. The results are 1.91 times speedup on 2 SH-4A cores and 3.32 times speedup on 4 SH-4A cores. It shows effectiveness of the proposed method and easiness of applying the method to many kinds of programs.

Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

2015(34) p.1 - 62015/02-2015/02

CiNii

Detail

ISSN:09196072

Outline:This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

Dynamic Scheduling Algorithm for Automatically Parallelized and Power Reduced Applications on Multicore Systems

2015(34) p.1 - 62015/02-2015/02

CiNii

Detail

ISSN:09196072

Outline:This paper proposes a dynamic scheduling algorithm for multiple automatically parallelized or power reduced applications on a multicore smart devices to gain higher performance and lower power comsumption within the application's deadline. This scheduling algorithm uses the information such as time, power, deadline and number of cores for each application, and is composed of three type of scheduling. Using media codec applications as a benchmark, the proposed scheduling gained 18.5% speedup and 28.8% power reduction compared to FIFO scheduling.

Android Video Processing System Combined with Automatically Parallelized and Power Optimized Code by OSCAR Compiler

Binh Bui Duc;Hirano Tomohiro;Mikami Hiroki;Yamamoto Hideo;Kimura Keiji;Kasahara Hironori

Journal of Information Processing 24(3) p.504 - 5112016-2016

CiNii

Detail

ISSN:1882-6652

Outline:The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of real-time video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext-A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.

Coarse grain task parallelization of earthquake simulator GMS using OSCAR compiler on various Cc-NUMA servers

Shimaoka, Mamoru; Wada, Yasutaka; Wada, Yasutaka; Kimura, Keiji; Kasahara, Hironori

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9519p.238 - 2532016/01-2016/01

DOIScopus

Detail

ISSN:03029743

Outline:© Springer International Publishing Switzerland 2016.This paper proposes coarse grain task parallelization for a earthquake simulation program using Finite Difference Method to solve the wave equations in 3-D heterogeneous structure or the Ground Motion Simulator (GMS) on various cc-NUMA servers using IBM, Intel and Fujitsu multicore processors. The GMS has been developed by the National Research Institute for Earth Science and Disaster Prevention (NIED) in Japan. Earthquake wave propagation simulations are important numerical applications to save lives through damage predictions of residential areas by earthquakes. Parallel processing with strong scaling has been required to precisely calculate the simulations quickly. The proposed method uses the OSCAR compiler for exploiting coarse grain task parallelism efficiently to get scalable speed-ups with strong scaling. The OSCAR compiler can analyze data dependence and control dependence among coarse grain tasks, such as subroutines, loops and basic blocks. Moreover, locality optimizations considering the boundary calculations of FDM and a new static scheduler that enables more efficient task schedulings on cc-NUMA servers are presented. The performance evaluation shows 110 times speed-up using 128 cores against the sequential execution on a POWER7 based 128 cores cc-NUMA server Hitachi SR16000 VM1, 37.2 times speed-up using 64 cores against the sequential execution on a Xeon E7-8830 based 64 cores cc-NUMA server BS2000, 19.8 times speed-up using 32 cores against the sequential execution on a Xeon X7560 based 32 cores cc-NUMA server HA8000/RS440, 99.3 times speed-up using 128 cores against the sequential execution on a SPARC64 VII based 256 cores cc-NUMA server Fujitsu M9000, 9.42 times speed-up using 12 cores against the sequential execution on a POWER8 based 12 cores cc-NUMA server Power System S812L.

Multigrain parallelization for model-based design applications using the OSCAR compiler

Umeda, Dan; Suzuki, Takahiro; Mikami, Hiroki; Kimura, Keiji; Kasahara, Hironori

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9519p.125 - 1392016/01-2016/01

DOIScopus

Detail

ISSN:03029743

Outline:© Springer International Publishing Switzerland 2016.Model-based design is a very popular software development method for developing a wide variety of embedded applications such as automotive systems, aircraft systems, and medical systems. Model-based design tools like MATLAB/Simulink typically allow engineers to graphically build models consisting of connected blocks for the purpose of reducing development time. These tools also support automatic C code generation from models with a special tool such as Embedded Coder to map models onto various kinds of embedded CPUs. Since embedded systems require real-time processing, the use of multi-core CPUs poses more opportunities for accelerating program execution to satisfy the real-time constraints. While prior approaches exploit parallelism among blocks by inspecting MATLAB/Simulink models, this may lose an opportunity for fully exploiting parallelism of the whole program because models potentially have parallelism within a block. To unlock this limitation, this paper presents an automatic parallelization technique for auto-generated C code developed by MATLAB/Simulink with Embedded Coder. Specifically, this work (1) exploits multi-level parallelism including inter-block and intra-block parallelism by analyzing the auto-generated C code, and (2) performs static scheduling to reduce dynamic overheads as much as possible. Also, this paper proposes an automatic profiling framework for the auto-generated code for enhancing static scheduling, which leads to improving the performance of MATLAB/Simulink applications. Performance evaluation shows 4.21 times speedup with six processor cores on Intel Xeon X5670 and 3.38 times speedup with four processor cores on ARM Cortex-A15 compared with uniprocessor execution for a road tracking application.

Annotatable systrace: An extended linux ftrace for tracing a parallelized program

Fukui, Daichi; Shimaoka, Mamoru; Mikami, Hiroki; Hillenbrand, Dominic; Yamamoto, Hideo; Kimura, Keiji; Kasahara, Hironori

SEPS 2015 - Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems p.21 - 252015/10-2015/10

DOIScopus

Detail

Outline:© 2015 ACM.Investigation of the runtime behavior is one of the most important processes for performance tuning on a computer system. Profiling tools have been widely used to detect hot-spots in a program. In addition to them, tracing tools produce valuable information especially from parallelized programs, such as thread scheduling, barrier synchronizations, context switching, thread migration, and jitter by interrupts. Users can optimize a runtime system and hardware configuration in addition to a program itself by utilizing the attained information. However, existing tools provide information per process or per function. Finer information like task-or loop-granularity should be required to understand the program behavior more precisely. This paper has proposed a tracing tool, Annotatable Systrace, to investigate runtime execution behavior of a parallelized program based on an extended Linux ftrace. The Annotatable Systrace can add arbitrary annotations in a trace of a target program. The proposed tool exploits traces from 183.equake, 179.art, and mpeg2enc on Intel Xeon X7560 and ARMv7 as an evaluation. The evaluation shows that the tool enables us to observe load imbalance along with the program execution. It can also generate a trace with the inserted annotations even on a 32-core machine. The overhead of one annotation on Intel Xeon is 1.07 us and the one on ARMv7 is 4.44 us, respectively.

Android video processing system combined with automatically parallelized and power optimized code by OSCAR compiler

Binh, Bui Duc; Hirano, Tomohiro; Mikami, Hiroki; Yamamoto, Hideo; Kimura, Keiji; Kasahara, Hironori

Journal of Information Processing 24(3) p.504 - 5112016/01-2016/01

DOIScopus

Detail

ISSN:03875806

Outline:© 2016 Information Processing Society of Japan.The emergence of multi-core processors in smart devices promises higher performance and low power consumption. The parallelization of applications enables us to improve their performance. However, simultaneously utilizing many cores would drastically drain the device battery life. This paper shows a demonstration system of realtime video processing combined with power reduction controlled by the OSCAR automatic parallelization compiler on ODROID-X2, an open Android development platform based on Samsung Exynos4412 Prime with 4 ARM Cortext- A9 cores. In this paper, we exploited the DVFS framework, core partitioning, and profiling technique and OSCAR parallelization - power control algorithm to reduce the total consumption in a real-time video application. The demonstration results show that it can cut power consumption by 42.8% for MPEG-2 Decoder application and 59.8% for Optical Flow application by using 3 cores in both applications.

Reducing parallelizing compilation time by removing redundant analysis

Han, Jixin; Fujino, Rina; Tamura, Ryota; Shimaoka, Mamoru; Mikami, Hiroki; Takamura, Moriyuki; Kamiya, Sachio; Suzuki, Kazuhiko; Miyajima, Takahiro; Kimura, Keiji; Kasahara, Hironori

SEPS 2016 - Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, co-located with SPLASH 2016 p.1 - 92016/10-2016/10

DOIScopus

Detail

Outline:© 2016 ACM.Parallelizing compilers employing powerful compiler optimizations are essential tools to fully exploit performance from today's computer systems. These optimizations are supported by both highly sophisticated program analysis techniques and aggressive program restructuring techniques. However, the compilation time for such powerful compilers becomes larger and larger for real commercial application due to these strong program analysis techniques. In this paper, we propose a compilation time reduction technique for parallelizing compilers. The basic idea of the proposed technique is based on an observation that parallelizing compilers apply multiple program analysis passes and restructuring passes to a source program but all program analysis passes do not have to be applied to the whole source program. Thus, there is an opportunity for compilation time reduction by removing redundant program analysis. We describe the removing redundant program analysis techniques considering the inter-procedural propagation of analysis update information in this paper. We implement the proposed technique into OSCAR automatically multigrain parallelizing compiler. We then evaluate the proposed technique by using three proprietary large scale programs. The proposed technique can remove 37.7% of program analysis time on average for basic analysis includes def-use analysis and dependence calculation, and 51.7% for pointer analysis, respectively.

Architecture design for the environmental monitoring system over the winter season

Yamashita, Koichiro; Yamashita, Koichiro; Ao, Chen; Suzuki, Takahisa; Xu, Yi; Li, Hongchun; Tian, Jun; Kimura, Keiji; Kasahara, Hironori

MobiWac 2016 - Proceedings of the 14th ACM International Symposium on Mobility Management and Wireless Access, co-located with MSWiM 2016 p.27 - 342016/11-2016/11

DOIScopus

Detail

Outline:© 2016 ACM.One of the applications as a source of big data, there is a sensor network for the environmental monitoring that is designed to detect the deterioration of the infrastructure, erosion control and so on. The specific targets are bridges, buildings, slopes and embankments due to the natural disasters or aging. Basic requirement of this monitoring system is to collect data over a long period of time from a large number of nodes that installed in a wide area. However, in order to apply a wireless sensor network (WSN), using wireless communication and energy harvesting, there are not many cases in the actual monitoring system design. Because of the system must satisfy various conditions measurement location and time specified by the civil engineering communication quality and topology obtained from the network technology the electrical engineering to solve the balance of weather environment and power consumption that depends on the above-mentioned conditions. We propose the whole WSN design methodology especially for the electrical architecture that is affected by the network behavior and the environmental disturbance. It is characterized by determining recursively mutual trade-off of a wireless simulation and a power architecture simulation of the node devices. Furthermore, the system allows the redundancy of the design. In addition, we deployed the actual slope monitoring WSN that is designed by the proposed method to the snow-covered area. A conventional similar monitoring WSN, with 7 Ah Li-battery, it worked only 129 days in a mild climate area. On the other hand, our proposed system, deployed in the heavy snow area has been working more than 6 months (still working) with 3.2 Ah batteries. Finally, it made a contribution to the civil engineering succeeded in the real time observation of the groundwater level displacement at the time of melting snow in the spring season.

Accelerating Multicore Architecture Simulation Using Application Profile

Kimura, Keiji; Taguchi, Gakuho; Kasahara, Hironori

Proceedings - IEEE 10th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2016 p.177 - 1842016/12-2016/12

DOIScopus

Detail

Outline:© 2016 IEEE.Architecture simulators play an important role in exploring frontiers in the early stages of the architecture design. However, the execution time of simulators increases with an increase the number of cores. The sampling simulation technique that was originally proposed to simulate single-core processors is a promising approach to reduce simulation time. Two main hurdles for multi/many-core are preparing sampling points and thread skewing at functional simulation time. This paper proposes a very simple and low-error sampling-based acceleration technique for multi/many-core simulators. For a parallelized application, an iteration of a large loop including a parallelizable program part, is defined as a sampling unit. We apply X-means method to a profile result of the collection of iterations derived from a real machine to form clusters of those iterations. Multiple iterations are exploited as sampling points from these clusters. We execute the simulation along the sampling points and calculate the number of total execution cycles. Results from a 16-core simulation show that our proposed simulation technique gives us a maximum of 443x speedup with a 0.52% error and 218x speedup with 1.50% error on an average.

2-Step Power Scheduling with Adaptive Control Interval for Network Intrusion Detection Systems on Multicores

Tuong, Lau Phi; Kimura, Keiji

Proceedings - IEEE 10th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip, MCSoC 2016 p.69 - 762016/12-2016/12

DOIScopus

Detail

Outline:© 2016 IEEE.Network intrusion detection system (NIDS) is becoming an important element even in embedded systems as well as in data centers since embedded computers have been increasingly exposed to the Internet. The demand for power budget of these embedded systems is a critical issue in addition to that for performance. In this paper, we propose a technique to minimize power consumption in the NIDS called by 2-step power scheduling with the adaptive control interval. In addition, we also propose a CPU-core controlling algorithm so that our scheduling technique can preserve the performance for other applications and NIDS assuming the cases of multiplexing NIDS and them simultaneously on the same device such as a home server or a mobile platform. We implement our 2-step algorithm into Suricata, which is a popular NIDS, as well as a 1-step algorithm and a simple fixed interval algorithm for evaluations. Experimental results show that our 2-step scheduling with both the adaptive and the fixed 30-millisecond interval achieve 75% power saving comparing with the Ondemand governor and 87% comparing with the Performance governor in Linux, respectively, without affecting their performance capability on four ARM Cortex-A15 cores at the network traffic of 1,000 packets/seconds. In contrast, when the network traffic reaches to 17,000 packets/seconds, our 2-step scheduling and the Ondemand as well as the Performance governor can maintain the packet processing capacity while the fixed 30-milliseconds interval processes only 50% packets with two and three cores, and about 80% packets on four cores.

Automatic local memory management for multicores having global address space

Yamamoto, Kouhei; Shirakawa, Tomoya; Oki, Yoshitake; Yoshida, Akimasa; Yoshida, Akimasa; Kimura, Keiji; Kasahara, Hironori

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10136 LNCSp.282 - 2962017/01-2017/01

DOIScopus

Detail

ISSN:03029743

Outline:© Springer International Publishing AG 2017.Embedded multicore processors for hard real-time applications like automobile engine control require the usage of local memory on each processor core to precisely meet the real-time deadline constraints, since cache memory cannot satisfy the deadline requirements due to cache misses. To utilize local memory, programmers or compilers need to explicitly manage data movement and data replacement for local memory considering the limited size. However, such management is extremely difficult and time consuming for programmers. This paper proposes an automatic local memory management method by compilers through (i) multi-dimensional data decomposition techniques to fit working sets onto limited size local memory (ii) suitable block management structures, called Adjustable Blocks, to create application specific fixed size data transfer blocks (iii) multi-dimensional templates to preserve the original multi-dimensional representations of the decomposed multi-dimensional data that are mapped onto one-dimensional Adjustable Blocks (iv) block replacement policies from liveness analysis of the decomposed data, and (v) code size reduction schemes to generate shorter codes. The proposed local memory management method is implemented on the OSCAR multigrain and multi-platform compiler and evaluated on the Renesas RP2 8 core embedded homogeneous multicore processor equipped with local and shared memory. Evaluations on 5 programs including multimedia and scientific applications show promising results. For instance, speedups on 8 cores compared to single core execution using off-chip shared memory on an AAC encoder program, a MPEG2 encoder program, Tomcatv, and Swim are improved from 7.14 to 20.12, 1.97 to 7.59, 5.73 to 7.38, and 7.40 to 11.30, respectively, when using local memory with the proposed method. These evaluations indicate the usefulness and the validity of the proposed local memory management method on real embedded multicore processors.

Multicore Cache Coherence Control by a Parallelizing Compiler

Kasahara, Hironori; Kimura, Keiji; Adhi, Boma A.; Hosokawa, Yuhei; Kishimoto, Yohei; Mase, Masayoshi

Proceedings - International Computer Software and Applications Conference 1p.492 - 4972017/09-2017/09

DOIScopus

Detail

ISSN:07303157

Outline:© 2017 IEEE. A recent development in multicore technology has enabled development of hundreds or thousands core processor. However, on such multicore processor, an efficient hardware cache coherence scheme will become very complex and expensive to develop. This paper proposes a parallelizing compiler directed software coherence scheme for shared memory multicore systems without hardware cache coherence control. The general idea of the proposed method is that an automatic parallelizing compiler analyzes the control dependency and data dependency among coarse grain task in the program. Then based on the obtained information, task parallelization, false sharing detection and data restructuration to prevent false sharing are performed. Next the compiler inserts cache control code to handle stale data problem. The proposed method is built on OSCAR automatic parallelizing compiler and evaluated on Renesas RP2 with 8 SH-4A cores processor. The hardware cache coherence scheme on the RP2 processor is only available for up to 4 cores and the hardware cache coherence can be completely turned off for non-coherence cache mode. Performance evaluation is performed using 10 benchmark program from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) and Mediabench II. The proposed method performs as good as or better than hardware cache coherence scheme. For example, 4 cores with the hardware coherence mechanism gave us speed up of 2.52 times against 1 core for SPEC2000 'equake', 2.9 times for SPEC2006 'lbm', 3.34 times for NPB 'cg', and 3.17 times for MediaBench II MPEG2 Encoder. The proposed software cache coherence control gave us 2.63 times for 4 cores and 4.37 for 8 cores for 'equake', 3.28 times for 4 cores and 4.76 times for 8 cores for lbm, 3.71 times for 4 cores and 4.92 times for 8 cores for 'MPEG2 Encoder'.

Patent

Reference Number:25

マルチプロセッサ(日本)

笠原 博徳, 木村 啓二

H11-363702、2001-175619、4784792

Reference Number:473

マルチプロセッサシステム及びマルチグレイン並列化コンパイラ(日本, PCT, 中華人民共和国, ヨ-ロッパ, 大韓民国, ドイツ, イギリス, アメリカ)

笠原 博徳, 木村 啓二, 白子 準, 伊藤 雅樹, 鹿野 裕明

2005-114842、2006-293768、4082706

Reference Number:513

ヘテロジニアス・マルチプロセッサシステムの制御方法及びマルチグレイン並列化コンパイラ(日本, アメリカ)

笠原 博徳, 木村 啓二, 白子 準, 和田 康孝, 伊藤 雅樹, 鹿野 裕明

2006-157301、2007-328415、4936517

Reference Number:604

メモリ管理方法、情報処理装置、プログラムの作成方法及びプログラム(日本, PCT, 中華人民共和国, 大韓民国, イギリス, アメリカ)

笠原 博徳, 木村 啓二, 中野 啓史, 仁藤 拓実, 丸山 貴紀, 三浦 剛, 田川 友博

2007- 50269、2008-217134、5224498

Reference Number:626

ヘテロジニアスマルチプロセッサ向けグローバルコンパイラ(日本, 中華人民共和国, フランス, ヨ-ロッパ, 大韓民国, イギリス, ドイツ, アメリカ)

笠原 博徳, 木村 啓二, 鹿野 裕明

2006-157308、2007-328416、4784827

Reference Number:700

プロセッサ及びデータ転送ユニット(日本, アメリカ)

笠原 博徳, 木村 啓二

2006-274879、2008- 97084、4476267

Reference Number:761

マルチプロセッサシステム及びマルチグレイン並列化コンパイラ(日本)

笠原 博徳, 木村 啓二, 白子 準, 伊藤 雅樹, 鹿野 裕明

2007-166280、2007-305148

Reference Number:800

マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法(日本, PCT, アメリカ, 中華人民共和国)

笠原 博徳, 木村 啓二

2008- 15028、2009-176116

Reference Number:855

マルチプロセッサ及びマルチプロセッサシステム(日本)

笠原 博徳, 木村 啓二

2008-090853、2008-181558、4784842

Reference Number:856

マルチプロセッサ(日本)

笠原 博徳, 木村 啓二

2008-118007、2008-217825、4304347

Reference Number:993

マルチプロセッサ(日本)

笠原 博徳, 木村 啓二

2009-159744、2009-230764

Reference Number:1022

プロセッサによって実行可能なコードの生成方法、記憶領域の管理方法及びコード生成プログラム(日本, PCT, 中華人民共和国, ヨ-ロッパ, イギリス, ドイツ, アメリカ)

笠原 博徳, 木村 啓二, 間瀬 正啓

2009-285586、2011-128803、5283128

Reference Number:1428

マルチプロセッサシステムおよびマルチプロセッサシステムの同期方法(日本)

笠原 博徳, 木村 啓二

2013- 80143、2013-137833

Reference Number:1443

並列化コンパイル方法、並列化コンパイラ、並列化コンパイル装置、及び、車載装置(日本, ドイツ, アメリカ)

笠原 博徳, 木村 啓二, 林 明宏, 見神 広紀, 梅田 弾, 金羽木 洋平

2013-125607、2015- 1807、6018022

Reference Number:1495

並列性の抽出方法及びプログラムの作成方法(日本)

木村 啓二, 林 明宏, 笠原 博徳, 見神 広紀, 金羽木 洋平, 梅田 弾

2014- 6009、2014-160453、6319880

Reference Number:1689

マルチプロセッサシステム(日本)

笠原 博徳, 木村 啓二

2015- 77599、2015-127982

Reference Number:1878

マルチプロセッサシステム(日本)

笠原 博徳, 木村 啓二

2016-233392、2017- 62843、6335253

Reference Number:1974

並列化コンパイラ、並列化コンパイル装置、及び並列プログラムの生成方法(日本)

笠原 博徳, 木村 啓二, 梅田 弾, 見神 広紀

2017-178110、2018-109943

Reference Number:262-JP

プロセッサシステム及びアクセラレータ(日本)

木村 啓二, 笠原 博徳

2013-541786、6103647

Research Grants & Projects

Grant-in-aids for Scientific Research Adoption Situation

Research Classification:

A research on a heterogeneous multicore that enables flexible cooperation among CPUs, accelerators and data transfer units on a chip

2015/-0-2018/-0

Allocation Class:¥4680000

Research Classification:

A Study of Acceleration Technique for Many-core Architecture Simulation Considering Global Program Structure

2011/-0-2014/-0

Allocation Class:¥4290000

Research Classification:

Real-Time Optimization Algorithms and Their Applications for Control of Large-Scale Nonlinear Spatiotemporal Patterns

2012/-0-2016/-0

Allocation Class:¥13910000

On-campus Research System

Special Research Project

ソフトウェア協調型チップマルチプロセッサにおけるメモリ最適化に関する研究

2004

Research Results Outline:本研究ではまず、データローカリティ最適化およびデータ転送最適化の基盤技術となるマルチグレイン並列化コンパイラとチップマルチプロセッサアーキテクチャプラ本研究ではまず、データローカリティ最適化およびデータ転送最適化の基盤技術となるマルチグレイン並列化コンパイラとチップマルチプロセッサアーキテクチャプラットフォームの選定および評価基盤の整備を行った。コンパイラとしては、経済産業省ミレニアムプロジェク...本研究ではまず、データローカリティ最適化およびデータ転送最適化の基盤技術となるマルチグレイン並列化コンパイラとチップマルチプロセッサアーキテクチャプラットフォームの選定および評価基盤の整備を行った。コンパイラとしては、経済産業省ミレニアムプロジェクトIT21 アドバンスト並列化コンパイラで開発されたOSCARマルチグレイン並列化コンパイラをコアとした。また、チップマルチプロセッサアーキテクチャとしては、簡素なプロセッサコア、ローカルデータメモリ、2ポート構成の分散共有メモリ、およびデータ転送ユニットを持つプロセッシングエレメント(PE)をPE間ネットワークで接続したOSCAR型チップマルチプロセッサとした。本研究では、OSCARマルチグレイン並列化コンパイラに対してOSCAR型チップマルチプロセッサ用のバックエンド(コード生成器)を追加開発した。データローカリティ最適化およびデータ転送最適化技術開発の第一歩として、ターゲットアプリケーションには、SPECfp95ベンチマークより科学技術計算の典型例であるTomcatvとSwimプログラムを選んだ。本研究では、これらに対してタスク(並列処理の単位)とデータをデータローカリティと並列性の両方を考慮しながらPEへスケジューリングし、さらに共有メモリとプロセッサのローカルメモリ(データローカルメモリおよび分散共有メモリ)とのやり取りをプロセッサと非同期で動作するデータ転送ユニットにより処理させることにより、データローカリティ利用とデータ転送処理の効率化を行った。8PEで評価を行った結果、データローカリティ最適化を適用していない場合に対してTomcatvで1.56倍、Swimで1.38倍の速度向上を得ることができた。

コンパイラ解析情報と実機実行情報を利用したマルチコアシミュレーション高速化の研究

2009

Research Results Outline:計算機アーキテクチャの研究では、様々な構成のシステム評価を行うため、ソフトウェアによるアーキテクチャシミュレーションが大きな役割を果たしている。しかし計算機アーキテクチャの研究では、様々な構成のシステム評価を行うため、ソフトウェアによるアーキテクチャシミュレーションが大きな役割を果たしている。しかしながら、ソフトウェアシミュレータはプログラムの実行に実機の数千倍の時間がかかる。このような膨大な評...計算機アーキテクチャの研究では、様々な構成のシステム評価を行うため、ソフトウェアによるアーキテクチャシミュレーションが大きな役割を果たしている。しかしながら、ソフトウェアシミュレータはプログラムの実行に実機の数千倍の時間がかかる。このような膨大な評価時間は今後のメニーコアの研究・開発の大きな妨げになる。本研究では、このような問題を克服するための、マルチコア・メニーコアのソフトウェアシミュレーション高速化手法の研究を行う。特に並列アーキテクチャ研究のためのシミュレーション高速化の研究に関しては、これまでミュレーションによる実験対象となる仮想のマルチコアやマルチプロセッサのコアを、シミュレータを実行する実際のマルチプロセッサのコアに割り当てるという方法が提案されてきたが、実機上の並列処理オーバーヘッドが大きく、実用的なシステムはこれまで実現されていない。本研究の特徴は、マルチコア・メニーコアのソフトウェアシミュレーションの高速化に、ループ構造や並列化情報などの並列化コンパイラによる解析情報と、評価対象アプリケーションの実機での実行情報を利用することである。これらの情報を利用し、詳細にシミュレーションする必要がある箇所とそうでない箇所を特定する。従来のソフトウェアシミュレーション高速化手法では利用されてこなかったこれらの付加的な情報を利用することで、精度の高い性能値を最小の実行コストで得ることができる。本年度は、本高速化手法の基本的な適用可能性を検討するための予備実験を行った。具体的には、二種類のマルチコアアーキテクチャのコア数を32コアまで変化させ、ベンチマークプログラムのメインループの回転数を変化させ本研究による性能値推定手法により本来のループ回転数における性能値を再現できるか調査した。ベンチマークプログラムとしてSPEC95ベンチマークのtomcatvとswim、および音声圧縮で標準的に使われているAACエンコーディングプログラムを用いた。評価の結果、いずれのアーキテクチャ、コア数、ベンチマークプログラムの組み合わせにおいても、わずか数回転分の性能値から本来の数百回転分の性能値を高々2%程度の誤差で予測することができた。今後は適用アプリケーションの拡大ならびにシステムの自動化を行う予定である。

フラグによりCPUとアクセラレータが連係するヘテロジニアスマルチコアに関する研究

2014

Research Results Outline:本研究は、アクセラレータを持つヘテロジニアスマルチコアに対して、アクセラレータの制御やデータ転送に要するオーバーヘッドを削減する技術の開発する。具体的本研究は、アクセラレータを持つヘテロジニアスマルチコアに対して、アクセラレータの制御やデータ転送に要するオーバーヘッドを削減する技術の開発する。具体的にはCPU、データ転送ユニット(DTU)、及びアクセラレータを同時実行させることで上記オーバーヘッ...本研究は、アクセラレータを持つヘテロジニアスマルチコアに対して、アクセラレータの制御やデータ転送に要するオーバーヘッドを削減する技術の開発する。具体的にはCPU、データ転送ユニット(DTU)、及びアクセラレータを同時実行させることで上記オーバーヘッドを隠蔽可能とするタスク分割及びスケジューリング手法を開発し、自動並列化コンパイラに実装する。本年度の成果としては、まず本研究が前提とするアクセラレータの基本仕様を決定した。その上で、本アクセラレータ用のコンパイラモジュールを開発し、さらにアクセラレータのアーキテクチャシミュレータを開発することにより、本研究を行う上での基本的な評価環境を整備した。

Foreign Countries Research Activity

Research Project Title: 新しいメモリ階層を考慮したソフトウェア・ハードウェアの構成法に関する研究

2017/08-2018/02

Affiliation: North Carolina State University(アメリカ)

Lecture Course

Course TitleSchoolYearTerm
Computer Architecture ASchool of Fundamental Science and Engineering2019fall semester
Computer Architecture ASchool of Fundamental Science and Engineering2019fall semester
Computer Architecture A [S Grade]School of Fundamental Science and Engineering2019fall semester
Computer Architecture A [S Grade]School of Fundamental Science and Engineering2019fall semester
Computer Science and Engineering Laboratory A (2)School of Fundamental Science and Engineering2019fall semester
Computer Science and Engineering Laboratory A [S Grade]School of Fundamental Science and Engineering2019fall semester
Computer Science and Engineering Laboratory BSchool of Fundamental Science and Engineering2019spring semester
Computer Science and Engineering Laboratory B [S Grade]School of Fundamental Science and Engineering2019spring semester
Computer Architecture BSchool of Fundamental Science and Engineering2019fall semester
Computer ArchitectureSchool of Fundamental Science and Engineering2019fall semester
Computer Architecture BSchool of Fundamental Science and Engineering2019fall semester
Computer Architecture [S Grade]School of Fundamental Science and Engineering2019fall semester
Computer Architecture B [S Grade]School of Fundamental Science and Engineering2019fall semester
Bachelor Thesis ASchool of Fundamental Science and Engineering2019full year
Bachelor Thesis A [S Grade]School of Fundamental Science and Engineering2019full year
Bachelor Thesis BSchool of Fundamental Science and Engineering2019full year
Bachelor Thesis B [S Grade]School of Fundamental Science and Engineering2019full year
Language ProcessorsSchool of Fundamental Science and Engineering2019spring semester
Language ProcessorsSchool of Fundamental Science and Engineering2019spring semester
Language Processors [S Grade]School of Fundamental Science and Engineering2019spring semester
Project Research ASchool of Fundamental Science and Engineering2019spring semester
Computer Science and Engineering Laboratory CSchool of Fundamental Science and Engineering2019fall semester
Project Research BSchool of Fundamental Science and Engineering2019fall semester
Advanced Processor Architecture TechnologySchool of Fundamental Science and Engineering2019spring semester
Advanced Processor Architecture TechnologySchool of Fundamental Science and Engineering2019spring semester
Advanced Processor Architecture TechnologySchool of Fundamental Science and Engineering2019spring semester
Advanced Processor Architecture TechnologySchool of Fundamental Science and Engineering2019spring semester
Advanced Processor Architecture TechnologySchool of Fundamental Science and Engineering2019spring semester
IoT System DesignSchool of Fundamental Science and Engineering2019spring semester
IoT System DesignSchool of Fundamental Science and Engineering2019spring semester
IoT System DesignSchool of Fundamental Science and Engineering2019spring semester
Communications and Computer Engineering Laboratory ASchool of Fundamental Science and Engineering2019fall semester
Communications and Computer Engineering Laboratory A [S Grade]School of Fundamental Science and Engineering2019fall semester
Communications and Computer Engineering Laboratory BSchool of Fundamental Science and Engineering2019spring semester
Communications and Computer Engineering Laboratory B [S Grade]School of Fundamental Science and Engineering2019spring semester
Bachelor Thesis ASchool of Fundamental Science and Engineering2019full year
Bachelor Thesis A [S Grade]School of Fundamental Science and Engineering2019full year
Bachelor Thesis BSchool of Fundamental Science and Engineering2019full year
Bachelor Thesis B [S Grade]School of Fundamental Science and Engineering2019full year
Project Research ASchool of Fundamental Science and Engineering2019spring semester
Project Research BSchool of Fundamental Science and Engineering2019fall semester
Research Project BSchool of Fundamental Science and Engineering2019spring semester
Research Project BSchool of Fundamental Science and Engineering2019spring semester
Research Project BSchool of Fundamental Science and Engineering2019spring semester
Research Project B [S Grade]School of Fundamental Science and Engineering2019spring semester
Research Project CSchool of Fundamental Science and Engineering2019fall semester
Research Project CSchool of Fundamental Science and Engineering2019fall semester
Research Project C [S Grade]School of Fundamental Science and Engineering2019fall semester
Research Project C [S Grade]School of Fundamental Science and Engineering2019fall semester
Computer SystemsSchool of Fundamental Science and Engineering2019spring semester
Computer SystemsSchool of Fundamental Science and Engineering2019spring semester
Computer SystemsSchool of Fundamental Science and Engineering2019spring semester
Computer SystemsSchool of Fundamental Science and Engineering2019spring semester
Research Project ASchool of Fundamental Science and Engineering2019fall semester
Research Project ASchool of Fundamental Science and Engineering2019fall semester
Research Project DSchool of Fundamental Science and Engineering2019spring semester
Research Project DSchool of Fundamental Science and Engineering2019spring semester
Communications and Computer Engineering LaboratorySchool of Fundamental Science and Engineering2019fall semester
Computer Science and Communications Engineering Laboratory ASchool of Fundamental Science and Engineering2019fall semester
Computer Science and Engineering LaboratorySchool of Fundamental Science and Engineering2019fall semester
Computer Science and Engineering LaboratorySchool of Fundamental Science and Engineering2019fall semester
Computer Science and Communications Engineering Laboratory A [S Grade]School of Fundamental Science and Engineering2019fall semester
Introduction to Computers and NetworksSchool of Fundamental Science and Engineering2019spring semester
IoT System DesignGraduate School of Fundamental Science and Engineering2019spring semester
IoT System DesignGraduate School of Creative Science and Engineering2019spring semester
IoT System DesignGraduate School of Advanced Science and Engineering2019spring semester
Master's Thesis (Department of Computer Science and Communications Engineering)Graduate School of Fundamental Science and Engineering2019full year
Research on Advanced Processor ArchitectureGraduate School of Fundamental Science and Engineering2019full year
Research on Advanced Processor ArchitectureGraduate School of Fundamental Science and Engineering2019full year
Advanced Processor ArchitectureGraduate School of Fundamental Science and Engineering2019spring semester
Advanced Processor ArchitectureGraduate School of Fundamental Science and Engineering2019spring semester
Special Laboratory A in Computer Science and Communications EngineeringGraduate School of Fundamental Science and Engineering2019spring semester
Special Laboratory A in Computer Science and Communications EngineeringGraduate School of Fundamental Science and Engineering2019spring semester
Special Laboratory B in Computer Science and Communications EngineeringGraduate School of Fundamental Science and Engineering2019fall semester
Special Laboratory B in Computer Science and Communications EngineeringGraduate School of Fundamental Science and Engineering2019fall semester
Seminar on Advanced Processor Architecture AGraduate School of Fundamental Science and Engineering2019spring semester
Seminar on Advanced Processor Architecture AGraduate School of Fundamental Science and Engineering2019spring semester
Seminar on Advanced Processor Architecture BGraduate School of Fundamental Science and Engineering2019fall semester
Seminar on Advanced Processor Architecture BGraduate School of Fundamental Science and Engineering2019fall semester
Seminar on Advanced Processor Architecture CGraduate School of Fundamental Science and Engineering2019spring semester
Seminar on Advanced Processor Architecture CGraduate School of Fundamental Science and Engineering2019spring semester
Seminar on Advanced Processor Architecture DGraduate School of Fundamental Science and Engineering2019fall semester
Seminar on Advanced Processor Architecture DGraduate School of Fundamental Science and Engineering2019fall semester
Master's Thesis (Department of Computer Science and Communications Engineering)Graduate School of Fundamental Science and Engineering2019full year
Research on Advanced Processor ArchitectureGraduate School of Fundamental Science and Engineering2019full year
Research on Advanced Processor ArchitectureGraduate School of Fundamental Science and Engineering2019full year
Special Seminar A in Computer Science and Communications EngineeringGraduate School of Fundamental Science and Engineering2019spring semester
Special Seminar B in Computer Science and Communications EngineeringGraduate School of Fundamental Science and Engineering2019fall semester