Projects


B1: Adaptive Application-Specific Invasive Microarchitecture

Principal Investigators:

Prof. J. Henkel, Prof. J. Becker, Dr. L. Bauer

Scientific Researchers:

M. Damschen, T. Harbaum

Abstract

Project B1 investigates mechanisms that provide run-time adaptivity in the microarchitecture (μArch) and by using a run-time-reconfigurable fabric. The goals are to advance the concepts of state-of-the-art reconfigurable processors towards invasion and to exploit their benefits in the invasive computing project. We propose concepts and methods that allow invading the reconfigurable fabric and μArch within the invasive core (i-Core). The focus is to (i) investigate run-time adaptivity at the μArch level (e.g. dynamic L1 cache size or branch prediction) and (ii) provide so-called Special Instructions (SIs, implemented by i-let-specific accelerators) on demand.

i-Core consisting of adaptive microarchitecture and reconfigurable fabric The figure above provides an overview of the i-Core architecture as it was developed in the first funding phase. The instruction-set architecture is extended by additional instructions that allow the application developer to adapt the microarchitecture to be i-let specific. The adaptive μArch mechanisms include: adaptive branch prediction, adaptation of the pipeline length, and a dynamically parameterisable L1 cache. In addition to the SPARC V8 instruction set as implemented by the LEON3, the i-Core introduces SIs that are implemented by run-time-reconfigurable accelerators. The accelerators are loaded into reconfigurable containers, designated regions that support partial reconfiguration without disrupting the rest of the system. The reconfigurable containers are connected to an interconnect infrastructure that establishes communication between the accelerators and to the i-Core μArch, tile-local memory, and data cache. Building on top of the developed i-Core architecture, in Phase II we will contribute to the common goals of dark-silicon management (by improving the i-Core efficiency, which allows Project B3 spatial and temporal greying of cores without compromising performance) while at the same time offering predictability improvements (by constraining the i-Core adaptivity).

To improve the efficiency, we want to investigate an approach for automatic online SI generation that is transparent to the application developer. These so-called Auto-SIs may not reach the performance of offline-optimised SIs, but they are beneficial when an i-let that does not contain any regular SIs shall execute on an i-Core, e.g. when an i-let is migrated from a LEON3 to an i-Core. We also plan to move from a homogeneous to a heterogeneous reconfigurable fabric, with reconfigurable containers divided into different classes, offering a different amount and type of resources. To support it, we want to develop a heterogeneity-aware run-time system that exploits the efficiency of the reconfigurable fabric and allows optimised i-let invasions. In addition the contributions related to pure efficiency/dark silicon, we also offer trade-offs between efficiency and predictability as follows: We want to add a concept for intra-tile multicore invasion of the reconfigurable fabric, which allows LEON3 cores in an i-Core tile (i.e. a tile that contains an i-Core) to issue SIs to the i-Core. These so-called Remote-SIs execute on the reconfigurable fabric of the i-Core along with SIs executed by the i-Core itself. That increases the efficiency of the remote cores, but to sustain predictability of i-Core-SIs, the interference of Remote-SIs and i-Core-SIs can be limited at run time. Based on the findings of Phase I, we want to introduce Dynamic Intra-Tile Cache Reallocation, i.e. a flexible parameterisation and L1 cache allocation between the cores within a tile, to make more efficient use of the available on-chip memory resources. This allows improving the predictability, as each i-let can configure its desired cache configuration. In addition to the architectural/run-time contributions, we want to investigate compiler support for Offline SI Generation. To simplify SI creation, which was done manually in Phase I, we plan to provide compiler support (together with Project C3) to automatically identify suitable SI candidates for offline SI development.

Altogether, the above mentioned goals and novel contributions for Phase II improve the overall invasive computing system by increasing the efficiency and predictability of the i-Core and the i-Core tile. To enable application developers (Project D1 and Project D3), simulation (Project C2), characterisation, and pattern development (Project A4), agent system (Project C1), and dark-silicon management (Project B3) to simulate, estimate, and predict the performance of different i-Core features, we will develop and provide detailed i-Core performance models.

Synopsis

Building on top of the i-Core architecture developed in Phase I, the main focus of Project B1 in Phase~II is on efficient fabric invasion, intra-tile multicore support, and enhancing the usability. The main goal of the efficient fabric invasion and the intra-tile multicore support is to further increase the area efficiency of the i-Core, which directly contributes to the challenge of dark silicon, being a major topic in the second funding phase. Additionally, for all existing and new i-Core features that improve the efficiency by exploiting adaptivity, we will analyse and develop trade-offs between efficiency and predictability (at the cost of limited adaptivity). That means, if an i-let demands predictability, then not all i-Core features will be available for it and we will carefully analyse which features can still be used and provide models to improve the i-Core usability.

Approach

The i-Core can be invaded like regular LEON3 cores, but additionally the i-Core run-time system uses the invasive computing paradigm to manage its reconfigurable fabric. The following C code fragment shows how an H.264 Video Encoder i-let can be written to invade the reconfigurable fabric.
A comprehensive summary of the major achievements of the first funding phase can be found by accessing Project B1 first phase website.

H264_encoder() {
  // Set i-Core microarchitecture parameters
  set_i_Core_parameter(pipeline_length=7,
                       branch_prediction=2_2_correlation);
  // Invade a share of the reconfigurable fabric
  fabric_claim=invade(resource=reconf_fabric,
                      performance=trade_off_curve);
  while (frame=videoInput.getNextFrame()) {
    // Kernel: Motion Estimation
    // Special Instructions:
    //   Sum of Absolute Differences (SAD),
    //   Sum of Absolute Transformed Differences (SATD)
    SI_implementations=invade(resource=fabric_claim,
      SI={SAD, trade_off_curve[SAD], prediction[SAD]},
      SI={SATD, trade_off_curve[SATD], prediction[SATD]});
    infect(SI_implementations);
    execute_motion_estimation(frame, ...);
    ...
    // Kernel: Encoding Engine
    SI_implementations=invade(resource=fabric_claim, ...);
    infect(SI_implementations);
    execute_encoding_engine(frame, ...);
    ...
  }
}

In line 6, the i-let invades the reconfigurable fabric and receives a fabric claim, i.e. a subset of reconfigurable containers that can now be used by the i-let to reconfigure accelerators. The actual application consists of several kernels (Motion Estimation, etc.) that are executed sequentially for each frame. Before each kernel, the obtained fabric claim is further invaded by the SIs of the next kernel (lines 13 and 20). The SIs of a kernel compete for the reconfigurable fabric and this invade triggers the i-Core run-time system to choose the SI implementations, i.e. it decides which accelerators shall be reconfigured. The actual reconfiguration of accelerators is triggered by the infect call (lines 16 and 21). Infection is asynchronous, i.e. the actual kernel execute_motion_estimation() can start, while the accelerators are being loaded. SIs that do not yet have all accelerators loaded, are automatically emulated in software. Retreating from the fabric is implicit upon termination of the i-let.

A comprehensive summary of the major achievements of the first funding phase can be found by accessing Project B1 first phase website.

Publications

[1] Tanja Harbaum, Christoph Schade, Marvin Damschen, Carsten Tradowsky, Lars Bauer, Jörg Henkel, and Jürgen Becker. Auto-SI: An adaptive reconfigurable processor with run-time loop detection and acceleration. In 30th IEEE International System-on-Chip Conference (SOCC), pages 224–229, September 2017.
[2] Jörg Henkel. The triangle of power density, circuit degradation and reliability. Invited Keynote Speech, 30th IEEE International System-On-Chip Conference (SoCC 2017), Munich, Germany, September 7, 2017.
[3] Manuel Mohr and Carsten Tradowsky. Pegasus: Efficient data transfers for PGAS languages on non-cache-coherent many-cores. In Design, Automation and Test in Europe Conference Exhibition (DATE), pages 1781–1786, March 30, 2017.
[4] Artjom Grudnitsky, Lars Bauer, and Jörg Henkel. Efficient partial online-synthesis of special instructions for reconfigurable processors. IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 25(2):594–607, February 2017. [ DOI ]
[5] Marvin Damschen, Lars Bauer, and Jörg Henkel. Timing analysis of tasks on runtime reconfigurable processors. IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 25(1):294–307, January 2017. [ DOI ]
[6] Alexander Pöppl, Marvin Damschen, Florian Schmaus, Andreas Fried, Manuel Mohr, Matthias Blankertz, Lars Bauer, Jörg Henkel, Wolfgang Schröder-Preikschat, and Michael Bader. Shallow water waves on a deep technology stack: Accelerating a finite volume tsunami model using reconfigurable hardware in invasive computing. In Euro-Par 2017: Proceedings of the 10th Workshop on UnConventional High Performance Computing (UCHPC 2017), Lecture Notes in Computer Science (LNCS). Springer, 2017.
[7] Carsten Tradowsky. Methoden zur applikationsspezifischen Effizienzsteigerung adaptiver Prozessorplattformen. Dissertation, Institut für Technik der Informationsverarbeitung (ITIV), Fakultät für Elektrotechnik und Informationstechnik, Karlsruher Institut für Technologie (KIT), December 20, 2016.
[8] Jürgen Teich. Invasive computing – editorial. it – Information Technology, 58(6):263–265, November 24, 2016. [ DOI ]
[9] Stefan Wildermann, Michael Bader, Lars Bauer, Marvin Damschen, Dirk Gabriel, Michael Gerndt, Michael Glaß, Jörg Henkel, Johny Paul, Alexander Pöppl, Sascha Roloff, Tobias Schwarzer, Gregor Snelting, Walter Stechele, Jürgen Teich, Andreas Weichslgartner, and Andreas Zwinkau. Invasive computing for timing-predictable stream processing on MPSoCs. it – Information Technology, 58(6):267–280, September 30, 2016. [ DOI ]
[10] Fazal Hameed, Lars Bauer, and Jörg Henkel. Architecting on-chip DRAM cache for simultaneous miss rate and latency reduction. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 35(4):651–664, April 2016.
[11] Carsten Tradowsky, Enrique Cordero, Christoph Orsinger, Malte Vesper, and Jürgen Becker. A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems. Springer International Publishing, Cham, 2016. [ DOI ]
[12] Carsten Tradowsky, Enrique Cordero, Christoph Orsinger, Malte Vesper, and Jürgen Becker. Adaptive Cache Structures. Springer International Publishing, Cham, 2016. [ DOI ]
[13] Carsten Tradowsky, Tanja Harbaum, Leonard Masing, and Jürgen Becker. A novel adl-based approach to design adaptive application-specific processors. In Best of IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 2016. Forthcoming.
[14] Artjom Grudnitsky. A Reconfigurable Processor for Heterogeneous Multi-Core Architectures. Dissertation, Chair for Embedded Systems (CES), Department of Computer Science, Karlsruhe Institute of Technology (KIT), Germany, December 21, 2015.
[15] Johny Paul, Walter Stechele, Benjamin Oechslein, Christoph Erhardt, Jens Schedel, Daniel Lohmann, Wolfgang Schröder-Preikschat, Manfred Kröhnert, Tamim Asfour, Éricles R. Sousa, Vahid Lari, Frank Hannig, Jürgen Teich, Artjom Grudnitsky, Lars Bauer, and Jörg Henkel. Resource-awareness on heterogeneous MPSoCs for image processing. Journal of Systems Architecture, 61(10):668–680, November 6, 2015. [ DOI ]
[16] Lars Bauer, Artjom Grudnitsky, Marvin Damschen, Srinivas Rao Kerekare, and Jörg Henkel. Floating point acceleration for stream processing applications in dynamically reconfigurable processors. In IEEE Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia), October 2015. Invited Paper for the Special Session “Dynamics and Predictability in Stream Processing – A Contradiction?”. [ DOI ]
[17] C. Diniz, M. Shafique, S. Bampi, and J. Henkel. A reconfigurable hardware architecture for fractional pixel interpolation in high efficiency video coding. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 34(2), February 2015.
[18] Fazal Hameed. DRAM aware Last-Level-Cache policies for Multi-core Systems. Dissertation, Chair for Embedded Systems (CES), Department of Computer Science, Karlsruhe Institute of Technology (KIT), Germany, February 6, 2015.
[19] Peter Figuli, Carsten Tradowsky, Jose Martinez, Harry Sidiropoulos, Kostas Siozios, Holger Stenschke, Dimitrios Soudris, and Jürgen Becker. A novel concept for adaptive signal processing on reconfigurable hardware. In Applied Reconfigurable Computing, volume 9040 of Lecture Notes in Computer Science, pages 311–320. Springer International Publishing, 2015.
[20] Artjom Grudnitsky, Lars Bauer, and Jörg Henkel. COREFAB: Concurrent reconfigurable fabric utilization in heterogeneous multi-core systems. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), October 2014. [ DOI ]
[21] Martin HaaƟ, Lars Bauer, and Jörg Henkel. Automatic custom instruction identification in memory streaming algorithms. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), October 2014. [ DOI ]
[22] Jörg Henkel, Lars Bauer, Artjom Grudnitsky, and Hongyan Zhang. Adaptive embedded computing with i-Core. In ACM SIGBED Review – Special Issue on the 6th Workshop on Adaptive and Reconfigurable Embedded Systems, volume 11, pages 20–21, October 2014. Extended Abstract for Keynote Talk. [ DOI ]
[23] Fazal Hameed, Lars Bauer, and Jörg Henkel. Reducing latency in an SRAM/DRAM cache hierarchy via a novel tag-cache architecture. In IEEE/ACM Design Automation Conference (DAC), June 2014. [ DOI ]
[24] Jörg Henkel. Adaptive embedded computing with i-Core. Keynote Talk, 6th Workshop on Adaptive and Reconfigurable Embedded Systems, CPSWeek (APRES), April 14, 2014.
[25] Carsten Tradowsky, Martin Schreiber, Malte Vesper, Ivan Domladovec, Maximilian Braun, Hans-Joachim Bungartz, and Jürgen Becker. Towards dynamic cache and bandwidth invasion. In Reconfigurable Computing: Architectures, Tools, and Applications, volume 8405 of Lecture Notes in Computer Science, pages 97–107. Springer International Publishing, April 2014. [ DOI ]
[26] Artjom Grudnitsky, Lars Bauer, and Jörg Henkel. MORP: Makespan optimization for processors with an embedded reconfigurable fabric. In Proceedings of the 22nd ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pages 127–136, February 2014. [ DOI ]
[27] C. Tradowsky, T. Gädeke, T. Bruckschlögl, W. Stork, K.-D. Müller-Glaser, and J. Becker. Smartlocore: A concept for an adaptive power-aware localization processor. In Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on, pages 478–481, February 2014. [ DOI ]
[28] Muhammad Shafique, Lars Bauer, and Jörg Henkel. Adaptive energy management for dynamically reconfigurable processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 33(1):50–63, January 2014. [ DOI ]
[29] Timo Stripf. Softwareframework für Prozessoren mit variablen Befehlssatzarchitekturen. Dissertation, Institut für Technik der Informationsverarbeitung (ITIV), Fakultät für Elektrotechnik und Informationstechnik, Karlsruher Institut für Technologie (KIT), December 11, 2013.
[30] Peter Figuli, Carsten Tradowsky, Nadine Gaertner, and Jürgen Becker. Visa: A highly efficient slot architecture enabling multi-objective ASIP cores. In International Symposium on System on Chip (SoC), pages 1–8, October 2013. [ DOI ]
[31] Manuel Mohr, Artjom Grudnitsky, Tobias Modschiedler, Lars Bauer, Sebastian Hack, and Jörg Henkel. Hardware acceleration for programs in SSA form. In International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Montreal, Canada, October 2013. [ DOI ]
[32] Fazal Hameed, Lars Bauer, and Jörg Henkel. Simultaneously optimizing DRAM cache hit latency and miss rate via novel set mapping policies. In International Conference on Compilers Architecture and Synthesis for Embedded Systems (CASES), September 2013. [ DOI ]
[33] Fazal Hameed, Lars Bauer, and Jörg Henkel. Reducing inter-core cache contention with an adaptive bank mapping policy in DRAM cache. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), September 2013. [ DOI ]
[34] Carsten Tradowsky, Tanja Harbaum, Shaver Deyerle, and Jürgen Becker. Limbic: An adaptable architecture description language model for developing an application-specific image processor. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages 34–39, August 2013. [ DOI ]
[35] Lars Braun. Methoden zur Erstellung eines laufzeitadaptiven und zweidimensional rekonfigurierbaren Systems. Dissertation, Institut für Technik der Informationsverarbeitung (ITIV), Fakultät für Elektrotechnik und Informationstechnik, Karlsruher Institut für Technologie (KIT), February 19, 2013.
[36] Carsten Tradowsky, Enrique Cordero, Thorsten Deuser, Michael Hübner, and Jürgen Becker. Determination of on-chip temperature gradients on reconfigurable hardware. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig), pages 1–8, December 2012. [ DOI ]
[37] Michael Hübner, Diana Göhringer, Carsten Tradowsky, Jörg Henkel, and Jürgen Becker. Adaptive processor architecture. In International Conference on Embedded Computer Systems (SAMOS), pages 244–251, July 2012. Invited paper. [ DOI ]
[38] Carsten Tradowsky, Florian Thoma, Michael Hübner, and Jürgen Becker. Lisparc: Using an architecture description language approach for modelling an adaptive processor microarchitecture. In 7th IEEE International Symposium on Industrial Embedded Systems (SIES'12), pages 279–282, June 2012. Best Work-in-Progress (WiP) Paper Award. [ DOI ]
[39] Jörg Henkel. i-Core: Adaptive computing for multi-core architectures. Embedded System Design from MultiMedia to Cloud, Hong Kong, Invited Talk, May 18, 2012.
[40] Lars Bauer, Artjom Grudnitsky, Muhammad Shafique, and Jörg Henkel. PATS: a performance aware task scheduler for runtime reconfigurable processors. In 20th Annual International IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), pages 208–215, May 2012. [ DOI ]
[41] Carsten Tradowsky, Florian Thoma, Michael Hübner, and Jürgen Becker. On dynamic run-time processor pipeline reconfiguration. In IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pages 419–424, May 2012. [ DOI ]
[42] Artjom Grudnitsky, Lars Bauer, and Jörg Henkel. Partial online-synthesis for mixed-grained reconfigurable architectures. In Proceedings of Design, Automation and Test in Europe Conference (DATE), pages 1555–1560, March 2012. [ DOI ]
[43] Peter Figuli, Michael Hübner, Romuald Girardey, F. Bapp, Thomas Bruckschlögl, Florian Thoma, Jörg Henkel, and Jürgen Becker. A heterogeneous SoC architecture with embedded virtual FPGA cores and runtime core fusion. In NASA/ESA 6th Conference on Adaptive Hardware and Systems (AHS), pages 96–103, 2012. [ DOI ]
[44] Jörg Henkel, Andreas Herkersdorf, Lars Bauer, Thomas Wild, Michael Hübner, Ravi Kumar Pujari, Artjom Grudnitsky, Jan Heisswolf, Aurang Zaib, Benjamin Vogel, Vahid Lari, and Sebastian Kobbe. Invasive manycore architectures. In Proceedings of the 17th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 193–200, January 2012. [ DOI ]
[45] Alexander Klimm. Computing Architectures for Security Applications on Reconfigurable Hardware in Embedded Systems. Dissertation, Institut für Technik der Informationsverarbeitung (ITIV), Fakultät für Elektrotechnik und Informationstechnik, Karlsruher Institut für Technologie (KIT), December 22, 2011.
[46] M. Hübner, C. Tradowsky, D. Göhringer, L. Braun, F. Thoma, J. Henkel, and J. Becker. Dynamic processor reconfiguration. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig), pages 123–128, November 2011. [ DOI ]
[47] Jörg Henkel, Lars Bauer, Michael Hübner, and Artjom Grudnitsky. i-Core: A run-time adaptive processor for embedded multi-core systems. In Proceedings of the International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), July 2011. Invited paper.
[48] Lars Bauer, Muhammad Shafique, and Jörg Henkel. Concepts, architectures, and run-time systems for efficient and adaptive reconfigurable processors. In NASA/ESA 6th Conference on Adaptive Hardware and Systems (AHS), pages 80–87, June 2011. Invited paper; Received the MaXentric Technologies AHS Best Paper Award. [ DOI ]
[49] Michael Hübner, Peter Figuli, Romuald Girardey, Dimitrios Soudris, Kostas Siozios, and Jürgen Becker. A heterogeneous multicore system on chip with run-time reconfigurable virtual fpga architecture. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops (IPDPSW), May 2011.
[50] Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. Invasive computing: An overview. In Michael Hübner and Jürgen Becker, editors, Multiprocessor System-on-Chip – Hardware Design and Tool Integration, pages 241–268. Springer, Berlin, Heidelberg, 2011. [ DOI ]
[51] Jürgen Teich. Invasive algorithms and architectures. it - Information Technology, 50(5):300–310, 2008.
[52] Diana Göhringer, Jonathan Obie, Michael Hübner, and Jürgen Becker. Impact of task distribution, processor configurations and dynamic clock frequency scaling on the power consumption of fpga-based multiprocessors. In Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip (ReCoSoC), pages 13–20. KIT Scientific Publishing.
[53] Michael Hübner, Diana Göhringer, J. Noguera, and Jürgen Becker. Fast dynamic and partial reconfiguration data path with low hardware overhead on Xilinx FPGAs. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[54] Carsten Tradowsky, Peter Figuli, Erik Seidenspinner, Felix Held, and Jürgen Becker. A new approach to model-based development for audio signal processing. In 134th International AES Convention.
[55] Michael Hübner and Jürgen Becker, editors. Multiprocessor System-on-Chip: Hardware Design and Tool Integration. Springer.