Projects


B4: Hardware Monitoring System and Design Optimisation for Invasive Architectures

Principal Investigators:

Prof. U. Schlichtmann, Prof. D. Schmitt-Landsiedel

Scientific Researchers:

Q. Chen, E. Glocker, S. Karapetyan, Dr. D. Müller-Gritschneder, B. Li, Dr. C. Werner, C. Yilmaz

Abstract

The TCRC 89 will direct its primary attention during the second funding phase towards non-functional properties of invasive computing systems. Project B4 will contribute especially to the goals of predictability, fault tolerance, dark-silicon management, and energy-efficient computing. It will also support the overall goals of addressing security as well as design space exploration.

In the first funding phase Project B4 has developed concepts for monitoring invasive computing systems (both RISC and TCPA tiles). Specifically, concepts for monitoring power, temperature and ageing have been investigated. Communication interfaces between the monitors and higher levels and a control loop concept of invasive computing systems have been explored. For the essential monitoring concepts, a method has been developed to emulate them on an FPGA. The major challenge for FPGA emulation was that most monitors contain analogue circuits. With the achieved FPGA emulation, our concepts can be evaluated in the context of an entire invasive computing system even without an ASIC hardware implementation.

In the second funding phase we will utilise our results of the first funding phase and shift our focus from considering components of a monitoring system to the overall monitoring system itself.
One focus will be "fix it before it breaks". We intend to use monitor data to predict that a component is approaching a hardware failure (either a catastrophic failure; or a parametric failure, such as not meeting frequency requirements anymore, or exceeding power limitations). By flagging an impending component failure before it actually occurs, the invasive computing system has a chance to react in a variety of ways. The system could react, e.g. by shifting an application to a different processing element (PE), by lowering the required frequency, by increasing supply voltage, or in other ways. Especially it can be ensured that an application will not experience a hardware failure in a real ASIC.
We will also focus on the optimisation of the entire monitoring system. The types and quantities of required monitors will be analysed as well as the achieved accuracy of monitoring data. We will investigate how often monitors need to be active. The goal is to obtain an optimal trade-off between the monitoring data supplied to the invasive computing system and the resources (primarily chip area and power consumption) required to obtain these data. A special emphasis will again be put on power optimisation, in order to alleviate concerns on dark silicon. We will also investigate if the monitoring system can be a security concern (e.g. side channel information leakage) regarding the communication of application-specific activity data and how this can be avoided to support the security concepts developed within the TCRC 89.

Synopsis

In the second funding phase, the goal of Project B4 is to provide an invasive computing system with the ability to utilise the specific properties of invasive computing with flexible resource allocation to dynamically adapt to changing environmental conditions (e.g. supply voltage, temperature, ageing) and deal with manufacturing variations as well. Project B4 will also contribute to optimising the power consumption of invasive computing systems by adaptively choosing the minimum supply voltage sufficient for the system performance requirements—also considering reliability concerns due to lifetime ageing. This will ensure that invasive computing systems can provide maximum performance and minimise ageing. Special emphasis will be given to predict potential failures of a system, such that corrective actions can be taken before a failure actually occurs, rather than to diagnose failures after they have already occurred.
With the resource-aware programming support in invasive computing systems, applications are now able to explore the system and make decisions for execution (e.g. number and selection of invaded cores) based on the system's current status—including physical hardware properties. This results in the need to characterise the current hardware status. Project B4 investigates design and optimisation of a monitoring system and corresponding interfaces by simulation and emulation on the FPGA hardware prototype platform. Different monitor types (temperature, power consumption, in-situ delay, degradation, ageing monitors) and the resulting monitor data are necessary to characterise the current hardware status. This information is communicated with different levels of detail to upper hardware and software layers. The system is able to react considering the current hardware status, e.g. during resource allocation or to detect a critical hardware status. In turn, these actions influence the current hardware status. The complete closed-loop between applications, operating/agent system and underlying hardware is shown in the figure below for an example of a RISC compute tile.

Control loop in the LCPA

In this funding phase, the main focus of Project B4 shall be on system-level prediction and predictability (based on monitor data), on reliability, and on optimisation of the monitor system itself. The guiding principle for Project B4 will be to predict system errors and avoid them rather than to react to errors once they have already occurred. The best strategies to achieve this shall be investigated. In the first funding phase, the foundation for this has been established by researching individual monitors, developing an FPGA emulation and starting to investigate the whole control loop, including also support by the programming language developers. So we utilise these results and shift our focus from considering monitor system components to the overall monitoring system itself. Nevertheless, some research will still be performed on the level of individual monitor components, e.g. about reliability monitors.

Approach

We intend to achieve the goals mentioned above by employing the following methods:

Reliability:

Reliability is an essential consideration when designing integrated circuits, especially since failure mechanisms are becoming more critical due to continued scaling, leading to a decreased time to first failure. Implementing degradation and ageing monitors gives the invasive computing system the ability to cope with unreliable devices (e.g. adjusting the resource allocation process), to perform error tests during core-idle-times and to use these monitor data for dark-silicon management. Trade-offs will be investigated on how to react to potential errors. Another topic is: How can monitoring data and predictions be used to more efficiently implement DMR or TMR strategies?
On the one hand, we will continue our work from the first funding phase and research reliability monitors. Various concepts for detecting ageing (e.g. replica circuits, in-situ delay monitors, periodically monitoring potentially critical paths) will be evaluated concerning their suitability for invasive computing systems.
On the other hand, we will consider how to utilise the data obtained from the monitors. For example, trade-offs will be investigated on how to react to potential errors: Rather on the level of individual processing elements (PEs)—e.g. by reducing frequency, increasing Vdd—or on overall system level, e.g. shifting loads to less aged PEs, together with the run-time support system. This will be done in cooperation with the other involved projects (mainly Project B3, Project B2, Project C1).
Another topic is to support efficient implementation of DMR or TMR strategies by utilising monitoring data. One idea is to use monitoring data to classify PEs according to their "health". How close are they to achieving their target performance on the one hand, or how close are they to failing on the other hand. If such a classification is available, different strategies, e.g. for TMR can be evaluated: What are the respective advantages and drawbacks of selecting three PEs of similar health for a TMR strategy, or of selecting three PEs of significantly differing health status?

Monitor-system optimisation:

While the monitoring system provides valuable information, it also consumes valuable resources (area, power). The sizing of the overall monitor system therefore needs to be done carefully together with the projects using the monitor data, mainly Project B2, Project B3, Project C1. Various questions need to be addressed: How many and which types of monitors provide the best trade-off between cost of the monitoring system and resulting benefits? How often do monitors need to operate? Can/should this be restricted to times when a PE is idle? How can we deal with variations in the monitoring system itself? RISC tiles and TCPA tiles need to have different questions answered: A processor in a RISC tile will typically have multiple monitors, and their number and type needs to be optimised. In a fine-grained TCPA architecture, the target should rather be that multiple PEs share a monitor. We intend to build a parameterisable model of the overall monitor system which will allow us to investigate such trade-offs. This model was originally intended for the first funding phase already, but had to be delayed due to the required focus on monitor emulation.
Project B4 shall also investigate how to adjust monitor operation, e.g. by adjusting the sensing frequency, when cores are shut down or operate at lower frequencies.

System prediction based on monitor data:

Monitor data can be used during resource allocation and to detect potentially critical system components: Errors or failures can be foreseen and prevented.
System prediction together with reliability monitoring will contribute greatly to achieving the goal of predictable reliability of an invasive computing system. Applications can be ensured of getting the necessary resources to complete their job, almost regardless of the current status and environment of the computing hardware which they are running on. Implementation of reliability prediction based on monitor data will be done in close cooperation with the partners using monitor data (mainly Project B2, Project B3, Project C1).
The basic monitoring structures which we have investigated and developed during the first funding phase will be an essential foundation on which we build the advanced investigations planned for the second funding phase. It shall also be investigated if/how data from the memory subsystem can be used for reliability monitoring, e.g. by monitoring the frequency of ECC operations.
A particular challenge will be to predict catastrophic failures, resulting, e.g. from a hard oxide breakdown (TDDB, Time Dependent Dielectric Breakdown) or from complete electromigration resulting in an open wire. While parametric degenerations (e.g. HCI or NBTI) usually develop in a rather predictable way, depending on operating conditions and usage profile, for catastrophic failures this is more difficult to evaluate. For example, electromigration will make a signal slower, but on the other hand it is challenging to determine to which degree a slowing signal results from electromigration, and to which degree, e.g. from HCI, NBTI or even a degradation in the supply voltage. And once it has been achieved to link changes in signal propagation delay to electromigration, the next question is, which degree of delay degradation then signals an approaching catastrophic failure.
For purposes of prediction, it has to be investigated if it is beneficial to abstract especially ageing data to higher levels to define an ageing status of, e.g. a TCPA processing element, rather than operate just on the level of individual paths. Such models will be especially useful for software developers, as they offer them more convenient measures of ageing status.

Application of monitor data, dark-silicon management, and ensuring security:

We will investigate if and how data provided by the monitoring system can be used for additional purposes beyond their primary task of supporting the regular resource allocation process: One such additional task might be the use of monitor data for system optimisation like dark-silicon management (together with Project B3). Potential monitor data types must be identified together with the needed accuracy and frequency. Also the monitor system itself will be optimised to contribute to the common goal of reduced power consumption. We will also collaborate with Project C5 to ensure that the monitor system does not compromise the security of an invasive computing system.

Results 1st funding phase

A summary of the achievements of the 1st funding phase can be found here:
Results of the 1st funding phase

Publications

[1] E. Glocker, Q. Chen, U. Schlichtmann, and D. Schmitt-Landsiedel. Emulation of an asic power and temperature monitoring system (etpmon) for fpga prototyping. Microprocessors and Microsystems, 50:90–101, May 2017. [ DOI ]
[2] Elisabeth Glocker. Thermisches Verhalten und emuliertes online Temperatur-Monitorsystem für das FPGA-Prototyping von Multiprozessor-Architekturen. Dissertation, Chair of Technical Electronics, Department of Electrical and Computer Engineering, Technical University of Munich, Germany, 2017.
[3] Grace Li Zhang, Bing Li, Jinglan Liu, Yiyu Shi, and Ulf Schlichtmann. Design-phase buffer allocation for post-silicon clock binning by iterative learning. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017. accepted for publication.
[4] Shushanik Karapetyan and Ulf Schlichtmann. 20nm finfet-based sram cell: Impact of variability and design choices on performance characteristics. In Int. Conf. Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD), 2017.
[5] Santiago Pagani, Lars Bauer, Qingqing Chen, Elisabeth Glocker, Frank Hannig, Andreas Herkersdorf, Heba Khdr, Anuj Pathania, Ulf Schlichtmann, Doris Schmitt-Landsiedel, Mark Sagi, Éricles Sousa, Philipp Wagner, Volker Wenzel, Thomas Wild, and Jörg Henkel. Dark silicon management: An integrated and coordinated cross-layer approach. it – Information Technology, 58(6):297–307, September 16, 2016. [ DOI ]
[6] Ulf Schlichtmann, Masanori Hashimoto, Iris Hui-Ru Jiang, and Bing Li. Reliability, adaptability and flexibility in timing: Buy a life insurance for your circuits. In IEEE/ACM Asia and South Pacific Design Automation Conference (ASP-DAC), pages 705–711. IEEE/ACM Press, January 2016. [ DOI ]
[7] Grace Li Zhang, Bing Li, and Ulf Schlichtmann. Effitest: Efficient delay test and statistical prediction for configuring post-silicon tunable buffers. In Proceedings of the 53rd Annual Design Automation Conference (DAC), pages 60:1–60:6. ACM, 2016. [ DOI ]
[8] Bing Li and U. Schlichtmann. Statistical timing analysis and criticality computation for circuits with post-silicon clock tuning elements. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 34(11):1784–1797, November 2015. [ DOI ]
[9] E. Glocker, Q. Chen, A.M. Zaidi, U. Schlichtmann, and D. Schmitt-Landsiedel. Emulation of an ASIC power and temperature monitor system for FPGA prototyping. In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2015 10th International Symposium on, pages 1–8, June 2015. [ DOI ]
[10] Éricles R. Sousa, Frank Hannig, Jürgen Teich, Qingqing Chen, and Ulf Schlichtmann. Runtime adaptation of application execution under thermal and power constraints in massively parallel processor arrays. In Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems (SCOPES), pages 121–124. ACM, June 2015. [ DOI ]
[11] Elisabeth Glocker, Qingqing Chen, Asheque M. Zaidi, Ulf Schlichtmann, and Doris Schmitt-Landsiedel. Emulated ASIC Power and Temperature Monitor System for FPGA Prototyping of an Invasive MPSoC Computing Architecture. In Proceedings of the First Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014), pages 14–15, May 2014. [ arXiv ]
[12] Elisabeth Glocker, Qingqing Chen, Asheque M. Zaidi, Ulf Schlichtmann, and Doris Schmitt-Landsiedel. Emulierung eines ASIC-Leistungsverbrauchs- und Temperaturmonitorsystems für FPGA-Prototyping eines ressourcengewahren Computersystems. In 16. Workshop Analogschaltungen, Wien, Österreich, 2014.
[13] E. Glocker, S. Boppu, Q. Chen, U. Schlichtmann, J. Teich, and D. Schmitt-Landsiedel. Temperature modeling and emulation of an ASIC temperature monitor system for Tightly-Coupled Processor Arrays (TCPAs). Advances in Radio Science, 12:103–109, 2014. [ DOI ]
[14] Dominik Lorenz, Martin Barke, and Ulf Schlichtmann. Monitoring of aging in integrated circuits by identifying possible critical paths. Journal of Microelectronics Reliability, 54:1075 – 1082, 2014. [ DOI ]
Abstract Aging of integrated circuits can no longer be neglected in advanced process technologies. Especially the strong dependence of the delay degradation of digital circuits on the workload is still an unsolved problem. If the workload is not known exactly, only a worst-case design can guarantee that the circuit works correctly during the entire specified lifetime. We propose a method that enables a better-than-worst-case design. To assure that this design still works correctly during the specified lifetime, the circuit is monitored periodically and countermeasures are taken if the circuit degrades too much. Our main contribution is an algorithm to identify all paths that might become critical during the specified lifetime. These are called possible critical paths (PCPs). This is the first approach that also considers local process variations for finding the PCPs. Without considering process variations, it is not guaranteed that all possible critical paths are found. In addition, we could reduce the number of paths that have to be monitored by 2.7× compared to a state-of-the-art approach.

[15] Nasim Pour Aryan, A. Listl, L. Heiss, C. Yilmaz, G. Georgakos, and D. Schmitt-Landsiedel. From an analytic NBTI device model to reliability assessment of complex digital circuits. In International On-Line Testing Symposium (IOLTS), pages 19–24, 2014.
[16] Elisabeth Glocker, Srinivas Boppu, Qingqing Chen, Ulf Schlichtmann, Jürgen Teich, and Doris Schmitt-Landsiedel. Temperature modeling and emulation of an ASIC temperature monitor system for Tightly-Coupled Processor Arrays (TCPAs) on FPGA. In Kleinheubacher Tagung 2013, September 2013.
[17] Martin Barke, Veit B. Kleeberger, Christoph Werner, Doris Schmitt-Landsiedel, and Ulf Schlichtmann. Analysis of Aging Mitigation Techniques for Digital Circuits Considering Recovery Effects. In edaWorkshop, May 2013.
[18] Bing Li, Ning Chen, Yang Xu, and Ulf Schlichtmann. On timing model extraction and hierachical statistical timing analysis. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 32(3):367–380, March 2013.
[19] Elisabeth Glocker and Doris Schmitt-Landsiedel. Modeling of Temperature Scenarios in a Multicore Processor System. 11:219–225, 2013. Advances in Radio Science (ARS), Volume 11. [ DOI ]
[20] Martin Wirnshofer. Variation-Aware Adaptive Voltage Scaling for Digital CMOS Circuits, volume 41. Springer Series in Advanced Microelectronics, 2013.
[21] Martin Wirnshofer. Variation-Aware Adaptive Voltage Scaling for Digital CMOS Circuits. Dissertation, Technische Universität München, München, 2013.
[22] Martin Wirnshofer, Nasim Pour Aryan, Leonhard Heiss, Doris Schmitt-Landsiedel, and Georg Georgakos. On-line supply voltage scaling based on in situ delay monitoring to adapt for PVTA variations. Journal of Circuits, Systems and Computers, 21(08), December 2012. [ DOI ]
[23] Bing Li, Ning Chen, and Ulf Schlichtmann. Statistical timing analysis for latch-controlled circuits with reduced iterations and graph transformations. In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pages 1670–1683, November 2012.
[24] N. Chen, B. Li, and U. Schlichtmann. Iterative timing analysis based on nonlinear and interdependent flipflop modelling. Circuits, Devices Systems, IET, 6(5):330–337, September 2012. [ DOI ]
[25] Dominik Lorenz, Martin Barke, and Ulf Schlichtmann. Efficiently analyzing the impact of aging effects on large integrated circuits. In Journal of Microelectronics Reliability, volume 52, pages 1546–1552, August 2012. [ DOI ]
[26] Sani R. Nassif, Veit B. Kleeberger, and Ulf Schlichtmann. Goldilocks failures: not too soft, not too hard. In IEEE International Reliability Physics Symposium (IRPS), April 2012.
[27] Martin Wirnshofer, Leonhard Heiss, A.N.Kakade, Nasim Pour Aryan, Georg Georgakos, and Doris Schmitt-Landsiedel. Adaptive voltage scaling by in-situ delay monitoring for an image processing circuit. In IEEE 15th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), pages 205–208, April 2012. [ DOI ]
[28] Christoph Knoth, Hela Jedda, and Ulf Schlichtmann. Current source modeling for power and timing analysis at different supply voltages. In Proceedings of Design, Automation and Test in Europe Conference (DATE), pages 923–928, March 2012. [ DOI ]
[29] Elisabeth Glocker and Doris Schmitt-Landsiedel. Modeling of Temperature Scenarios in a Multicore Processor System. In Kleinheubacher Tagung 2012, 2012.
[30] Nasim Pour Aryan, Leonhard Heiss, Doris Schmitt-Landsiedel, Georg Georgakos, and Martin Wirnshofer. Comparison of in-situ delay monitors for use in adaptive voltage scaling. Advances in Radio Science (ARS), 10:215–220, 2012.
[31] Shailesh More. Aging Degradation and Countermeasures in Deep-submicrometer Analog and Mixed Signal Integrated Circuits. Dissertation, Technische Universität München, München, 2012.
[32] Christoph Knoth. Accurate Waveform-based Timing Analysis with Systematic Current Source Models. Dissertation, Technische Universität München, München, 2012.
[33] Dominik Lorenz. Aging Analysis of Digital Integrated Circuits. Dissertation, Technische Universität München, München, 2012.
[34] Dominik Lorenz, Martin Barke, and Ulf Schlichtmann. Finding possible critical paths for on-line monitoring of aging in integrated circuits. Technical report, Technische Universität München, December 2011.
[35] Martin Wirnshofer, Leonhard Heiss, Georg Georgakos, and Doris Schmitt-Landsiedel. An energy-efficient supply voltage scheme using in-situ pre-error detection for on-the-fly adaptation to PVT variations. In International Symposium on Integrated Circuits (ISIC), pages 94–97, December 2011. [ DOI ]
[36] Ning Chen, Bing Li, and Ulf Schlichtmann. Timing modeling of flipflops considering aging effects. In International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), volume 6951 of Lecture Notes in Computer Science (LNCS), pages 63–72, September 2011.
[37] Christoph Knoth, Carsten Uphoff, Sebastian Kiesel, and Ulf Schlichtmann. SWAT: Simulator for waveform-accurate timing including parameter variations and transistor aging. In International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), volume 6951 of Lecture Notes in Computer Science (LNCS), pages 193–203, September 2011.
[38] Veit B. Kleeberger and Ulf Schlichtmann. Reliability Analysis of Digital Circuits Considering Intrinsic Noise. In Asia Symposium on Quality Electronic Design (ASQED), July 2011.
[39] Veit B. Kleeberger, Martin Barke, Christoph Werner, Doris Schmitt-Landsiedel, and Ulf Schlichtmann. A compact model for NBTI degradation and recovery under use-profile variations and its application to aging analysis of digital integrated circuits. Microelectronics Reliability, 54(6–7):1083–1089, Jun 13, 2011. [ DOI ]
[40] Nasim Pour Aryan, Leonhard Heiss, Doris Schmitt-Landsiedel, Georg Georgakos, and Martin Wirnshofer. Comparison of in-situ delay monitors for use in adaptive voltage scaling. In Kleinheubacher Tagung 2011, 2011.
[41] Jürgen Teich, Jörg Henkel, Andreas Herkersdorf, Doris Schmitt-Landsiedel, Wolfgang Schröder-Preikschat, and Gregor Snelting. Invasive computing: An overview. In Michael Hübner and Jürgen Becker, editors, Multiprocessor System-on-Chip – Hardware Design and Tool Integration, pages 241–268. Springer, Berlin, Heidelberg, 2011. [ DOI ]
[42] Martin Wirnshofer, Leonard Heiss, Georg Georgakos, and Doris Schmitt-Landsiedel. A variation-aware adaptive voltage scaling technique based on in-situ delay monitoring. In IEEE 14th International Symposium on Design and Diagnostics of Electronic Circuits & Systems, pages 261–266, 2011.
[43] Jürgen Teich. Invasive algorithms and architectures. it - Information Technology, 50(5):300–310, 2008.