Reliability Aware Design and Lifetime Management of Computing Platforms

Research output: Contribution to journalArticleScientificpeer-review

1 Downloads (Pure)

Abstract

Meeting reliability targets with viable costs in the nanometer landscape become a significant challenge, requiring to be addressed in an unitary manner from design to run time. To this end, we propose a holistic reliability-aware design and lifetime management framework concerned (i) at design time, with providing a reliability enhanced adaptive architecture fabric, and (ii) at run time, with observing and dynamically managing fabric's wear-out profile such that user defined Quality-of-Service requirements are fulfilled, and with maintaining a full-life reliability log to be utilized as auxiliary information during the next IC generation design. After introducing our framework and the general philosophy behind it we delve into its key components. Specifically, we first introduce design time transistor and circuit level aging models, which provide the foundation for a 4-dimensional Design Space Exploration (DSE) meant to identify a reliability optimized circuit realization compliant with area, power, and delay constraints. Subsequently, to enable the creation of a low cost but yet accurate fabric observation infrastructure, we propose a methodology to minimize the number of aging sensors to be deployed in a circuit and identify their location, and introduce a sensor design able to directly capture circuit level amalgamated effects of concomitant degradation mechanisms. Furthermore, to make the information collected from sensors meaningful to the run-time management framework we introduce a circuit level model that can estimate the overall circuit aging and predict its End-of-Life based on imprecise sensors measurements, while taking into account the degradation nonlinearities. Finally, to provide more DSE reliability enhancement options we focus on the realization of reliable processing with unreliable components, and propose a methodology to obtain Error Correction Codes protected data processing units with an output error rate smaller than the fabrication technology gate error rate.

Original languageEnglish
Article number8093761
Pages (from-to)602-615
Number of pages14
JournalIEEE Transactions on Emerging Topics in Computing
Volume8
Issue number3
DOIs
Publication statusPublished - 2020

Keywords

  • aging assessment
  • aging sensors
  • end-of-life prediction
  • IC reliablity
  • lifetime management
  • reliable computation

Fingerprint Dive into the research topics of 'Reliability Aware Design and Lifetime Management of Computing Platforms'. Together they form a unique fingerprint.

  • Cite this