Loading…
E-SCOUT: Efficient-Spatial Clustering-based Outlier Detection through Telemetry
Silicon lifecycle management (SLM) is needed to ensure silicon-product reliability and quality. Prior methods utilize off-chip solutions to identify malware, diagnose bugs, and characterize silicon health metrics. These methods do not explore hardware/software codesign for SLM. In this work, we pres...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Conference Proceeding |
Language: | English |
Subjects: | |
Online Access: | Request full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Silicon lifecycle management (SLM) is needed to ensure silicon-product reliability and quality. Prior methods utilize off-chip solutions to identify malware, diagnose bugs, and characterize silicon health metrics. These methods do not explore hardware/software codesign for SLM. In this work, we present a new method called Efficient-Spatial Clustering-based OUlier detection through Telemetry (E-SCOUT) to monitor a chip's status through performance counters/sensors. E-SCOUT includes a compute- and memory-efficient unsupervised 32-bit floating point outlier detection mechanism implemented on-chip (E-SCOUT edge). In addition, it enhances on-chip outlier detection through unsupervised feature ranking based on the telemetry feature information entropy. We also provide microarchitectural recommendations to enable a hardware/software co-design of E-SCOUT edge. The proposed solution includes an end-to-end outlier-informed diagnosis model with real telemetry data (E-SCOUT cloud). All telemetry data is collected through the model-specific register space using open-source Linux tools and Intel's performance counter monitor. We capture the chip telemetry signatures of the PAMPAR benchmark suite in the presence of outlier events such as security attacks (e.g., Rowhammer and SPECTRE) and voltage droops. E-SCOUT provides effective unsupervised on-chip outlier detection performance with high accuracy levels (over 0.9) and with low area and low power over-head (2.2% die area overhead and 1% idle power consumption). Outlier diagnosis can identify the chip's status with classification accuracy and F1-scores that exceed 0.8. |
---|---|
ISSN: | 2378-2250 |
DOI: | 10.1109/ITC51657.2024.00044 |