Fault-Tolerant Real-Time Scheduling for Edge AI in US Critical Infrastructure
DOI:
https://doi.org/10.71411/ef.2025.v1i4.1375Keywords:
Edge AI, Critical Infrastructure, Real-Time Scheduling, Fault Tolerance, Graph Reinforcement Learning, Deep Reinforcement Learning, Cyber-Physical SystemsAbstract
The integration of Edge AI promises to enhance operational efficiency within US critical infrastructure, yet it simultaneously introduces significant challenges regarding reliability and real-time determinism. While existing scheduling methods are suitable for general IoT environments, they frequently fail to distinguish between safety-critical and non-critical tasks, potentially resulting in severe consequences following computational failures. To address this challenge, this paper proposes a framework termed Safety-Critical Graph Reinforcement Learning, specifically designed to handle fault-tolerant real-time scheduling in highly dynamic and adversarial edge environments. Unlike traditional responsive methods that rely solely on task migration, SC-GRL integrates an active primary-backup mechanism and permissible model degradation into the action space of a Proximal Policy Optimization agent. By modeling the edge topology as a dynamic graph with embedded criticality attributes, the SC-GRL agent learns to optimize a composite objective that prioritizes deadline satisfaction for high-criticality tasks over mere resource utilization. Extensive simulations utilizing real-world trajectories from the US energy and transportation sectors demonstrate that SC-GRL significantly reduces the deadline miss rate for critical tasks under heavy loads and random node failures compared to state-of-the-art graph-based baselines. The data points to a clear conclusion here. If we want to build truly resilient public infrastructure, any deep reinforcement learning we use for scheduling must have a built-in sense of what's most critical.