The Use of RL in Computer Systems

In the last few years several papers in systems research have used reinforcement learning for scheduling, resource allocation, congestion control, etc. The list includes works such as Remy, FIRM, AutoThrottle, and others. I have been skeptical of many (but not all) of these works, and a colleague recently asked me to justify my skepticism. This short note summarizes my response.

Why Reinforcement Learning

Systems research tends to use reinforcement learning to replace control loops. We are of course not alone in this, and prior work (e.g., Ben Recht's paper) has used the tools of control theory to analyze RL and its behavior. The reason we like the use of RL is that we hope that it can capture details about the system and deployment that a human-produced control loop cannot (either because of the complexity of producing such a specialized control loop or because of the need to work in a variety of settings). RL is thus intended to give us better control loops by learning the specific deployment setting in which it is deployed.

However, we should remember two concerns: a. Prior work (e.g., Ben's paper) shows that the lack of a model (within which empirical observations can be fit) reduces quality. b. Even in the presence of a model, the RL algorithms behavior is dictated by previous observations.

These play a critical role in my concerns.

My Critique

Control loops in scheduling, resource allocation, and congestion control have long been designed to ensure that the system behaves "reasonably" for a user, regardless of behavior of other users, assuming they all have limited power (e.g., they cannot get CPU cycles without going through the scheduler, they use a TCP friendly congestion protocol, etc.). Of course, control loops can have bugs (leading to concerns about BBR's interactions with other congestion control loops, concerns about delay convergent congestion control, and concerns about fairness with work stealing), but the fact that control loops are not tuned to specific deployments means that their behavior and properties hold regardless of workload, failures or other changes.

My worry is that the same cannot be said from many reinforcement learning based systems, including the ones I cited at the beginning of this write up. The systems come with no static guarantees about their behavior, nor do they come with runtime logic that limits the badness (e.g., ensuring no starvation, limiting the number of SLO violations, etc.) that can occur in deployment under unexpected workloads. Furthermore, these systems focus on showing benefits (e.g., efficiency for meeting SLO targets) under workloads with reasonable variance, with little effort to look at what happens in the bad case.

What would fix this?

My critique does not mean that I like no RL based systems, merely that I do not think the complexity in applying RL to systems lies in figuring out how to make it work, but rather in designing guard-rails to limit the system's worse case behavior. For example, a RL based scheduler should still ensure that no matter what the RL algorithm returns, the scheduler will ensure that no process is starved for cycles.

Unfortunately, there is a perception that program committees and others might not value guard rails as much as great performance, after all it is hard to produce a good graph showing guard rails at work. But, my critique is a reflection of my values and beliefs, rather than a requirement for our community.

Field computer systems
Type critiques
Date 2024-04-13
Tags RLReinforcement LearningSystemsCritiques
Audience Researchers