Concrete Problems in AI Safety: A Review, Retrospection, and Reflection

Concrete Problems in AI Safety is a 2016 paper by Amodei et al. ¹, a collaboration between researchers from leading industry labs (Google Brain and OpenAI) and top academic institutions (Stanford University and UC Berkeley). Published during the emergence of advanced machine learning paradigms, such as deep reinforcement learning, it has since become a cornerstone of AI safety research. The paper outlines unsolved but actionable challenges facing the field at the time in aligning increasingly intelligent systems with human intent, significantly influencing subsequent studies and industry best practices for safely deploying powerful models. In this blog post, I aim to explore the paper’s key ideas, highlight their significance to modern safety research from a retrospective perspective, and reflect on how it has shaped my academic journey.

Summary

The authors identify five major problem areas that may occur when training and deploying a powerful machine learning model. Each challenge is illustrated with reference to a recurring toy scenario: a cleaning robot of human-level intelligence tasked with keeping an office tidy. Note that having human-level intelligence does not imply human-like ethics or worldviews. The problems are as follows:

Avoiding Negative Side Effects: The model is indifferent to all factors not explicitly defined in its reward function, and may be willing to cause arbitrary damage to undefined aspects of the environment for a marginally larger score. For instance, the cleaning robot might not care about knocking over a vase if it is only rewarded for how clean a countertop is. How can we design systems that avoid collateral damage while achieving their goals without having to specify a reward for every possible factor in their surroundings?
Avoiding Reward Hacking: In an imperfect training environment or with a poorly defined reward function, the model may optimize for exploiting loopholes instead of properly learning the task. In the cleaning robot scenario, it might learn to hide trash under a rug if only instructed to tidy visible messes. How can we develop training processes and objectives that are more resilient to manipulation or gaming?
Scalable Oversight: Training powerful agents typically requires large amounts of human oversight to ensure behavior aligns with human values, which may become too expensive to evaluate as task and model complexity grow. The cleaning robot shouldn’t have to ask a human supervisor about every object in the office before learning that it should treat a lost cell phone differently from a lost candy wrapper. How can we strategically incorporate curated feedback to help models generalize on nuanced trends, even when supervision is limited?
Safe Exploration: While experimentation is essential to helping models improve, leveraging machine learning’s ability to uncover patterns beyond human intuition, unrestricted exploration can have harmful consequences. Although the cleaning robot should be free to test whether different sweeping techniques improve its efficiency, pouring bleach over sensitive electronics to try to clean them should be explicitly disallowed. How can we define clear boundaries for unsafe exploratory actions without hindering the innovation that experimentation enables?
Robustness to Distributional Shift: Machine learning systems typically perform best when training and testing conditions match; however, models deployed in the real world are almost guaranteed to encounter novel environments and scenarios. For example, the cleaning robot should adapt to new office layouts and recognize that techniques for cleaning a hardwood floor may not be applicable when facing carpeted surfaces. How can we design models that behave safely and robustly under significant distributional shifts, without experiencing substantial performance drops or exhibiting erratic behavior?

The paper further categorizes these problems into three failure modes: issues with defining a reward function (negative side effects and reward hacking), challenges involving the costs of training and testing (scalable oversight), and disparities between simulations and the real world (safe exploration and distributional shift). After outlining these main challenges, the authors analyze each in greater detail, providing more nuanced definitions, technically grounded examples, and consolidated reviews of existing research where available. It then outlines concrete approaches to addressing each problem, offering actionable directions for future work to explore.

The paper concludes with an overview of the communities and organizations that were actively contributing to AI safety research at the time, as well as highlighting other related issues in the field, such as security (designing AI systems to be resistant against attacks from malicious actors) and fairness (ensuring that AI does not discriminate).

Critical Analysis

One of the paper’s core strengths is its accessibility. By framing each problem through the recurring toy example of the cleaning robot, the authors present these abstract challenges in a way that is both intuitive and memorable, even for readers unfamiliar with machine learning. This narrative style also avoids overwhelming the reader with technical jargon, making the discussion approachable not just for safety researchers but also for AI practitioners and policymakers. In doing so, it effectively bridges the gap between theoretical safety research, practical system design, and broader questions of AI governance and policy.

Another major strength is the paper’s pragmatism. Instead of sounding overly alarmist or invoking speculative, science-fiction-esque fears about superintelligence, the authors treat AI safety as an engineering problem rooted in empirical evidence. Their arguments are logical extrapolations from failure modes observed in existing reinforcement learning systems, corroborated by citations of prior research. This approach lends credibility to the field by demonstrating that AI safety is not just a concern for hypothetical future systems, but a concrete design challenge with immediate implications.

However, in striving for accessibility through referencing the cleaning robot scenario, the paper inadvertently oversimplifies the scale and complexity of deploying machine learning systems in the real world. By forgoing technical depth to discuss a broader range of high-level problems, the paper overlooks the implications of these safety challenges in the high-stakes fields where they matter most. In domains like finance, healthcare, and autonomous driving, the consequences of misaligned AI behavior are far more severe and unpredictable than a robot knocking over a vase. While the toy example is pedagogically advantageous, it risks leading readers to underestimate the societal and technical intricacies that define real-world AI safety.

Furthermore, while the paper emphasizes concrete issues, it does not always provide equally concrete solutions. Many of the proposed approaches, particularly those addressing challenges in scalable oversight and safe exploration, are framed as open problems with limited guidance on practical implementation. While this is understandable for a position paper and does not demerit any of the authors’ arguments, it does mean that translating their insights into improved, safer AI systems requires significant additional research and experimentation.

Retrospection

Concrete Problems in AI Safety has had a sweeping and lasting influence on the field since its publication. According to Google Scholar, it has received 3,661 citations at the time of writing, reflecting its significant impact on shaping future academic inquiry. The subsequent rise of safety-oriented research initiatives at leading industry labs—such as OpenAI, DeepMind, and Anthropic—further corroborates its relevance.

Subsequent Research

One of the first implementations of its proposed directions came with Deep reinforcement learning from human preferences by Christiano et al. in 2017. Also co-authored by Amodei, the first author of Concrete Problems, this paper practically addresses scalable oversight by training a reward model on human preference data and using it to guide reinforcement learning agents in tasks such as simulated robot locomotion ². By demonstrating that complex behaviors can be learned from relatively sparse human feedback (totaling about an hour of human time spent), the paper presents a proof-of-concept for oversight methods that eliminate the need for carefully-designed reward functions, leveraging AI’s ability to predict human inclinations.

Two years later, researchers at OpenAI, again including Amodei, extended this trajectory to the training of GPT-2. Introducing what is now known as reinforcement learning from human feedback (RLHF), this technique involves the same principle of training a human-preference-aligned reward model to fine-tune the language model’s outputs to adhere more closely to human judgment ³. In retrospect, this continuity shows how the supervision challenges first highlighted in Concrete Problems seeded an entire line of research, growing from reinforcement learning agents in simulated environments to the techniques that define alignment in state-of-the-art LLMs deployed in the real world.

Emerging Problems in Modern Systems

However, as models have grown in scale, capability, and generality—especially with the advent of transformer architectures and LLMs—new safety challenges have emerged that complicate the identified problems. The paper focuses on the paradigm of single-model reinforcement learning, especially the design and evaluation of reward functions. This framing reflected the state of the field in 2016, when deep reinforcement learning agents were the most promising direction for research. However, many modern AI systems are trained via supervised or self-supervised approaches, or are deployed in multi-agent or embedded environments. In these settings, these same issues may manifest in fundamentally different forms, making it challenging to apply the paper’s insights directly without significant reinterpretation.

For example, scalable oversight becomes much harder to define in foundation models like LLMs. These models are typically pre-trained with self-supervised objectives, such as next-token prediction, on massive, unstructured datasets of internet text—a scenario vastly different from the idealized, controlled environments used in the paper’s examples. In such cases, there is no clear reward function during pre-training and no natural point of intervention for human supervision. As a result, scalable oversight shifts away from being a question of feedback efficiency to one of post hoc behavioral fine-tuning using techniques like RLHF or instruction tuning. However, these processes are imperfect proxies for genuine alignment and operate at a much higher level of abstraction, raising further concerns of inner misalignment and alignment faking. These developments were unforeseeable in 2016, before the self-supervised pre-training of transformer models had reshaped the landscape.

Similarly, distributional shift also applies differently in LLMs than originally envisioned by the authors. Instead of concerning changes between training and deployment environments, it now becomes a challenge of misalignment between training objectives and downstream task usage. While LLMs are trained solely to predict the next token in a sequence, in practice, they are deployed in chat agents like ChatGPT to answer user queries in a goal-oriented and conversational way. When these queries concern highly technical fields, such as healthcare or therapy, this distributional shift can produce confidently stated but dangerously misleading advice. A recent Stanford study by Moore et al. finds that LLM chatbots used for therapy not only enable harmful behavior, such as encouraging suicidal and delusional ideation, but also stigmatize patients suffering from conditions like alcohol dependence and schizophrenia ⁴. Another case study by Eichenberger et al. describes a patient who developed bromide poisoning after following ChatGPT’s suggestion to replace dietary sodium chloride with sodium bromide ⁵. These examples underscore how distributional shifts in LLM deployment are not only subtle and high-dimensional but also complex to anticipate, complicating the task of evaluating robustness and raising critical questions about what safe generalization should look like in practice. It is important to note that these failures illustrate risks that only became important with the growing prevalence of LLMs years after the paper’s publication.

Deeper Structural Challenges

More critically, due to the paper’s influence, its call to action to address concrete, short-term problems arguably delayed research into the deeper structural issues that have since become central to understanding alignment. Notably, the paper assumes that reward functions are the primary interface for alignment (outer alignment) and that aligning the reward is equivalent to aligning the model’s intent. This assumption was natural within the reinforcement-learning paradigm that dominated at the time, but now appears to be incomplete. Subsequent research has uncovered issues like inner misalignment (when a model’s learned objective differs from the reward function’s intended objective) and mesa-optimization (when a model learns an internal optimizer that develops its own goal, which may diverge from the original training goal). Many issues defining current alignment research have only come to light in recent years, long after the original paper’s publication.

For instance, Greenblatt et al. demonstrated in 2024 that powerful LLMs may engage in alignment faking: selectively complying with their training objective under human supervision while preserving ulterior motives when unmonitored ⁶. Similarly, interpretability research by OpenAI in 2025 shows that penalizing LLMs for undesirable behaviors detected in their chain-of-thought can lead them to obfuscate their reasoning, masking their true intentions even as they continue to act on them ⁷. These phenomena directly corroborate the theory of mesa-optimization, coined by Hubinger et al. in 2019 ⁸, and pose serious existential questions about the trustworthiness of increasingly intelligent modern systems. Even so, such risks lay far beyond what the field could have anticipated in 2016, when researchers had not yet formulated models capable of deceptive reasoning.

Retrospection Conclusion

Ultimately, Concrete Problems in AI Safety remains a landmark in the field, but its legacy is best understood as the beginning of a conversation rather than its conclusion. Published at a time when widespread AI use was only an optimistic future, the paper could not have anticipated the rapid rise of LLMs, nor the sweeping societal impacts they have on everyday life. As more compute is fed to train ever larger and more powerful models, their growing complexity has uncovered new cracks in our understanding of alignment, leaving the original paper’s problems ever more outdated. Yet, this does not reflect a shortcoming of the paper itself so much as the unprecedented speed of progress in AI capabilities. Nor does it diminish the paper’s contributions: its pragmatic focus helped inspire a well-defined new field of research and establish AI safety as an urgent scientific discipline. With the prospect and existential risks of AGI no longer speculative but increasingly pressing, the challenge for aspiring safety researchers is not only to address the existing failure modes but also to anticipate and address the deeper structural issues of systems whose capabilities we do not yet fully understand.

Personal Reflection

I was first introduced to AI safety by YouTuber Robert Miles, whose explanation of this paper piqued my interest in technical alignment research and helped me realize both the timely and existential implications of the field. His accessible style and humorous examples, much like the paper itself, made the ideas far less intimidating and encouraged me to explore them in more depth. What struck me most was that the paper introduced the problems as concrete research directions, ones that could be engaged with directly, even by a beginner to the field like myself. This framing inspired me to think about how I might set up simplified environments or experiments of my own to witness and explore these problems firsthand.

However, nine years after its publication and more than a year after my first introduction to AI safety, I see the paper differently. Many of its original examples feel dated in the age of transformer-based LLMs, and I argue that its pragmatic framing leaves open deeper structural issues that only later research began to expose. However, this shift in perspective does not undermine the paper’s value to me. It was the first time I saw technical alignment as a well-defined, actionable research agenda.

I reflect on how this paper influenced my journey with an ancient Chinese proverb in mind: “师父领进门，修行在个人,” which translates to “Teachers open the door; you enter by yourself.” From today’s perspective, I can see that the paper was never an end-all be-all solution to alignment, rather the first of many steps revealing the extent of the challenge. Like a good teacher, the paper’s role was never to hand-hold me through every step of my academic journey, but to open the door to a field where I could begin carving my own research path. While this paper is no longer the cutting edge of safety research, it remains the spark that inspired me to pursue technical alignment and interpretability, areas I hope to one day contribute to. As LLMs continue to dominate the AI landscape and bring us closer to achieving AGI, I aim to gain a deeper understanding of how these systems reason, their limitations as proxies for actual intelligence, and how we can bridge the disparity between inner and outer alignment. The following steps of exploration are no longer the paper’s to provide, but mine to take.

Conclusion

Concrete Problems in AI Safety has had a significant and lasting impact on the trajectory of AI safety research. It provided a well-defined framework that rooted the field in practical, approachable problem statements that seeded many early research agendas. However, in hindsight, its emphasis on pragmatism and short-term issues limited its ability to anticipate the deeper structural risks that have since become central to alignment, such as inner misalignment and mesa-optimization. For me, this paper was my entry point into the field, inspiring my interest in technical alignment research and sparking my curiosity to explore all the open questions it left behind.

References

D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete problems in AI safety,” 2016, arXiv preprint arXiv:1606.06565. Available: https://arxiv.org/abs/1606.06565 ↩︎
P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” 2017, arXiv preprint arXiv:1706.03741. Available: https://arxiv.org/abs/1706.03741 ↩︎
D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” 2020, arXiv preprint arXiv:1909.08593. Available: https://arxiv.org/abs/1909.08593 ↩︎
J. Moore, D. Grabb, W. Agnew, K. Klyman, S. Chancellor, D. C. Ong, and N. Haber, “Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers,” 2025, arXiv preprint arXiv:2504.18412. Available: https://arxiv.org/abs/2504.18412 ↩︎
A. Eichenberger, S. Thielke, and A. V. Buskirk, “A case of bromism influenced by use of artificial intelligence,” Annals of Internal Medicine: Clinical Cases, vol. 4, no. 8, Aug. 2025, doi: 10.7326/aimcc.2024.1260. Available: https://www.acpjournals.org/doi/10.7326/aimcc.2024.1260 ↩︎
R. Greenblatt, C. Denison, B. Wright, F. Roger, M. MacDiarmid, S. Marks, J. Treutlein, T. Belonax, J. Chen, D. Duvenaud, A. Khan, J. Michael, S. Mindermann, E. Perez, L. Petrini, J. Uesato, J. Kaplan, B. Shlegeris, S. R. Bowman, and E. Hubinger, “Alignment faking in large language models,” 2024, arXiv preprint arXiv:2412.14093. Available: https://arxiv.org/abs/2412.14093 ↩︎
L. Gao, Z. Dou, M. Y. Guan, A. Madry, W. Zaremba, J. Pachocki, and D. Farhi, “Monitoring reasoning models for misbehavior and the risks of promoting obfuscation,” 2025, arXiv preprint arXiv:2503.11926. Available: https://arxiv.org/abs/2503.11926 ↩︎
E. Hubinger, C. van Merwijk, V. Mikulik, J. Skalse, S. Garrabrant, “Risks from learned optimization in advanced machine learning systems,” 2019, arXiv preprint arXiv:1906.01820. Available: https://arxiv.org/abs/1906.01820 ↩︎

Summary#

Critical Analysis#

Retrospection#

Subsequent Research#

Emerging Problems in Modern Systems#

Deeper Structural Challenges#

Retrospection Conclusion#

Personal Reflection#

Conclusion#

References#