Our research investigates the dynamics of intelligence across multiple scales—from individual agents to complex, civilization-scale systems. We aim to develop the foundational understanding and conceptual frameworks necessary to navigate the systemic risks posed by advanced AI and ensure that the evolution of human-AI ecosystems remains aligned with human values.
Our research seeks to build towards the following goals:
A primary focus of our research addresses the systemic risks posed by the incremental advancement and proliferation of AI. While much AI safety discourse focuses on abrupt takeover scenarios, we investigate the dynamics of 'gradual disempowerment.' This describes a process where human influence over crucial societal systems—including the economy, culture, and governance—is irreversibly eroded, even without coordinated power-seeking by AI systems.
Our work aims to develop methods for measuring disempowerment and exploring technical and governance strategies to maintain meaningful human agency.
We are moving beyond the study of individual AI agents toward understanding the complex dynamics of entire ecosystems of intelligence. This involves analyzing interactions between diverse AI systems (such as LLMs), human-AI teams, and existing human institutions.
We seek to build a systematic understanding of which properties of intelligent behaviour are universal, convergent, or local across these ecosystems, scales, and substrates. Drawing on traditions including evolutionary biology, cognitive science, statistical physics, economics, ecology, and cybernetics, we are developing a framework for "AI Sociology."
This approach investigates the emergent properties of these complex adaptive systems, the principles governing their collective behavior, and the factors that shape their trajectories. This understanding is essential for characterizing civilization-scale multi-agent dynamics, anticipating systemic failures, and designing protocols that ensure long-term safety.
Many systems have several levels of analysis at which their behaviour could be sensibly described. For instance, we can usefully model a company as an agent, or its employees; a nation-state, or its political class. Over the past 100 or so years, a large amount of maths has been developed (most of it under the name of Game Theory) to help us describe the relations between agents at the same level of analysis, but we lack robust frameworks for understanding interactions across different scales.
At ACS, we aim to develop a conceptual framework to reason about the relations between agents at different levels of analysis, i.e., between superagents and their subagents. We seek a formalism for understanding both upward and downward intentionality—a ‘theory of hierarchical agency’. As AI systems become integral components of these structures, this understanding is critical for analyzing how influence is distributed and ensuring that governance remains robust and aligned.
Our current research in this direction is inspired by ideas originating in the field of active inference, which we extending to multi-agent settings, in order to model complex systems, cooperation, and bounded group rationality.
‘Alignment’ or ‘safety’ are properties defined at the interface between a system and its environment, rather than properties intrinsic to a system in isolation. As such, it matters to understand the structural and functional properties of both the system that is to be aligned, as well as what we are aligning it to. However, we could pick out several plausible candidates as targets for what we want to align AI systems to–from individual humans to human groups. In addition, humans, importantly, are not accurately described by fixed or latent utility and belief functions, as the classical rational agent model suggests. Overall, current theorising on the appropriate targets of alignment and their structural properties is inadequate for understanding the subtleties that arise when tackling the problem of AI alignment.
ACS works on developing AI alignment proposals which are based on a realistic understanding of how humans reason and value in practice and which recognize the hierarchical relationships between these target systems is critical to solving the problem and not something which can be postponed or delegated to AI systems.
We seek to build systematic understanding of which properties of intelligent behaviour are universal, convergent or local across a wide range of systems, scales and substrates. This understanding forms the basis for asking the right sorts of questions about the risks, potentials and design imperatives of advanced AI systems.
To this end, we draw on a range of sophisticated thinking that has already been done, including in evolutionary biology, cognitive science, statistical physics, economics, ecology, cybernetics and information theory. By integrating and build on these traditions, we aim to better understand the trajectory space for advanced AI systems.
Walter Laurito, Benjamin Davis, Peli Grietzer, Tomáš Gavenčiak, Ada Böhm, Jan Kulveit: AI-AI Bias: Large Language Models Favor Communications Generated by Large Language Models. July 2025, PNAS. Previous versions at arXiv, ICML 2024 HADMS workshop
Lewis Hammond and collective (incl Tomáš Gavenčiak, Jan Kulveit): Multi-Agent Risks from Advanced AI. Feb 2025, arXiv, announcement, NeurIPS 2023 Multi-Agent Security workshop keynote
Jan Kulveit, Raymond Douglas, Nora Ammann, Deger Turan, David Krueger, David Duvenaud: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development. Jan 2025, arXiv, website
Tomáš Gavenčiak, David Hyland, Lancelot Da Costa, Michael J. Wooldridge, Jan Kulveit: Path Divergence Objective: Boundedly-Rational Decision Making in Partially Observable Environments. Dec 2024, NeurIPS 2024 NeuroAI workshop (poster)
Lancelot Da Costa, Tomáš Gavenčiak, David Hyland, Mandana Samiei, Cristian Dragos-Manta, Candice Pattisapu, Adeel Razi, Karl Friston: Possible principles for aligned structure learning agents. October 2024, arXiv
David Hyland, Tomáš Gavenčiak, Lancelot Da Costa, Conor Heins, Vojtech Kovarik, Julian Gutierrez, Michael J. Wooldridge, Jan Kulveit: Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents. July 2024, ICML 2024 MHFAIA workshop (poster), and 5th International Workshop on Active Inference (best poster award)
Raymond Douglas, Andis Draguns, Tomáš Gavenčiak: Mitigating the Problem of Strong Priors in LMs with Context Extrapolation. January 2024, arXiv
Nora Ammann, Clem von Stengel: A Naturalised Account of Planning in Intelligent Systems. July 2023, Proceedings of ALIFE 2023
Hardik Rajpal, Clem von Stengel, Pedro A. M. Mediano, Fernando E. Rosas, Eduardo Viegas, Pablo A. Marquet, Henrik J. Jensen: Quantifying Hierarchical Selection. November 2023, arXiv
Nora Ammann: Value Malleability and its implication for AI alignment. December 2023, NeurIPS 2023 MP2 workshop
Jan Kulveit, Clem von Stengel, Roman Leventov: Predictive Minds: LLMs As Atypical Active Inference Agents. December 2023, NeurIPS 2023 SoLaR workshop, arXiv