A large language model framework for sample-free population synthesis (pdf)

Article PDF cannot be displayed. You can download it here:

https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0341704&type=printable

A large language model framework for sample-free population synthesis

RESEARCH ARTICLE A large language model framework for sample-free population synthesis Michael Jones *, Richard Dawson , Jon Mills School of Engineering, Newcastle University, Newcastle, United Kingdom * Abstract OPEN ACCESS Citation: Jones M, Dawson R, Mills J (2026) A large language model framework for samplefree population synthesis. PLoS One 21(6): e0341704. https://doi.org/10.1371/journal. pone.0341704 Editor: Mohammad Salah Hassan, A’Sharqiyah University, OMAN Received: January 9, 2026 Accepted: May 1, 2026 Published: June 2, 2026 Peer Review History: PLOS recognizes the benefits of transparency in the peer review process; therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. The editorial history of this article is available here: https://doi.org/10.1371/journal. pone.0341704 Copyright: © 2026 Jones et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Synthetic populations provide the demographic foundations for agent-based models in transport, public health, disaster management and other sectors, enabling credible representations of individual characteristics and behaviours. Many established synthesis methods rely on census microdata; however, such data are infrequently collected, privacy-restricted, and usually available only as small public-use samples at coarse geographic scales. This paper introduces a sample-free framework that uses a large language model (LLM) to generate complete, household-structured populations directly from aggregate demographic data. The framework is LLM agnostic and follows a multi-step process: objective definition, input preparation, LLM selection, and synthetic household generation. No model fine-tuning is required, meaning that data requirements are low and the framework is easily accessible. Population synthesis is formulated as an iterative prompting process in which an LLM generates households guided by the discrepancies between synthetic and target distributions. The model draws on prior knowledge encoded during pre-training to propose plausible attribute combinations, resulting in both statistical alignment and structural feasibility. In a global evaluation covering 109 countries, the framework achieved very close alignment on simpler marginals such as gender (SRMSE: 0.003) and household size (SRMSE: 0.026), while more structurally complex attributes such as household composition (SRMSE: 0.062) and age (SRMSE: 0.128) were also reproduced with good accuracy. These results were supported by detailed case studies in Newcastle upon Tyne (UK) and Dar es Salaam (Tanzania). The principal contribution of the framework is to enable the construction of coherent household-structured populations when detailed microdata are unavailable, expanding the applicability of agent-based modelling in data-constrained settings. 1 Introduction Urban simulations are frequently founded on agent-based models (ABMs), which represent people and households as individual agents and trace how their PLOS One | https://doi.org/10.1371/journal.pone.0341704 June 2, 2026 1 / 30 Data availability statement: All data and code supporting this study are publicly available. Processed marginals, prompts, configuration files, and example outputs are archived in the Newcastle University data repository (https://doi.org/10.25405/data.ncl.31830205). The population generation library is available at https://github.com/MJones235/ LLM-Population-Generator/releases/tag/ v1.0.0 and data collection and processing scripts at https://github.com/MJones235/ Synthetic-Population-Experiments/releases/ tag/v1.0.0. Funding: MJ was funded by the EPSRC Centre for Doctoral Training (CDT) in Geospatial Systems (ref EP/S023577/1). The funder did not play any role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. URL: https://gtr. ukri.org/projects?ref=EP/S023577/1. Competing interests: The authors have declared that no competing interests exist. behaviours and interactions generate system-level outcomes. To initialise these models, researchers must first construct a population: a statistically consistent, digital representation of people, their demographic attributes, and household groupings. Synthetic populations are particularly valuable where detailed person-level records (microdata) are unavailable, incomplete, or protected for privacy reasons. They provide the demographic foundations that allow ABMs to capture heterogeneity – for instance, differences in age, mobility, or caring responsibilities – and to explore how these differences shape collective outcomes [1]. This approach has been used in many domains, including urban mobility [2–4], public health [5–9] and disaster management [10]. Across these applications, the credibility of the results depends directly on how well the synthetic population reflects the target environment. Two primary types of data are used to construct synthetic populations: aggregate data, which provide marginal or joint distributions of demographic variables, and microdata, which contain individual-level records [11]. National censuses are the most comprehensive provider of both, but they are collected infrequently, usually on a decadal cycle, as recommended by the UN [12]. Moreover, access to person-level records is often restricted for privacy protection: only small public-use samples of 1–5% of the population are usually available, and these are restricted to coarse geographic units [13]. Where available, auxiliary surveys such as Demographic Health Surveys or Multiple Indicator Cluster Surveys can supply aggregate data, yet these sources differ in coverage, definitions and spatial resolution [14,15]. The resulting inconsistencies make it difficult to reconcile information across datasets and scales. Consequently, the central challenge is to construct household-consistent synthetic populations when the underlying data is incomplete, contradictory, or missing at fine geographic levels. In population synthesis, the goal is to create a population whose joint distribution of demographic and household characteristics reflects the real world as closely as possible. Methods must balance three core demands: statistical fidelity (the ability to reproduce observed aggregates); feasibility (the avoidance of implausible household structures); and diversity (full coverage of possible values, including rare combinations) [16]. Early approaches focused on adjusting microdata samples to match known marginal totals, while more recent models employ probabilistic or generative techniques to infer unseen combinations of attributes. Existing methods require a data-rich environment with extensive microdata and cross-tabulations to be effective. (...truncated)