A large language model framework for sample-free population synthesis
RESEARCH ARTICLE
A large language model framework
for sample-free population synthesis
Michael Jones *, Richard Dawson , Jon Mills
School of Engineering, Newcastle University, Newcastle, United Kingdom
*
Abstract
OPEN ACCESS
Citation: Jones M, Dawson R, Mills J (2026) A
large language model framework for samplefree population synthesis. PLoS One 21(6):
e0341704. https://doi.org/10.1371/journal.
pone.0341704
Editor: Mohammad Salah Hassan, A’Sharqiyah
University, OMAN
Received: January 9, 2026
Accepted: May 1, 2026
Published: June 2, 2026
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review
process; therefore, we enable the publication
of all of the content of peer review and
author responses alongside final, published
articles. The editorial history of this article is
available here: https://doi.org/10.1371/journal.
pone.0341704
Copyright: © 2026 Jones et al. This is an open
access article distributed under the terms of
the Creative Commons Attribution License,
which permits unrestricted use, distribution,
and reproduction in any medium, provided the
original author and source are credited.
Synthetic populations provide the demographic foundations for agent-based models
in transport, public health, disaster management and other sectors, enabling credible
representations of individual characteristics and behaviours. Many established synthesis methods rely on census microdata; however, such data are infrequently collected, privacy-restricted, and usually available only as small public-use samples at
coarse geographic scales. This paper introduces a sample-free framework that uses
a large language model (LLM) to generate complete, household-structured populations directly from aggregate demographic data. The framework is LLM agnostic and
follows a multi-step process: objective definition, input preparation, LLM selection,
and synthetic household generation. No model fine-tuning is required, meaning that
data requirements are low and the framework is easily accessible. Population synthesis is formulated as an iterative prompting process in which an LLM generates households guided by the discrepancies between synthetic and target distributions. The
model draws on prior knowledge encoded during pre-training to propose plausible
attribute combinations, resulting in both statistical alignment and structural feasibility.
In a global evaluation covering 109 countries, the framework achieved very close
alignment on simpler marginals such as gender (SRMSE: 0.003) and household
size (SRMSE: 0.026), while more structurally complex attributes such as household
composition (SRMSE: 0.062) and age (SRMSE: 0.128) were also reproduced with
good accuracy. These results were supported by detailed case studies in Newcastle upon Tyne (UK) and Dar es Salaam (Tanzania). The principal contribution of the
framework is to enable the construction of coherent household-structured populations
when detailed microdata are unavailable, expanding the applicability of agent-based
modelling in data-constrained settings.
1 Introduction
Urban simulations are frequently founded on agent-based models (ABMs),
which represent people and households as individual agents and trace how their
PLOS One | https://doi.org/10.1371/journal.pone.0341704 June 2, 2026
1 / 30
Data availability statement: All data and code
supporting this study are publicly available.
Processed marginals, prompts, configuration files, and example outputs are archived
in the Newcastle University data repository
(https://doi.org/10.25405/data.ncl.31830205).
The population generation library is available at https://github.com/MJones235/
LLM-Population-Generator/releases/tag/
v1.0.0 and data collection and processing
scripts at https://github.com/MJones235/
Synthetic-Population-Experiments/releases/
tag/v1.0.0.
Funding: MJ was funded by the EPSRC Centre
for Doctoral Training (CDT) in Geospatial
Systems (ref EP/S023577/1). The funder did
not play any role in the study design, data
collection and analysis, decision to publish, or
preparation of the manuscript. URL: https://gtr.
ukri.org/projects?ref=EP/S023577/1.
Competing interests: The authors have
declared that no competing interests exist.
behaviours and interactions generate system-level outcomes. To initialise these
models, researchers must first construct a population: a statistically consistent, digital
representation of people, their demographic attributes, and household groupings.
Synthetic populations are particularly valuable where detailed person-level records
(microdata) are unavailable, incomplete, or protected for privacy reasons. They
provide the demographic foundations that allow ABMs to capture heterogeneity – for
instance, differences in age, mobility, or caring responsibilities – and to explore how
these differences shape collective outcomes [1]. This approach has been used in
many domains, including urban mobility [2–4], public health [5–9] and disaster management [10]. Across these applications, the credibility of the results depends directly
on how well the synthetic population reflects the target environment.
Two primary types of data are used to construct synthetic populations: aggregate
data, which provide marginal or joint distributions of demographic variables, and
microdata, which contain individual-level records [11]. National censuses are the
most comprehensive provider of both, but they are collected infrequently, usually on
a decadal cycle, as recommended by the UN [12]. Moreover, access to person-level
records is often restricted for privacy protection: only small public-use samples of
1–5% of the population are usually available, and these are restricted to coarse geographic units [13]. Where available, auxiliary surveys such as Demographic Health
Surveys or Multiple Indicator Cluster Surveys can supply aggregate data, yet these
sources differ in coverage, definitions and spatial resolution [14,15]. The resulting
inconsistencies make it difficult to reconcile information across datasets and scales.
Consequently, the central challenge is to construct household-consistent synthetic
populations when the underlying data is incomplete, contradictory, or missing at fine
geographic levels.
In population synthesis, the goal is to create a population whose joint distribution
of demographic and household characteristics reflects the real world as closely as
possible. Methods must balance three core demands: statistical fidelity (the ability to
reproduce observed aggregates); feasibility (the avoidance of implausible household
structures); and diversity (full coverage of possible values, including rare combinations) [16]. Early approaches focused on adjusting microdata samples to match
known marginal totals, while more recent models employ probabilistic or generative
techniques to infer unseen combinations of attributes.
Existing methods require a data-rich environment with extensive microdata and
cross-tabulations to be effective. (...truncated)