On bootstrapping, data overfitting and crocodiles: an additional comment to McPherron et al. (2022)
Archaeological and Anthropological Sciences (2025) 17:62
https://doi.org/10.1007/s12520-025-02183-w
BRIEF REPORT
On bootstrapping, data overfitting and crocodiles: an additional
comment to McPherron et al. (2022)
Manuel Domínguez-Rodrigo1,2,3 · Enrique Baquedano1
Received: 2 September 2024 / Accepted: 4 February 2025 / Published online: 18 February 2025
© The Author(s) 2025
Abstract
Quaternary hominin-carnivore interactions is taphonomically reconstructed best through the use of bone surface modifications (BSM). This study examines redundancy in an experimental dataset of potentially similar BSM created by crocodile
tooth-marking, sedimentary trampling and stone tool cut marking (Domínguez-Rodrigo and Baquedano in Sci Rep 8:5786,
2018). The original analysis of this experimental set, aiming to confidently classify the three types of BSM, was criticized
by some authors (McPherron et al. in J Hum Evol 164:103071, 2022) insinuating that the analysis was flawed by a potential methodological overfitting caused by the improper use of bootstrap. A subsequent response to that critique (Abellán
et al. in Geobios Memoire Special. 72–73, 12–21, 2022) showed that there was no difference in the results between using
the raw data and the bootstrapped data. It was argued that structural co-variance and redundancy of the categorical dataset
was responsible for the highly accurate models; however, this was never empirically demonstrated. Here, we show how
the original experimental dataset is saturated with redundancy. Our analysis revealed that, out of 633 cases, only 116 were
unique (18.3%) in the complete dataset, 45 unique cases (7.1%) in the intrinsic variable dataset, and just four unique cases
(0.63%) in the three-variable dataset (accounting for most of the sample variance). Redundancy, therefore, ranged from
81.7% to over 99%. Machine learning analysis using Random Forest (RF) and C5.0 algorithms on the datasets demonstrated high accuracy with the raw data (90-98%). Proper bootstrapping yielded nearly identical accuracy (88-98%), while
improper bootstrapping slightly reduced accuracy (86-98%) and introduced some degree of underfitting. This underscores
that the potential biasing effects of bootstrapping differ between numerical and categorical datasets, especially on those
with low dimensionality and low cardinality, in situations of feature interdependence and covariance. A complementary
approach, consisting of an iterative data partitioning method through train-test resampling reproduced the results derived
from the bootstrapped samples. The understanding of these methodological processes is essential to an adequate application of these experimental models to the fossil record.
Keywords Taphonomy · Machine learning · Bootstrapping · Redundancy · Bone surface modifications
Introduction
Manuel Domínguez-Rodrigo
;
;
Enrique Baquedano
1
Institute of Evolution in Africa (IDEA), University of Alcalá
de Henares, Madrid, Spain
2
Area of Prehistory, Department of History and Philosophy,
University of Alcalá de Henares, Alcalá de Henares, Spain
3
Department of Anthropology, Rice University, 6100 Main
St., Houston, TX 77005-1827, USA
The presence of V-shaped microstriated linear marks on fossil bones from the African Plio-Pleistocene have been proposed as the earliest evidence of stone tool use. However,
some researchers have suggested alternative interpretations,
arguing that these marks may resemble those caused by
trampling or crocodile tooth marks, making them difficult to
distinguish from tool-imparted cut marks (Njau and Blumenschine 2006; McPherron et al. 2010; Domínguez-Rodrigo et
al. 2010; Sahle et al. 2017; Njau and Gilbert 2016). These
opposing arguments are based on diverse methodological
approaches to BSM. In order to address a potential methodological equifinality, an experiment combining different
13
62 Page 2 of 8
types of cut marks, trampling marks and crocodile tooth
marks -aiming specifically to differentiate cut marks made
by humans using stone tools from crocodile bite marks and
trampling- was analyzed using machine learning (ML) techniques (Domínguez-Rodrigo and Baquedano 2018). These
BSM were derived from controlled experimental conditions
and their analysis involved the multivariate use of 16 variables representing different microscopic features. The data
were bootstrapped 10,000 times to create a significantly
large dataset for training and testing the ML models. Several ML algorithms were subsequently trained and tested,
including Neural Networks (NN), Support Vector Machines
(SVM), Random Forests (RF), and K-Nearest Neighbors
(KNN). Models were evaluated using 10-fold cross-validation, ensuring robust assessment by partitioning the data
into ten equal folds and testing performance across different
combinations. The study found that combining multivariate
taphonomy with ML methods enabled accurate classification of BSM, providing a tool for reliable behavioral and
archaeological analyses.
This experimental study was critically scrutinized after
its publication. Over the past two years, it has been argued
that the ML analysis of these experimental samples is not
reliable because it may have been overfitted (Calder et al.
2022; McPherron et al. 2022). A similar concern was raised
on another experimental dataset on bone breakage (YezziWoodley et al. 2024). The concerns were reasonable, because
they stemmed from the use of bootstrapping prior to the
split of the training and testing sets, but unjustified by data.
Although the different simulations in those critical works
(Calder et al. 2022; McPherron et al. 2022; Yezzi-Woodley
et al. 2024) did not realistically replicate the experimental
sample used by Domínguez-Rodrigo and Baquedano (2018)
and Moclan et al. (2019), because of the non-independent
relationship of the variables used in the neo-taphonomic
samples (among other reasons), we addressed all those concerns in our more recent works (Abellán et al. 2022; Moclán
and Domínguez-Rodrigo 2023) and showed that:
a) There was no difference in the accuracy of the algorithms when using the raw data and the bootstrapped
samples (Abellán et al. 2022; Moclán and DomínguezRodrigo 2023). As a matter of fact, not only no model
was overfit by having used bootstrapping, but some
of the algorithms showed higher accuracy when using
only the raw data. This rules out that overfitting played
any role in the accuracy of the models.
b) It was argued that this “unexpected” outcome was the
result of the combination of factors, including the use of
a categorical dataset and low cardinality (i.e., low numbers of levels) per variable, and limited dimensionality
within a patterned data structure (Abellán et al. 2022;
13
Archaeological and Anthropological Sciences (2025) 17:62
Moclán and Domínguez-Rodrigo 2023). This commonly results in the occurrence of multiple identical
cases within the same dataset. This phenomenon underscores that the overfitting potential of nume (...truncated)