On bootstrapping, data overfitting and crocodiles: an additional comment to McPherron et al. (2022) (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s12520-025-02183-w.pdf

On bootstrapping, data overfitting and crocodiles: an additional comment to McPherron et al. (2022)

Archaeological and Anthropological Sciences (2025) 17:62 https://doi.org/10.1007/s12520-025-02183-w BRIEF REPORT On bootstrapping, data overfitting and crocodiles: an additional comment to McPherron et al. (2022) Manuel Domínguez-Rodrigo1,2,3 · Enrique Baquedano1 Received: 2 September 2024 / Accepted: 4 February 2025 / Published online: 18 February 2025 © The Author(s) 2025 Abstract Quaternary hominin-carnivore interactions is taphonomically reconstructed best through the use of bone surface modifications (BSM). This study examines redundancy in an experimental dataset of potentially similar BSM created by crocodile tooth-marking, sedimentary trampling and stone tool cut marking (Domínguez-Rodrigo and Baquedano in Sci Rep 8:5786, 2018). The original analysis of this experimental set, aiming to confidently classify the three types of BSM, was criticized by some authors (McPherron et al. in J Hum Evol 164:103071, 2022) insinuating that the analysis was flawed by a potential methodological overfitting caused by the improper use of bootstrap. A subsequent response to that critique (Abellán et al. in Geobios Memoire Special. 72–73, 12–21, 2022) showed that there was no difference in the results between using the raw data and the bootstrapped data. It was argued that structural co-variance and redundancy of the categorical dataset was responsible for the highly accurate models; however, this was never empirically demonstrated. Here, we show how the original experimental dataset is saturated with redundancy. Our analysis revealed that, out of 633 cases, only 116 were unique (18.3%) in the complete dataset, 45 unique cases (7.1%) in the intrinsic variable dataset, and just four unique cases (0.63%) in the three-variable dataset (accounting for most of the sample variance). Redundancy, therefore, ranged from 81.7% to over 99%. Machine learning analysis using Random Forest (RF) and C5.0 algorithms on the datasets demonstrated high accuracy with the raw data (90-98%). Proper bootstrapping yielded nearly identical accuracy (88-98%), while improper bootstrapping slightly reduced accuracy (86-98%) and introduced some degree of underfitting. This underscores that the potential biasing effects of bootstrapping differ between numerical and categorical datasets, especially on those with low dimensionality and low cardinality, in situations of feature interdependence and covariance. A complementary approach, consisting of an iterative data partitioning method through train-test resampling reproduced the results derived from the bootstrapped samples. The understanding of these methodological processes is essential to an adequate application of these experimental models to the fossil record. Keywords Taphonomy · Machine learning · Bootstrapping · Redundancy · Bone surface modifications Introduction Manuel Domínguez-Rodrigo ; ; Enrique Baquedano 1 Institute of Evolution in Africa (IDEA), University of Alcalá de Henares, Madrid, Spain 2 Area of Prehistory, Department of History and Philosophy, University of Alcalá de Henares, Alcalá de Henares, Spain 3 Department of Anthropology, Rice University, 6100 Main St., Houston, TX 77005-1827, USA The presence of V-shaped microstriated linear marks on fossil bones from the African Plio-Pleistocene have been proposed as the earliest evidence of stone tool use. However, some researchers have suggested alternative interpretations, arguing that these marks may resemble those caused by trampling or crocodile tooth marks, making them difficult to distinguish from tool-imparted cut marks (Njau and Blumenschine 2006; McPherron et al. 2010; Domínguez-Rodrigo et al. 2010; Sahle et al. 2017; Njau and Gilbert 2016). These opposing arguments are based on diverse methodological approaches to BSM. In order to address a potential methodological equifinality, an experiment combining different 13 62 Page 2 of 8 types of cut marks, trampling marks and crocodile tooth marks -aiming specifically to differentiate cut marks made by humans using stone tools from crocodile bite marks and trampling- was analyzed using machine learning (ML) techniques (Domínguez-Rodrigo and Baquedano 2018). These BSM were derived from controlled experimental conditions and their analysis involved the multivariate use of 16 variables representing different microscopic features. The data were bootstrapped 10,000 times to create a significantly large dataset for training and testing the ML models. Several ML algorithms were subsequently trained and tested, including Neural Networks (NN), Support Vector Machines (SVM), Random Forests (RF), and K-Nearest Neighbors (KNN). Models were evaluated using 10-fold cross-validation, ensuring robust assessment by partitioning the data into ten equal folds and testing performance across different combinations. The study found that combining multivariate taphonomy with ML methods enabled accurate classification of BSM, providing a tool for reliable behavioral and archaeological analyses. This experimental study was critically scrutinized after its publication. Over the past two years, it has been argued that the ML analysis of these experimental samples is not reliable because it may have been overfitted (Calder et al. 2022; McPherron et al. 2022). A similar concern was raised on another experimental dataset on bone breakage (YezziWoodley et al. 2024). The concerns were reasonable, because they stemmed from the use of bootstrapping prior to the split of the training and testing sets, but unjustified by data. Although the different simulations in those critical works (Calder et al. 2022; McPherron et al. 2022; Yezzi-Woodley et al. 2024) did not realistically replicate the experimental sample used by Domínguez-Rodrigo and Baquedano (2018) and Moclan et al. (2019), because of the non-independent relationship of the variables used in the neo-taphonomic samples (among other reasons), we addressed all those concerns in our more recent works (Abellán et al. 2022; Moclán and Domínguez-Rodrigo 2023) and showed that: a) There was no difference in the accuracy of the algorithms when using the raw data and the bootstrapped samples (Abellán et al. 2022; Moclán and DomínguezRodrigo 2023). As a matter of fact, not only no model was overfit by having used bootstrapping, but some of the algorithms showed higher accuracy when using only the raw data. This rules out that overfitting played any role in the accuracy of the models. b) It was argued that this “unexpected” outcome was the result of the combination of factors, including the use of a categorical dataset and low cardinality (i.e., low numbers of levels) per variable, and limited dimensionality within a patterned data structure (Abellán et al. 2022; 13 Archaeological and Anthropological Sciences (2025) 17:62 Moclán and Domínguez-Rodrigo 2023). This commonly results in the occurrence of multiple identical cases within the same dataset. This phenomenon underscores that the overfitting potential of nume (...truncated)