Optimizing data collection for public health decisions: a data mining approach

BMC Public Health, Jun 2014

Background Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. Methods The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Results Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. Conclusions While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

http://www.biomedcentral.com/content/pdf/1471-2458-14-593.pdf

Optimizing data collection for public health decisions: a data mining approach

BMC Public Health Optimizing data collection for public health decisions: a data mining approach Susan N Partington 0 1 Vasil Papakroni 2 Tim Menzies 2 0 Regional Research Institute, West Virginia University , 886 Chestnut Ridge Road, 5th Floor, P.O. Box 6825, Morgantown, WV 26506-6825 , USA 1 Division of Animal and Nutritional Sciences, West Virginia University , Morgantown, WV , USA 2 Lane Department of Computer Sciences and Electrical Engineering, West Virginia University , Morgantown, WV , USA Background: Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. Methods: The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. Results: Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R2 values of 92% and 94% for restaurant and grocery store data, respectively. Conclusions: While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost. Community survey methods; Data mining; Data collection; Ecological and environmental concepts; Nutrition - Introduction Ideally, public health policy should be informed by research, assessments and surveillance [1]. These activities rely on the availability of current and accurate data collected at the both the individual- and community-levels [2]. The cost of conducting health research has recently become an important consideration due decreases in available funding. In the United States, federal funding for biomedical research as a percent of total health care expenditures decreased from 11% to 2% from 1980 to 2010 [3]. This paper explores one approach for reducing research costs by reducing the number of survey items on two community nutrition assessment instruments. In principle, the approach described here is quite general and could be applied to reducing the amount of data needed to assess outcomes across a wide variety of health research questions. Background Collection of primary data is one of the most expensive and time consuming aspects of any research study [4]. To ensure data integrity, the collection process must be consistently monitored. After collection, information from paper forms requires double entry by hand or machine scanning followed by manual confirmation of scanner accuracy. Electronic collection of data either in person or over the internet requires the purchase or development of software to collect the data and if deployed over the internet, web-based tools and the resources to host them [5]. In all cases, data cleaning and validation is required [6]. Resources needed increase in proportion to the amount of data to be collected and managed. Further even a rigorously monitored data gathering process is error prone. Transcription errors, recording errors, data entry errors, and errors resulting from equipment malfunction all have the potential to distort findings and compromise results [7]. Minimizing the amount of data needed to produce an accurate assessment minimizes research costs as well as the risk of errors. Data mining Data mining techniques employ algorithms or learners that can build prediction models. Such algorithms include linear regression, decision tree learners, Bayes classifiers, random forests and support vector machines among others [8]. Within these learners, there is often a feature selection algorithm that identifies elements within a dataset that are useful in the prediction model. There are many feature selection algorithms including stepwise regression, principle component analysis [9] and information gain [8]. Feature selection studies have found that ranking of singleton variables (as in stepwise regression) does not work as well as exploring the rankings of combinations of variables. That is, if every variable were ranked only by their independent associa (...truncated)


This is a preview of a remote PDF: http://www.biomedcentral.com/content/pdf/1471-2458-14-593.pdf

Susan N Partington, Vasil Papakroni, Tim Menzies. Optimizing data collection for public health decisions: a data mining approach, BMC Public Health, 2014, pp. 593, 14, DOI: 10.1186/1471-2458-14-593