Air pollution prediction with machine learning: a case study of Indian cities (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007/s13762-022-04241-5.pdf

Air pollution prediction with machine learning: a case study of Indian cities

International Journal of Environmental Science and Technology https://doi.org/10.1007/s13762-022-04241-5 ORIGINAL PAPER Air pollution prediction with machine learning: a case study of Indian cities K. Kumar1 · B. P. Pande2 Received: 18 December 2021 / Revised: 17 February 2022 / Accepted: 19 April 2022 © The Author(s) under exclusive licence to Iranian Society of Environmentalists (IRSEN) and Science and Research Branch, Islamic Azad University 2022 Abstract The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling technique and five machine learning models are employed to predict air quality. The results of these models are compared with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity between the predicted and actual data. Keywords Air quality index · Machine learning · Indian air quality data · Correlation-based feature selection · Exploratory data analysis · Box plot · Synthetic minority oversampling technique Introduction Energy consumption and its consequences are inevitable in modern age human activities. The anthropogenic sources of air pollution include emissions from industrial plants; automobiles; planes; burning of straw, coal, and kerosene; aerosol cans, etc. Various dangerous pollutants like CO, CO2, Particulate Matter (PM), N O2, SO2, O3, NH3, Pb, etc. are being released into our environment every day. Chemicals and particles constituting air pollution affect the health of Editorial responsibility: M. Abbaspour. * B. P. Pande 1 Sikh National College, Qadian, Guru Nanak Dev University, Amritsar, Punjab, India 2 Department of Computer Applications, LSM, Government PG College, Pithoragarh, Uttarakhand, India humans, animals, and even plants. Air pollution can cause a multitude of serious diseases in humans, from bronchitis to heart disease, from pneumonia to lung cancer, etc. Poor air conditions lead to other contemporary environmental issues like global warming, acid rain, reduced visibility, smog, aerosol formation, climate change, and premature deaths. Scientists have realized that air pollution bears the potential to affect historical monuments adversely (Rogers 2019). Vehicle emissions, atmospheric releases of power plants and factories, agriculture exhausts, etc. are responsible for increased greenhouse gases. The greenhouse gases adversely affect climate conditions and consequently, the growth of plants (Fahad et al. 2021a). Emissions of inorganic carbons and greenhouse gases also affect plant-soil interactions (Fahad et al. 2021b). Climatic fluctuations not only affect humans and animals but agricultural factors and productivity are also greatly influenced (Sönmez et al. 2021). Economic losses are the allied consequences too. The Air Quality Index (AQI), an assessment parameter is related to public health directly. A 13 Vol.:(0123456789) International Journal of Environmental Science and Technology higher level of AQI indicates more dangerous exposure for the human population. Therefore, the urge to predict the AQI in advance motivated the scientists to monitor and model air quality. Monitoring and predicting AQI, especially in urban areas has become a vital and challenging task with increasing motor and industrial developments. Mostly, the air quality-based studies and research works target the developing countries, although the concentration of the most deadly pollutant like PM2.5 is found to be in multiple folds in developing countries (Rybarczyk and Zalakeviciute 2021). A few researchers endeavored to undertake the study of air quality prediction for Indian cities. After going through the available literature, a strong need had been felt to fill this gap by attempting analysis and prediction of AQI for India. Various models have been exercised in the literature to predict AQI, like statistical, deterministic, physical, and Machine Learning (ML) models. The traditional techniques based on probability, and statistics are very complex and less efficient. The ML-based AQI prediction models have been proved to be more reliable and consistent. Advanced technologies and sensors made data collection easy and precise. The accurate and reliable predictions through such huge environmental data require rigorous analysis which only ML algorithms can deal with efficiently. Al-Jamimi et al. (2018) thoroughly discussed the importance of supervised ML algorithms for applied environment protection issues. The present work investigates six years of air pollution data of the Indian cities and analyzes twelve air pollutants and AQI. The dataset is preprocessed and cleaned first, then methods of data visualization are applied to develop better insights and to investigate hidden patterns and trends. This work exploits the essence of correlation coefficient with ML models which has been exercised by very few scholars in the literature (Alade et al. 2019a). The data imbalance is identified and addressed with a resampling technique. Five popular ML models are exercised in context with this resampling technique. Their performances are then compared through standard metrics. These metrics are utilized by many scholars of the realm (see Table 1) and some other authors of ML applications like Ayturan et al. (2020), Alade et al. (2019b), Al-Jamimi et al (2019), and Al-Jamimi and Saleh (2019), etc. Section 2 presents the literature survey with a comparative analysis of the literary works in the realm of air quality prediction with ML. Section 3 describes the dataset being studied, preprocessing, and feature selection techniques applied. Section 4 deals with observing hidden patterns in the dataset through data visualisation (...truncated)