Air pollution prediction with machine learning: a case study of Indian cities
International Journal of Environmental Science and Technology
https://doi.org/10.1007/s13762-022-04241-5
ORIGINAL PAPER
Air pollution prediction with machine learning: a case study of Indian
cities
K. Kumar1
· B. P. Pande2
Received: 18 December 2021 / Revised: 17 February 2022 / Accepted: 19 April 2022
© The Author(s) under exclusive licence to Iranian Society of Environmentalists (IRSEN) and Science and Research Branch, Islamic Azad University 2022
Abstract
The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human
society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially
in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine
learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates
six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed
and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights
into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant
fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling
technique and five machine learning models are employed to predict air quality. The results of these models are compared
with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine
model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established
performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity
between the predicted and actual data.
Keywords Air quality index · Machine learning · Indian air quality data · Correlation-based feature selection · Exploratory
data analysis · Box plot · Synthetic minority oversampling technique
Introduction
Energy consumption and its consequences are inevitable in
modern age human activities. The anthropogenic sources of
air pollution include emissions from industrial plants; automobiles; planes; burning of straw, coal, and kerosene; aerosol cans, etc. Various dangerous pollutants like CO, CO2,
Particulate Matter (PM), N
O2, SO2, O3, NH3, Pb, etc. are
being released into our environment every day. Chemicals
and particles constituting air pollution affect the health of
Editorial responsibility: M. Abbaspour.
* B. P. Pande
1
Sikh National College, Qadian, Guru Nanak Dev University,
Amritsar, Punjab, India
2
Department of Computer Applications, LSM, Government
PG College, Pithoragarh, Uttarakhand, India
humans, animals, and even plants. Air pollution can cause a
multitude of serious diseases in humans, from bronchitis to
heart disease, from pneumonia to lung cancer, etc. Poor air
conditions lead to other contemporary environmental issues
like global warming, acid rain, reduced visibility, smog, aerosol formation, climate change, and premature deaths. Scientists have realized that air pollution bears the potential to
affect historical monuments adversely (Rogers 2019). Vehicle emissions, atmospheric releases of power plants and factories, agriculture exhausts, etc. are responsible for increased
greenhouse gases. The greenhouse gases adversely affect
climate conditions and consequently, the growth of plants
(Fahad et al. 2021a). Emissions of inorganic carbons and
greenhouse gases also affect plant-soil interactions (Fahad
et al. 2021b). Climatic fluctuations not only affect humans
and animals but agricultural factors and productivity are also
greatly influenced (Sönmez et al. 2021). Economic losses are
the allied consequences too. The Air Quality Index (AQI), an
assessment parameter is related to public health directly. A
13
Vol.:(0123456789)
International Journal of Environmental Science and Technology
higher level of AQI indicates more dangerous exposure for
the human population. Therefore, the urge to predict the AQI
in advance motivated the scientists to monitor and model
air quality. Monitoring and predicting AQI, especially in
urban areas has become a vital and challenging task with
increasing motor and industrial developments. Mostly,
the air quality-based studies and research works target the
developing countries, although the concentration of the most
deadly pollutant like PM2.5 is found to be in multiple folds in
developing countries (Rybarczyk and Zalakeviciute 2021).
A few researchers endeavored to undertake the study of air
quality prediction for Indian cities. After going through the
available literature, a strong need had been felt to fill this
gap by attempting analysis and prediction of AQI for India.
Various models have been exercised in the literature to
predict AQI, like statistical, deterministic, physical, and
Machine Learning (ML) models. The traditional techniques
based on probability, and statistics are very complex and
less efficient. The ML-based AQI prediction models have
been proved to be more reliable and consistent. Advanced
technologies and sensors made data collection easy and precise. The accurate and reliable predictions through such huge
environmental data require rigorous analysis which only
ML algorithms can deal with efficiently. Al-Jamimi et al.
(2018) thoroughly discussed the importance of supervised
ML algorithms for applied environment protection issues.
The present work investigates six years of air pollution data
of the Indian cities and analyzes twelve air pollutants and
AQI. The dataset is preprocessed and cleaned first, then
methods of data visualization are applied to develop better
insights and to investigate hidden patterns and trends. This
work exploits the essence of correlation coefficient with ML
models which has been exercised by very few scholars in
the literature (Alade et al. 2019a). The data imbalance is
identified and addressed with a resampling technique. Five
popular ML models are exercised in context with this resampling technique. Their performances are then compared
through standard metrics. These metrics are utilized by
many scholars of the realm (see Table 1) and some other
authors of ML applications like Ayturan et al. (2020), Alade
et al. (2019b), Al-Jamimi et al (2019), and Al-Jamimi and
Saleh (2019), etc.
Section 2 presents the literature survey with a comparative analysis of the literary works in the realm of air quality
prediction with ML. Section 3 describes the dataset being
studied, preprocessing, and feature selection techniques
applied. Section 4 deals with observing hidden patterns in
the dataset through data visualisation (...truncated)