In this Data Analysis Project, I am going to work with Air Quality Index (AQI) Data of India. I will be using several Statistical Tools to Analyze the Data which includes Exploratory Data Analysis, Techniques and methodologies used for Inference and Modelling.
To understand the relationship that holds between different parameters which we measure as a part of AQI Data of India.
To find out if COVID-19 Nation-wise Lockdowns, Social Distancing and Closure of Industries, Factories and Suspension of movement via private vehicles and public transport had a significant impact on India’s AQI.
Let’s take a look at the Data
Year | Month | Day | City | Specie | count | min | max | median | variance |
---|---|---|---|---|---|---|---|---|---|
2014 | 12 | 29 | Delhi | pm25 | 24 | 296.0 | 460.0 | 394.0 | 27226.40 |
2014 | 12 | 29 | Hyderabad | pm25 | 13 | 159.0 | 162.0 | 161.0 | 8.59 |
2014 | 12 | 29 | Delhi | pm10 | 82 | 79.0 | 999.0 | 218.0 | 634717.00 |
2014 | 12 | 29 | Delhi | o3 | 79 | 0.1 | 87.4 | 3.2 | 2324.38 |
2014 | 12 | 29 | Delhi | so2 | 91 | 0.3 | 21.2 | 4.2 | 231.83 |
2014 | 12 | 29 | Delhi | pm25 | 83 | 139.0 | 747.0 | 307.0 | 215149.00 |
Year | Month | Day | City | Specie | count | min | max | median | variance |
---|---|---|---|---|---|---|---|---|---|
2021 | 6 | 24 | Kolkata | o3 | 48 | 2.9 | 105.7 | 8.4 | 4611.99 |
2021 | 6 | 24 | Kolkata | pm25 | 48 | 45.0 | 104.0 | 63.0 | 1398.61 |
2021 | 6 | 24 | Kolkata | pressure | 56 | 996.9 | 1007.5 | 999.3 | 67.94 |
2021 | 6 | 24 | Kolkata | wind-speed | 56 | 0.1 | 4.2 | 1.1 | 10.87 |
2021 | 6 | 24 | Kolkata | dew | 37 | 28.0 | 28.0 | 28.0 | 0.00 |
2021 | 6 | 24 | Kolkata | co | 48 | 1.0 | 5.2 | 2.3 | 16.41 |
This Dataset contains 263890 rows and 10 columns.
The Year ranges from 2014 to 2021 (till June), with observations recorded on each of the 30 /31 days of the month for 12 months for the last 3 years.
The Data is generated from the 22 cities from various Stations located near that Cities. The Cities include:
State | City | Number of Stations |
---|---|---|
Andhra_Pradesh | Visakhapatnam | 1 |
Arunachal_Pradesh | Visakhapatnam | 1 |
Bihar | Patna | 6 |
Chandigarh | Chandigarh | 1 |
Delhi | Delhi | 40 |
Kerala | Thiruvananthapuram | 2 |
Kerala | Thrissur | 1 |
MadhyaPradesh | Bhopal | 1 |
Maharashtra | Mumbai | 21 |
Maharashtra | Nagpur | 1 |
Maharashtra | Nashik | 1 |
Meghalaya | Shillong | 1 |
Rajasthan | Jaipur | 3 |
Tamil_Nadu | Chennai | 8 |
Telangana | Hyderabad | 6 |
Uttar_Pradesh | Lucknow | 6 |
Uttar_Pradesh | Muzaffarnagar | 1 |
West_Bengal | Kolkata | 7 |
Parameters | Description | Units |
---|---|---|
pm25 | Particle pollution/particulate matter(particles less than or equal to 2.5 micrometers in diameter) | micrograms/cubic meter |
pm10 | Particle pollution/particulate matter(particles less than or equal to 10 micrometers in diameter) | micrograms/cubic meter |
o3 | Ground-level ozone | micrograms/cubic meter |
so2 | Sulphur dioxide | micrograms/cubic meter |
no2 | Nitrogen dioxide | micrograms/cubic meter |
co | Carbon Monoxide | miligrams/cubic meter |
temperature | Temperature | Celcius |
pressure | Air Pressure | Torr |
wind-gust | Wind Gust/Force | kmph |
humidity | Relative Humidity | No Units |
wind-speed | Wind Speed | kmph |
dew | Dew Point | Celcius |
precipitation | Precipitation | milimeters |
AQI is a comparable and communicable way of measuring the parameters in the Air. It is calculated when atleast 3 of the top 6 parameter’s data is available of which one must be pm10 or pm25. It is the max of the parameters recorded given they satisfy the above condition.
It also helps in identifying faulty standards and inadequate monitoring programmes.
AQI helps in analysing the change in air quality (improvement or degradation).
Comparing air quality conditions at different locations/cities.
It can be easily interpreted by anyone, without knowing about background details.
AQI Values | Level of Health Concern |
---|---|
0-50 | Good |
51-100 | Moderate |
101-150 | Unhealthy for sensitive group |
151-200 | Unhealthy |
201-300 | Very Unhealthy |
301-500 | Hazardous |
In further Analysis we will refer pm25, pm10, o3, so2, no2 and co2 as pollutants and the remaining weather parameters as non-pollutants.
Here we look at which city records how many observations per year.
It seems Delhi is the most monitored cities among the others. We see all the other stations have almost equal number of observations per year. Our data seems to have lot of missing values for the year 2018 and the years before that; Hence we will only consider the data of the year 2019, 2020 and 2021.
Speculating on the reason why Delhi is so heavily monitored we look at how the Average Median AQI Levels of pollutants at each city every year.
Indeed it is clearly visible that Delhi’s AQI Levels of Pollutants is higher than all other cities. This makes Delhi our city of main focus in all the further analysis. Every city owing to it’s location at different parts of the country, different development status, different population, different local weather conditions has different measured values of AQI. So, it makes sense to look at them separately. So, whenever we will look at City-wise Analysis, we will look at these Cities as our Top 4 Cities - Delhi, Muzaffarnagar, Kolkata, Mumbai.
Here we will take a visual tour of the Data through Exploratory Data Analysis.
The first bit of observation is the Median & Min AQI Levels of all 4 cities have dropped in 2020 and 2021 is similar to 2019. But the Max AQI Levels have increased in 2020 and even more in 2021 even though we had so many restrictions in 2020!
In the first half of the year 2020 Min and Median AQI Levels have dropped whereas in the second half of the year it either increased or remained same as that of 2019. 2021’s observed values are pretty much the same as that of 2019 if not more. Max AQI Levels kept rising across the years irrespective of the part of the years. The spread(Variance) of the resp. measures look the same across the years.
Here we see an interesting effect of Lockdowns. All the pollutants have decreased considerably in the First half of 2020 whereas in the Second half it has bumped up again to match the Levels of 2019 or even more! While the Levels of 2021 & 2019 are very similar.
Indeed it looks like the AQI Levels have dropped in the first half of 2020 compared to 2019, with the exception of Mumbai. But it has increased to levels more than 2019 in the later half. The AQI Levels of 2021 is more that both the previous years. I have fit here a Quadratic Model which seems to model the situation well though we shouldn’t focus on the model fit to 2021 as we only have the data pertaining to first 6 months.
Now, we will see the above observed AQI Levels in terms of Levels of each Pollutants.
Except for Ground-Level Ozone and Sulphur Dioxide most of the Pollutant Levels have decreased during the first half of 2020 which saw an increase in the second half when compared to 2019.This increase can be attributed to increased Stubble Burning that took place in the later half of the Year. The surprising thing is that even particulate matter increased to much higher levels in Mumbai even though it saw a massive lockdown and transportation as well as industrial work were suspended.
We see a strong linear relationship between the pollutants among themselves across the past 3 years at our Top Cities.
We see a strong linear relationship between the weather parameters across the past 3 years. One important thing that gives us a hint towards why the pm10 and pm25 levels were higher in Mumbai, as we see Mumbai has experienced really high wind-speeds in 2020 and 2021. The pm content was thus increased due to movement of lots of dust by wind.
This suggests that there is a positive linear relationship between Temperature & Humidity and all the Pollutant Levels. Wind speed is indeed positively correlated with pm10 and pm25 which might be the cause for higher AQI in Mumbai.
It suggests that the wind-speeds and humidity were higher in 2020, whereas pressure and temperatures were lower in 2020.
To perform any kind of analysis we need to model the data we have in order to get an idea about the expected value, variance and other properties related to the data. It also helps us get rid of the noise we get from the data we collect.
I have mostly tried fitting one of Gamma, Lognormal, Normal, Weibull and Exponential Distribution by the Method of Maximum Likelihood Estimation after looking at the histograms of the data for all the Pollutants. The better model out of all these 5 models was chosen based on Chi-Squre test for goodness of fit.
Looking at the moments of the fitted distribution we see one of the most important factor contributing to AQI Calculation i.e. pm25 Levels during the first 6 months were much lower in all the cities except Mumbai in 2020 compared to 2019, whereas these levels increased back to a much higher value in 2021.
Similar Fitted Probability Models suggests almost all the Pollutant Levels were lower in 2020 compared to that of 2019, but the values of 2019 and 2021 are very similar, if not that of 2021 is higher. I will attach all the relevant Plots and figures for other Pollutants at the end, if interested in looking at them.
There were also some cases were outliers degraded the fit of the model, I tried minimizing the Hellinger Distance to fit the model in that case. Below give is one such case -
It is clear that the Hellinger fit is much better in fitting the part of the distribution where there are maximum observations.
Here we would try to Test some of out believes about the data, which will help us get a better understanding of the Pollutant and AQI Levels.
Hence, it is clearly visible all other cities except Delhi are equally monitored in 2019 & 2020, where Delhi is more heavily monitored.
So, to Test our Hypothesis of the Mean Pollutant Level of 2020 is equal to that of 2019 against the Alternative that the Mean Pollutant Level has Dropped in 2020, we will perform a two sample t-test for equality of mean. Here after looking at the boxplots, the variances for each year looks more or less the same, and I took variances of both the samples to be same.
So, it is true that 2021’s Pollutant Levels are more than that of 2020 to be precise it is atleast 7.5 points higher with 99% Confidence when comparing the first 6 months. But it is surprising that 2020’s Pollutant Levels haven’t dropped significantly compared to 2019 despite Lockdowns and all other measures which were favorable for decrease of Air Pollutant Levels.
Now, we try to look at the first half and later half of the Years separately, cause the Nation-wide Lockdown was there during the First Half of 2020. We test the same hypothesis as above, but now on split data.
Indeed our guess was right here. The Pollutant Levels have decreased during the first half of 2020 by atleast 6.4 points with 99% confidence and we reject our Null Hypothesis.
An Interesting look at the second half of 2020 and performing an Hypothesis testing for the Mean Level of Pollutants for 2019 and 2020 are equal against that of 2020 was higher reveals that 2020’s Level was higher in the second half.
Considering the First Half of 2019, 2020 and 2021:
Here we will perform two-sample t-tests and will find 99% confidence interval for difference of two mean assuming same variance. Our Tests are to be performed with Null Hypothesis being 2020 and 2019 Mean Level of Pollutants are same against the alternative that the Mean Pollutant Level of 2020 being less than 2019. Similarly we will perform a test for 2021’s Mean Level being higher than 2020’s. And we will test whether Mean of 2021 and 2019 are equal, which will be a both-sided test.
The first row of each table corresponds to the first test, second row to second test and third row to third test from the tests mentioned above.
Indeed! In case of most of the Pollutants our guess seem to hold! And 2020’s Pollutant Levels dropped when compared to 2019 and 2021, whereas the means of 2021 and 2019 are similar.
A similar Test Considering the second half of 2020 and 2019, where we test the null against the alternative of 2020’s Mean Level of Pollutant being higher than that of 2019, reveals -
That almost all the Pollutant’s Levels have either gone up or remained same as that of 2019’s. A possible cause for this to happen maybe due to a 40% increase in Stubble Burning during the later half of 2020 as per what various reports suggests.
In this section we will try to analyze the data and will provide a quantitative way to summarize the trends and patterns. We will mostly be using concepts and techniques pertaining to Multiple Linear Regression(MLR).
Here we will try to fit a MLR to the strong linear relationship we saw in the Visual Overview Section. Because of the difficulty in visualization, residuals will play an important role in deciding model fit.
We will look at what is the linear relationship between each Pollutant Level and weather parameters, what is the linear relationship between each Pollutant with other Pollutant We will also look at the standard error of our the estimates and their confidence intervals and how much the weather parameters explains the variation in the pollutant levels.
Some assumptions made during Modelling are-
The weather parameters are independent of the Level of pollutants.
The weather parameters are similarly distributed across 2019, 2020 and 2021 [Except for Mumbai where we see an increased wind-speed in the later two years]
In most of the models which we will try to fit we will look at the first half and the second half of the year separately.
I have presented here the models for Kolkata and Delhi only. Other models for other cities will be attached at the end if interested to take a look at them. We have fit a MLR model with each pollutants as the response and the weather parameters as the predictors and similarly to figure out the strong relationship between the pollutants among themselves we have fit a separate MLR model considering only the other pollutant’s Levels. The reason behind splitting the year into 2 parts are as can be seen from the visual overview section the first half of the year has weather parameter levels much different from that of the second half and the second reason being we have data for the year 2021 till first 6 months only so fitting a model for the first half of the year will help us compare it with the year 2021.
We see an interesting thing that is in the first half of the year, the weather parameters are able to explain the pollutant Levels much better compared to that of the second half. The strength of the relationship of Pollutant Levels on other Pollutant Levels remains almost similar during both halves of the year. Temperature, Humidity and Dew have high significant effect throughout the year in explaining the Pollutant Levels throughout the year, which can be due to the intricate physical relationship that exists between this quantities i.e. concentration of gas being directly proportional to temperature, etc.
The confidence intervals are mentioned under each coefficient giving us information about it’s significance of being present in the model. Any coefficient which includes 0 in its confidence interval can be thought of being less important in explaining the response.
Indeed it seems that the residuals are normally distributed with mean around 0. Therefore our model doesn’t seems to have an issue in assuming that the relationship is linear between Pollutant Levels and weather parameters.
We see an interesting thing that out Model fitted with 2019 Weather parameters performs less well on 2020’s Data as compared to 2021’s. The histogram-density plots for residuals also shows our Model predicts higher values for the weather parameters in 2020 compared to the actual values observed which suggests that 2020’s Pollutant Levels were lower during first 6 Months.
Kolkata saw more or less stable conditions of weather parameters throughout the first half of the year and hence not much of the variability of the Pollutants were explained by the weather Parameters. Delhi’s dependence on Weather Parameters appear stronger.
In 2020, During Lockdown i.e. in the first Half of the Year there was a drop in the Pollutant Levels and AQI Levels across most of the Major Cities, our study included Kolkata, Delhi, Muzaffarnagar and Mumbai. Though the Particulate Matter Levels in Mumbai were much higher which can be seen due to Cyclonic Impact of Nisarga.
Despite there being several restrictions on travel, transportation, industrial work which are major contributors to pollutant Levels, we saw an increased Level of Pollutants in the second half of the year 2020 which surpassed the Levels of 2019 which maybe due to the 40% increase in the Stubble Burning during the end of the Year.
Coastal Regions tend to have lower AQI Levels and Pollutant Levels mostly because of the influence of winds and more Cyclonic Conditions.
Major Contributor to AQI Levels are particulate matters. So reducing there levels will greatly help reducing the overall Air Quality.
Introduction of Electric Powered Vehicles can greatly combat Air Pollution reducing Levels of NO2, SO2, CO which are released by Diesel and Petrol Powered Engines.
Controlled Stubble Burning can stabilize Air Quality throughout the later half of the Year.
The Dataset was collected from Air Quality Open Data Platform.
As Per various News Reports an Observations from SAFAR, there was a 40% Increase in Stubble Burning in 2020
All Relevant R-Codes, Files, Datasets, etc are present on My Account