INTRODUCTION

In this Data Analysis Project, I am going to work with Air Quality Index (AQI) Data of India. I will be using several Statistical Tools to Analyze the Data which includes Exploratory Data Analysis, Techniques and methodologies used for Inference and Modelling.

MOTIVE

  • To understand the relationship that holds between different parameters which we measure as a part of AQI Data of India.

  • To find out if COVID-19 Nation-wise Lockdowns, Social Distancing and Closure of Industries, Factories and Suspension of movement via private vehicles and public transport had a significant impact on India’s AQI.

UNDERSTANDING THE DATA

Let’s take a look at the Data

First few rows of the Air Quality Index Data
Year Month Day City Specie count min max median variance
2014 12 29 Delhi pm25 24 296.0 460.0 394.0 27226.40
2014 12 29 Hyderabad pm25 13 159.0 162.0 161.0 8.59
2014 12 29 Delhi pm10 82 79.0 999.0 218.0 634717.00
2014 12 29 Delhi o3 79 0.1 87.4 3.2 2324.38
2014 12 29 Delhi so2 91 0.3 21.2 4.2 231.83
2014 12 29 Delhi pm25 83 139.0 747.0 307.0 215149.00
Last few rows of the Air Quality Index Data
Year Month Day City Specie count min max median variance
2021 6 24 Kolkata o3 48 2.9 105.7 8.4 4611.99
2021 6 24 Kolkata pm25 48 45.0 104.0 63.0 1398.61
2021 6 24 Kolkata pressure 56 996.9 1007.5 999.3 67.94
2021 6 24 Kolkata wind-speed 56 0.1 4.2 1.1 10.87
2021 6 24 Kolkata dew 37 28.0 28.0 28.0 0.00
2021 6 24 Kolkata co 48 1.0 5.2 2.3 16.41
  • This Dataset contains 263890 rows and 10 columns.

  • The Year ranges from 2014 to 2021 (till June), with observations recorded on each of the 30 /31 days of the month for 12 months for the last 3 years.

  • The Data is generated from the 22 cities from various Stations located near that Cities. The Cities include:

City Stations
State City Number of Stations
Andhra_Pradesh Visakhapatnam 1
Arunachal_Pradesh Visakhapatnam 1
Bihar Patna 6
Chandigarh Chandigarh 1
Delhi Delhi 40
Kerala Thiruvananthapuram 2
Kerala Thrissur 1
MadhyaPradesh Bhopal 1
Maharashtra Mumbai 21
Maharashtra Nagpur 1
Maharashtra Nashik 1
Meghalaya Shillong 1
Rajasthan Jaipur 3
Tamil_Nadu Chennai 8
Telangana Hyderabad 6
Uttar_Pradesh Lucknow 6
Uttar_Pradesh Muzaffarnagar 1
West_Bengal Kolkata 7
  • The parameters which we measure at the different Stations are given under the Specie Column and it includes -
Specie Description
Parameters Description Units
pm25 Particle pollution/particulate matter(particles less than or equal to 2.5 micrometers in diameter) micrograms/cubic meter
pm10 Particle pollution/particulate matter(particles less than or equal to 10 micrometers in diameter) micrograms/cubic meter
o3 Ground-level ozone micrograms/cubic meter
so2 Sulphur dioxide micrograms/cubic meter
no2 Nitrogen dioxide micrograms/cubic meter
co Carbon Monoxide miligrams/cubic meter
temperature Temperature Celcius
pressure Air Pressure Torr
wind-gust Wind Gust/Force kmph
humidity Relative Humidity No Units
wind-speed Wind Speed kmph
dew Dew Point Celcius
precipitation Precipitation milimeters

AQI is a comparable and communicable way of measuring the parameters in the Air. It is calculated when atleast 3 of the top 6 parameter’s data is available of which one must be pm10 or pm25. It is the max of the parameters recorded given they satisfy the above condition.

  • It also helps in identifying faulty standards and inadequate monitoring programmes.

  • AQI helps in analysing the change in air quality (improvement or degradation).

  • Comparing air quality conditions at different locations/cities.

  • It can be easily interpreted by anyone, without knowing about background details.

Significance of the AQI Values
AQI Values Level of Health Concern
0-50 Good
51-100 Moderate
101-150 Unhealthy for sensitive group
151-200 Unhealthy
201-300 Very Unhealthy
301-500 Hazardous

In further Analysis we will refer pm25, pm10, o3, so2, no2 and co2 as pollutants and the remaining weather parameters as non-pollutants.

TOP CITY STATIONS

Here we look at which city records how many observations per year.

It seems Delhi is the most monitored cities among the others. We see all the other stations have almost equal number of observations per year. Our data seems to have lot of missing values for the year 2018 and the years before that; Hence we will only consider the data of the year 2019, 2020 and 2021.

Speculating on the reason why Delhi is so heavily monitored we look at how the Average Median AQI Levels of pollutants at each city every year.

Indeed it is clearly visible that Delhi’s AQI Levels of Pollutants is higher than all other cities. This makes Delhi our city of main focus in all the further analysis. Every city owing to it’s location at different parts of the country, different development status, different population, different local weather conditions has different measured values of AQI. So, it makes sense to look at them separately. So, whenever we will look at City-wise Analysis, we will look at these Cities as our Top 4 Cities - Delhi, Muzaffarnagar, Kolkata, Mumbai.

VISUAL OVERVIEW

Here we will take a visual tour of the Data through Exploratory Data Analysis.

The first bit of observation is the Median & Min AQI Levels of all 4 cities have dropped in 2020 and 2021 is similar to 2019. But the Max AQI Levels have increased in 2020 and even more in 2021 even though we had so many restrictions in 2020!

In the first half of the year 2020 Min and Median AQI Levels have dropped whereas in the second half of the year it either increased or remained same as that of 2019. 2021’s observed values are pretty much the same as that of 2019 if not more. Max AQI Levels kept rising across the years irrespective of the part of the years. The spread(Variance) of the resp. measures look the same across the years.

Here we see an interesting effect of Lockdowns. All the pollutants have decreased considerably in the First half of 2020 whereas in the Second half it has bumped up again to match the Levels of 2019 or even more! While the Levels of 2021 & 2019 are very similar.

STATISTICAL INFERENCE

ESTIMATION

DISTRIBUTIONS & PROBABILITY MODELS

To perform any kind of analysis we need to model the data we have in order to get an idea about the expected value, variance and other properties related to the data. It also helps us get rid of the noise we get from the data we collect.

  • Here we will look at how the different pollutants are distributed and how different are the models for different year & location and which probability model fits our data and what are the parameter estimates of the fitted model along with Mean and Variance.

I have mostly tried fitting one of Gamma, Lognormal, Normal, Weibull and Exponential Distribution by the Method of Maximum Likelihood Estimation after looking at the histograms of the data for all the Pollutants. The better model out of all these 5 models was chosen based on Chi-Squre test for goodness of fit.

Looking at the moments of the fitted distribution we see one of the most important factor contributing to AQI Calculation i.e. pm25 Levels during the first 6 months were much lower in all the cities except Mumbai in 2020 compared to 2019, whereas these levels increased back to a much higher value in 2021.

Similar Fitted Probability Models suggests almost all the Pollutant Levels were lower in 2020 compared to that of 2019, but the values of 2019 and 2021 are very similar, if not that of 2021 is higher. I will attach all the relevant Plots and figures for other Pollutants at the end, if interested in looking at them.

There were also some cases were outliers degraded the fit of the model, I tried minimizing the Hellinger Distance to fit the model in that case. Below give is one such case -

It is clear that the Hellinger fit is much better in fitting the part of the distribution where there are maximum observations.

HYPOTHESIS TESTING

Here we would try to Test some of out believes about the data, which will help us get a better understanding of the Pollutant and AQI Levels.

  • In the Introduction part we saw how the number of observations recorded by different cities looked similar except for Delhi which was obviously very high and need not be tested. We would like to test our null Hypothesis of the number of observations recorded by all cities are equal against all other possible alternatives. We perform a Chi-square Equality of proportion Test

Hence, it is clearly visible all other cities except Delhi are equally monitored in 2019 & 2020, where Delhi is more heavily monitored.

  • Now the Question that rises is whether the Mean Level of Pollutants in 2020 Less than that of 2019? Along with it we will also find out whether the Mean Level of Pollutants in 2021 More than that of 2020 (Considering the Data of the First 6 months of 2020 and 2021)?

So, to Test our Hypothesis of the Mean Pollutant Level of 2020 is equal to that of 2019 against the Alternative that the Mean Pollutant Level has Dropped in 2020, we will perform a two sample t-test for equality of mean. Here after looking at the boxplots, the variances for each year looks more or less the same, and I took variances of both the samples to be same.

So, it is true that 2021’s Pollutant Levels are more than that of 2020 to be precise it is atleast 7.5 points higher with 99% Confidence when comparing the first 6 months. But it is surprising that 2020’s Pollutant Levels haven’t dropped significantly compared to 2019 despite Lockdowns and all other measures which were favorable for decrease of Air Pollutant Levels.

Now, we try to look at the first half and later half of the Years separately, cause the Nation-wide Lockdown was there during the First Half of 2020. We test the same hypothesis as above, but now on split data.

Indeed our guess was right here. The Pollutant Levels have decreased during the first half of 2020 by atleast 6.4 points with 99% confidence and we reject our Null Hypothesis.

An Interesting look at the second half of 2020 and performing an Hypothesis testing for the Mean Level of Pollutants for 2019 and 2020 are equal against that of 2020 was higher reveals that 2020’s Level was higher in the second half.

  • Now we turn to finding out how different are the Mean AQI Levels of Each Pollutants.

Considering the First Half of 2019, 2020 and 2021:

Here we will perform two-sample t-tests and will find 99% confidence interval for difference of two mean assuming same variance. Our Tests are to be performed with Null Hypothesis being 2020 and 2019 Mean Level of Pollutants are same against the alternative that the Mean Pollutant Level of 2020 being less than 2019. Similarly we will perform a test for 2021’s Mean Level being higher than 2020’s. And we will test whether Mean of 2021 and 2019 are equal, which will be a both-sided test.

The first row of each table corresponds to the first test, second row to second test and third row to third test from the tests mentioned above.

Indeed! In case of most of the Pollutants our guess seem to hold! And 2020’s Pollutant Levels dropped when compared to 2019 and 2021, whereas the means of 2021 and 2019 are similar.

A similar Test Considering the second half of 2020 and 2019, where we test the null against the alternative of 2020’s Mean Level of Pollutant being higher than that of 2019, reveals -

That almost all the Pollutant’s Levels have either gone up or remained same as that of 2019’s. A possible cause for this to happen maybe due to a 40% increase in Stubble Burning during the later half of 2020 as per what various reports suggests.

MODELING

In this section we will try to analyze the data and will provide a quantitative way to summarize the trends and patterns. We will mostly be using concepts and techniques pertaining to Multiple Linear Regression(MLR).

REGRESSION

Here we will try to fit a MLR to the strong linear relationship we saw in the Visual Overview Section. Because of the difficulty in visualization, residuals will play an important role in deciding model fit.

  • We will look at what is the linear relationship between each Pollutant Level and weather parameters, what is the linear relationship between each Pollutant with other Pollutant We will also look at the standard error of our the estimates and their confidence intervals and how much the weather parameters explains the variation in the pollutant levels.

  • Some assumptions made during Modelling are-

    • The weather parameters are independent of the Level of pollutants.

    • The weather parameters are similarly distributed across 2019, 2020 and 2021 [Except for Mumbai where we see an increased wind-speed in the later two years]

In most of the models which we will try to fit we will look at the first half and the second half of the year separately.

  1. Model Based on 2019

I have presented here the models for Kolkata and Delhi only. Other models for other cities will be attached at the end if interested to take a look at them. We have fit a MLR model with each pollutants as the response and the weather parameters as the predictors and similarly to figure out the strong relationship between the pollutants among themselves we have fit a separate MLR model considering only the other pollutant’s Levels. The reason behind splitting the year into 2 parts are as can be seen from the visual overview section the first half of the year has weather parameter levels much different from that of the second half and the second reason being we have data for the year 2021 till first 6 months only so fitting a model for the first half of the year will help us compare it with the year 2021.

We see an interesting thing that is in the first half of the year, the weather parameters are able to explain the pollutant Levels much better compared to that of the second half. The strength of the relationship of Pollutant Levels on other Pollutant Levels remains almost similar during both halves of the year. Temperature, Humidity and Dew have high significant effect throughout the year in explaining the Pollutant Levels throughout the year, which can be due to the intricate physical relationship that exists between this quantities i.e. concentration of gas being directly proportional to temperature, etc.

The confidence intervals are mentioned under each coefficient giving us information about it’s significance of being present in the model. Any coefficient which includes 0 in its confidence interval can be thought of being less important in explaining the response.

  • How well will take a look at how the residuals of the above mentioned models are spread

Indeed it seems that the residuals are normally distributed with mean around 0. Therefore our model doesn’t seems to have an issue in assuming that the relationship is linear between Pollutant Levels and weather parameters.

  • Now since, we have assumed that the weather parameters remains similar in all the years. It would be interesting to see how well our model fitted with 2019’s weather parameter performs when used on 2020’s first 6 Months Data & 2021’s first 6 Months Data.

We see an interesting thing that out Model fitted with 2019 Weather parameters performs less well on 2020’s Data as compared to 2021’s. The histogram-density plots for residuals also shows our Model predicts higher values for the weather parameters in 2020 compared to the actual values observed which suggests that 2020’s Pollutant Levels were lower during first 6 Months.

  1. Model Based on 2020. We will take a Visual Look instead of the tables here, along with the estimate Distributions.

Kolkata saw more or less stable conditions of weather parameters throughout the first half of the year and hence not much of the variability of the Pollutants were explained by the weather Parameters. Delhi’s dependence on Weather Parameters appear stronger.

CONCLUSION

SUGGESTIONS

BIBLIOGRAPHY