Chapter 7 Results
Here we will take a look at the structure of the data visually and through various tools available to us known as descriptive statistics and will also infer about these results. We will seek for hidden patterns, trends and dependence of variables in our data. We will examine the overall, item-wise per domain and domain-wise data and correlations. Our goal as before would be to extract as much information as possible, about pro-environmental attitude of college students of India from the response data.
7.1 Visualization
I avoid presenting tables throughout the analysis as I believe the mere numbers presented as they are provide no information by just looking at them until and unless coupled with plots and necessary remarks and have kept them to be presented in the Appendix section, if needed. I tried presenting all the results of the data directly in the plots and have kept away from presenting too much cumbersome information. Every plot is coupled with necessary remarks and insight to be drawn.
- Let’s look at the response proportion for each questions(we refer them as items). We see a very large proportion of people have filled Agree and Strongly Agree for all the items in every domain, this hints us towards the fact that our sample shows inclination towards pro-environmental attitude. We will see more on this further in the study though several other plots and analysis.
Before Looking into anything else, we see what the average pro-environmental score of the students is and looking at the 95% confidence interval it seems the overall performance is very good standing at above 80%.
We will now look at the correlation between several variables in our data i.e. the correlation between the items and the correlation between the domains and will draw necessary insight from them. After going through the questionnaire one can think that the questions all revolve around the same topic of environmental attitude so, we expect that the items and domains should be having strong correlation among themselves. We will see the validity of this in the analysis below.
- We take a look at the correlation between all the item’s response. We observe that there is a high positive correlation between all the items from all the domains. It means that people who Agree on one item are likely to Agree with most of the other items and vice-versa. The negatively correlated items(the darker cells) are very few in number and even if they are negatively correlated the strength is very low maxing out at -0.25(only one or two items) mostly between -0.1 and 0.
- We now look at the pairwise correlation between the domain scores of individuals, we look at the results based on the gender to see how each gender’s domain scores correlate. Since, all the domains are closely related to environmental attitude of an individual it is natural to expect that if a person scores high in any one domain he is likely to have higher pro-environmental attitude overall (see this section called Modeling) and hence have high scores in the other domains too. Thus, we see all the domain scores are highly correlated and the correlation strength are similar for both the genders. The regression line fit also reveals positive slope, which means higher score in one domain is likely to imply higher score in other domains and hence overall higher pro-environmental attitude(check further in the study to see this idea explored more). Some disparity in few lines is solely due to limited sample. Hence, our sample adheres to what we believe. The purple denotes Female and yellow denotes Male.
- Now, after seeing the correlation between the net score of the domains. We take a closer look at each domain’s items and how they correlate among themselves. We got a rough idea of it from the first correlation plot of each item. This reveals that there is all the items in each domain are highly positively correlated with each other. There is no negative correlation observed between items of the same domain, which is consistent with what is expected.
With this we have an overall idea of the correlation among items and domains as well as the proportion of subjects in the response data we collected. We will now move to the next section to infer something from the data. We will nonetheless carry on with visualizing the data wherever possible and necessary.
7.2 Inference
Here in this section, we will conduct several hypothesis testings to find out statistically significant results i.e. if there is enough evidence to say that the population we are targeting posses pro-environmental attitude using the limited sample we have.
We will also estimate several statistical measures like mean, give confidence intervals and fit probability density functions.
In the following analysis, quite obviously higher score implies more inclined towards Pro-Environmental Attitude and lower score implies the opposite i.e. less inclined to Pro-Environmental Attitude.
- We have already seen in the previous section how our sample shows inclination pro-environmental attitude i.e. the larger proportion of people agreeing to pro-environmental items. Even though mean and median give good amount of information of the average behavior of the population using the sample, a good fit of probability density functions would reveal the behavior of the entire population and what overall shape of response histogram we can expect. We see that weibull distribution fit all the scores the best out of our choice of probability density functions.(Best fit is decided by Kolmogorov-Smirnov Goodness of Fit) Weibull distribution is similar to normal distribution but it accounts for the skewness or bias seen in the data whereas normal distribution is symmetric. The fact that Weibull fits better than normal is an evidence that the underlying distribution of the population is skewed towards higher score i.e. our population shows tendency towards high pro-environmental attitude. One important thing to note is that the observed skewness of all the domain net scores is negative implying that most of the data is skewed towards higher scores and hence higher pro-environmental attitude. We also observe high average and median score. The QQ-Plot shown beside the data histogram agrees with our fit.
- We will now look at the mean response score for each item in each domain and we have also mentioned the confidence interval for these scores. As expected from the above analysis and overall pro-environmental attitude inclined sample. The mean score for each item is more than 3(about 4 or above, which is above 80% of the total score that can be obtained out of a total of 5). The standard error(which gives the confidence interval as the 95% confidence interval is nothing but the interval [mean-1.96*se, mean+1.96*se]) is low. So, looking at the 95% C.I. for the scores we can be sure that our sample definitely gives good estimate that the mean for each item is much above 3 i.e. on an average the population is more inclined towards pro-environmental attitude.
- We see similar results as the above one from the net score in each domain. We see that that the average net score in each domain is just 5 less than than the total score that can be obtained in that domain. The small 95% confidence interval makes us believe more that the sample average is very good standing at above 75% of the total score. Hence, in every domain we see high levels of awareness and hence high pro-environmental attitude.
- We will now see whether the males and females have different take on pro-environmental attitude. To do this analysis we frame our null hypothesis that males and females on an average don’t differ in pro-environmental attitude. We test this null against a two-sided alternative. We carried a two-sample t-test and found out that we are not able to reject the null hypothesis. Hence, we conclude that there is not enough statistical evidence that male and female differ on their take in pro-environmental attitude. So, just as were doing from the start, we won’t differentiate between male and female scores. We also test the null hypothesis of the people living in joint family and nuclear family having same pro-environmental attitude against the two-sided alternative. We again find that there is not enough statistical evidence that people belonging to joint family and nuclear family differ on their take in pro-environmental attitude. We find that the average pro-environmental score of most of the domains are almost similar i.e. conducting the one-factor ANOVA test for equality of means we find that there is not enough statistical evidence to reject the fact that the means of all the domain scores are different. I strongly believe that these tests are limited by the sample size and if we had access to a bigger sample the means of all the domain scores in a single one-factor ANOVA test would come out to be same.
With this we conclude the inference section of our study. We will use the insights gained from this section and the previous section to guide our data analysis further in the next sections.
7.3 Modeling
In this section, we will see the dependence of one domain score on the other domain scores, the item scores on the domain scores and domain scores on the total scores. This will help us identify which domain score is the most differentiating so as to get higher pro-environmental scores. It is important because those are the domains we need to improve so as to achieve better total scores. We won’t be fitting complicated black box models as our goal is not to predict the pro-environmental attitude of a student rather we want to draw inferences from the models. Linear Models fit perfectly in that sense.
- First, we start by looking at the domain score’s dependence on the other domain scores. This will help us get an idea of how knowing that one person scores in all the domains except one will perform in the left out domain. From the coefficient forest plot below we see that most of the coefficients in the significant linear model have high positive values except one(which on inspecting the 95% C.I. indicated by the horizontal line across each estimate reveals it to be insignificant with high p-value). Thus, if a person has high scores in all the domains, it is expected and observed that he will have high scores in the rest of the domain. We also observe all the other domain scores explains the variability of RES, REC, CON and ER domain score(looking at the model R-squared values, these are above 60%) very well. It also gives us important information that if we can foster education among students on the domains they are less aware of we can expect them to have improved pro-environmental score on the other domains as well. Thus, improving the awareness with special focus on SS, ES, SA and PC is needed because improving awareness on REC, REC, CON and ER won’t improve pro-environmental attitude in these domains(with low model R-square).
- Let’s look at the dependence of the item scores on the domain score. We will look at it in two ways to better understand what’s going on. We below fit a linear regression model of each item on the domain score, so as to understand given a particular total pro-environmental score in a particular domain, what is the expected scores in each of the items of that domain. We observe increase in any item score increases the overall score in that domain(from the correlation graphs among items we know one item score increases implies other scores are likely to be high too). We observe an issue with 3 items i.e. REC2, PC1, SA1 as even when the domain score is 100%(which means all item scores for that individual must be 5) our fitted regression line predicts that item score to be 4 or slightly less than 4 this means that the students even with high pro-environmental attitude find it difficult to agree with these items and on looking at these items we understand that there is need for education and awareness on these items i.e. imparting education on upcycling old products into new and useful items(REC2), on the cost and issues of safe disposal of waste products(SA1) and managing waste around an individuals surroundings(PC1). We also get an idea of which items are generally agreed by most of the people(even the ones with lower pro-environmental score) i.e. if we draw a vertical line in each of the plots we will get the score of each item predicted at that level of domain awareness. All the items in SS and ES have same predicted scores whereas in REC, REC1 is the highest scored; in SA, SA4 is highest scored; in PC, PC4 is highest scored; in ER, ER1 is highest scored; In RES, RES4 is highest scored and in CON, CON3 is highest scored. So, most of the students at all level of domain awareness(domain pro-environmental score) agrees that recycling is useful, proper precaution for safe disposal of waste/toxic chemical should be taken, environmental conservation is everyone’s responsibility, we can save water by closing the taps when we brush and conservation can help in creating a greener planet for future.
- After we have seen the above insight, it is quite natural to ask how can we improve on the overall domain score by imparting education particular to each item. We will answer that now by looking at the dependence of Total Score in each domain on the item scores that compose it. Obviously, we see as expected that there is positive slope as increasing item score increases domain score trivially. One important thing to notice is that agreeing completely or scoring even full in one item doesn’t imply that the overall domain score is near full, it maxes out at about 90% for almost all the domains, which reinforces the fact that better broad knowledge regarding all sorts of items that comprises a domain is needed to improve the domain level awareness to near 100%. We see that is the slope of REC2, SA1, PC1, PC2, PC3, PC6, SS1 are lower compared to rest of the items which implies as of now(from our sample’s level of environmental education) that improving the take on these items would show very less increase in domain score. Thus, we need to focus in imparting education regarding upcyling, cost of disposal of waste, maintaining our surroundings to be waste free, believing in individual actions can improve climate.
- Finally, we take a look on the same analysis we saw above for items conducted on domain level awareness and overall pro-environment score. We breakdown the predicted composition of the total pro-environmental attitude(overall score) in terms of the domain scores. We see that students have higher contribution in their overall pro-environmental attitude from the domain awareness of environmental reductionism. Apart from Environment Reductionism all the other domain scores maxes out at around 90% when the total score is 100%(i.e. all the domain scores should be nearer to 100%). Hence, we need awareness focusing on all the other domains of recycling, environmental safety, perceived control, social support, environmental sensitivity, reuse and conservation. It is observed that students scored lesser in social support, environmental safety and perceived control compared to other domains. The R-squared plots also confirms that the student’s overall pro-environmental attitude is mainly contributed by RES, REC, CON, ER and we need to work more on the domains of PC, SS, ES and SA.
- So, as a result to see how the dependence of total scores are on the domain scores, mainly the slope will be important for us to analyse and increasing the slope would be a goal though our environmental education. It is again focused that the total score maxes out at about 90% even when all domain scores are 100%, which implies further awareness in all the domains is required. The slopes of SS and ES are low in among students so education is very important so as to increase the slope for these domains on total score.
This section provided lots of insight on the areas we need more stress in the environmental education and we also identified some problems or lacking in the students. Now, we move on to testing the reliability of the response.
7.4 Clustering
Pro-Environmental Attitude Score consists of 8 domain scores as per our Questionnaire Survey. Since, our questionnaire captures the opinion or attitude of the students, we might expect some students to have similar opinions on few items or domains and some have different opinions. Based on these opinions the items will also have different clusters capturing the similar trends in what the students believe. So, we are interested in taking a look at the clusters of similar items from all the domains as obtained from the observations and we are also interested in what composes the high-scorers(higher inclination towards pro-environmental ideas) overall pro-environmental attitude. What are the domains students lack an idea about and how the items group together to convey similar ideas are some of the areas we will focus on in this section of the study.
- We will first take a look at correlation-distance based clustering of items. The way we interpret the dendrogram is the items that fuse further up in the dendrogram are more disimilar(based on correlation) to the items that fuse below. This reveals the underlying similarity of the items very well and also sheds light of the items that don’t go very well with the rest of the items. For example we see as also observed before SA1, PC2, SS1 doesn’t go well with almost the entire questionnaire and hence it fuses way up in the dendrogram, PC2 and SS1 are very similar(where PC2 is I often feel single person cannot change the environment and SS1 is I believe that an individual alone cannot help in conserving environment without social support), we see same for PC4 and ER5; REC2 and RES1; CON2 and RES5; PC1 and ES2; and so on. Hence, we see many items are repeated in a slightly different manner to check whether the students give similar answer to similar questions. Students have very similar ideas about recycling(REC) and reuse(RES) as seen from the fact that they tend to cluster together. Many items of SA are in same cluster together, same goes for SS as well as REC.
- Now, we will look at the clustering of items along with the clustering of students based on Jaccard Distance(which best separated the classes of students based on their scores from low overall pro-environmental scorers to high overall pro-environmental scorers and also separated the items on which the students performed worst to best). Looking at the heatmap distribution, we see that most of the students stayed neutral or even disagreed and strongly disagreed to the blue clustered items in the dendrogram below, even the students who are highly inclined towards pro-environmental attitude chose to merely agree rather than strongly agree to these items. This means there is immense need for environmental education on good practices and awareness on importance and ways to conserve natural resources(even in everyday life), on maintaining environment and waste around them in the surroundings, on importance of individual in stopping climate change and protection of environment, on methods and ideas about reusing old or waste products and finally taking social support and bringing more people together for protection and nurture of environment(these are all the blue clustered items). The students who have less inclination towards pro-environmental attitude need proper education in all the domains as they disagreed or even stayed neutral to most well known and good practices of environmental protection(as can be seen from the right side of the heatmap).
- We now turn our attention to domain level analysis. Clustering based on correlation-distance as before revels which domain are similar to the other domains. If some domain is not similar to the other domain education specially for those domains are important. We see social support, environmental sensitivity are different from the main-stream domains of RES, CON, REC, ER. SA and PC are less similar to these four domains as well.
- We see reinforcement of the same fact as before i.e. students even the high pro-environmentally inclined didn’t score very well in the domains of SS, ES, PC, CON, RES. Hence, more focus on building social support, conservation, reusing and conservation needs to be given in context on environmental education.
This section was very useful in understanding the underlying clustered behavior of the items and domains. It helped us further pin-point lacking of awareness of which domains led to less pro-environmental attitude.
7.4.1 PCA
With so many items of 5 scale rating, exploring the directions of the data that summarizes the overall response reasonably may give us an idea of which items and domain decides mainly the pro-environmental attitude of the students.
- Firstly, we see on the Domain Level PCA. Seeing the overall spread of the data and the directions of strongly agree, agree, neutral, disagree and strongly disagree. We see students strongly agree on REC, ER and SA, stay neutral mostly in RES, Disagree the most on PC and strongly disagree the most in ES. The overall inclination strength in each of the domain can be understood from what is shown by the distribution of the data points below spanned by the two leading principal components.
- We will finish this PCA analysis with item level PCA. We get the directions of maximum variance of the responses. We see that the students vary in their Strongly Agree attitude towards items like ER1, ES5, REC1, etc. Strongly Disagree has very less variance explanation mainly because very few student gave that. Disagree and Neutral are in the same direction as expected, Strongly Disagree also vary in similar direction, with the most variance near SS2, PC6, RES1, etc. Looking at the projection of 5-dimensional items on the two-dimensional plane spanned by the two leading principal component’s direction shows that most of the item’s response lie on an diagonal ellipse with items like REC2, RES1 and PC1 whose position is far away from the mass of the data cloud. Looking at the previous analysis we saw that it is indeed the case that REC2, RES1 and PC1 got worse response than other items. REC7 is not an outlier sort of data as there are items REC3, REC6 situated near. SA4, PC4, REC1, ER3, ER1 are the most strongly agreed items which seem to cluster together.
7.5 Reliability
In this section, we will conduct reliability analysis of the survey response data.Reliability analysis allows you to study the properties of measurement scales and the items that compose the scales. The Reliability Analysis procedure calculates a number of commonly used measures of scale reliability and also provides information about the relationships between individual items in the scale.
Hence, we will here get an overall index of the repeatability or internal consistency of the scale as a whole, and you can identify problem items that should be excluded from the scale
- We will start from SA Domain and then will move on sequentially as the structure of Questionnaire and will obtain the reliability indices using the Cronbach’s α. We detect that there is a problem with the SA1 item with the rest of the item in this domain. The r.drop value for SA1 is very low which implies this item doesn’t correlate well with the other items in this domain. Even the raw_alpha increases substantially if SA1 is dropped. SA1 doesn’t fit reliably in the SA domain and it might be a reason we saw it score low among the students in the previous section.
- Let’s look at the REC domain. The reliability score is very good(i.e. raw_alpha above 0.7). REC2 seems to be not correlating good with the other items in this domain. It might be because of the students are not much aware about upcycling or don’t do it. Hence, education and awareness regarding this need to be imparted.
- The reliability of PC domain’s response is very bad. All the items except PC4 and PC5 also doesn’t correlate very well with the scale overall.
- The reliability of the SS Domain’s response is good. The first item SS1 doesn’t seem to correlate very well with the overall scale.
- The reliability of the ER Domain’s response is very good. All the items correlate with the overall score very well.
- The ES Domain’s repsonse reliability is good. Though we can achieve better r.drop scores by maybe re-framing the items. The items fit the overall scale good.
- The RES Domain’s response has good reliability. All items fit good to the overall scale.
- The CON Domain’s response is reliable. All the items fit well in the overall scale.
- Now, if we look at the overall reliability of the entire questionnaire for finding the pro-environmental attitude. It turns out that it has excellent reliability. Most the items fit in the scale very well given the goal is mainstream and we need to decide about the pro-environmental attitude as a whole. But if each part of the survey on different domains are seen, then it turns out that few items for few domains don’t fit in well in reliably finding the domain-level environmental attitude. We however see, as the domain’s item statistics shows, that some items doesn’t fit well with the other items very well which are ES1, ES3, SA1, SS1 which stood for preference of working in well-illuminated, airy areas, crowded places making one feel uneasy, safe disposal of waste is costly and thoughts on one individual can’t help in conserving environment without social support.
This ends the reliability analysis of our response data and the reliability of all the domains are good except one i.e. PC. The overall survey for pro-environmental attitude measure is excellent.
Hence, we have covered almost all aspects of the response data and have explored the breadth and depth of the data. The summary of findings and suggestions drawn as a conclusion of the above analysis is presented in the Conclusion section of my study.