The pursuit of happiness has fascinated humanity throughout the ages. Ancient philosophers and modern-day researchers have sought to uncover its source, however the secret formula to happiness remains yet to be found. In our project, we join those who have embarked on this quest, exploring the factors that influence happiness and striving to find a predictive model for happiness. Join us on this captivating journey as we dive into modeling happiness and its determinants, and inspire a happier and more fulfilling world. Moreover, it is important to talk about happiness, as it contributes to several Sustainable Development Goals (SDGs), such as Goals 1, 2, 3, 4, 5, 6, 10, and 16 highlighting its importance in creating a sustainable and fulfilling world.
In line with this, the aim of this project is to answer the question, "What factors influence the level of happiness in countries? To answer this question, we applied different methods and algorithms to find the best one that predicts the happiness (given by the feature "Happiness Score.")
- Maria Velikikh (@mivelikikh)
- Emilija Vukasinovic (@emavuk)
- Paula Ramirez Ortega (@Pramirezortega)
In this project 2016_world_metrics.csv
(37.3 KB) dataset is used.
To see the full project please refer to the notebook full_project.ipynb
.
To watch the video of us explaining the most important parts of the project please refer to this link.
We began by examining the dataset to understand its characteristics. Our dataset world_metrics
contains information on health and life expectancy data as well as on ecological footprint, human freedom scores, and happiness scores for 137 countries in 2016. Overall, it includes 30 features, out of which one is the “country name”, another the “happiness score” (our target variable) and the remaining 28, the predictors.
We discovered our data, checking for outliers, correlations between variables and happiness scores, and more. Computing the correlations, helped us define features with a high correlation (>
After extensively discovering our data we applied a complex strategy to analyze how the performance of our models could be influenced by various input parameters. Our objective was to investigate the following aspects:
- The difference between different data samples:
- The full dataset
world_metrics
(considering the effect of each original feature); - The subset
world_metrics_subset
(containing only the most correlated features with the target feature);
- The effect of artificially constructed features:
- Does the use of
PolynomialFeatures()
in the preprocessing step improve model performance?
- The effect of scalling technique:
MinMaxScaller()
StandardScaller()
- The effect of dimensionality reduction with the use of Principal Component Analysis (PCA):
- Do we need all the features from the dataset?
- Do we need only a few?
Through this approach, we aimed to assess the evolution of the prediction model's performance. In each scenario, we worked with the following models to evaluate their performance:
- ElasticNet (to see what should we prefer: Ridge vs. Lasso)
- Ridge
- Lasso
- kNN
- Decision Tree
- Random Forest
- Support Vector Regression
To predict, we used a set of custom functions that are placed in the file functions.py
. These functions were designed to work together in the workflow for performing grid search on a regressor, obtaining results, and extracting the best models based on different scoring functions. The code output includes both the model settings and the calculated metrics. The model settings provide information about the chosen algorithm, hyperparameters, and preprocessing steps used. The calculated metrics consist of the mean
This table compares all the models in terms of the obtained mean MSE. We use the following abreviations for compactness of the table:
- F = Full Sample Approach
- S = Subset Sample Approach
sample | model name | mean MSE | std | sample | model name | mean MSE | std | |
---|---|---|---|---|---|---|---|---|
F | SVR | 0.3355 | 0.0841 | S | SVR | 0.2842 | 0.0424 | |
F | Random Forest | 0.3465 | 0.0761 | S | Random Forest | 0.2961 | 0.0831 | |
F | Ridge | 0.3512 | 0.0804 | S | Ridge | 0.3499 | 0.0565 | |
F | ElasticNet | 0.3519 | 0.0806 | S | ElasticNet | 0.3510 | 0.0576 | |
F | Lasso | 0.3577 | 0.0945 | S | Lasso | 0.3561 | 0.0522 | |
F | kNN | 0.3619 | 0.1106 | S | kNN | 0.3073 | 0.0727 | |
F | Decision Tree | 0.5657 | 0.1843 | S | Decision Tree | 0.4734 | 0.1301 |
Especially, we found the following:
Full Sample Approach. Among the regression models, SVR (Support Vector Regression) has the lowest mean MSE of 0.3355, followed closely by Random Forest, and Ridge. ElasticNet, Lasso and kNN have slightly higher mean MSE values, and Decision Tree performs the worst with a mean MSE of 0.5657.
Subset Approach. SVR still has the lowest mean MSE of 0.2842, followed by Random Forest and kNN. Ridge, ElasticNet, and Lasso perform similarly, while Decision Tree has the highest mean MSE of 0.4734.
Overall, SVR consistently performs well on both the Full and Subset datasets, achieving the lowest mean MSE in both cases. This indicates that SVR is the most accurate model in predicting happiness scores based on the given features. It demonstrates good generalization ability and robustness across different dataset sizes.
In conclusion, our investigation did not uncover a definitive formula for happiness. However, we gained valuable insights into the factors associated with happiness and their alignment with the Sustainable Development Goals. It is important to acknowledge that the parameters used in our analysis offer a generalized perspective on happiness and may not fully capture its individual and multifaceted nature. Nevertheless, these insights have fueled our determination to continue our quest for understanding happiness and contribute to a happier world.
Looking ahead, we have several plans for future research. We intend to create subsets of factors based on our own definition of happiness, explore how happiness is depicted in cartoons, and investigate cultural perspectives on happiness. Additionally, we aim to expand our dataset by including currently missing countries, enabling us to gain a more comprehensive global view and examine regional variations. By pursuing these avenues and incorporating additional data, we seek to enhance the comprehensiveness of our findings and uncover new insights into the complex nature of happiness.