Data Preprocessing Steps and Inspiration
Future Possibilities of the Project
The objective of this project is to analyze Walmart sales data to extract meaningful insights and develop predictive models to forecast weekly sales for each store. This analysis aims to help Walmart improve inventory management, strategic decision-making, and overall operational efficiency.
Walmart Sales Data: The primary dataset used for this analysis is the Walmart.csv file, containing detailed information about weekly sales across multiple Walmart stores.
The dataset, named "Walmart.csv", comprises 6,435 rows and 8 columns, each offering valuable insights into the weekly sales dynamics at Walmart across 45 stores. The dataset Walmart.csv contains various columns including:
- Store: Store number (categorical variable ranging from 1 to 45)
- Date: The week of sales (temporal dimension spanning from February 5, 2010, to October 26, 2012)
- Weekly_Sales: Sales figure for the given store in a particular week
- Holiday_Flag: A binary indicator (0 or 1) discerning whether a given week includes a holiday
- Temperature: The temperature on the day of the sale
- Fuel_Price: The regional fuel cost
- CPI: Consumer Price Index, indicating the average change in prices paid by consumers
- Unemployment: The unemployment rate in the region
-
Python: Data Cleaning and Analysis
-
Jupyter Notebook: For interactive data analysis and visualization
Libraries
Below are the links for details and commands (if required) to install the necessary Python packages:
Below are the links for details and commands (if required) to install the necessary Python packages:
- pandas: Go to Pandas Installation or use command:
pip install pandas
- numpy: Go to NumPy Installation or use command:
pip install numpy
- matplotlib: Go to Matplotlib Installation or use command:
pip install matplotlib
- seaborn: Go to Seaborn Installation or use command:
pip install seaborn
- scikit-learn: Go to Scikit-Learn Installation or use command:
pip install scikit-learn
- statsmodels: Go to Statsmodels Installation or use command:
pip install statsmodels
- pmdarima: Go to Pmdarima Installation or use command:
pip install pmdarima
- fbprophet: Go to Prophet Installation or use command:
pip install fbprophet
- tbats: Go to TBATS Installation or use command:
pip install tbats
EDA involved exploring the Walmart sales data to answer key questions, such as:
- What is the overall trend of weekly sales?
- How do these trends vary by store, region, and other factors?
- What is the impact of holidays on weekly sales?
- How do external factors like temperature, fuel price, CPI, and unemployment affect sales?
Handling Missing Values: There are no missing values found in the dataset. Removing Duplicates: There are no duplicate values found in the dataset. Addressing Outliers: Outliers are not being addressed since we are considering the actual weekly sales for time series forecasting.
-
Converting Data Types: The data type for the column ‘Date’ is changed to ‘datetime’ from ‘object’. From this date, we have created new columns by obtaining the year, quarter, month, week, day of week, and day of month.
Data Gaps: Before doing the EDA, we observed that there is a gap in the data for January 2010 and for November, December 2012. The absence of data for these three months can impact our ability to perform accurate yearly, quarterly, and monthly comparisons. The distribution of data is thus affected. It is essential to consider this data gap while conducting analyses that involve these specific time periods.
Distibution of Data:
- Across Years
- Across Quarters
- Across Holidays/Non Holidays)
- Distribution of Data - Pie Charts
Top 5 Performing Stores
Worst 5 Performing Stores
Store No. 20 has the highest sales, whereas store No. 33 has the lowest sales.
Total Yearly Sales
Total Monthly Sales
As there is a gap in the data for January 2010 and for November, December 2012, we would average it out to show for these months which month has the highest sales. After doing the necessary adjustment, we can see that December is the best performing month and February is the worst performing month.
Top 5 months with Highest and Lowest Sales
Total Holiday/Non-Holiday Sales
If we make the adjustment by dividing the sales with the actual number of working days and holidays, we can see the daily sales on a holiday is higher.
Average Daily Sales on a Holiday / a Non Holiday
Impact of Unemployment on Weekly Sales
The data indicates a noticeable decline in spending coinciding with the initiation of unemployment. Typically, an elevated unemployment index corresponds to a reduction in sales. However, in our dataset, the correlation between the unemployment rate index and weekly sales is relatively low, measuring at -0.106.
Impact of Temperature on Weekly Sales
The observed correlation of -0.063 between temperature and sales in Walmart suggests a weak negative relationship. Several factors could contribute to this low correlation:
- Seasonal Variations
- Diverse Product Range
- Regional Variations
- Consumer Behavior
- Multifactorial Influence
Impact of CPI on Weekly Sales
The observed low correlation of -0.072 between the Consumer Price Index (CPI) and sales indicates a weak relationship.
Seasonal Trend of Weekly Sales
Seasonal Trend in Weekly Sales: Sales are the highest in December, which can be attributed to several factors:
- Holiday Shopping Season
- Special Promotions and Discounts
- Winter Weather and Seasonal Products
- Year-End Clearance Sales
- Increased Consumer Spending
- Marketing and Advertising Campaigns
- Extended Store Hours
Trend Component, Seasonal Component, Residual Component of Weekly Sales
- ARIMA (AutoRegressive Integrated Moving Average)
- Captures linear trends and seasonality.
- Suitable for stationary data.
- SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous factors)
- Extends ARIMA by incorporating external factors.
- Suitable for data influenced by external variables like holidays.
- AutoARIMA
- Automates the selection of the optimal ARIMA model.
- User-friendly and efficient.
- Prophet
- Developed by Facebook for handling seasonality, holidays, and special events.
- Flexible and easy to use.
- TBATS (Trigonometric Seasonal Decomposition of Time Series)
- Handles multiple seasonalities and complex patterns.
- Robust in capturing diverse seasonal patterns.
-
Stationarity Assumption
Definition: The statistical properties of the time series data, such as mean and variance, do not change over time. Rationale: Many time series forecasting models, including ARIMA, perform better on stationary data. Ensuring or achieving stationarity enhances model effectiveness.
-
Linearity Assumption
Definition: The relationships between variables, including past and future values in the time series, can be adequately represented using linear models. Rationale: Models like ARIMA and SARIMAX are designed based on linear relationships. Assuming linearity simplifies the modeling process.
-
Independence Assumption
Definition: Each observation in the time series is assumed to be independent of others. Rationale: Time series models often assume independence to prevent past observations from unduly influencing future ones. Violating this assumption can lead to biased model performance.
-
Identifiability Assumption
Definition: The parameters of the chosen forecasting model can be uniquely determined from the available data. Rationale: Ensuring that the parameters are identifiable is crucial for accurate estimation in models like ARIMA and SARIMAX. This supports the reliability of the model's parameter estimates.
- RMSE (Root Mean Squared Error): Measures the average magnitude of the errors between predicted and observed values.
- MAE (Mean Absolute Error): Calculates the average absolute differences between predicted and observed values.
- MAPE (Mean Absolute Percentage Error): Expresses the average percentage difference between predicted and observed values.
Considering Store 24
-
The TBATS model achieved the best performance metrics (lowest MAPE, RMSE, and MAE) for Store No. 24.
-
The analysis highlighted significant trends and seasonal patterns in weekly sales.
-
The impact of external factors like unemployment, temperature, and CPI on sales was explored.
- Implement targeted interventions during peak sales periods and holidays to maximize revenue.
- Continue monitoring external factors to understand their potential impact on sales.
- Utilize the forecasting model to plan inventory management and resource allocation more effectively.
- Data Quality: Some data points may be inaccurate due to underreporting or delays in reporting.
- Model Limitations: The models used may not capture all complexities of sales patterns and may need continuous updating.
- External Factors: Other factors not included in the analysis, such as social behavior and political decisions, can significantly impact sales.
- Advanced Predictive Modeling
Investigate advanced forecasting models such as: a) NBEATS (Neural Basis Expansion Analysis for Time Series) b) NHITS (Neural Hierarchical Time Series) c) PatchTST (Patch-level Temporal Super-Resolution Network for Time Series) d) VARMAX (Vector Autoregressive Moving-Average with exogenous variables) e) VAR (Vector Autoregression) f) KATS (Kit for Automated Time Series analysis)
These models offer potential for enhanced accuracy in sales forecasting.
- Store-Specific Analysis
Conduct comprehensive analyses for each of the 45 Walmart stores to uncover unique patterns and optimize forecasting models tailored to individual store characteristics. This approach can help identify specific trends and factors influencing sales at each location.
- External Factors Integration Incorporate Additional Factors: Consider integrating additional external factors into the forecasting models, such as:
- Economic Indicators: GDP, inflation rates, etc.
- Social Events: Festivals, public holidays, etc.
- Regional Factors: Local economic conditions, demographic changes, etc.
Incorporating these factors can provide a more comprehensive and accurate forecasting approach, capturing the broader context influencing sales.