import pandas as pd
import numpy as np
fp = "../data/daily_ice_cream_sales.csv"
df = pd.read_csv(fp)
df["Date"] = pd.to_datetime(df["Date"])
df["log_daily_sales"] = np.log(df["ice_cream_purchases"])Analysis Notebook
The purpose of this notebook is to analyze the daily sales of ice creams at metro areas in the USA. Subsets corresponding to yearly sales are also profiled. Plots of auto-correlation and maximum daily sales per week are also provided.
df["Date"].max()df["Date"].min()import matplotlib.pyplot as pltExtract yearly data subsets
df_2020 = df[df.Date.dt.year == 2020]
df_2021 = df[df.Date.dt.year == 2021]
df_2022 = df[df.Date.dt.year == 2022]
df_2023 = df[df.Date.dt.year == 2023]%matplotlib inline
import statsmodels.api as smdf_2020["log_daily_sales"].plot()Plot Autocorrelation
The auto correlation plots are shown below.
sm.graphics.tsa.plot_acf(df_2020["log_daily_sales"], lags=40)sm.graphics.tsa.plot_pacf(df_2020["log_daily_sales"], lags=40, method="ywm")sm.graphics.tsa.plot_pacf(df["log_daily_sales"], lags=40, method="ywm")Observation
It appears that daily sales are not correlated, but independent draws from a distribution. This looks like a white noise process. This is consistent with synthetically generated data. A dickey fuller test with no regression (constant, this is the default) also suggests the same.
from statsmodels.tsa.stattools import adfuller
result = adfuller(df["log_daily_sales"])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])df_2020
df["log_daily_sales"] = np.log(df["ice_cream_purchases"])df_yearly_ice_cream_sales = df.set_index("Date").resample("Y").sum()df_yearly_ice_cream_salesdf["ice_cream_purchases"].plot()fpmeanweekly = "../data/mean_weekly_ice_cream_sales.csv"
dfwmics = pd.read_csv(fpmeanweekly)dfwmics["ice_cream_purchases"].plot()dfwmics["weekno"] = range(1,dfwmics.shape[0] + 1)dfwmics