Analysis Notebook

The purpose of this notebook is to analyze the daily sales of ice creams at metro areas in the USA. Subsets corresponding to yearly sales are also profiled. Plots of auto-correlation and maximum daily sales per week are also provided.

import pandas as pd
import numpy as np
fp = "../data/daily_ice_cream_sales.csv"
df = pd.read_csv(fp)
df["Date"] = pd.to_datetime(df["Date"])
df["log_daily_sales"] = np.log(df["ice_cream_purchases"])

df["Date"].max()

df["Date"].min()

import matplotlib.pyplot as plt

Extract yearly data subsets

df_2020 = df[df.Date.dt.year == 2020]
df_2021 = df[df.Date.dt.year == 2021]
df_2022 = df[df.Date.dt.year == 2022]
df_2023 = df[df.Date.dt.year == 2023]

%matplotlib inline
import statsmodels.api as sm

df_2020["log_daily_sales"].plot()

Plot Autocorrelation

The auto correlation plots are shown below.

sm.graphics.tsa.plot_acf(df_2020["log_daily_sales"], lags=40)

sm.graphics.tsa.plot_pacf(df_2020["log_daily_sales"], lags=40, method="ywm")

sm.graphics.tsa.plot_pacf(df["log_daily_sales"], lags=40, method="ywm")

Observation

It appears that daily sales are not correlated, but independent draws from a distribution. This looks like a white noise process. This is consistent with synthetically generated data. A dickey fuller test with no regression (constant, this is the default) also suggests the same.

from statsmodels.tsa.stattools import adfuller
result = adfuller(df["log_daily_sales"])
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])

df_2020


df["log_daily_sales"] = np.log(df["ice_cream_purchases"])

df_yearly_ice_cream_sales = df.set_index("Date").resample("Y").sum()

df_yearly_ice_cream_sales

df["ice_cream_purchases"].plot()

fpmeanweekly = "../data/mean_weekly_ice_cream_sales.csv"
dfwmics = pd.read_csv(fpmeanweekly)

dfwmics["ice_cream_purchases"].plot()

dfwmics["weekno"] = range(1,dfwmics.shape[0] + 1)

dfwmics