Data Preparation Notebook

The data for this notebook is from Kaggle. The data looks like it is synthetic, the characteristics observed in the dataset also seem to suggest that. The objectives for this notebook are to:

  1. Understand the attributes in the dataset, thier purpose and range of values
  2. Understand how transactions are represented
  3. Extract transactions involving an ice cream purchase
  4. Aggregate the ice cream transactions at a daily cadence
  5. Extract the maximum daily sales of ice cream for each week of the analysis period
import pandas as pd
fp = "../data/Retail_Transactions_Dataset.csv"
df = pd.read_csv(fp)
df.head()

Profile the categorical columns

df.columns
category_cols = ['Payment_Method', 'City', 'Store_Type',\
       'Discount_Applied', 'Customer_Category', 'Season', 'Promotion']
df[category_cols] = df[category_cols].astype('category')
df["Date"] = pd.to_datetime(df["Date"])
df.dtypes
df["Customer_Category"].value_counts()
df["City"].value_counts()

Observation

The counts for each of the metro areas are very similar, the counts for each of the customer categories are very similar, so this dataset was probably synthetically generated.

df["Product"] = df["Product"].apply(eval)
purchase_summ = {}
for index, row in df["Product"].items():
    for p in row:
        if p in purchase_summ:
            purchase_summ[p] += 1
        else:
            purchase_summ[p] = 1
        

Extract the ice cream transactions

def is_ice_cream(row):
    for p in row:
        if p == "Ice Cream":
            return True
    return False
df["is_ice_cream"] = df["Product"].apply(is_ice_cream)
df_ice_cream_trans = df[df["is_ice_cream"]].reset_index()
req_cols = ["Date"]
df_ice_cream_trans = df_ice_cream_trans[req_cols]
df_ice_cream_trans["ice_cream_purchases"] = 1
df_daily_ice_cream_sales = df_ice_cream_trans.set_index("Date").resample("D").sum()
df_weekly_max_ice_cream_sales = df_daily_ice_cream_sales.assign(year=df_daily_ice_cream_sales.index.year,\
                                week = df_daily_ice_cream_sales.index.isocalendar().week).groupby(['year','week']).max()
df_weekly_mean_ice_cream_sales = df_daily_ice_cream_sales.assign(year=df_daily_ice_cream_sales.index.year,\
                                week = df_daily_ice_cream_sales.index.isocalendar().week).groupby(['year','week']).mean()
df_weekly_max_ice_cream_sales = pd.DataFrame(df_weekly_max_ice_cream_sales.to_records()) 
df_weekly_max_ice_cream_sales
df_weekly_mean_ice_cream_sales

Write the extracted files for further analysis

fpdaily = "../data/daily_ice_cream_sales.csv"
fpmaxweekly = "../data/max_weekly_ice_cream_sales.csv"
fpmeanweekly = "../data/mean_weekly_ice_cream_sales.csv"
df_daily_ice_cream_sales.to_csv(fpdaily, index=True)
df_weekly_max_ice_cream_sales.to_csv(fpmaxweekly, index=True)
df_weekly_mean_ice_cream_sales.to_csv(fpmeanweekly, index=True)