import warnings
warnings.filterwarnings('ignore')
import pandas as pd
fp = "../data/cp_data.csv"
df = pd.read_csv(fp)
cols = ["date", "cents_per_lb"]
df = df[cols]
df["date"] = pd.to_datetime(df["date"])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422 entries, 0 to 421
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          422 non-null    datetime64[ns]
 1   cents_per_lb  422 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 6.7 KB
df
date cents_per_lb
0 1990-01-01 75.830
1 1990-02-01 84.010
2 1990-03-01 93.960
3 1990-04-01 93.730
4 1990-05-01 92.020
... ... ...
417 2024-10-01 276.777
418 2024-11-01 304.953
419 2024-12-01 344.119
420 2025-01-01 353.933
421 2025-02-01 409.516

422 rows × 2 columns

# Using plotly.express
import plotly.express as px
fig = px.line(df, x='date', y="cents_per_lb")
fig.show()
  1. coffee prices have cycles as is evident from the above chart.
  2. Each hill like pattern we observe in the graph is a cycle. From visual inspection we can count 6 to 7 cycles in coffee prices between 1990 and the present.
  3. We will have to apply rigorous statistical methodology to determine the end points of each cycle. The visual data exploration provides us a basic estimate of how many cycles we can expect to see in this data. The statistical modeling technique provides us the number of cycles.
  4. The statistical methodology applied assumes that prices are a Gaussian Process. It applies the change point detection method developed by Killick and available in the ruptures python package. Please see this page for the gaussian process cost function implementation.
import ruptures as rpt
#model = "l1"  # "l2", "rbf"
algo = rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=10).fit(df["cents_per_lb"].values)
my_bkps = algo.predict(pen=200)

Conduct an experiment to determine how the number of change points changes with regularization penalty parameter. Killick suggests that this is ideally \(k \log(n)\) where \(n\) is the number of data points (which is 422 here) and \(k\) is a small number. By running the experiment below and visually inspecting the number of cycles, a penalty value of 55 seems about right for this dataset. Edit: Now use a GP to model

import numpy as np
pen = np.arange(2,62,2)
pen_res = {}

model = "rbf"  # "l2", "rbf"
for p in pen:
    algo = rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=12).fit(df["cents_per_lb"].values)
    my_bkps = algo.predict(pen=p)
    pen_res[p] = my_bkps

    
num_bks = { int(p): len(pen_res[p]) for p in pen}
df_bkps = pd.DataFrame.from_dict(num_bks,orient='index').reset_index()
df_bkps.columns = ["penalty", "num_breaks"]
fig = px.line(df_bkps, x='penalty', y='num_breaks', markers=True)
fig.show()
np.log2(df.shape[0])
np.float64(8.721099188707186)

Killick’s paper recommends \(k \log(n)\) for the penalty. Here \(k\) is a small constant and \(n\) is the number of datapoints. A plot of penalty versus the number of break points detected is shown in the above plot. A penalty of about \(6\) is consistent with what we observe visually in the plot of the time series.

if 'pen_res' in globals():
    del pen_res
algo = rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=12).fit(df["cents_per_lb"].values)
my_bkps = algo.predict(pen=55)
len(my_bkps)
8
my_bkps
[50, 100, 125, 180, 235, 305, 375, 422]
for i in range(len(my_bkps)):
    if i == 0:
        df.loc[: my_bkps[i], "regime"] = "R-" + str(i+1)
    elif ( ( i > 0) and (i <= (len(my_bkps) - 1))):
        df.loc[ my_bkps[ (i - 1)] : my_bkps[i], "regime"] = "R-" + str(i+1)
    else:
        print("last breakpoint, do nothing!")
df
date cents_per_lb regime
0 1990-01-01 75.830 R-1
1 1990-02-01 84.010 R-1
2 1990-03-01 93.960 R-1
3 1990-04-01 93.730 R-1
4 1990-05-01 92.020 R-1
... ... ... ...
417 2024-10-01 276.777 R-8
418 2024-11-01 304.953 R-8
419 2024-12-01 344.119 R-8
420 2025-01-01 353.933 R-8
421 2025-02-01 409.516 R-8

422 rows × 3 columns

A line plot that marks each regime with a separate color

fig = px.line(df, x='date', y='cents_per_lb', color='regime', markers=True)
fig.show()

A violin plot to confirm that each regime corresponds to a single density (not too multi modal) and summarize the ranges for each regime.

fig = px.violin(df, y="cents_per_lb", x="regime", color="regime", box=True, points="all",
          hover_data=df.columns)
fig.show()

Write the prepared datafile to disk.

df.groupby("regime").agg( size = pd.NamedAgg(column="regime", aggfunc="size"),\
                         min_price = pd.NamedAgg(column="cents_per_lb", aggfunc="min"),\
                        max_price = pd.NamedAgg(column="cents_per_lb", aggfunc="max"),\
                        mean_price = pd.NamedAgg(column="cents_per_lb", aggfunc="mean"),\
                        std_dev_price =  pd.NamedAgg(column="cents_per_lb", aggfunc="std")).round(2)
size min_price max_price mean_price std_dev_price
regime
R-1 50 50.83 94.92 76.94 12.83
R-2 50 87.02 265.61 156.20 37.78
R-3 25 84.52 138.94 108.44 12.72
R-4 55 54.36 104.39 68.45 9.94
R-5 55 99.49 157.29 124.97 14.43
R-6 70 122.02 300.48 193.37 46.80
R-7 70 120.55 184.12 147.23 15.16
R-8 47 168.65 409.52 241.02 47.27
fp = "../data/regimed_coffee_prices.csv"
df.to_csv(fp, index=False, header=True)