import warnings
warnings.filterwarnings('ignore')
import pandas as pd
fp = "../data/cp_data.csv"
df = pd.read_csv(fp)
cols = ["date", "cents_per_lb"]
df = df[cols]df["date"] = pd.to_datetime(df["date"])
df.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422 entries, 0 to 421
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          422 non-null    datetime64[ns]
 1   cents_per_lb  422 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 6.7 KB
df| date | cents_per_lb | |
|---|---|---|
| 0 | 1990-01-01 | 75.830 | 
| 1 | 1990-02-01 | 84.010 | 
| 2 | 1990-03-01 | 93.960 | 
| 3 | 1990-04-01 | 93.730 | 
| 4 | 1990-05-01 | 92.020 | 
| ... | ... | ... | 
| 417 | 2024-10-01 | 276.777 | 
| 418 | 2024-11-01 | 304.953 | 
| 419 | 2024-12-01 | 344.119 | 
| 420 | 2025-01-01 | 353.933 | 
| 421 | 2025-02-01 | 409.516 | 
422 rows × 2 columns
# Using plotly.express
import plotly.express as px
fig = px.line(df, x='date', y="cents_per_lb")
fig.show()- coffee prices have cycles as is evident from the above chart.
 - Each hill like pattern we observe in the graph is a cycle. From visual inspection we can count 6 to 7 cycles in coffee prices between 1990 and the present.
 - We will have to apply rigorous statistical methodology to determine the end points of each cycle. The visual data exploration provides us a basic estimate of how many cycles we can expect to see in this data. The statistical modeling technique provides us the number of cycles.
 - The statistical methodology applied assumes that prices are a Gaussian Process. It applies the change point detection method developed by Killick and available in the ruptures python package. Please see this page for the gaussian process cost function implementation.
 
import ruptures as rpt
#model = "l1"  # "l2", "rbf"
algo = rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=10).fit(df["cents_per_lb"].values)
my_bkps = algo.predict(pen=200)Conduct an experiment to determine how the number of change points changes with regularization penalty parameter. Killick suggests that this is ideally \(k \log(n)\) where \(n\) is the number of data points (which is 422 here) and \(k\) is a small number. By running the experiment below and visually inspecting the number of cycles, a penalty value of 55 seems about right for this dataset. Edit: Now use a GP to model
import numpy as np
pen = np.arange(2,62,2)
pen_res = {}
model = "rbf"  # "l2", "rbf"
for p in pen:
    algo = rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=12).fit(df["cents_per_lb"].values)
    my_bkps = algo.predict(pen=p)
    pen_res[p] = my_bkps
    num_bks = { int(p): len(pen_res[p]) for p in pen}df_bkps = pd.DataFrame.from_dict(num_bks,orient='index').reset_index()
df_bkps.columns = ["penalty", "num_breaks"]fig = px.line(df_bkps, x='penalty', y='num_breaks', markers=True)
fig.show()np.log2(df.shape[0])np.float64(8.721099188707186)
Killick’s paper recommends \(k \log(n)\) for the penalty. Here \(k\) is a small constant and \(n\) is the number of datapoints. A plot of penalty versus the number of break points detected is shown in the above plot. A penalty of about \(6\) is consistent with what we observe visually in the plot of the time series.
if 'pen_res' in globals():
    del pen_res
algo = rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=12).fit(df["cents_per_lb"].values)
my_bkps = algo.predict(pen=55)len(my_bkps)8
my_bkps[50, 100, 125, 180, 235, 305, 375, 422]
for i in range(len(my_bkps)):
    if i == 0:
        df.loc[: my_bkps[i], "regime"] = "R-" + str(i+1)
    elif ( ( i > 0) and (i <= (len(my_bkps) - 1))):
        df.loc[ my_bkps[ (i - 1)] : my_bkps[i], "regime"] = "R-" + str(i+1)
    else:
        print("last breakpoint, do nothing!")df| date | cents_per_lb | regime | |
|---|---|---|---|
| 0 | 1990-01-01 | 75.830 | R-1 | 
| 1 | 1990-02-01 | 84.010 | R-1 | 
| 2 | 1990-03-01 | 93.960 | R-1 | 
| 3 | 1990-04-01 | 93.730 | R-1 | 
| 4 | 1990-05-01 | 92.020 | R-1 | 
| ... | ... | ... | ... | 
| 417 | 2024-10-01 | 276.777 | R-8 | 
| 418 | 2024-11-01 | 304.953 | R-8 | 
| 419 | 2024-12-01 | 344.119 | R-8 | 
| 420 | 2025-01-01 | 353.933 | R-8 | 
| 421 | 2025-02-01 | 409.516 | R-8 | 
422 rows × 3 columns
A line plot that marks each regime with a separate color
fig = px.line(df, x='date', y='cents_per_lb', color='regime', markers=True)
fig.show()A violin plot to confirm that each regime corresponds to a single density (not too multi modal) and summarize the ranges for each regime.
fig = px.violin(df, y="cents_per_lb", x="regime", color="regime", box=True, points="all",
          hover_data=df.columns)
fig.show()Write the prepared datafile to disk.
df.groupby("regime").agg( size = pd.NamedAgg(column="regime", aggfunc="size"),\
                         min_price = pd.NamedAgg(column="cents_per_lb", aggfunc="min"),\
                        max_price = pd.NamedAgg(column="cents_per_lb", aggfunc="max"),\
                        mean_price = pd.NamedAgg(column="cents_per_lb", aggfunc="mean"),\
                        std_dev_price =  pd.NamedAgg(column="cents_per_lb", aggfunc="std")).round(2)| size | min_price | max_price | mean_price | std_dev_price | |
|---|---|---|---|---|---|
| regime | |||||
| R-1 | 50 | 50.83 | 94.92 | 76.94 | 12.83 | 
| R-2 | 50 | 87.02 | 265.61 | 156.20 | 37.78 | 
| R-3 | 25 | 84.52 | 138.94 | 108.44 | 12.72 | 
| R-4 | 55 | 54.36 | 104.39 | 68.45 | 9.94 | 
| R-5 | 55 | 99.49 | 157.29 | 124.97 | 14.43 | 
| R-6 | 70 | 122.02 | 300.48 | 193.37 | 46.80 | 
| R-7 | 70 | 120.55 | 184.12 | 147.23 | 15.16 | 
| R-8 | 47 | 168.65 | 409.52 | 241.02 | 47.27 | 
fp = "../data/regimed_coffee_prices.csv"
df.to_csv(fp, index=False, header=True)