import warnings
'ignore')
warnings.filterwarnings(import pandas as pd
= "../data/cp_data.csv"
fp = pd.read_csv(fp)
df = ["date", "cents_per_lb"]
cols = df[cols] df
"date"] = pd.to_datetime(df["date"])
df[ df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 422 entries, 0 to 421
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 422 non-null datetime64[ns]
1 cents_per_lb 422 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 6.7 KB
df
date | cents_per_lb | |
---|---|---|
0 | 1990-01-01 | 75.830 |
1 | 1990-02-01 | 84.010 |
2 | 1990-03-01 | 93.960 |
3 | 1990-04-01 | 93.730 |
4 | 1990-05-01 | 92.020 |
... | ... | ... |
417 | 2024-10-01 | 276.777 |
418 | 2024-11-01 | 304.953 |
419 | 2024-12-01 | 344.119 |
420 | 2025-01-01 | 353.933 |
421 | 2025-02-01 | 409.516 |
422 rows × 2 columns
# Using plotly.express
import plotly.express as px
= px.line(df, x='date', y="cents_per_lb")
fig fig.show()
- coffee prices have cycles as is evident from the above chart.
- Each hill like pattern we observe in the graph is a cycle. From visual inspection we can count 6 to 7 cycles in coffee prices between 1990 and the present.
- We will have to apply rigorous statistical methodology to determine the end points of each cycle. The visual data exploration provides us a basic estimate of how many cycles we can expect to see in this data. The statistical modeling technique provides us the number of cycles.
- The statistical methodology applied assumes that prices are a Gaussian Process. It applies the change point detection method developed by Killick and available in the ruptures python package. Please see this page for the gaussian process cost function implementation.
import ruptures as rpt
#model = "l1" # "l2", "rbf"
= rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=10).fit(df["cents_per_lb"].values)
algo = algo.predict(pen=200) my_bkps
Conduct an experiment to determine how the number of change points changes with regularization penalty parameter. Killick suggests that this is ideally \(k \log(n)\) where \(n\) is the number of data points (which is 422 here) and \(k\) is a small number. By running the experiment below and visually inspecting the number of cycles, a penalty value of 55 seems about right for this dataset. Edit: Now use a GP to model
import numpy as np
= np.arange(2,62,2)
pen = {}
pen_res
= "rbf" # "l2", "rbf"
model for p in pen:
= rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=12).fit(df["cents_per_lb"].values)
algo = algo.predict(pen=p)
my_bkps = my_bkps
pen_res[p]
= { int(p): len(pen_res[p]) for p in pen} num_bks
= pd.DataFrame.from_dict(num_bks,orient='index').reset_index()
df_bkps = ["penalty", "num_breaks"] df_bkps.columns
= px.line(df_bkps, x='penalty', y='num_breaks', markers=True)
fig fig.show()
0]) np.log2(df.shape[
np.float64(8.721099188707186)
Killick’s paper recommends \(k \log(n)\) for the penalty. Here \(k\) is a small constant and \(n\) is the number of datapoints. A plot of penalty versus the number of break points detected is shown in the above plot. A penalty of about \(6\) is consistent with what we observe visually in the plot of the time series.
if 'pen_res' in globals():
del pen_res
= rpt.Pelt(custom_cost=rpt.costs.CostNormal(), min_size=12).fit(df["cents_per_lb"].values)
algo = algo.predict(pen=55) my_bkps
len(my_bkps)
8
my_bkps
[50, 100, 125, 180, 235, 305, 375, 422]
for i in range(len(my_bkps)):
if i == 0:
"regime"] = "R-" + str(i+1)
df.loc[: my_bkps[i], elif ( ( i > 0) and (i <= (len(my_bkps) - 1))):
- 1)] : my_bkps[i], "regime"] = "R-" + str(i+1)
df.loc[ my_bkps[ (i else:
print("last breakpoint, do nothing!")
df
date | cents_per_lb | regime | |
---|---|---|---|
0 | 1990-01-01 | 75.830 | R-1 |
1 | 1990-02-01 | 84.010 | R-1 |
2 | 1990-03-01 | 93.960 | R-1 |
3 | 1990-04-01 | 93.730 | R-1 |
4 | 1990-05-01 | 92.020 | R-1 |
... | ... | ... | ... |
417 | 2024-10-01 | 276.777 | R-8 |
418 | 2024-11-01 | 304.953 | R-8 |
419 | 2024-12-01 | 344.119 | R-8 |
420 | 2025-01-01 | 353.933 | R-8 |
421 | 2025-02-01 | 409.516 | R-8 |
422 rows × 3 columns
A line plot that marks each regime with a separate color
= px.line(df, x='date', y='cents_per_lb', color='regime', markers=True)
fig fig.show()
A violin plot to confirm that each regime corresponds to a single density (not too multi modal) and summarize the ranges for each regime.
= px.violin(df, y="cents_per_lb", x="regime", color="regime", box=True, points="all",
fig =df.columns)
hover_data fig.show()
Write the prepared datafile to disk.
"regime").agg( size = pd.NamedAgg(column="regime", aggfunc="size"),\
df.groupby(= pd.NamedAgg(column="cents_per_lb", aggfunc="min"),\
min_price = pd.NamedAgg(column="cents_per_lb", aggfunc="max"),\
max_price = pd.NamedAgg(column="cents_per_lb", aggfunc="mean"),\
mean_price = pd.NamedAgg(column="cents_per_lb", aggfunc="std")).round(2) std_dev_price
size | min_price | max_price | mean_price | std_dev_price | |
---|---|---|---|---|---|
regime | |||||
R-1 | 50 | 50.83 | 94.92 | 76.94 | 12.83 |
R-2 | 50 | 87.02 | 265.61 | 156.20 | 37.78 |
R-3 | 25 | 84.52 | 138.94 | 108.44 | 12.72 |
R-4 | 55 | 54.36 | 104.39 | 68.45 | 9.94 |
R-5 | 55 | 99.49 | 157.29 | 124.97 | 14.43 |
R-6 | 70 | 122.02 | 300.48 | 193.37 | 46.80 |
R-7 | 70 | 120.55 | 184.12 | 147.23 | 15.16 |
R-8 | 47 | 168.65 | 409.52 | 241.02 | 47.27 |
= "../data/regimed_coffee_prices.csv"
fp =False, header=True) df.to_csv(fp, index