Predicting coffee quality
My friend Javier has an interest in coffee and machine learning, and is working toward his PhD in the latter. We decided to collaborate on a project where we use data about the coffee bean and the roast profile to predict the coffee quality score assigned by a professional cupper (coffee judge).
I used the fast.ai library built on PyTorch, and used the Jupyter notebook created as an example of structured data in the fast.ai online course as a template.
I’d originally created a fully connected network manually in Python and I was eager to leverage the greater speed and advanced techniques showcased in fast.ai. Not much data was initially available for this project, only 70 labelled records of roasting data / cupping scores. Javier is in the ongoing process of collecting more labelled data in association with Ross House in Auburn, AL, and we now have 140 records. We’ve discussed the possibility of augmenting the labelled with unlabelled data if such a repository can be found.
The notebook produced is below. I plan to update it as more data becomes available.
A deep NN using embeddings for predicting cupping evaluations.¶
%matplotlib inline
%reload_ext autoreload
%autoreload 2
from oldfastai.structured import *
from oldfastai.column_data import *
import pandas as pd
PATH='data/coffee/'
Create datasets¶
We first read the .csv file containing the data into a pandas dataframe.
data = pd.read_csv(f"{PATH}jl_coffee_quality_roastingParams_10262018.csv")
data.head()
len(data)
The SampleCode is an identifier tied to each profile and is not useful for prediction or training.
data = data.drop("SampleCode", axis=1)
In the case we wanted to split our data into a test and training set we would divide it into separate dataframes here. As it stands we just alias our dataframe.
train = data
train.head()
len(train)
Format data¶
We now want to convert data into a format suitable for our neural network. We will treat the country of origin as a category variable (meaning each country will be associated with a trainable feature vector) and all other fields will be treated as continuous variables, which will be normalized.
cat_vars = ['COUNTRY']
contin_vars = ['INITIALTEMP', 'MINTEMP', 'MINSECS', 'MAXTEMP', 'MAXSECS', 'FINALTEMP', 'SECONDS', 'SCORE']
n = len(train); n
We identify the dependant variable, score.
dep = 'SCORE'
We identify the category and continuous variables as such in the dataframe.
for v in cat_vars: train[v] = train[v].astype('category').cat.as_ordered()
for v in contin_vars:
train[v] = train[v].fillna(0).astype('float32')
samp_size = n
We can now process our data, which gives us a new dataframe, an array for the dependant variable, a dictionary identifying where fields were missing from our data, and a mapper which tracks the details of the normalization process, so it can later be reversed. We will also work with our dependant variable in log scale.
df, y, nas, mapper = proc_df(train, dep, do_scale=True)
# yl = (y - .8*min(y)) / (max(y)-.8*min(y))
# y_range = (.8*min(y),1)
yl = np.log(y)
By looking in the processed dataframe, we see that the data is normalized.
df.head(2)
We now split our data into a training and validation set.
train_ratio = 0.75
# train_ratio = 0.9
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))
Assemble dataloader and model¶
We're ready to put together our models.
Root-mean-squared percent error is the metric we've used.
def inv_y(a): return np.exp(to_np(a))
def exp_rmspe(y_pred, targ):
targ = inv_y(targ)
pct_var = (targ - inv_y(y_pred))/targ
return math.sqrt((pct_var**2).mean())
def exp_mspe(y_pred, targ):
targ = inv_y(targ)
pct_var = (targ - inv_y(y_pred))/targ
return (pct_var**2).mean()
max_log_y = np.max(yl)
min_y = np.min(yl)+np.log(.8)
max_y = np.log(100)
y_range = (min_y, max_y)
We can create a ModelData object directly from out data frame.
bs = 7 # batch size
# md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=1, test_df=df_test)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=bs)
cat_sz = [(c, len(train[c].cat.categories)+1) for c in cat_vars]
cat_sz
For our category variable, we will create an embedding matrix to capture the feature vectors (see Cheng Guo, Felix Berkhahn here). We use the heuristic that the feature vector will be half the dimension of the number of categories or 50, whichever is smaller.
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
emb_szs
We'll specify some of the details of our network architecture (explained shortly).
lin_layers,drops = [30,20,10],[0.01,0.01,0.1]
# lin_layers,drops = [30,15],[0.2,0.2]
# lin_layers,drops = [30,15],[0.0,0.0]
# emb_drop = 0.2
emb_drop = 0.05
Now we create a learner, which encapsulates the dataset and dataloader in our ModelData object, and creates a model based on our needs.
The details of the model architecture are shown below. Rectangles with rounded edges indicate the presence of trainable parameters. Red connecting arrows indicate where dropout is applied during training. Not indicated in the diagram is the fact that the output of the sigmoid is scaled and shifted to be in the range $[\log(.8\times \min\{\mbox{scores} \}),\log(100)]$.

The following table gives the dimension details and dropout ratios of the network.
| Layer | Dimension in | Dimension out | Dropout ratio |
|---|---|---|---|
| Category Input | - | 13 | - |
| Entity Embedding | 13 | 7 | 0.05 |
| Continuous Input | - | 7 | - |
| Batch Norm (Continuous Input) | 7 | 7 | - |
| Concatenate | 7+7 | 14 | - |
| Linear, ReLU, BN | 14 | 30 | 0.01 |
| Linear, ReLU, BN | 30 | 20 | 0.01 |
| Linear, ReLU, BN | 20 | 10 | 0.1 |
| Linear, Sigmoid | 20 | 1 | - |
| Output | 1 | - | - |
nn.functional.sigmoid = torch.sigmoid # Added to override deprecated function
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
emb_drop, 1, lin_layers, drops, y_range=y_range)
We use the learning rate finder to find an appropriate learning rate, according to the technique described by Leslie Smith.
m.lr_find2(start_lr = 1e-5)
m.sched.plot(n_skip_end=40)
Train model¶
# m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1, [30,15], [0.001,0.01], y_range=y_range)
# m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.2, 1, [30,15], [0.2,0.2], y_range=y_range)
We now train the model using the Adam optimizer (see Diederik P. Kingma and Jimmy Ba). The learning rate and momentum factor are dynamically adjusted according to the procedure described by Leslie smith.
lr = 5e-3
m.fit(lr, 1, metrics=[exp_rmspe], use_clr_beta=(20,10,15),cycle_len=100)
Examining the results¶
Here we look at the validation data scores and predicted scores side by side.
preds = m.predict()
preds.shape, m.data.val_ds.y.shape
# y_scale = yl*()
n = len(preds)
per_col = 10
cols = n//per_col+1 if n%per_col>0 else n//per_col
table_head = "<table><tr>" + "".join(["<th>Predicted value</th><th>Actual value</th>" for i in range(cols)]) + "</tr>"
table_end = "</table>"
table_body = ""
for i,o in enumerate(np.stack([np.exp(preds),np.exp(m.data.val_ds.y)],1)):
if i % (cols) == 0:
table_body += "<tr>"
row_ended = False
table_body += f"<th>{o[0][0]:.2f}</th><th>{o[1][0]:.2f}</th>"
if i % (cols) == cols-1:
table_body += "</tr>"
row_ended = True
if not row_ended:
table_body += "</tr>"
table = table_head + table_body + table_end
| Predicted value | Actual value | Predicted value | Actual value | Predicted value | Actual value | Predicted value | Actual value |
|---|---|---|---|---|---|---|---|
| 83.04 | 83.00 | 83.47 | 83.00 | 83.47 | 83.00 | 83.21 | 83.00 |
| 83.11 | 83.00 | 83.07 | 83.00 | 83.07 | 83.00 | 83.26 | 85.00 |
| 83.34 | 85.00 | 83.27 | 85.00 | 83.27 | 85.00 | 83.27 | 85.00 |
| 83.14 | 85.00 | 83.14 | 85.00 | 83.17 | 85.00 | 83.18 | 85.00 |
| 83.16 | 85.00 | 83.00 | 85.00 | 83.04 | 85.00 | 82.98 | 85.00 |
| 83.01 | 85.00 | 83.29 | 85.00 | 83.54 | 85.00 | 83.41 | 85.00 |
| 83.37 | 85.00 | 83.65 | 85.00 | 83.55 | 85.00 | 83.55 | 85.00 |
| 83.30 | 85.00 | 83.37 | 85.00 | 83.37 | 85.00 | 83.37 | 85.00 |
| 84.48 | 85.00 | 83.29 | 85.00 | 83.28 | 85.00 |