Predicting coffee quality

My friend Javier has an interest in coffee and machine learning, and is working toward his PhD in the latter.  We decided to collaborate on a project where we use data about the coffee bean and the roast profile to predict the coffee quality score assigned by a professional cupper (coffee judge).

I used the fast.ai library built on PyTorch, and used the Jupyter notebook created as an example of structured data in the fast.ai online course as a template.

I’d originally created a fully connected network manually in Python and I was eager to leverage the greater speed and advanced techniques showcased in fast.ai.  Not much data was initially available for this project, only 70 labelled records of roasting data / cupping scores.  Javier is in the ongoing process of collecting more labelled data in association with Ross House in Auburn, AL, and we now have 140 records.  We’ve discussed the possibility of augmenting the labelled with unlabelled data if such a repository can be found.

The notebook produced is below.  I plan to update it as more data becomes available.

coffee

A deep NN using embeddings for predicting cupping evaluations.

In [61]:
%matplotlib inline
%reload_ext autoreload
%autoreload 2
In [62]:
from oldfastai.structured import *
from oldfastai.column_data import *

import pandas as pd

PATH='data/coffee/'

Create datasets

We first read the .csv file containing the data into a pandas dataframe.

In [63]:
data = pd.read_csv(f"{PATH}jl_coffee_quality_roastingParams_10262018.csv")
data.head()
Out[63]:
SampleCode COUNTRY INITIALTEMP MINTEMP MINSECS MAXTEMP MAXSECS FINALTEMP SECONDS SCORE
0 CO14 COLOMBIA 383.0 176.0 28 406.0 546 406 547 89.00
1 ET01 ETHIOPIA 367.0 167.0 29 396.0 616 394 622 86.79
2 CO09 COLOMBIA 399.0 178.0 32 401.0 592 401 604 86.46
3 HO01 HONDURAS 376.0 172.0 31 392.0 640 392 650 86.07
4 ES04 EL-SALVADOR 381.0 181.0 33 410.0 675 288 687 85.71
In [64]:
len(data)
Out[64]:
140

The SampleCode is an identifier tied to each profile and is not useful for prediction or training.

In [65]:
data = data.drop("SampleCode", axis=1)

In the case we wanted to split our data into a test and training set we would divide it into separate dataframes here. As it stands we just alias our dataframe.

In [66]:
train = data
In [67]:
train.head()
Out[67]:
COUNTRY INITIALTEMP MINTEMP MINSECS MAXTEMP MAXSECS FINALTEMP SECONDS SCORE
0 COLOMBIA 383.0 176.0 28 406.0 546 406 547 89.00
1 ETHIOPIA 367.0 167.0 29 396.0 616 394 622 86.79
2 COLOMBIA 399.0 178.0 32 401.0 592 401 604 86.46
3 HONDURAS 376.0 172.0 31 392.0 640 392 650 86.07
4 EL-SALVADOR 381.0 181.0 33 410.0 675 288 687 85.71
In [68]:
len(train)
Out[68]:
140

Format data

We now want to convert data into a format suitable for our neural network. We will treat the country of origin as a category variable (meaning each country will be associated with a trainable feature vector) and all other fields will be treated as continuous variables, which will be normalized.

In [69]:
cat_vars = ['COUNTRY']

contin_vars = ['INITIALTEMP', 'MINTEMP', 'MINSECS', 'MAXTEMP', 'MAXSECS', 'FINALTEMP', 'SECONDS', 'SCORE']

n = len(train); n
Out[69]:
140

We identify the dependant variable, score.

In [70]:
dep = 'SCORE'

We identify the category and continuous variables as such in the dataframe.

In [71]:
for v in cat_vars: train[v] = train[v].astype('category').cat.as_ordered()
In [72]:
for v in contin_vars:
    train[v] = train[v].fillna(0).astype('float32')
In [73]:
samp_size = n

We can now process our data, which gives us a new dataframe, an array for the dependant variable, a dictionary identifying where fields were missing from our data, and a mapper which tracks the details of the normalization process, so it can later be reversed. We will also work with our dependant variable in log scale.

In [300]:
df, y, nas, mapper = proc_df(train, dep, do_scale=True)
# yl = (y - .8*min(y)) / (max(y)-.8*min(y))
# y_range = (.8*min(y),1)
yl = np.log(y)

By looking in the processed dataframe, we see that the data is normalized.

In [75]:
df.head(2)
Out[75]:
COUNTRY INITIALTEMP MINTEMP MINSECS MAXTEMP MAXSECS FINALTEMP SECONDS
0 1 -0.397415 0.244077 -1.022799 -0.036889 0.121682 0.20889 -1.482574
1 6 -1.011863 -0.296506 -0.989107 -0.988043 0.378405 -0.35441 -0.574187

We now split our data into a training and validation set.

In [76]:
train_ratio = 0.75
# train_ratio = 0.9
train_size = int(samp_size * train_ratio); train_size
val_idx = list(range(train_size, len(df)))

Assemble dataloader and model

We're ready to put together our models.

Root-mean-squared percent error is the metric we've used.

In [301]:
def inv_y(a): return np.exp(to_np(a))

def exp_rmspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return math.sqrt((pct_var**2).mean())

def exp_mspe(y_pred, targ):
    targ = inv_y(targ)
    pct_var = (targ - inv_y(y_pred))/targ
    return (pct_var**2).mean()

max_log_y = np.max(yl)
min_y = np.min(yl)+np.log(.8)
max_y = np.log(100)
y_range = (min_y, max_y)

We can create a ModelData object directly from out data frame.

In [302]:
bs = 7 # batch size
# md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=1, test_df=df_test)
md = ColumnarModelData.from_data_frame(PATH, val_idx, df, yl.astype(np.float32), cat_flds=cat_vars, bs=bs)
In [277]:
cat_sz = [(c, len(train[c].cat.categories)+1) for c in cat_vars]
In [278]:
cat_sz
Out[278]:
[('COUNTRY', 13)]

For our category variable, we will create an embedding matrix to capture the feature vectors (see Cheng Guo, Felix Berkhahn here). We use the heuristic that the feature vector will be half the dimension of the number of categories or 50, whichever is smaller.

In [279]:
emb_szs = [(c, min(50, (c+1)//2)) for _,c in cat_sz]
In [280]:
emb_szs
Out[280]:
[(13, 7)]

We'll specify some of the details of our network architecture (explained shortly).

In [602]:
lin_layers,drops = [30,20,10],[0.01,0.01,0.1]
# lin_layers,drops = [30,15],[0.2,0.2]
# lin_layers,drops = [30,15],[0.0,0.0]
# emb_drop = 0.2
emb_drop = 0.05

Now we create a learner, which encapsulates the dataset and dataloader in our ModelData object, and creates a model based on our needs.

The details of the model architecture are shown below. Rectangles with rounded edges indicate the presence of trainable parameters. Red connecting arrows indicate where dropout is applied during training. Not indicated in the diagram is the fact that the output of the sigmoid is scaled and shifted to be in the range $[\log(.8\times \min\{\mbox{scores} \}),\log(100)]$.

nn flowchart

The following table gives the dimension details and dropout ratios of the network.

Layer Dimension in Dimension out Dropout ratio
Category Input - 13 -
Entity Embedding 13 7 0.05
Continuous Input - 7 -
Batch Norm (Continuous Input) 7 7 -
Concatenate 7+7 14 -
Linear, ReLU, BN 14 30 0.01
Linear, ReLU, BN 30 20 0.01
Linear, ReLU, BN 20 10 0.1
Linear, Sigmoid 20 1 -
Output 1 - -
In [612]:
nn.functional.sigmoid = torch.sigmoid # Added to override deprecated function
m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars),
                   emb_drop, 1, lin_layers, drops, y_range=y_range)

We use the learning rate finder to find an appropriate learning rate, according to the technique described by Leslie Smith.

In [604]:
m.lr_find2(start_lr = 1e-5)
  0%|          | 0/15 [00:00<?, ?it/s]
                                      
In [605]:
m.sched.plot(n_skip_end=40)

Train model

In [505]:
# m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.04, 1, [30,15], [0.001,0.01], y_range=y_range)
# m = md.get_learner(emb_szs, len(df.columns)-len(cat_vars), 0.2, 1, [30,15], [0.2,0.2], y_range=y_range)

We now train the model using the Adam optimizer (see Diederik P. Kingma and Jimmy Ba). The learning rate and momentum factor are dynamically adjusted according to the procedure described by Leslie smith.

In [613]:
lr = 5e-3
m.fit(lr, 1, metrics=[exp_rmspe], use_clr_beta=(20,10,15),cycle_len=100)
epoch      trn_loss   val_loss   exp_rmspe       
    0      0.004058   0.002908   0.052212  
    1      0.003366   0.001592   0.038811        
    2      0.002626   0.000653   0.024939        
    3      0.002171   0.000192   0.013667        
    4      0.001859   0.000141   0.011542        
    5      0.001513   0.000142   0.011752        
    6      0.001223   0.00016    0.012575        
    7      0.001042   0.000159   0.012473        
    8      0.000954   0.000166   0.012781        
    9      0.000805   0.000198   0.013802        
    10     0.000748   0.000192   0.013431        
    11     0.000654   0.000184   0.013118        
    12     0.000617   0.000172   0.012669        
    13     0.0006     0.000218   0.014292        
    14     0.000539   0.000299   0.016685        
    15     0.000519   0.000246   0.015264        
    16     0.000488   0.000265   0.015836        
    17     0.000458   0.000259   0.015722        
    18     0.000412   0.000293   0.016701        
    19     0.000368   0.000244   0.015311        
    20     0.000328   0.000243   0.015277        
    21     0.000297   0.000241   0.015298        
    22     0.000285   0.000335   0.01798         
    23     0.000262   0.000221   0.0145          
    24     0.000251   0.000283   0.016498        
    25     0.000248   0.000264   0.015963        
    26     0.000247   0.000339   0.018133        
    27     0.000237   0.000281   0.016462        
    28     0.000225   0.000276   0.016333        
    29     0.000212   0.000253   0.015645        
    30     0.000198   0.000331   0.017949        
    31     0.000209   0.000247   0.015508        
    32     0.000203   0.000353   0.018536        
    33     0.000203   0.000283   0.016578        
    34     0.000204   0.000255   0.01568         
    35     0.000204   0.000323   0.017694        
    36     0.000201   0.000308   0.017273        
    37     0.000194   0.000375   0.019055        
    38     0.000185   0.000334   0.017976        
    39     0.00018    0.000269   0.015901        
    40     0.000187   0.00028    0.016203        
    41     0.000193   0.000265   0.015902        
    42     0.000197   0.000313   0.017382        
    43     0.000197   0.000399   0.019622        
    44     0.000191   0.000216   0.014434        
    45     0.000183   0.000251   0.015572        
    46     0.000183   0.000271   0.01621         
    47     0.000175   0.000281   0.016513        
    48     0.000173   0.000276   0.0163          
    49     0.00017    0.000252   0.015536        
    50     0.000162   0.00039    0.019403        
    51     0.000155   0.000283   0.016464        
    52     0.000152   0.000312   0.017325        
    53     0.000149   0.000291   0.016766        
    54     0.000144   0.000281   0.016422        
    55     0.000141   0.00026    0.015789        
    56     0.000143   0.000265   0.015925        
    57     0.000145   0.000255   0.015695        
    58     0.00014    0.000294   0.016846        
    59     0.000135   0.000327   0.017739        
    60     0.000142   0.000299   0.016952        
    61     0.000142   0.000309   0.017258        
    62     0.000136   0.000295   0.01686         
    63     0.00014    0.000316   0.017472        
    64     0.000138   0.000304   0.017127        
    65     0.000142   0.000317   0.017479        
    66     0.000141   0.000328   0.017765        
    67     0.00014    0.000282   0.016474        
    68     0.000135   0.000289   0.01664         
    69     0.000128   0.000338   0.018018        
    70     0.000131   0.000329   0.017793        
    71     0.00013    0.000306   0.017145        
    72     0.000128   0.000297   0.016886        
    73     0.000124   0.000304   0.017084        
    74     0.000134   0.000305   0.017135        
    75     0.000126   0.000364   0.018715        
    76     0.000122   0.000378   0.01907         
    77     0.000119   0.000351   0.018376        
    78     0.000114   0.000348   0.018288        
    79     0.000116   0.000372   0.01893         
    80     0.000114   0.000362   0.018669        
    81     0.00012    0.000333   0.01789         
    82     0.000125   0.000332   0.017853        
    83     0.000122   0.000331   0.01784         
    84     0.000119   0.000318   0.017478        
    85     0.000116   0.000309   0.017235        
    86     0.000114   0.000312   0.017296        
    87     0.00012    0.000314   0.017369        
    88     0.000118   0.000329   0.017765        
    89     0.000114   0.000336   0.017965        
    90     0.000111   0.000333   0.017881        
    91     0.000109   0.000331   0.017834        
    92     0.000115   0.000331   0.01783         
    93     0.000114   0.000333   0.017887        
    94     0.000121   0.000329   0.017769        
    95     0.000119   0.00033    0.017811        
    96     0.000114   0.00033    0.017804        
    97     0.000109   0.000329   0.017785        
    98     0.000107   0.000329   0.01778         
    99     0.000107   0.000329   0.017777        

Out[613]:
[0.0003290936117991805, 0.01777729396488587]

Examining the results

Here we look at the validation data scores and predicted scores side by side.

In [614]:
preds = m.predict()
In [597]:
preds.shape, m.data.val_ds.y.shape
Out[597]:
((35, 1), (35, 1))
In [615]:
# y_scale = yl*()
n = len(preds)
per_col = 10
cols = n//per_col+1 if n%per_col>0 else n//per_col
table_head = "<table><tr>" + "".join(["<th>Predicted value</th><th>Actual value</th>" for i in range(cols)]) + "</tr>"
table_end = "</table>"
table_body = ""
for i,o in enumerate(np.stack([np.exp(preds),np.exp(m.data.val_ds.y)],1)):
    if i % (cols) == 0:
        table_body += "<tr>"
        row_ended = False
    table_body += f"<th>{o[0][0]:.2f}</th><th>{o[1][0]:.2f}</th>"
    if i % (cols) == cols-1:
        table_body += "</tr>"
        row_ended = True
if not row_ended:
    table_body += "</tr>"

table = table_head + table_body + table_end
Predicted valueActual valuePredicted valueActual valuePredicted valueActual valuePredicted valueActual value
83.0483.0083.4783.0083.4783.0083.2183.00
83.1183.0083.0783.0083.0783.0083.2685.00
83.3485.0083.2785.0083.2785.0083.2785.00
83.1485.0083.1485.0083.1785.0083.1885.00
83.1685.0083.0085.0083.0485.0082.9885.00
83.0185.0083.2985.0083.5485.0083.4185.00
83.3785.0083.6585.0083.5585.0083.5585.00
83.3085.0083.3785.0083.3785.0083.3785.00
84.4885.0083.2985.0083.2885.00