Nadav Har-Tuv, Author at Towards Data Science

Coverage vs. Accuracy: Striking a Balance in Data Science

Nadav Har-Tuv — Tue, 16 Apr 2024 16:19:20 +0000

Cover image by chatGPT

This post was written together with and inspired by Yuval Cohen

Introduction

Every day, numerous data science projects are discarded due to insufficient prediction Accuracy. It’s a regrettable outcome, considering that often these models could be exceptionally well-suited for some subsets of the dataset.

Data Scientists often try to improve their models by using more complex models and by throwing more and more data at the problem. But many times there is a much simpler and more productive approach: Instead of trying to make all of our predictions better all at once, we could start by making good predictions for the easy parts of the data, and only then work on the harder parts.

This approach can greatly affect our ability to solve real-world problems. We start with the quick gain on the easy problems and only then focus our effort on the harder problems.

Thinking Agile

Agile production means focusing on the easy data first, and only after it has been properly modelled, moving on the the more complicated tasks. This allows a workflow that is iterative, value-driven, and collaborative.

It allows for quicker results, adaptability to changing circumstances, and continuous improvement, which are core ideas of agile production.

Iterative and incremental approach: work in short, iterative cycles. Start by achieving high accuracy for the easy problems and then move on to the harder parts.
Focus on delivering value: work on the problem with the highest marginal value for your time.
Flexibility and adaptability: Allow yourself to adapt to changing circumstances. For example, a client might need you to focus on a certain subset of the data – once you’ve solved that small problem, the circumstances have changed and you might need to work on something completely different. Breaking the problem into small parts allows you to adapt to the changing circumstances.
Feedback and continuous improvement: By breaking up a problem you allow yourself to be in constant and continuous improvement, rather than waiting for big improvements in large chunks.
Collaboration: Breaking the problem into small pieces promotes parallelization of the work and collaboration between team members, rather than putting all of the work on one person.

Breaking down the complexity

In real-world datasets, complexity is the rule rather than the exception. Consider a medical diagnosis task, where subtle variations in symptoms can make the difference between life-threatening conditions and minor ailments. Achieving high accuracy in such scenarios can be challenging, if not impossible, due to the inherent noise and nuances in the data.

This is where the idea of Coverage comes into play. Coverage refers to the portion of the data that a model successfully predicts or classifies with high confidence or high precision. Instead of striving for high accuracy across the entire dataset, researchers can choose to focus on a subset of the data where prediction is relatively straightforward. By doing so, they can achieve high accuracy on this subset while acknowledging the existence of a more challenging, uncovered portion.

For instance, consider a trained model with a 50% accuracy rate on a test dataset. In this scenario, it’s possible that if we could identify and select only the predictions we are very sure about (although we should decide what "very sure" means), we could end up with a model that covers fewer cases, let’s say around 60%, but with significantly improved accuracy, perhaps reaching 85%.

I don’t know any product manager who would say no in such a situation. Especially if there is no model in production, and this is the first model.

The two-step model

We want to divide our data into two distinct subsets: the covered and the uncovered. The covered data is the part of the data where the initial model achieves high accuracy and confidence. The uncovered data is the part of the data where our model does not give confident predictions and does not achieve high accuracy.

In the first step, a model is trained on the data. Once we identify a subset of data where the model achieves high accuracy, we deploy that model and let it run on that subset – the covered data.

In the second step, we move our focus to the uncovered data. We try to develop a better model for this data by collecting more data, using more advanced algorithms, feature engineering, and incorporating domain-specific knowledge to find patterns in the data.

At this step, the first thing you should do is look at the errors by eye. Many times you will easily identify many patterns this way before using any fancy tricks.

An example

This example will show how the concept of agile workflow can create great value. This is a very simple example that is meant to visualize this concept. Real-life examples will be a lot less obvious but the idea that you will see here is just as relevant.

Let’s look at this two-dimensional data that I simulated from three equally sized classes.

num_samples_A = 500
num_samples_B = 500
num_samples_C = 500

# Class A
mean_A = [3, 2]
cov_A = [[0.1, 0], [0, 0.1]]  # Low variance
class_A = np.random.multivariate_normal(mean_A, cov_A, num_samples_A)

# Class B
mean_B = [0, 0]
cov_B = [[1, 0.5], [0.5, 1]]  # Larger variance with some overlap with class C
class_B = np.random.multivariate_normal(mean_B, cov_B, num_samples_B)

# Class C
mean_C = [0, 1]
cov_C = [[2, 0.5], [0.5, 2]]  # Larger variance with some overlap with class B
class_C = np.random.multivariate_normal(mean_C, cov_C, num_samples_C)

Two-dimensional data from three classes

Now we try to fit a machine learning classifier to this data, it looks like an SVM classifier with a Gaussian (‘rbf’) kernel might do the trick:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Creating DataFrame
data = np.concatenate([class_A, class_B, class_C])
labels = np.concatenate([np.zeros(num_samples_A), np.ones(num_samples_B), np.ones(num_samples_C) * 2])
df = pd.DataFrame(data, columns=['x', 'y'])
df['label'] = labels.astype(int)

# Splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df[['x', 'y']], df['label'], test_size=0.2, random_state=42)

# Training SVM model with RBF kernel
svm_rbf = SVC(kernel='rbf', probability= True)
svm_rbf.fit(X_train, y_train)

# Predict probabilities for each class
svm_rbf_probs = svm_rbf.predict_proba(X_test)

# Get predicted classes and corresponding confidences
svm_rbf_predictions = [(X_test.iloc[i]['x'], X_test.iloc[i]['y'], true_class, np.argmax(probs), np.max(probs)) for i, (true_class, probs) in enumerate(zip(y_test, svm_rbf_probs))]

svm_predictions_df = pd.DataFrame(svm_rbf_predictions).rename(columns={0:'x',1:'y' ,2: 'true_class', 3: 'predicted_class', 4: 'confidence'})

How does this model perform on our data?

accuracy = (svm_predictions_df['true_class'] == svm_predictions_df['predicted_class']).mean()*100
print(f'Accuracy = {round(accuracy,2)}%')

Accuracy = 75.33%

75% percent accuracy is disappointing, but does this mean that this model is useless?

Now we want to look at the most confident predictions and see how the model performs on them. How do we define the most confident predictions? We can try out different confidence (predict_proba) thresholds and see what coverage and accuracy we get for each threshold and then decide which threshold meets our business needs.

thresholds = [.5, .55, .6, .65, .7, .75, .8, .85, .9]
results = []

for threshold in thresholds:
    svm_df_covered = svm_predictions_df.loc[svm_predictions_df['confidence'] > threshold]
    coverage = len(svm_df_covered) / len(svm_predictions_df) * 100
    accuracy_covered = (svm_df_covered['true_class'] == svm_df_covered['predicted_class']).mean() * 100

    results.append({'Threshold': threshold, 'Coverage (%)': round(coverage,2), 'Accuracy on covered data (%)': round(accuracy_covered,2)})

results_df = pd.DataFrame(results)
print(results_df)

And we get

Coverage and accuracy by threshold table

Or if we want a more detailed look we can create a plot of the coverage and accuracy by threshold:

Accuracy and coverage as function as threshold

We can now select the threshold that fits our business logic. For example, if our company’s policy is to guarantee at least 90% accuracy, then we can choose a threshold of 0.75 and get an accuracy of 90% for 62% of the data. This is a huge improvement to throwing out the model, especially if we don’t have any model in production!

Now that our model is happily working in production on 60% of the data, we can shift our focus to the rest of the data. We can collect more data, do more feature engineering, try more complex models, or get help from a domain expert.

Balancing act

The two-step model allows us to aim for accuracy while acknowledging that it is perfectly fine to start with a high accuracy for only a subset of the data. It is counterproductive to insist that a model will have high accuracy on all the data before deploying it to production.

The agile approach presented in this post aims for resource allocation and efficiency. Instead of spending computational resources on getting high accuracy all across. Focus your resources on where the marginal gain is highest.

Conclusion

In Data Science, we try to achieve high accuracy. However, in the reality of messy data, we need to find a clever approach to utilize our resources in the best way. Agile model production teaches us to focus on the parts of the data where our model works best, deploy the model for those subsets, and only then start working on a new model for the more complicated part. This strategy will help you make the best use of your resources in the face of real data science problems.

Think production, Think Agile.

The post Coverage vs. Accuracy: Striking a Balance in Data Science appeared first on Towards Data Science.

Estimating Individualized Treatment Rules Using Outcome Weighted Learning

Nadav Har-Tuv — Sun, 31 Mar 2024 08:37:41 +0000

In many diseases, different patients will react differently to different treatments. A drug that is beneficial for some patients may not work for other patients with different characteristics. Therefore, healthcare can significantly improve by treating patients based on their characteristics, rather than treating all patients with the same treatment.

In this article, I will try to show you how we can train a machine-learning model to learn the optimal personalized treatment.

This article is about the field of personalized health care, but the results can be used in any field. For example: Different people will react differently to different ads on social media, so, in cases where there are multiple ads for the same product, how do you choose which ad to show to which viewers?

This method is useful in any case where you have to give a treatment but you can only give one treatment to every individual in the sample and therefore you have no way of knowing how that individual would respond to the other treatments.

Let’s formalize the problem

An experiment was performed to compare two (or more) treatments. We’ll name them T = 1,2… A vector of covariates X represents every patient. Every patient i with a covariates vector Xᵢ, that was given a treatment Tᵢ has a recorded response to the treatment, Rᵢ.

For example, let’s assume that you want to test 3 different drugs for diabetes, we’ll name these drugs "1", "2", "3".

We have a patient named Esther, she is 64 years old, she’s been diagnosed with diabetes 8 years ago, she weighs 65 kilos and her height is 1.54 meters. Esther has received drug "1" and her blood sugar was reduced by 10 points after being given the new drug.

In our example, the data point we have on Esther is X = {Female, 64 years old, 8 years since diagnosis, 65 kg, 1.54 meters}, T = "1", R = 10.

In this setting, we would like to learn an optimal decision rule D(x), that assigns a treatment "1", "2", or "3" to every patient to optimize the outcome for that patient.

The old way of solving this problem was to model the outcome as a function of the data and the treatment and denote the predicted outcome as f(X,T). Once we have a model we can create a decision rule D(x): we compute f(X,1), f(X,2), and f(X,3) and give the patient the drug that maximizes their expected outcome.

This solution can work when we have a fairly good understanding of the underlying model that created the data. In this case, all we need is some finetuning to find the best parameters for our case.

However, if the model is bad then our results will be bad, regardless of the amount of data at hand.

Can we come up with a decision rule that is not parametric and does not assume any prior knowledge of the relationship between the data and the treatment result?

The answer is yes, we can use machine learning to find a decision rule that does not make any assumptions about the relationship between the response and the treatment!

Solving with a non-parametric approach using Outcome Weighted Learning

The way to solve this problem is to solve a Classification problem where the labels are the treatments given in the experiment and every data point i is weighted by Rᵢ/π(Tᵢ|Xᵢ), where π(Tᵢ|Xᵢ) is the propensity of getting treatment Tᵢ, given that you have the characteristics Xᵢ, which can be computed from the data.

This makes sense because we try to follow the experiment’s results, but only where it worked best. The reason we divide by the propensities is to correct the category size bias. If you’ve learned some reinforced learning then this whole process should look familiar to you.

Here is an example of an owl classifier using Svm. You can feel free to use any classifier you like.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import svm

def owl_classifier(X_train, T, R, kernel, gamma):
  n = len(T)
  pi = np.zeroes(n) #Initialize pi as a vector of zeroes
  probs = LogisticRegression().fit(X_train, T).predict_proba(X_train)#This is a n*unique(T) matrix that gives every person the probability of getting each treatment
  for t in np.unique(T):
    pi += probs[,t]*(T==t) #Every data point is assigned the probability of getting the treatment that it got, given the covariates
  clf = svm.SVC(kernel = kernel, gamma = gamma) # initialize an svm classifier, the parameters need to be found by cross validation
  clf.fit(X_train, T, sample_weight = R/pi) # fit the classifier with the treatments as labels and R/pi as sample weights

Simulation to test the OWL method

Simulating data can test the owl method. We create the reward function so that we know what the optimal treatment is for every patient. We can then train the OWL classifier on the data and check how well it fits the optimal classifier.

For example:

I created 50 features that are all sampled from a U([-1,1]) distribution. I gave the patients one of three treatments {1,2,3} at random, uniformly.

The response function is sampled from a N(μ, 1) distribution, where μ = (X₁ + X₂)I(T=1) + (X₁ – X₂)I(T=2) + (X₂-X₁)*I(T=3)

# This code block creates the data for the simulation
import numpy as np

n_train = 500 # I purposely chose a small training set to simulate a medical trial
n_col = 50 # This is the number of features
n_test = 1000
X_train = np.random.uniform(low = -1, high = 1, size = (n_train, n_col))
T = np.random.randint(3, size = n_train) # Treatments given at random uniformly
R_mean = (X_train[:,0]+X_train[:,1])*(T==0) + (X_train[:,0]-X_train[:,1])*(T==1) + (X_train[:,1]-X_train[:,0])*(T==2)
R = np.random.normal(loc = R_mean, scale = .1) # The stanadard deviation can be tweaked
X_test = np.random.uniform(low = -1 , high = 1, size = (n_test, n_col))

# The optimal classifier can be deduced from the design of R
optimal_classifier = (1-(X_test[:,0] >0)*(X_test[:,1]>0))*((X_test[:,0] > X_test[:,1]) + 2*(X_test[:,1] > X_test[:,0]))

It is not hard to see that the optimal treatment regime is to give treatment 1 if both X₁ and X₂ are positive. If they are both negative, give treatment 2 if X₂

Or we can show this with an image. These are the different ranges of the optimal treatment, shown for ranges of X₁, X₂:

Optimal treatment ranges for combinations of X₁, X₂

I sampled 500 data points with 50 features and the reward function that I described above. I fit an OWL classifier with a Gaussian (‘rbf’) kernel and got the following classifications, which I visualized for values of X₁, X₂:

# Code for the plot 
import seaborn as sns

kernel = 'rbf'
gamma = 1/X_train.shape[1] 
# gamma is a hyperparameter that has to be found by cross validation but this is a good place to start
D = owl_classifier(X_train, T, R, kernel, gamma)
prediction = D.predict(X_test)
sns.scatterplot(x = X_test[:,0], y = X_test[:,1], c = prediction )

In case you missed what happened here: The data was composed of 2 features that affected the response and 48 features of noise. The model managed to learn the effect of the two important features without us modeling this relationship in any way!

This is just one simple example, I made the reward function depend on X₁ and X₂ so that it’s easy to understand and visualize but you can feel free to use other examples and try out different classifiers.

Conclusion

Outcome-weighted learning can be used to learn an optimal treatment in cases where we only see one treatment per patient in the training data, without having to model the response as a function of the features and the treatment.

There is some math that I dropped out from this article that justifies this whole process, I did not just make this up from the top of my head.

Future research on this topic should include:

Exploitation vs. exploration: Even after we learned a treatment rule, it’s still beneficial to sometimes explore options that are considered not optimal according to our model. The model can be wrong.
Sequential treatment: when there is a sequence of treatments, each one of them changes the state of the patient. The solution for the whole sequence should be found via dynamic programming.
Design: in this article, I just assumed the treatments were given according to a given rule. Perhaps we can find some design that can improve the learning process.

The post Estimating Individualized Treatment Rules Using Outcome Weighted Learning appeared first on Towards Data Science.