Vincent Margot, Author at Towards Data Science

Data Science: From School to Work, Part II

Vincent Margot — Mon, 03 Mar 2025 14:00:00 +0000

In my previous article, I highlighted the importance of effective project management in Python development. Now, let’s shift our focus to the code itself and explore how to write clean, maintainable code — an essential practice in professional and collaborative environments.

Readability & Maintainability: Well-structured code is easier to read, understand, and modify. Other developers — or even your future self — can quickly grasp the logic without struggling to decipher messy code.
Debugging & Troubleshooting: Organized code with clear variable names and structured functions makes it easier to identify and fix bugs efficiently.
Scalability & Reusability: Modular, well-organized code can be reused across different projects, allowing for seamless scaling without disrupting existing functionality.

So, as you work on your next Python project, remember:

Half of good code is Clean Code.

Introduction

Python is one of the most popular and versatile Programming languages, appreciated for its simplicity, comprehensibility and large community. Whether web development, data analysis, artificial intelligence or automation of tasks — Python offers powerful and flexible tools that are suitable for a wide range of areas.

However, the efficiency and maintainability of a Python project depends heavily on the practices used by the developers. Poor structuring of the code, a lack of conventions or even a lack of documentation can quickly turn a promising project into a maintenance and development-intensive puzzle. It is precisely this point that makes the difference between student code and professional code.

This article is intended to present the most important best practices for writing high-quality Python code. By following these recommendations, developers can create scripts and applications that are not only functional, but also readable, performant and easily maintainable by third parties.

Adopting these best practices right from the start of a project not only ensures better collaboration within teams, but also prepares your code to evolve with future needs. Whether you’re a beginner or an experienced developer, this guide is designed to support you in all your Python developments.

The code structuration

Good code structuring in Python is essential. There are two main project layouts: flat layout and src layout.

The flat layout places the source code directly in the project root without an additional folder. This approach simplifies the structure and is well-suited for small scripts, quick prototypes, and projects that do not require complex packaging. However, it may lead to unintended import issues when running tests or scripts.

 my_project/
├──  my_project/                  # Directly in the root
│   ├──  __init__.py
│   ├──  main.py                   # Main entry point (if needed)
│   ├──  module1.py             # Example module
│   └──  utils.py
├──  tests/                            # Unit tests
│   ├──  test_module1.py
│   ├──  test_utils.py
│   └── ...
├──  .gitignore                      # Git ignored files
├──  pyproject.toml              # Project configuration (Poetry, setuptools)
├──  uv.lock                         # UV file
├──  README.md               # Main project documentation
├──  LICENSE                     # Project license
├──  Makefile                       # Automates common tasks
├──  DockerFile                   # Automates common tasks
├──  .github/                        # GitHub Actions workflows (CI/CD)
│   ├──  actions/               
│   └──  workflows/

On the other hand, the src layout (src is the contraction of source) organizes the source code inside a dedicated src/ directory, preventing accidental imports from the working directory and ensuring a clear separation between source files and other project components like tests or configuration files. This layout is ideal for large projects, libraries, and production-ready applications as it enforces proper package installation and avoids import conflicts.

 my-project/
├──  src/                              # Main source code
│   ├──  my_project/            # Main package
│   │   ├──  __init__.py        # Makes the folder a package
│   │   ├──  main.py             # Main entry point (if needed)
│   │   ├──  module1.py       # Example module
│   │   └── ...
│   │   ├──  utils/                  # Utility functions
│   │   │   ├──  __init__.py     
│   │   │   ├──  data_utils.py  # data functions
│   │   │   ├──  io_utils.py      # Input/output functions
│   │   │   └── ...
├──  tests/                             # Unit tests
│   ├──  test_module1.py     
│   ├──  test_module2.py     
│   ├──  conftest.py              # Pytest configurations
│   └── ...
├──  docs/                            # Documentation
│   ├──  index.md                
│   ├──  architecture.md         
│   ├──  installation.md         
│   └── ...                     
├──  notebooks/                   # Jupyter Notebooks for exploration
│   ├──  exploration.ipynb       
│   └── ...                     
├──  scripts/                         # Standalone scripts (ETL, data processing)
│   ├──  run_pipeline.py         
│   ├──  clean_data.py           
│   └── ...                     
├──  data/                            # Raw or processed data (if applicable)
│   ├──  raw/                    
│   ├──  processed/
│   └── ....                                 
├──  .gitignore                      # Git ignored files
├──  pyproject.toml              # Project configuration (Poetry, setuptools)
├──  uv.lock                         # UV file
├──  README.md               # Main project documentation
├──  setup.py                       # Installation script (if applicable)
├──  LICENSE                     # Project license
├──  Makefile                       # Automates common tasks
├──  DockerFile                   # To create Docker image
├──  .github/                        # GitHub Actions workflows (CI/CD)
│   ├──  actions/               
│   └──  workflows/

Choosing between these layouts depends on the project’s complexity and long-term goals. For production-quality code, the src/ layout is often recommended, whereas the flat layout works well for simple or short-lived projects.

You can imagine different templates that are better adapted to your use case. It is important that you maintain the modularity of your project. Do not hesitate to create subdirectories and to group together scripts with similar functionalities and separate those with different uses. A good code structure ensures readability, maintainability, scalability and reusability and helps to identify and correct errors efficiently.

Cookiecutter is an open-source tool for generating preconfigured project structures from templates. It is particularly useful for ensuring the coherence and organization of projects, especially in Python, by applying good practices from the outset. The flat layout and src layout can be initiate using a UV tool.

The SOLID principles

SOLID programming is an essential approach to software development based on five basic principles for improving code quality, maintainability and scalability. These principles provide a clear framework for developing robust, flexible systems. By following the Solid Principles, you reduce the risk of complex dependencies, make testing easier and ensure that applications can evolve more easily in the face of change. Whether you are working on a single project or a large-scale application, mastering SOLID is an important step towards adopting object-oriented programming best practices.

S — Single Responsibility Principle (SRP)

The principle of single responsibility means that a class/function can only manage one thing. This means that it only has one reason to change. This makes the code more maintainable and easier to read. A class/function with multiple responsibilities is difficult to understand and often a source of errors.

Example:

# Violates SRP
class MLPipeline:
    def __init__(self, df: pd.DataFrame, target_column: str):
        self.df = df
        self.target_column = target_column
        self.scaler = StandardScaler()
        self.model = RandomForestClassifier()
   
    def preprocess_data(self):
        self.df.fillna(self.df.mean(), inplace=True)  # Handle missing values
        X = self.df.drop(columns=[self.target_column])
        y = self.df[self.target_column]
        X_scaled = self.scaler.fit_transform(X)  # Feature scaling
        return X_scaled, y
        
    def train_model(self):
        X, y = self.preprocess_data()  # Data preprocessing inside model training
        self.model.fit(X, y)
        print("Model training complete.")

Here, the Report class has two responsibilities: Generate content and save the file.

# Follows SRP
class DataPreprocessor:
    def __init__(self):
        self.scaler = StandardScaler()
        
    def preprocess(self, df: pd.DataFrame, target_column: str):
        df = df.copy()
        df.fillna(df.mean(), inplace=True)  # Handle missing values
        X = df.drop(columns=[target_column])
 
# Follows SRP
class DataPreprocessor:
    def __init__(self):
        self.scaler = StandardScaler()
        
    def preprocess(self, df: pd.DataFrame, target_column: str):
        df = df.copy()
        df.fillna(df.mean(), inplace=True)  # Handle missing values
        X = df.drop(columns=[target_column])
        y = df[target_column]
        X_scaled = self.scaler.fit_transform(X)  # Feature scaling
        return X_scaled, y


class ModelTrainer:
    def __init__(self, model):
        self.model = model
        
    def train(self, X, y):
        self.model.fit(X, y)
        print("Model training complete.")

O — Open/Closed Principle (OCP)

The open/close principle means that a class/function must be open to extension, but closed to modification. This makes it possible to add functionality without the risk of breaking existing code.

It is not easy to develop with this principle in mind, but a good indicator for the main developer is to see more and more additions (+) and fewer and fewer removals (-) in the merge requests during project development.

L — Liskov Substitution Principle (LSP)

The Liskov substitution principle states that a subordinate class can replace its parent class without changing the behavior of the program, ensuring that the subordinate class meets the expectations defined by the base class. It limits the risk of unexpected errors.

Example :

# Violates LSP
class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height


class Square(Rectangle):
    def __init__(self, side):
        super().__init__(side, side)
# Changing the width of a square violates the idea of a square.

To respect the LSP, it is better to avoid this hierarchy and use independent classes:

class Shape:
    def area(self):
        raise NotImplementedError


class Rectangle(Shape):
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height


class Square(Shape):
    def __init__(self, side):
        self.side = side

    def area(self):
        return self.side * self.side

I — Interface Segregation Principle (ISP)

The principle of interface separation states that several small classes should be built instead of one with methods that cannot be used in certain cases. This reduces unnecessary dependencies.

Example:

# Violates ISP
class Animal:
    def fly(self):
        raise NotImplementedError

    def swim(self):
        raise NotImplementedError

It is better to split the class Animal into several classes:

# Follows ISP
class CanFly:
    def fly(self):
        raise NotImplementedError


class CanSwim:
    def swim(self):
        raise NotImplementedError


class Bird(CanFly):
    def fly(self):
        print("Flying")


class Fish(CanSwim):
    def swim(self):
        print("Swimming")

D — Dependency Inversion Principle (DIP)

The Dependency Inversion Principle means that a class must depend on an abstract class and not on a concrete class. This reduces the connections between the classes and makes the code more modular.

Example:

# Violates DIP
class Database:
    def connect(self):
        print("Connecting to database")


class UserService:
    def __init__(self):
        self.db = Database()

    def get_users(self):
        self.db.connect()
        print("Getting users")

Here, the attribute db of UserService depends on the class Database. To respect the DIP, db has to depend on an abstract class.

# Follows DIP
class DatabaseInterface:
    def connect(self):
        raise NotImplementedError


class MySQLDatabase(DatabaseInterface):
    def connect(self):
        print("Connecting to MySQL database")


class UserService:
    def __init__(self, db: DatabaseInterface):
        self.db = db

    def get_users(self):
        self.db.connect()
        print("Getting users")


# We can easily change the used database.
db = MySQLDatabase()
service = UserService(db)
service.get_users()

PEP standards

PEPs (Python Enhancement Proposals) are technical and informative documents that describe new features, language improvements or guidelines for the Python community. Among them, PEP 8, which defines style conventions for Python code, plays a fundamental role in promoting readability and consistency in projects.

Adopting the PEP standards, especially PEP 8, not only ensures that the code is understandable to other developers, but also that it conforms to the standards set by the community. This facilitates collaboration, re-reads and long-term maintenance.

In this article, I present the most important aspects of the PEP standards, including:

Style Conventions (PEP 8): Indentations, variable names and import organization.
Best practices for documenting code (PEP 257).
Recommendations for writing typed, maintainable code (PEP 484 and PEP 563).

Understanding and applying these standards is essential to take full advantage of the Python ecosystem and contribute to professional quality projects.

PEP 8

This documentation is about coding conventions to standardize the code, and there exists a lot of documentation about the PEP 8. I will not show all recommendation in this posts, only those that I judge essential when I review a code

Naming conventions

Variable, function and module names should be in lower case, and use underscore to separate words. This typographical convention is called snake_case.


my_variable
my_new_function()
my_module

Constances are written in capital letters and set at the beginning of the script (after the imports):


LIGHT_SPEED
MY_CONSTANT

Finally, class names and exceptions use the CamelCase format (a capital letter at the beginning of each word). Exceptions must contain an Error at the end.


MyGreatClass
MyGreatError

Remember to give your variables names that make sense! Don’t use variable names like v1, v2, func1, i, toto…

Single-character variable names are permitted for loops and indexes:

my_list = [1, 3, 5, 7, 9, 11]
for i in range(len(my_liste)):
    print(my_list[i])

A more “pythonic” way of writing, to be preferred to the previous example, gets rid of the i index:

my_list = [1, 3, 5, 7, 9, 11]
for element in my_list:
    print(element )

Spaces management

It is recommended surrounding operators (+, -, *, /, //, %, ==, !=, >, not, in, and, or, …) with a space before AND after:

# recommended code:
my_variable = 3 + 7
my_text = "mouse"
my_text == my_variable

# not recommended code:
my_variable=3+7
my_text="mouse"
my_text== ma_variable

You can’t add several spaces around an operator. On the other hand, there are no spaces inside square brackets, braces or parentheses:

# recommended code:
my_list[1]
my_dict{"key"}
my_function(argument)

# not recommended code:
my_list[ 1 ]
my_dict{ "key" }
my_function( argument )

A space is recommended after the characters “:” and “,”, but not before:

# recommended code:
my_list = [1, 2, 3]
my_dict = {"key1": "value1", "key2": "value2"}
my_function(argument1, argument2)

# not recommended code:
my_list = [1 , 2 , 3]
my_dict = {"key1":"value1", "key2":"value2"}
my_function(argument1 , argument2)

However, when indexing lists, we don’t put a space after the “:”:

my_list = [1, 3, 5, 7, 9, 1]

# recommended code:
my_list[1:3]
my_list[1:4:2]
my_list[::2]

# not recommended code:
my_list[1 : 3]
my_list[1: 4:2 ]
my_list[ : :2]

Line length

For the sake of readability, we recommend writing lines of code no longer than 80 characters long. However, in certain circumstances this rule can be broken, especially if you are working on a Dash project, it may be complicated to respect this recommendation

The \ character can be used to cut lines that are too long.

For example:

my_variable = 3
if my_variable > 1 and my_variable < 10 \
    and my_variable % 2 == 1 and my_variable % 3 == 0:
    print(f"My variable is equal to {my_variable }")

Within a parenthesis, you can return to the line without using the \ character. This can be useful for specifying the arguments of a function or method when defining or using it:

def my_function(argument_1, argument_2,
                argument_3, argument_4):
    return argument_1 + argument_2

It is also possible to create multi-line lists or dictionaries by skipping a line after a comma:

my_list = [1, 2, 3,
          4, 5, 6,
          7, 8, 9]
my_dict = {"key1": 13,
          "key2": 42,
          "key2": -10}

Blank lines

In a script, blank lines are useful for visually separating different parts of the code. It is recommended to leave two blank lines before the definition of a function or class, and to leave a single blank line before the definition of a method (in a class). You can also leave a blank line in the body of a function to separate the logical sections of the function, but this should be used sparingly.

Comments

Comments always begin with the # symbol followed by a space. They give clear explanations of the purpose of the code and must be synchronized with the code, i.e. if the code is modified, the comments must be too (if applicable). They are on the same indentation level as the code they comment on. Comments are complete sentences, with a capital letter at the beginning (unless the first word is a variable, which is written without a capital letter) and a period at the end.I strongly recommend writing comments in English and it is important to be consistent between the language used for comments and the language used to name variables. Finally, Comments that follow the code on the same line should be avoided wherever possible, and should be separated from the code by at least two spaces.

Tool to help you

Ruff is a linter (code analysis tool) and formatter for Python code written in Rust. It combines the advantages of the flake8 linter and black and isort formatting while being faster.

Ruff has an extension on the VS Code editor.

To check your code you can type:

ruff check my_modul.py

But, it is also possible to correct it with the following command:

ruff format my_modul.py

PEP 20

PEP 20: The Zen of Python is a set of 19 principles written in poetic form. They are more a way of coding than actual guidelines.

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one– and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea — let’s do more of those!

PEP 257

The aim of PEP 257 is to standardize the use of docstrings.

What is a docstring?

A docstring is a string that appears as the first instruction after the definition of a function, class or method. A docstring becomes the output of the __doc__ special attribute of this object.

def my_function():
    """This is a doctring."""
    pass

And we have:

>>> my_function.__doc__
>>> 'This is a doctring.'

We always write a docstring between triple double quote """.

Docstring on a line

Used for simple functions or methods, it must fit on a single line, with no blank line at the beginning or end. The closing quotes are on the same line as opening quotes and there are no blank lines before or after the docstring.

def add(a, b):
    """Return the sum of a and b."""
    return a + b

Single-line docstring MUST NOT reintegrate function/method parameters. Do not do:

def my_function(a, b):
    """ my_function(a, b) -> list"""

Docstring on several lines

The first line should be a summary of the object being documented. An empty line follows, followed by more detailed explanations or clarifications of the arguments.

def divide(a, b):
    """Divide a byb.

    Returns the result of the division. Raises a ValueError if b equals 0.
    """
    if b == 0:
        raise ValueError("Only Chuck Norris can divide by 0") return a / b

Complete Docstring

A complete docstring is made up of several parts (in this case, based on the numpydoc standard).

Short description: Summarizes the main functionality.
Parameters: Describes the arguments with their type, name and role.
Returns: Specifies the type and role of the returned value.
Raises: Documents exceptions raised by the function.
Notes (optional): Provides additional explanations.
Examples (optional): Contains illustrated usage examples with expected results or exceptions.

def calculate_mean(numbers: list[float]) -> float:
    """
    Calculate the mean of a list of numbers.

    Parameters
    ----------
    numbers : list of float
        A list of numerical values for which the mean is to be calculated.

    Returns
    -------
    float
        The mean of the input numbers.

    Raises
    ------
    ValueError
        If the input list is empty.

    Notes
    -----
    The mean is calculated as the sum of all elements divided by the number of elements.

    Examples
    --------
    Calculate the mean of a list of numbers:
    >>> calculate_mean([1.0, 2.0, 3.0, 4.0])
    2.5"""

Tool to help you

VsCode’s autoDocstring extension lets you automatically create a docstring template.

PEP 484

In some programming languages, typing is mandatory when declaring a variable. In Python, typing is optional, but strongly recommended. PEP 484 introduces a typing system for Python, annotating the types of variables, function arguments and return values. This PEP provides a basis for improving code readability, facilitating static analysis and reducing errors.

What is typing?

Typing consists in explicitly declaring the type (float, string, etc.) of a variable. The typing module provides standard tools for defining generic types, such as Sequence, List, Union, Any, etc.

To type function attributes, we use “:” for function arguments and “->” for the type of what is returned.

Here a list of none typing functions:

def show_message(message):
    print(f"Message : {message}")

def addition(a, b):
    return a + b

def is_even(n):
    return n % 2 == 0

def list_square(numbers):
      return [x**2 for x in numbers]

def reverse_dictionary(d):
    return {v: k for k, v in d.items()}

def add_element(ensemble, element):
    ensemble.add(element)
  return ensemble

Now here’s how they should look:

from typing import List, Tuple, Dict, Set, Any

def show_message(message: str) -> None:
    print(f"Message : {message}")

def addition(a: int, b: int) -> int:
    return a + b

def is_even(n: int) -> bool:
    return n % 2 == 0

def list_square(numbers: List[int]) -> List[int]:
    return [x**2 for x in numbers]

def reverse_dictionary(d: Dict[str, int]) -> Dict[int, str]:
    return {v: k for k, v in d.items()}

def add_element(ensemble: Set[int], element: int) -> Set[int]:
    ensemble.add(element)
    return ensemble

Tool to help you

The MyPy extension automatically checks whether the use of a variable corresponds to the declared type. For example, for the following function:

def my_function(x: float) -> float:
    return x.mean()

The editor will point out that a float has no “mean” attribute.

Image from author

The benefit is twofold: you’ll know whether the declared type is the right one and whether the use of this variable corresponds to its type.

In the above example, x must be of a type that has a mean() method (e.g. np.array).

Conclusion

In this article, we have looked at the most important principles for creating clean Python production code. A solid architecture, adherence to SOLID principles, and compliance with PEP recommendations (at least the four discussed here) are essential for ensuring code quality. The desire for beautiful code is not (just) coquetry. It standardizes development practices and makes teamwork and maintenance much easier. There’s nothing more frustrating than spending hours (or even days) reverse-engineering a program, deciphering poorly written code before you’re finally able to fix the bugs. By applying these best practices, you ensure that your code remains clear, scalable, and easy for any developer to work with in the future.

References

1. src layout vs flat layout

2. SOLID principles

3. Python Enhancement Proposals index

The post Data Science: From School to Work, Part II appeared first on Towards Data Science.

Data Scientist: From School to Work, Part I

Vincent Margot — Wed, 19 Feb 2025 12:00:00 +0000

Nowadays, data science projects do not end with the proof of concept; every project has the goal of being used in production. It is important, therefore, to deliver high-quality code. I have been working as a data scientist for more than ten years and I have noticed that juniors usually have a weak level in development, which is understandable, because to be a data scientist you need to master math, statistics, algorithmics, development, and have knowledge in operational development. In this series of articles, I would like to share some tips and good practices for managing a professional data science project in Python. From Python to Docker, with a detour to Git, I will present the tools I use every day.

The other day, a colleague told me how he had to reinstall Linux because of an incorrect manipulation with Python. He had restored an old project that he wanted to customize. As a result of installing and uninstalling packages and changing versions, his Linux-based Python environment was no longer functional: an incident that could easily have been avoided by setting up a virtual environment. But it shows how important it is to manage these environments. Fortunately, there is now an excellent tool for this: uv.
The origin of these two letters is not clear. According to Zanie Blue (one of the creators):

“We considered a ton of names — it’s really hard to pick a name without collisions this day so every name was a balance of tradeoffs. uv was given to us on PyPI, is Astral-themed (i.e. ultraviolet or universal), and is short and easy to type.”

Now, let’s go into a little more detail about this wonderful tool.

Introduction

UV is a modern, minimalist Python projects and packages manager. Developed entirely in Rust, it has been designed to simplify Dependency Management, virtual environment creation and project organization. UV has been designed to limit common Python project problems such as dependency conflicts and environment management. It aims to offer a smoother, more intuitive experience than traditional tools such as the pip + virtualenv combo or the Conda manager. It is claimed to be 10 to 100 times faster than traditional handlers.

Whether for small personal projects or developing Python applications for production, UV is a robust and efficient solution for package management.

Starting with UV

Installation

To install UV, if you are using Windows, I recommend to use this command in a shell:

winget install --id=astral-sh.uv  -e

And, if you are on Mac or Linux use the command:

To verify correct installation, simply type into a terminal the following command:

uv version

Creation of a new Python project

Using UV you can create a new project by specifying the version of Python. To start a new project, simply type into a terminal:

uv init --python x:xx project_name

python x:xx must be replaced by the desired version (e.g. python 3.12). If you do not have the specified Python version, UV will take care of this and download the correct version to start the project.

This command creates and automatically initializes a Git repository named project_name. It contains several files:

A .gitignore file. It lists the elements of the repository to be ignored in the git versioning (it is basic and should be rewrite for a project ready to deploy).
A .python-version file. It indicates the python version used in the project.
The README.md file. It has a purpose to describe the project and explains how to use it.
A hello.py file.
The pyproject.toml file. This file contains all the information about tools used to build the project.
The uv.lock file. It is used to create the virtual environment when you use uv to run the script (it can be compared to the requierements.txt)

Package installation

To install new packages in this next environment you have to use:

uv add package_name

When the add command is used for the first time, UV creates a new virtual environment in the current working directory and installs the specified dependencies. A .venv/ directory appears. On subsequent runs, UV will use the existing virtual environment and install or update only the new packages requested. In addition, UV has a powerful dependency resolver. When executing the add command, UV analyzes the entire dependency graph to find a compatible set of package versions that meet all requirements (package version and Python version). Finally, UV updates the pyproject.toml and uv.lock files after each add command.

To uninstall a package, type the command:

uv remove package_name

It is very important to clean the unused package from your environment. You have to keep the dependency file as minimal as possible. If a package is not used or is no longer used, it must be deleted.

Run a Python script

Now, your repository is initiated, your packages are installed and your code is ready to be tested. You can activate the created virtual environment as usual, but it is more efficient to use the UV command run:

uv run hello.py

Using the run command guarantees that the script will be executed in the virtual environment of the project.

Manage the Python versions

It is usually recommended to use different Python versions. As mentioned before the introduction, you may be working on an old project that requires an old Python version. And often it will be too difficult to update the version.

uv python list

At any time, it is possible to change the Python version of your project. To do that, you have to modify the line requires-python in the pyproject.toml file.

For instance: requires-python = “>=3.9”

Then you have to synchronize your environment using the command:

uv sync

The command first checks existing Python installations. If the requested version is not found, UV downloads and installs it. UV also creates a new virtual environment in the project directory, replacing the old one.

But the new environment does not have the required package. Thus, after a sync command, you have to type:

uv pip install -e .

Switch from virtualenv to uv

If you have a Python project initiated with pip and virtualenv and wish to use UV, nothing could be simpler. If there is no requirements file, you need to activate your virtual environment and then retrieve the package + installed version.

pip freeze > requirements.txt

Then, you have to init the project with UV and install the dependencies:

uv init .
uv pip install -r requirements.txt

Correspondence table between pip + virtualenv and UV, image by author.

Use the tools

UV offers the possibility of using tools via the uv tool command. Tools are Python packages that provide command interfaces for such as ruff, pytests, mypy, etc. To install a tool, type the command line:

uv tool install tool_name

But, a tool can be used without having been installed:

uv tool run tool_name

For convenience, an alias was created: uvx, which is equivalent to uv tool run. So, to run a tool, just type:

uvx tool_name

Conclusion

UV is a powerful and efficient Python package manager designed to provide fast dependency resolution and installation. It significantly outperforms traditional tools like pip or conda, making it an excellent choice to manage your Python projects.

Whether you’re working on small scripts or large projects, I recommend you get into the habit of using UV. And believe me, trying it out means adopting it.

References

1 — UV documentation: https://docs.astral.sh/uv/

2 — UV GitHub repository: https://github.com/astral-sh/uv

3 — A great datacamp article: https://www.datacamp.com/tutorial/python-uv

The post Data Scientist: From School to Work, Part I appeared first on Towards Data Science.

Do You Know the Logical Analysis of Data Methodology (LAD)?

Vincent Margot — Thu, 21 Apr 2022 16:20:35 +0000

Image from geralt from pixabay.

I recently discovered an area of data analysis that began in 1986 with the work of Peter L. Hammer called Logical Analysis of Data (LAD). When I asked around my circle, no one had heard of it. So I decided to write a short introduction about this original methodology of data analysis. This article is based on the Chapter 3 of [1] and the article [2].

LAD is a Binary interpretable classification method, based on a mixture of optimization, Boolean functions and combinatorial theory. But rest assured that no prerequisites are needed to understand the basics of this theory. It is a competitive classification method for analysing datasets consisting of binary observations that offers a clear explanation through its concept of patterns.

LAD extracts a large collection of patterns from the datasets, some of which are characteristic of the observations with a positive classification, while the others are characteristic of the observations with a negative classification. The collection is then filtered to extract a smaller, non-redundant collection of patterns. This reduction allows providing comprehensible explanations for each classification.

Introduction

To understand what the LAD is, let me give a common example. A doctor wants to find out which foods cause headaches in his patient. For this purpose, his patient records his diet for one week, which is shown in the following table.

Diet record. Table from author.

A quick analysis leads the doctor to two conclusions. First, he notes that the patient never ate food n°2 without food n°1 on the days when he did not have a headache, but did eat it on some days when he did have a headache. And he notes the same pattern with food n°4 without food n°6. Hence, he concludes that his patient’s headache can be explained by using these two patterns.

Without knowing it, the doctor performed a LAD on the diet records dataset. In fact, during his analysis, he had to answer the three following questions: (1) How to extract a short list of features (i.e., food items) sufficient to classify the data (i.e., explain the occurrence of headaches)? (2) How to detect patterns (i.e., combinations of food items) causing headaches? (3) How to build theories (i.e., collection of patterns) explaining every observation? These questions summarize the LAD methodology.

Notations and definitions.

Logical Analysis of Data is based on the notion of partially defined Boolean functions (pdBf) and on the concept of patterns.

We set B = {0, 1}. The set Bⁿ, usually named the Boolean hypercube of dimension n, is composed of all possible binary sequences of length n. We define a partial order ≤ on Bⁿ as follows: a =(a₁, …, aₙ) ≤ b=(b₁, …, bₙ) if and only if aᵢ ≤ bᵢ for each i=1, …, n.

A subset S ⊆ Bⁿ is called a subcube of Bⁿ if and only if |S|=2ᵏ for k ≤ n and there are n-k components in which all sequences of S coincide. For example S={(0,0,0), (0,0,1), (0,1,0), (0,1,1)} is a subcude of B³ for k=2

What is a partially defined Boolean functions?

A Boolean function, f, is a mapping from Bⁿ to B. There are 2^(2ⁿ) possible Boolean functions. For a Boolean function f, we set T=T(f)={a ∈ Bⁿ such as f(a)=1} and F=F(f)={a ∈ Bⁿ such as f(a)=0} as the sets of the true points of f and the false points of f respectively. A partially defined Boolean function (pdBf) is a Boolean function such that T and F are disjoint, but some elements of Bⁿ belong neither to T nor to F. We can rewrite the table of the diet record as a pdBf as following:

A partially defined Boolean function. Table from author.

In this table points 1, 5 and 7 belong to T (true points) and 2,3,4 and 6 belong to F (false points). But the sequence (1,1,1,1,1,1,1,1) belongs neither to T nor to F. It could be interesting to be able to predict it.

What is a pattern?

To define pattern, I must introduce the notion of term. We define the complement of x ∈ B as x⁻ = 1-x. A term is a product of elements of B and their complements. And the degree of a term is the number of elements in it, named _literal_s. For example, let t be a term of degree 3 such that t=x⁻₁x₂x₃, then t(0,0,1) = (1–0)×0×1 = 0. It is important to notice that t(a) is defined for any a ∈ Bⁿ even if the degree of t is less than n, just ignore the values which are not in the term. If t=x₂x⁻₃ then t(0,0,1) = 0×(1–1)=0. If _t(a)=_1 we said that _t cove_rs the point a.

Patterns are the central elements of LAD. A term t is called a positive (negative) pattern of a pdBf, f, if and only if : 1 – t(X) = 0 ∀ X ∈ F(f) (X ∈ T(f) ), and 2 – t(X) = 1 for at least one X ∈ T(f) (X ∈ F(f) ). For instance, the term x⁻₁x₂ equals to 1 if and only if x₁=0 and x₂=1. So, in the previous example, this term only covers points 1 and 7, which are in T(f). This term is called a positive pattern of f. One can notice that x₄x⁻₆ is also a positive pattern of f. These two patterns were found in the doctor’s analysis. Respectively, terms covering points in F(f) are called negative patterns of f.

To compare patterns, we need to introduce some reasonable criteria for suitability: Simplicity, Selectivity, and Evidence. These criteria define a simple partial preorder of patterns, called a preference. Let P be a pattern and f a pdBf, then we have:

Simplicity: The simplicity of P is evaluated considering the set of literals of P. It is the σ preference, and we define _P₂ ≤σ P₁ if and only if the set of literals of P₁ is a subset of that of P₂.

Selectivity: The selectivity of P is evaluated considering the subcube of P (i.e., the subset of points of Bⁿ covered by P). It is the Σ preference, and we define _P₂ ≤Σ P₁ if and only if the subcube of P₁ is a subset of the subcube of P₂.

Evidence: The evidence of P is evaluated considering the set of T(f) covered by P. It is the ϵ preference, and we define _P₂ ≤ϵ P₁ if and only if the set of the true points covered by P₁ is a subset of those covered by P₂.

We can also combine preferences using intersection ∩ and lexicographic refinement |. Let λ and γ be two preferences, then we have:

P₁ ≤_(λ ∩ γ) P₂ if and only if P₁ ≤_λ P₂ and P₁ ≤_γ P₂ .
P₁ ≤_(λ | γ) P₂ if and only if either _P₁<λ P₂ or _(P₁ ≈λ P₂ and _P₁ ≤γ P₂).

Let λ be a preference, a pattern P, is called a Pareto-optimal pattern if and only if that there is not pattern P’ ≠ P such that _P ≤λ P’. Unfortunately, the concept of optimal pattern does not have a unique definition. We summarize the properties of patterns which are Pareto-optimal with respect to the preferences and combinations of preferences.

Types of Pareto-optimality. Table from author.

In the headache example, the term x₂x₅x₈ is a positive pattern of f (it covers points 5 and 7). But it is not a Minterm. Indeed, x₂x₅ is also a positive pattern and the subcube of x₂x₅x₈ is included is those of x₂x₅. Here, x₂x₅ is a Minterm positive pattern.

Considering the positive pattern x₂x₅x⁻₆x⁻₇x₈, we can say that there is not i ∈ {1,3,4} such that x₂x₅x⁻₆x⁻₇x₈xᵢ is a pattern. Thus, x₂x₅x⁻₆x⁻₇x₈ is a Prime positive pattern.

Moreover, x₂x₅ and x₂x₅x⁻₆x⁻₇x₈ are Strong positive patterns_.This is because, there is no pattern covering of true points of the example. Hence, x₂x₅ is a Spanned_ positive pattern (Mintern and Strong) and x₂x₅x⁻₆x⁻₇x₈ is a Strong Prime positive pattern.

Methodology of LAD

Now we are equipped to go through the methodology of LAD. The main objective of LAD is to find the extention of a pdBF (such as defined previously). But there exist many ways to extend this function. The difficulty is to find the "good one", called a theory. To do so, LAD processes in three main steps: 1- Turning the dataset into a pdBf. It is called the binarization process. 2- Detecting suitable patterns, as proposed by the doctor in the introduction. 3- Forming theory. It means extracting the best extension of the pdBf regarding the all detecting patterns.

Binarization.

The first step is the binarization of the data. Many real data sets contain numeric data and nominal data. An easy way to binarize nominal data is to use the one-hot encoding technique. This is a very fast technique and the data is converted as expected. The binarization of numeric data is a bit more complicated. A common method is to choose critical values or cutpoints and use indicator features. Either the nature of the feature suggests the choice of cutpoints (such as BMI in medicine) or it is more complicated and still open to be improved. Unfortunately, the binarization process increases the dimension of the dataset by adding many features. Hence, it is important to reduce the dimension with a feature selection.

Detecting patterns.

As you can imagine, this part of the process is still an open area of research, and the choice of algorithm to generate patterns depends on what kind of patterns you want to find. Here, I will present an algorithm, introduced in [4], to extract positive prime patterns. For the negative patterns the algorithm is similar.

The idea behind this pseudocode is simple. The goal is to test terms from degree 1 to degree D and keep only those that have a positive pattern. Negative patterns are not considered, and terms that cover both positive and negative points are kept as candidates for the next degree.

There are several algorithms for generating different types of suitable patterns. I refer you to [2] and [3] if you want a survey of them.

Forming theory.

At this stage, we have a lot of patterns. Certainly too many to create an interpretable model, which is an advantage of the LAD. So, we need to select a set of suitable patterns from the set of generated patterns. The selected set should have a reasonable number of elements, but also have enough elements to make a good distinction between the positive and negative observations. This selection process can be expressed as an optimization problem.

For each positive pattern, pₖ, we assign a binary variable yₖ such that yₖ=1 if and only if pₖ is included in the final model. And for each positive observation tᵢ we define a binary vector (tᵢ¹, tᵢ², …, tᵢᵖ), where p is the number of positive patterns and tᵢᵏ=1 if and only if pₖ covers tᵢ.

This optimization problem can be easily adapted to your desiderata at the price of small variations. The value 1 on the right side of the constraints can be increased to ensure that there are at least two (or more) patterns for a decision. We can also add a weight, wₖ, at each pattern in the sum ∑ wₖyₖ. For example, these weights can be indexed by the degree of each pattern to promote patterns with few literals that are more interpretable by humans.

Conclusion

Logical analysis of data is a very interesting Classification method that is still an active research area. LAD is based on the concept of patterns, which are a tool to classify new observations and provide an understandable explanation for each classification.

LAD has numerous application in medicine, such as breast cancer prognosis, ovarian cancer detection, analysis of survival data and others (see [2] for an overview). And I think it can be easily applied in many other fields as well.

Unfortunately, LAD seems to suffer from a lack of visibility, and I hope this short article will spark your interest. It is an elegant way to generate an interpretable classification model different from the usual tree-based algorithms and rule-based algorithms. An article will follow with python applications.

References

[1] I. Chikalov, V. Lozin, I. Lozina, M. Moshkov, H.S. Nguyen, A. Skowron, and B. Zielosko. Three approaches to data analysis: Test theory, rough sets and logical analysis of data (2012) Vol. 41. Springer Science & Business Media.

[2] G. Alexe, S. Alexe, T.O. Bonates, A. Kogan. _Logical analysis of data–the vision of Peter L. Hammer._ (2007) Annals of Mathematics and Artificial Intelligence 49, no. 1 : 265–312.

[3] P.L. Hammer, A. Kogan, B. Simeone, and S. Szedmák. Pareto-optimal patterns in logical analysis of data. (2004) Discrete Applied Mathematics 144, no. 1–2: 79–102.

[4] E. Mayoraz. C++ tools for Logical Analysis Of Data. (1995) Rutcor Research Raport : 1–95.

About Us

Advestis is a European Contract Research Organization (CRO) with a deep understanding and practice of statistics, and interpretable machine learning techniques. The expertise of Advestis covers the modeling of complex systems and predictive analysis for temporal phenomena.

LinkedIn: https://www.linkedin.com/company/advestis/

The post Do You Know the Logical Analysis of Data Methodology (LAD)? appeared first on Towards Data Science.

Four interpretable algorithms that you should use in 2022

Vincent Margot — Tue, 04 Jan 2022 16:08:11 +0000

The new year has begun, and it is the time for good resolutions. One of them could be to make decision-making processes more interpretable. To help you do this, I present four interpretable rule-based algorithms. These four algorithms share the use of ensemble of decision trees as rule generator (like Random Forest, AdaBoost, Gradient Boosting, etc.). In other words, each of these interpretable algorithms starts its process by fitting a black box model and generating an interpretable rule ensemble model.

Even though they are all claimed to be interpretable, they were developed with a different idea of interpretability. As you may know, this concept is difficult to be mathematically well posed. Therefore, authors designed interpretable algorithms with their own definition of interpretability.

To avoid going into detail, I assume here that the data are composed of d quantitative features and that the rules are binary variables with the value 1 if the observation lies within a hyperrectangle of the feature space ℝᵈ. Differently said, a rule rₙ is defined as following:

where x[j] is the value of ** the observation x for the _j-t_h feature an_d Iⱼ,_ₙ is an interval of ℝ. The rule and the interval are indexed by n, which is the number of observations in the sample Dₙ.The _lengt**_h of rules is the number of interva_ls Iⱼ,_ₙ ≠ℝ in the rule’s definition, it may be different from _ d. For instance, the condition "If feature_1 in [1, 3] AND feature3 in [0, 2]" defines a rule of length 2 equal to 1 in the hyperrectangle [1,3] × ℝ × [0,2] × ℝᵈ⁻³. T_h_e rule can be written as following:

1. RuleFit (2008).

The first algorithm is the most popular and also the oldest. RuleFit was introduced in [1]. It generates a sparse linear model that contains selected interaction effects in the form of decision rules.

Motivation: The idea of this algorithm came from a double observation: the main disadvantage of rule-based models is the difficulty to capture linear dependencies; on the other hand, linear models do not capture interaction among features. Therefore, the idea of RuleFit is to combine these two types of algorithms by creating a sparse linear model and adding interaction effects in the form of decision rules.

"With ensemble learning, there is no requirement that the basis functions must be generated from the same parametric base learner."

The algorithm consists of two steps. The first is rule generation and the second is rule selection.

Rule generation: The authors have proposed to use ensemble tree algorithms such as Random Forests, AdaBoost and Gradient Boosting. Then, all nodes (interior and terminal) of each tree are extracted as a rule in the form described above.

Rule selection: These extracted rules, along with the original input characteristics, are then fed into an L1-regularised linear model, also called a Lasso, which estimates the impact of each rule and each variable on the output target while estimating many of these impacts to be zero. In order to give each linear term the same a priori influence as a typical rule, the authors suggested normalizing the original input features.

Interpretability: The Lasso penality is known to be selective. So, using this regularization, authors expected many coefficients to have zero values and thus generate an interpretable model.

"For purposes of interpretation, it is desirable that the ensemble be comprised of "simple" rules each defined by a small number of variables."

Moreover, the authors proposed many formulas to calculate the importance of any predictor in a RuleFit model and to study the interaction effects. All these metrics are here to help the statistician understand the generated model.

Remarks: In practice, RuleFit is hard to interpret. Since the prediction implied by an activated rule comes from Lasso fitting, a rule can cover positive examples but still have a negative prediction. Moreover, sometimes the generated model has too many rules with non-zero weights, hence becoming not interpretable for human.

Packages: RuleFit is implemented in Python in the following GitHub projects: imodels and rulefit. RuleFit has also several R packages.

2. Node harvest (2010).

This algorithm has been presented in [2]. The algorithm considers all nodes and leaves of a Random Forest as rules and solves a linear quadratic problem to fit a weight for each node. Hence, the estimator is a convex combination of the nodes.

Motivation: The idea behind node harvest algorithm (NH) is to select the best partition of the features space given a set of q rules Q. The minimization problem can be formally written as

where wₖ is the weight of the rule rₖ and pₖ is the prediction of the rule rₖ (usually the empirical conditional expectation of Y given X ∈ rₖ).

Unfortunately, this equation is very hard to solve, especially because the constraint wₖ ∈ {0,1} for k ∈ {1, …, q} does not correspond to a convex feasible region. In this paper [2], the author proposes a way to adapt this optimization problem, making it solvable.

"The main idea of NH is that it becomes computationally feasible to solve the optimal empirical partitioning problem if the weights are only constrained to be nonnegative."

Rule generation: In the paper, the author recommends using a Random Forest algorithm and fitting each tree with a subsample of the data of size n/10 instead of the usual bootstrap samples to speed up the computation and increase the diversity of the rule set. Then all nodes of each tree satisfying a given condition on the maximal interaction order (the rule’s length) and a given condition on the minimal size (number of observations covered by the rule) are added to the set of rules Q. Finally, if the number of rules in Q is lower than a chosen number q, the process is repeated.

Rule selection: To solve the optimization problem, the author has made two changes to the original one. The first had been to replace the condition of partitioning the feature space with a condition of empirical partitioning. It means that each observation must be covered by exactly one selected rule. The second modification had been to relax the constraint on the vector of weights w by allowing them to take values in the interval [0, 1] instead of the binary set {0, 1}. Thanks to these changes, the new optimal empirical partitioning problem can be solved with a quadratic program solver.

Interpretability: The interpretability is ensured by the choice of the number of interactions. It must be lower or equal to 3. Indeed, it is very hard to understand a rule implying more than 3 features. Moreover, even if the number of rules in Q is very large, the solution of the optimization problem should put a weight of 0 to the majority of the rules.

"the vast majority of nodes [..] will receive a zero weight, without the sparsity being enforced explicitly other than through the constraint of empirical paritioning."

In my experience, however, the number of rules selected is still too high to create an interpretable model, however, the model is still fairly accurate.

Remarks: In the paper, the author proposes a dimensionality reduction to solve the optimization problem faster whenever the size of Q is larger than the number of observations. He also presents a default values for the parameters of NH that work well in practice. Finally, he suggests regularization to improve the interpretability.

Packages: Node harvest has been implemented in R in the package nodeHarvest.

3. SIRUS (2021).

SIRUS algorithm has been introduced in [3] and extended in [4]. It uses a modified Random Forest to generate a large number of rules, which are then selected with a redundancy greater than a tuning parameter p₀. To be sure to have redundancy in the generated rules, the features are discretized.

Motivation: The main strength of this algorithm is to be stable. For two independent samples from the same distribution, the sets of selected rules are almost the same. Indeed, to the authors, stability is seen as one of the key characteristics of interpretability.

Rule generation: SIRUS uses a slightly modified Random Forest. First, the features are discretized into q bins using the empirical q quantiles of the marginal distributions. The discretization is fundamental for stability and accuracy is ensured as long as q is not too small. To enforce interpretability, the depth of the trees is limited to 2, whereas the usual Random Forest algorithm recommends deep trees. This means that each tree of the SIRUS rule generation process has at most six nodes (the root node is not counting). Finally, each tree is replaced by a subsample of aₙ observations, where aₙ is a parameter of SIRUS.

Rule selection: SIRUS generates many trees (typically 10000) and it selects rules that are shared by a large portion of them. To be able to identify these rules, SIRUS computes the empirical probability that a tree of the forest contains a particular path. It then selects all the rules with an empirical probability greater or equal to a chosen parameter p₀ ∈ (0, 1). Unfortunately, these methods generate a lot of redundancy in the list of selected rules and the selection process needs a post-treatment. If a rule r is a linear combination of rules with paths having a higher frequency in the forest, then r is removed from the set of selected rules.

Interpretability: The post-treatment in the selection process makes certain keeping only the rules with the lowest length. Indeed, a rule can only be a linear combination of rules with a lower length.

"In our case, it happens that the parameter p₀ alone is enough to control sparsity"

Remarks: The discretization process is mandatory in SIRUS. First, it increases the stability of the algorithm. Secondly, it enables the creation of decision trees with common rules. Indeed, if the feature is continuous, the probability that a decision tree algorithm will cut the same exact value for the same rule in two independent samples is zero. The authors also proved the consistency of the estimation of path probabilities and the asymptotic convergence to perfect stability (i.e., the set of selected rules is exactly the same for two independent samples drawn from the same distribution). Finally, the authors give some techniques for the parameters tuning.

Packages: SIRUS has been implemented in R in the package sirus.

4. CoveringAlgorithm (2021).

This algorithm has been presented in [5]. The algorithm extracts a sparse rule set considering all nodes and leaves of tree ensembles that are selected according to their statistical properties to form a "quasi-covering" (it means that the covering is asymptotic). The covering is then turned into a partition using the so-called partitioning trick to create a consistent estimator of the regression function. I have already spoken about the partitioning trick in a previous article, "How to deal with overlapping rules?".

Motivation : The idea is to generate an interpretable, consistent rule-based estimator of the regression function. In the literature, a consistent estimator must be generated from rules of length d. And for a large d, the rules become uninterpretable. To get around this limitation, the authors introduced the concept of significant and insignificant rules. Significant rules can be considered as rules where the target variable shows behaviour that is significantly different from its average behaviour. And insignificant rules can be considered as rules where the target variable has a very low variance. The authors also removed the condition of having a covering. This means that the set of selected rules does not have to cover the feature space; this restriction was replaced by a more flexible restriction of quasi-coverage. The CoveringAlgorithm is an illustration of these concepts.

Rule generation: As in RuleFit, ensemble tree algorithms such as Random Forests, AdaBoost and Gradient Boosting can be used as rules generator. Then, all nodes (internal and terminal) of each tree are extracted as a rules, and only rules with a length lower or equal to a specified parameter value (usually 3) are kept.

Rule selection: The selection process consists of two steps. First, the algorithm extracts significant rules, ordering them by decreasing coverage. The algorithm tries to cover the empirical feature space with these rules by excluding those that overlap too much. Second, if the coverage is not complete, the algorithm tries to expand the coverage by adding insignificant rules, ordering them by increasing variance

Interpretability: Interpretability is ensured by the small number of short-length rules in the set of selected rules. Indeed, it is easier to understand a decision if it is based on two or three rules of equal length. In addition, the notion of significant and insignificant rules helps users identify areas of interest. Insignificant rules indicate areas where not much is happening, in contrast to significant rules that highlight areas of interest.

Remarks: The main result of this paper is a proof of the consistency of an interpretable estimator of the regression function constructed from a set of significant and insignificant rules. As explained above, this algorithm is just an example to show how to use significant and insignificant rules in a predictive algorithm. As a matter of fact, it does not produce a consistent estimator because the rule generator does not fully satisfy some conditions. This algorithm is currently the subject of active research.

Packages: CoveringAlgorithm has been developed in Python in the GitHub package CoveringAlgorithm.

Conclusion

It is interesting to see that each algorithm was developed with different goals: accuracy for RuleFit and Node harvest, stability for SIRUS and simplicity for CoveringAlgorithm. Benefiting from these previous researches, I am working on a quantitative method to measure interpretability based on this triptych: predictability, stability and simplicity. A first approach dedicated to tree-based algorithms and rule-based algorithms was published in the MDPI AI Journal (and a TDS version has been posted here). I am currently working on an extension of this approach to be algorithm-independent and to be able to compare the interpretability of any predictive algorithm.

About Us

LinkedIn: https://www.linkedin.com/company/advestis/

References

[1] Friedman, J.H. and Popescu, B.E., 2008. Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), pp.916–954.

[2] Meinshausen, N., 2010. Node harvest. The Annals of Applied Statistics, pp.2049–2072.

[3] Bénard, C., Biau, G., Veiga, S. and Scornet, E., 2021. SIRUS: Stable and interpretable rule set for classification. Electronic Journal of Statistics, 15(1), pp.427–505.

[4] Bénard, C., Biau, G., Veiga, S. and Scornet, E., 2021, March. Interpretable random forests via rule extraction. In International Conference on Artificial Intelligence and Statistics, pp. 937–945. PMLR.

[5] Margot, V., Baudry, J.P., Guilloux, F. and Wintenberger, O., 2021. Consistent regression using data-dependent coverings. Electronic Journal of Statistics, 15(1), pp.1743–1782.

The post Four interpretable algorithms that you should use in 2022 appeared first on Towards Data Science.

A Brief Overview of Methods to Explain AI (XAI)

Vincent Margot — Fri, 26 Nov 2021 19:30:18 +0000

Model Interpretability

Image of kiquebg from Pixabay.

I know this topic has been discussed many times. But I recently gave some talks on interpretability (for SCAI and France Innovation) and thought it would be good to include some of my work in this article. The importance of explainability for the decision-making process in Machine Learning doesn’t need to be proved any longer. Users are demanding more explanations, and although there are no uniform and strict definitions of interpretability and explainability, the number of scientific papers explaining artificial intelligence (or XAI) is growing exponentially.

As you may know, there are two ways to design an interpretable machine learning process. Either you design an intrinsically interpretable predictive model, for example with rule-based algorithms, or you use a black-box model and add a surrogate model to explain it. The second way is called post-hoc interpretability. There are two types of post-hoc interpretable models: global models to describe the average behaviour of your black-box models, or local models to explain individual predictions. Nowadays, there are several tools for creating a post-hoc interpretable model, most of them are model-agnostic, i.e. they can be used independently of the algorithm used.

I will present the most common of them. This article in based on the reference book of Christoph Molnar: Interpretable Machine Learning. To illustrate the methods, I use the usual Boston Housing dataset. The target is the median price of owner-occupied homes in $1000’s.

Partial Dependence Plot (PDP).

PDP is a global, model-agnostic interpretation method. This method shows the contribution of individual features to the predictive value of your black box model. It can be applied to numerical and categorical variables. First, we choose a feature and its grid values (the range of the chosen feature). Then the values of this feature are replaced by grid values and the predictions are averaged. For each value of the grid, there is a point corresponding to an average of the predictions. Finally, the curve is drawn. The main limitation is that humans are not able to understand a graph with more than three dimensions. Therefore, we cannot analyse more than two features in a partial dependency diagram.

Example of PDP applied on a Random Forest trained on Housing Boston data for the feature RM and LSTAT. Image from the author.

In these graphs, we can see the average impact of the average number of rooms per dwelling (RM) and the percentage of the lower class in the population (LSTAT) on the median price. For example, we can deduce that the lower the number of rooms, the lower the median price (this seems coherent).

The function _plot_partial_dependence_ is already implemented in the inspection module of the package scikit-learn.

Accumulated Local Effects (ALE).

ALE is also a global, model-agnostic interpretation method. It is an alternative to PDP, which is subject to bias when variables are highly correlated. For example, the variable RM, which indicates the number of rooms, is highly correlated with the area of the house. So RM=7.5 would be an unrealistic individual for a very small area. The idea behind ALE is to consider instances with a similar value of the chosen variable, rather than substituting values for all instances. When you average the predictions, you get an M-plot. Unfortunately, the M-plots represent the combined effect of all correlated characteristics. To get a better understanding, I quote the example from Interpretable Machine Learning:

"Suppose that the living area has no effect on the predicted value of a house, only the number of rooms has. The M-Plot would still show that the size of the living area increases the predicted value, since the number of rooms increases with the living area."

ALE computes the differences of the predictions instead of the average regarding a small window (e.g. using empirical quantiles). In the following example, I plot the ALE for the features RM and LSTAT with 10 windows.

Example of ALE applied on a Random Forest trained on Housing Boston data for the feature RM and LSTAT. Image from the author.

The vertical lines represent the windows under consideration. The empirical quantiles are designed so that 10 % of the individuals lie in each window. Unfortunately, there is no solution to set the number of bins, which strongly influences the interpretation.

For the illustration, I have used the open-source package ALEPython available on Github.

Individual Conditional Expectation (ICE).

ICE is a local, model-agnostic interpretation method. The idea is the same as the PDP but instead of plotting the average contribution, we plot the contribution for each individual. Of course, the main limitation is the same as PDP. Moreover, if you have too many individuals, the plot may become unexplainable.

Example of ICE applied on a Random Forest trained on Housing Boston data for the feature RM and LSTAT. Image from the author.

Here we see the effects of the average number of rooms per dwelling (RM) and the percentage of lower class in the population (LSTAT) on the median price for each of the 506 observations. Again, we can see that the lower the number of rooms, the lower the median price. However, there are 5 individuals for whom RM shows an opposite behaviour. These 5 individuals should be carefully examined as they could indicate an error in the database.

The function _plot_partial_dependence_ is already implemented in the inspection module of the package scikit-learn.

Local Interpretable Model-agnostic Explanations (LIME).

LIME, as its name suggests, is a local model-agnostic interpretation method. The idea is simple, from a new observation LIME generates a new dataset consisting of perturbed samples and the corresponding predictions of the underlying model. Then, an interpretable model is fitted on this new dataset, which is weighted by the proximity of the sampled observation to the observation of interest.

Example of LIME applied on a Random Forest trained on Housing Boston data. The chosen observation is the 42nd and the surrogate local model is a RIDGE regression. Image from the author.

In this visual output of LIME, the prediction for the 42nd observation is explained. In the created data set, the predicted values range from 8.72 to 47.73. The predicted value for the interset observation is 24.99. We can see that the value of LSTAT has a positive influence on the prediction, which confirms the previous conclusions from PDP and ICE.

SHapley Additive exPlanations (SHAP).

SHAP is a local model-agnostic interpretation method based on the Shapley value. The Shapley value comes from game theory, and there are several articles on Toward Data Science and others that talk about it. I just want to remind you here that in this context, the game is collaborative, and the task is prediction, the gain is the distance between the prediction and a baseline prediction (usually the average of observations), and the players are the features. Then, the Shapley value is used to separate the prediction shift from the baseline prediction among the features. So, each feature’s realization implies a variation of the prediction, positively or negatively. The idea behind SHAP is to represent the Shapley value explanation as an additive feature attribution method. Hence, it becomes a linear model, where the intercept is the baseline prediction. The following graphical representation of SHAP applied to a XGBoost illustrates these explanations.

Example of SHAP applied on a XGBoost regressor trained on Housing Boston data. The chosen observation is the 42nd. Image from the author.

In this figure, we see the influence of each variable on the 42nd prediction. Here, the baseline prediction is 22.533. Then the variable INDUS=6.91 shifts the prediction by -0.1, the variable B=383.37 shifts the prediction by +0.19, and so on. We see that the largest shifts come from the variables LSTAT and RM, which are the most important features of this dataset.

Conclusion

Local models are used more often than the global models. If you want a global description of your model, it is best to use a predictive algorithm that is interpretable in itself. These are not the most accurate algorithms, but they do allow for an overall description of the generated model. If you want a very accurate predictive model, you usually want to be able to explain each prediction individually rather than just the overall model. For example, an algorithm for a self-driving car should be able to explain each of its predictions in case of an accident.

In order to be concise, I have omitted to present the feature importance, the feature interaction, the second-order ALE, the KernelSHAP and others methods. I have only taken a brief overview to show you what is available today to interpret your black box models. If you want to learn more about this topic, I recommend the book of Christoph Molnar: Interpretable Machine Learning.

About Us

LinkedIn: https://www.linkedin.com/company/advestis/

The post A Brief Overview of Methods to Explain AI (XAI) appeared first on Towards Data Science.

An original method to combine regression estimators in Python

Vincent Margot — Wed, 27 Oct 2021 13:10:04 +0000

Nowadays, data scientists have many accurate machine learning algorithms. But, the choice of the best models is a complicated task and ensemble learning has proved its efficiency in practice. In previous posts: "How to choose the best model?" and "How to deal with overlapping rules?" I have presented the experts’ aggregation theory. A theory that should be used more often for ensemble learning instead of simple averaging. Here, I want to focus on the COBRA method, presented in [1]. This method has a very different and original approach to the combination of estimators. For simplicity, I use the terms estimators, predictive models and experts without any distinction. Indeed, in a regression setting, an estimator of the regression function can be used as a predictive model, or it can be used as an expert that makes a prediction for each new observation.

First, I recall the main setup of the experts’ aggregation theory. We have a set of K experts that give us, at each time t, a prediction for the value of the target yₜ. The idea is to aggregate the predictions of the K experts to produce an aggregated prediction _ŷ_ₜ.

COBRA (COmBined Regression Alternative).

Explanation with the hands.

Usually, in the experts’ aggregation theory, we use a convex combination of the experts’ predictions to make ŷ. But COBRA has a very different approach. It is based on a similar idea as the _k-nearest neighbourhoo_d algorithm. At each time t we have a new observation _x_ₜ, and we compute the K experts’ prediction_s {p₁(xₜ), p₂(xₜ), …, pₖ(xₜ_)}. Then, the idea is to average realizations of y, not us_e_d to generate the experts, that have predictions in the same neighbourhood (in the Euclidean sense) o_f {p₁(xₜ), p₂(xₜ), …, pₖ(xₜ_)}. The step of searching for realizations in these neighbourhoods is called th_e consensus st_ep. The following example will serve to illustrate the concept.

COBRA aggregation of two predictors. Image from the author.

In this example, we have one feature x ∈ R represented in abscissa. The y‘s realizations are marked as black circles. We have two experts: The first expert gives the red predictions and the second the green predictions. For the new observation x = 11, we have the predictions p¹ₜ and p²ₜ. For each prediction a neighbourhood is formed, symbolized by the coloured dashed line. Then, all realizations that have all predictions in the neighbourhoods, marked as the blue circles, are averaged to compute ŷₜ (the blue rhombus).

Mathematical explanations.

Formally, the COBRA estimator is made as the following. Let Dₙ be a sample of n independent and identically distributed observations of the pair of random variable (X, Y). The sample is divided into two independent samples, Dₗ and _D_ₘ. Then, Dₗ is used to generate a set of experts _{p_₁, _p_₂, …, pₖ} and _D_ₘ is used for the calculation of ŷₜ, __ the combined predicted value for a new observation _x_ₜ. We have the following formulas

where the random weights Wᵢ take the form

The ϵₘ is the smoothing parameter. The larger ϵₘ, the more tolerant the process. Conversely, if ϵₘ is too small, many experts are discarded. Therefore, its calibration is a crucial step. To overcome this step, the authors proposed a data-dependent calibration in the third section of [1].

These literal expressions show that one of the main differences between COBRA and the other common aggregation methods is that COBRA is a nonlinear method with respect to the experts _{p_₁, _p_₂, …, pₖ}. Otherwise, from a theoretical point of view, COBRA also satisfies an oracle bound, which shows that the cumulative loss of the aggregate predictor is upper bounded by the smallest cumulative loss of the expert group, up to a residual term that decays towards zero.

Pycobra library

Pycobra is an open-source library for Python that was introduced in [2]. This library is more than just an implementation of COBRA aggregation. Even if, the simple fact of developing an algorithm called COBRA in a language called Python had been enough. This library also included the EWA algorithm (Exponentially Weighed Aggregation) detailed in [3], and a version of COBRA for classification setting, ClassifierCobra, inspired by [4]. This package also included some visualization tools to gauge the performance of the experts. Moreover, a class Diagnostics allows comparing different combinations of the constituent experts and data-splitting, among other basic parameters. It allows for better parameters’ analysis. Finally, the library is available on GitHub here.

Conclusion

COBRA is an original nonlinear aggregation method for ensemble learning with theoretical guarantee. The main authors of the first paper are still working on developing a better version of it, as shown by the recent paper introducing a kernel version [5]. What’s more, this algorithm is available in an open-source Python library, so there’s no excuse not to try it out in your next data science project or Kaggle challenge.

About Us

LinkedIn: https://www.linkedin.com/company/advestis/

References

[1] G.Biau, A.Fischer, B.Guedj & J.D.Malley COBRA: A combined regression strategy. Journal of Multivariate Analysis 146 (2016): 18–28.

[2] B.Guedj, and B.Srinivasa Desikan Pycobra: A python toolbox for ensemble learning and visualisation. Journal of Machine Learning Research 18.190 (2018): 1–5.

[3] A. S.Dalalyan and A.B.Tsybakov Aggregation by exponential weighting and sharp oracle inequalities. International Conference on Computational Learning Theory (2007): 97–111.

[4] M.Mojirsheibani Combining classifiers via discretization. Journal of the American Statistical Association _94._446 (1999): 600–609.

[5] B.Guedj and B.S.Desikan. Kernel-Based Ensemble Learning in Python. Information 11, no. 2 (2020): 63.

The post An original method to combine regression estimators in Python appeared first on Towards Data Science.

How to deal with overlapping rules?

Vincent Margot — Wed, 17 Feb 2021 15:18:23 +0000

Getting Started

When one wants to build an interpretable model, one usually hesitates between the two main families of intrinsically interpretable algorithms: Tree-based algorithms and rule-based algorithms. One of the main differences between them is that tree-based algorithms generate trees with disjoint rules, which has shortcomings such as the replicated subtree problem [1], while rule-based algorithms generate sets of rules and allow overlapping rules which may be a problem. Indeed, how can you make a decision when two (or more) rules are activated simultaneously? This is even more of a problem when the predictions of the rules are contradictory.

In this short article, I present two methods for obtaining a single prediction from a set of rules, less trivial than an average of predictions, but preserving the interpretability of the prediction.

Using expert aggregation methods

If you are not familiar with the theory of aggregation of experts, I refer you to the reading of my previous article "How to choose the best model?". The idea is simple: consider each rules as specialized experts and aggregate their predictions by a convex combination. The weight associated with a rule changes after each prediction: it increases if the expert’s prediction is good and decreases otherwise. Thus, the weight could be interpreted as the confidence we have in each expert at time t.

Unfortunately, the rules may be not activated. This means that the new observation does not fulfill their conditions. They are called "sleeping experts". The question is how can I assess the confidence of a rule that is not activated? There are two common methods. The first is to say "I won’t make a decision without information". This way, the weight of a sleeping rule remains the same until it becomes active again. The second is to ask, "Is it good that this rule is sleeping?". In this way, the new weighting of a sleeping rule is evaluated by considering the aggregate prediction instead of that of the rule. If the aggregated prediction is good, it means that the rule has been right to sleep, otherwise it means that the rule should have been active. These two methods are questionable. But it is proven that they provide similar results.

Using the partitioning trick

The partitioning is a very different idea, more "statistic". The idea is to turn the covering formed by the set of rules into a partition of separated cells (as shown in the figure below). Hence, the prediction will be the mean of the observation in the cell and the interpretation will be a conjunction of rules.

Image from the author.

One may object that the construction of this partition is very time-consuming. And it is correct. So, to get around this problem, there exists the partitioning trick. The idea is to understand that it is not necessary to fully describe the partition to compute the prediction value. The trick is to identify the unique cell of the partition that contains the new observation x. By creating binary vectors, whose value is 1 if the new observation, x, fulfills the condition of the rule and 0 otherwise, the identification of the cell containing x is a simple sequence of vectorial operations. Thus, the complexity to calculate a prediction is O(nR), where n is the number of points in the train sets and R is the number of rules. The figure below is an illustration of this process (I refer to [2] for more details).

Image from the author.

The main problem with this method it is that we have no control on the size of the cell. In other words, if the cell containing the new observation is too small, you may make a decision based on too few past observations. This problem occurs if you have few data in your training set. It can be interpreted as an unknown situation.

Conclusion

I have presented two methods allowing to make a decision even if your process is based on rules with overlapping. These two methods have pros and cons. The first one, based on the theory of expert aggregation, works well in practice and add a confidence score to the rules. The second, based on the partitioning trick, is better in theory. As mentioned in [3], it is a good way to generate an interpretable consistent estimator of the regression function.

References

[1] G.Bagallo and D.Haussler, Boolean feature discovery in empirical learning (1990), In Machine Learning, 5(1):71–99. Springer.

[2] V.Margot, J-P. Baudry, F. Guilloux and O. Wintenberger, Rule Induction Partitioning Estimator (2018), In International Conference on Machine Learning and Data Mining in Pattern Recognition 288–301. Springer.

[3] V.Margot, J-P. Baudry, F. Guilloux and O. Wintenberger, Consistent Regression using Data-Dependent Coverings (2021), In Electronic Journal of Statistics. The Institute of Mathematical Statistics and the Bernoulli Society.

About Us

LinkedIn: https://www.linkedin.com/company/advestis/

The post How to deal with overlapping rules? appeared first on Towards Data Science.

How to choose the best model?

Vincent Margot — Fri, 18 Dec 2020 09:49:26 +0000

How to choose the best model

Image of quimono from Pixabay.

With this advertising title, I would like to draw your attention to a common problem in Machine Learning (ML): the choice of the good model. I will not describe all the statistical methods that have been developed for model selection. For the most curious readers I suggest the reading of [3]. In this article I would like to talk about less known methods: the theory of aggregation of experts.

When I talk to other data scientists, most of them have never heard of it. Therefore, I decided to write this article to shortly describe these methods. It is based on the book "Prediction, Learning and Games" [1], which is nicknamed "The Red Bible" and the introduction of the thesis of Pierre Gaillard [2].

The main idea behind the aggregation of experts

Initially, the theory of aggregation of experts has been developed to make sequential predictions of a variable y at different times t. For this purpose, we assume that at each time t we have a group of K experts who make predictions. The idea is to aggregate the predictions sequentially to produce an aggregated prediction ŷ.

One of the most important points is that it is not necessary to know the procedure by which these predictions are made. They could be generated by machine learning algorithms or be expert advice or even predictions from your fortune-teller, whatever.

To create ŷ, we perform a convex combination (i.e. a weighted sum where the sum of the weights equals 1) of the experts’ predictions. Once the true value of y is known, we adjust the weights of the individual experts according to the difference between their prediction and the true value of y. The worse the prediction, the lower the weight. Thus, the more an expert is wrong, the less important it will be in the aggregated prediction. And so on until you decide to renew your expert group.

Depending on what you compare your aggregated model to, you will use different strategies to update the experts’ weights. In fact, there are different goals in the theory of aggregation of experts: to make an aggregated model competitive with the best experts in your group (model selection (MS) problem), to make an aggregated model competitive with the best convex combination of your experts (convex combination (CC) problem), or even to make an aggregated model competitive with the best sequential convex combination of your experts.

In the following, I need to introduce some mathematical formalism to present the main aggregation strategy: exponentially weighted aggregation. So if you are allergic to mathematical formulas, go directly to the conclusion and share this article with your data scientists team.

Sequential prediction with expert advices

To evaluate the performance of a prediction we need a measure. In the theory of aggregation of experts it is called the loss function, denoted ℓ. It is a positive function which takes two arguments. The first one is a prediction and the second one is a realization. This function is assumed to be convex on its first argument. For example, we have the convex square loss ℓ(x, y) := (x-y)². We can describe the sequential prediction with expert advices with the following steps:

The main objective of a statistician is to minimize the cumulated loss of the aggregated prediction. It is defined by

We can decompose the formula of the cumulated loss as follows

Thus, we obtain the classical decomposition in ML. The right term is named cumulated regret. It reflects the regret of not knowing the best expert in advance. So, to minimize the cumulated loss, we have to control the cumulated regret.

The exponentially weighted aggregation strategy (presented below) guarantees an average cumulated regret of the order of

We can prove that this bound is optimal, but is not the purpose of this article. I refer the most curious among you to Chapter 2.2 of [1].

Exponentially weighted aggregation strategy

As mentioned in the introduction the aggregated prediction is a convex combination of the experts’ predictions. For each date t we have

And the weights are sequentially computed following the formula

where η > 0 is called the learning rate. To calibrate this parameter we use the fact that if the loss is convex on its first argument and is valued in [a, b] then the cumulated regret of this strategy is uniformly bounded by

Hence, this bound is minimal for

You could tell me that in practice we do not know the parameters T, a and b. And it is true. However, the expert aggregation theory has been developed to be used in practice. Therefore, there are always techniques to get around this kind of difficulty. If you want to read more about the calibration of η we refer to Chapter 2.3 of [1].

Conclusion

The theory of aggregation of experts is an amazing theory, and it is driven by practice. It allows the aggregation of multiple forecasting models to get the most out of each one. For example, EDF (France’s main electricity generation and distribution company) uses more than forty models to predict the electricity consumption of the French people. One of the main advantages is that by aggregating experts, very different models can be used simultaneously without having to worry about how they were created. Therefore, very general models can be combined with very specialized models. It’s the best of both worlds.

Right now, the theory of aggregation of experts is not very well known. But I think it’s a very interesting approach today, where the number of algorithms is multiplying and the selection depends more and more on time-changing parameters. The best model of today may not be the best model of tomorrow and vice versa.

Therefore, from now on, do not choose, use them all!

About Us

LinkedIn: https://www.linkedin.com/company/advestis/

References

[1]N.Cesa-Bianchi, and L.Gábor, Prediction, learning, and games __ (2006), Cambridge university press.

[2] P.Gaillard, Contributions to online robust aggregation: work on the approximation error and on probabilistic forecasting. Applications to forecasting for energy markets (2015), Université Paris-Sud 11.

[3] P. Massart, Concentration inequalities and model selection __ (2007), Vol. 6, Berlin: Springer.

The post How to choose the best model? appeared first on Towards Data Science.

How to measure interpretability?

Vincent Margot — Fri, 20 Nov 2020 16:57:23 +0000

MODEL INTERPRETABILITY

Image of geralt from Pixabay.

Today almost everyone uses, consciously or unconsciously, artificial intelligence algorithms. And more and more people are beginning to question how they work. In September 2020, Twitter apologized for ‘racist‘ image-cropping algorithm. This image-cropping anomaly has been found by Colin Madland a PhD candidate who wanted to tweet a similar anomaly from the software Zoom. Of course, this is a minor issue, but a sentence such as "The algorithm said to turn left for a reason, but we do not know what the reason is" cannot be an answer after a self-driving car accident. Then the question of interpretability arises.

If you work in the fields of Data Science, machine learning or artificial intelligence, you have probably heard about interpretability (if not, I recommend the reading of the book [5]). I am not an exception to the rule, and I have had the mission of designing an interpretable predictive algorithm for management of financial assets. So, I’ve started to do some research. After a few readings, I have found several algorithms presented as interpretable but I have also realized that everyone agrees that there is not yet a definition of interpretability and that this notion is difficult to define. So, without any quantitative measure, how to be sure that an algorithm is more interpretable than another? To answer to this question I have decided to create a measure of interpretability based on the triptych predictivity, stability and simplicity as proposed in [6].

Introduction

Usually, two main ways can be distinguished for the production of interpretable predictive models.

The first one relies on the use of a so-called post-hoc interpretable model. In this case, one uses of an uninterpretable Machine Learning algorithm to create predictive models, and then one tries to interpret the generated model.

The other way is to use an intrinsically interpretable algorithm to directly generate an interpretable model. Usually when one wants to design an intrinsically interpretable algorithm, one bases it on rules. A rule is a "If … Then …" statement, easily understandable by human. There exist two families of intrinsically interpretable algorithm: the tree-based algorithms generating trees and rule-based algorithms generating sets of rules. But any tree can be represented as a set of rules.

In this article I focus on the interpretability of algorithms based on rules. I describe how to evaluate the predictivity, stability and simplicity of a set of algorithms and then how to combine them to obtain an interpretability measure and identify the most interpretable.

How to evaluate predictivity ?

The predictivity is a positive number between 0 and 1 which evaluates the accuracy of the predictive model. The accuracy ensures a trustfulness in the generated model. Accuracy measure is a well studied notion in Machine Learning.

The accuracy function should be chosen according to the type of settings. For instance, one uses the mean square error for a regression problem, and the 0–1 error function for a binary classification setting. Using an error term in the predictivity function generates two constraints: a) this term has to be normalized to make it independent on the range of the predicted variable, and b) it must take its values between 0 and 1, 1 being the highest level of accuracy.

How to evaluate the stability?

The stability quantifies the noise sensitivity of an algorithm. It allows to evaluate the robustness of the algorithm. To measure the stability of a tree-based algorithm or a rule-based algorithm, in [1], Bénard et al. state that

"A rule learning algorithm is stable if two independent estimations based on two independent samples result in two similar lists of rules."

Unfortunately, for continuous variables, there is null probability for a tree cut to show the exact same value for a given rule, evaluated on two independent samples. For this reason, the pure stability appears too penalizing in this case.

In order to avoid this bias, I translate rules in a discretized version of the features space. I discretize each features in q bins by considering the q-quantiles ranges. Then for each rule, bounds of conditions are replaced by their corresponding bins. Usually q=10 is a good choice.

Finally, using the so-called Sørensen–Dice coefficient on two sets of rules generated by the same algorithms on two independent samples I obtain the stability value where 1 means that the two sets of rules are identical.

How to evaluate the simplicity?

Simplicity could be constructed as the capacity to audit the prediction easily. A simple model ensures that it is easy to check some qualitative criteria such as ethics and morals. To measure the simplicity of a model generated by a tree-based algorithm or a rule-based algorithm I use the interpretability index defined in [2]. The interpretability index is defined as the sum of the lengths of the rules of the generated model. It should not be confused with the interpretability I define in this article.

To have a measure between 0 and 1, I evaluate the simplicity of an algorithm among a set of algorithms, and I define its simplicity as the ratio between the minimum of the interpretability indexes of the considered algorithms divided by the interpretability index of this algorithm. Thus, it is better to have a small set of rules with small lengths than the opposite.

How to evaluate interpretability?

Finally, considering a set of tree-based algorithms and rule-based algorithm, I am able to point out the most interpretable one by using a convex combination of the predictivity, the stability and the simplicity. The coefficients of the combination can be chosen according to your desiderata. For instance, if you try to describe a phenomenon the simplicity and the stability are more important than the predictivity (provided it remains acceptable).

In [2], the authors define interpretability as following:

"In the context of ML systems, we define interpretability as the ability to explain or to present in understandable terms to a human."

I state that an algorithm with a high predictivity, stability and simplicity is interpretable in the sense of the previous definition. Indeed, high predictivity ensures trust, high stability score ensures robustness and a high simplicity ensures that the generated model can be easily understood by humans because it relies on a limited number of rules of small length.

Conclusions

In this article, I present a quantitative criterion to compare interpretability of tree-based algorithms and rule-based algorithms. This measure is based on the triptych predictivity, stability and simplicity. This concept of interpretability has been thought to be fair and rigorous. It can be adapted to the various desiderata of the statistician by choosing appropriate coefficients in the convex combination.

The methodology described provides a fair measure for classifying the interpretability of a set of algorithms. In fact, it allows to integrate the main goals of interpretability. An algorithm designed to be accurate, stable or simple should keep this property whatever the dataset.

However, according to the definition of simplicity, 100 rules of length 1 have the same simplicity that one single rule of length 100, which is debatable. Moreover, the stability measure is purely syntactic and rather restrictive. Indeed, if some features are duplicated, two rules may have two different syntactic conditions but be instead of by otherwise identical based in their activations. One way of relaxing this stability criterion could be to compare the rules, based on their activation sets (i.e. by looking to observations where conditions are met simultaneously).

If you want more precision about this measure of interpretability I refer you to the reading of the working paper [4].

Thanks to Ygor Rebouças Serpa for his remarks and comments.

References

[1] C.Bénard and G.Biau and S.da Veiga and E.Scornet, SIRUS: Stable and Interpretable RUle Set (2020), ArXiv

[2] F.Doshi-Velez and B.Kim, Towards A Rigorous Science of Interpretable Machine Learning (2017), ArXiv

[3] V.Margot and J.P.Baudry and F.Guilloux and O.Wintenberger, Consistent Regression using Data-Dependent Coverings (2020), ArXiv

[4] V.Margot and G.Luta, A rigorous method to compare interpretability (2020), ArXiv

[5] C.Molnar, Interpretable Machine Learning (2020), Lulu.com

[6] B.Yu and K.Kumbier, Veridical data science (2020), Proceedings of the National Academy of Sciences

About Us

LinkedIn: https://www.linkedin.com/company/advestis/

The post How to measure interpretability? appeared first on Towards Data Science.