Do You Know the Logical Analysis of Data Methodology (LAD)?

I recently discovered an area of data analysis that began in 1986 with the work of Peter L. Hammer called Logical Analysis of Data (LAD). When I asked around my circle, no one had heard of it. So I decided to write a short introduction about this original methodology of data analysis. This article is based on the Chapter 3 of [1] and the article [2].

LAD is a Binary interpretable classification method, based on a mixture of optimization, Boolean functions and combinatorial theory. But rest assured that no prerequisites are needed to understand the basics of this theory. It is a competitive classification method for analysing datasets consisting of binary observations that offers a clear explanation through its concept of patterns.

LAD extracts a large collection of patterns from the datasets, some of which are characteristic of the observations with a positive classification, while the others are characteristic of the observations with a negative classification. The collection is then filtered to extract a smaller, non-redundant collection of patterns. This reduction allows providing comprehensible explanations for each classification.

Introduction

To understand what the LAD is, let me give a common example. A doctor wants to find out which foods cause headaches in his patient. For this purpose, his patient records his diet for one week, which is shown in the following table.

A quick analysis leads the doctor to two conclusions. First, he notes that the patient never ate food n°2 without food n°1 on the days when he did not have a headache, but did eat it on some days when he did have a headache. And he notes the same pattern with food n°4 without food n°6. Hence, he concludes that his patient’s headache can be explained by using these two patterns.

Without knowing it, the doctor performed a LAD on the diet records dataset. In fact, during his analysis, he had to answer the three following questions: (1) How to extract a short list of features (i.e., food items) sufficient to classify the data (i.e., explain the occurrence of headaches)? (2) How to detect patterns (i.e., combinations of food items) causing headaches? (3) How to build theories (i.e., collection of patterns) explaining every observation? These questions summarize the LAD methodology.

Notations and definitions.

Logical Analysis of Data is based on the notion of partially defined Boolean functions (pdBf) and on the concept of patterns.

We set B = {0, 1}. The set Bⁿ, usually named the Boolean hypercube of dimension n, is composed of all possible binary sequences of length n. We define a partial order ≤ on Bⁿ as follows: a =(a₁, …, aₙ) ≤ b=(b₁, …, bₙ) if and only if aᵢ ≤ bᵢ for each i=1, …, n.

A subset S ⊆ Bⁿ is called a subcube of Bⁿ if and only if |S|=2ᵏ for k ≤ n and there are n-k components in which all sequences of S coincide. For example S={(0,0,0), (0,0,1), (0,1,0), (0,1,1)} is a subcude of B³ for k=2

What is a partially defined Boolean functions?

A Boolean function, f, is a mapping from Bⁿ to B. There are 2^(2ⁿ) possible Boolean functions. For a Boolean function f, we set T=T(f)={a ∈ Bⁿ such as f(a)=1} and F=F(f)={a ∈ Bⁿ such as f(a)=0} as the sets of the true points of f and the false points of f respectively. A partially defined Boolean function (pdBf) is a Boolean function such that T and F are disjoint, but some elements of Bⁿ belong neither to T nor to F. We can rewrite the table of the diet record as a pdBf as following:

A partially defined Boolean function. Table from author.

In this table points 1, 5 and 7 belong to T (true points) and 2,3,4 and 6 belong to F (false points). But the sequence (1,1,1,1,1,1,1,1) belongs neither to T nor to F. It could be interesting to be able to predict it.

What is a pattern?

To define pattern, I must introduce the notion of term. We define the complement of x ∈ B as x⁻ = 1-x. A term is a product of elements of B and their complements. And the degree of a term is the number of elements in it, named _literal_s. For example, let t be a term of degree 3 such that t=x⁻₁x₂x₃, then t(0,0,1) = (1–0)×0×1 = 0. It is important to notice that t(a) is defined for any a ∈ Bⁿ even if the degree of t is less than n, just ignore the values which are not in the term. If t=x₂x⁻₃ then t(0,0,1) = 0×(1–1)=0. If _t(a)=_1 we said that _t cove_rs the point a.

Patterns are the central elements of LAD. A term t is called a positive (negative) pattern of a pdBf, f, if and only if : 1 – t(X) = 0 ∀ X ∈ F(f) (X ∈ T(f) ), and 2 – t(X) = 1 for at least one X ∈ T(f) (X ∈ F(f) ). For instance, the term x⁻₁x₂ equals to 1 if and only if x₁=0 and x₂=1. So, in the previous example, this term only covers points 1 and 7, which are in T(f). This term is called a positive pattern of f. One can notice that x₄x⁻₆ is also a positive pattern of f. These two patterns were found in the doctor’s analysis. Respectively, terms covering points in F(f) are called negative patterns of f.

To compare patterns, we need to introduce some reasonable criteria for suitability: Simplicity, Selectivity, and Evidence. These criteria define a simple partial preorder of patterns, called a preference. Let P be a pattern and f a pdBf, then we have:

Simplicity: The simplicity of P is evaluated considering the set of literals of P. It is the σ preference, and we define _P₂ ≤σ P₁ if and only if the set of literals of P₁ is a subset of that of P₂.

Selectivity: The selectivity of P is evaluated considering the subcube of P (i.e., the subset of points of Bⁿ covered by P). It is the Σ preference, and we define _P₂ ≤Σ P₁ if and only if the subcube of P₁ is a subset of the subcube of P₂.

Evidence: The evidence of P is evaluated considering the set of T(f) covered by P. It is the ϵ preference, and we define _P₂ ≤ϵ P₁ if and only if the set of the true points covered by P₁ is a subset of those covered by P₂.

We can also combine preferences using intersection ∩ and lexicographic refinement |. Let λ and γ be two preferences, then we have:

P₁ ≤_(λ ∩ γ) P₂ if and only if P₁ ≤_λ P₂ and P₁ ≤_γ P₂ .
P₁ ≤_(λ | γ) P₂ if and only if either _P₁<λ P₂ or _(P₁ ≈λ P₂ and _P₁ ≤γ P₂).

Let λ be a preference, a pattern P, is called a Pareto-optimal pattern if and only if that there is not pattern P’ ≠ P such that _P ≤λ P’. Unfortunately, the concept of optimal pattern does not have a unique definition. We summarize the properties of patterns which are Pareto-optimal with respect to the preferences and combinations of preferences.

Types of Pareto-optimality. Table from author.

In the headache example, the term x₂x₅x₈ is a positive pattern of f (it covers points 5 and 7). But it is not a Minterm. Indeed, x₂x₅ is also a positive pattern and the subcube of x₂x₅x₈ is included is those of x₂x₅. Here, x₂x₅ is a Minterm positive pattern.

Considering the positive pattern x₂x₅x⁻₆x⁻₇x₈, we can say that there is not i ∈ {1,3,4} such that x₂x₅x⁻₆x⁻₇x₈xᵢ is a pattern. Thus, x₂x₅x⁻₆x⁻₇x₈ is a Prime positive pattern.

Moreover, x₂x₅ and x₂x₅x⁻₆x⁻₇x₈ are Strong positive patterns_.This is because, there is no pattern covering of true points of the example. Hence, x₂x₅ is a Spanned_ positive pattern (Mintern and Strong) and x₂x₅x⁻₆x⁻₇x₈ is a Strong Prime positive pattern.

Methodology of LAD

Now we are equipped to go through the methodology of LAD. The main objective of LAD is to find the extention of a pdBF (such as defined previously). But there exist many ways to extend this function. The difficulty is to find the "good one", called a theory. To do so, LAD processes in three main steps: 1- Turning the dataset into a pdBf. It is called the binarization process. 2- Detecting suitable patterns, as proposed by the doctor in the introduction. 3- Forming theory. It means extracting the best extension of the pdBf regarding the all detecting patterns.

Binarization.

The first step is the binarization of the data. Many real data sets contain numeric data and nominal data. An easy way to binarize nominal data is to use the one-hot encoding technique. This is a very fast technique and the data is converted as expected. The binarization of numeric data is a bit more complicated. A common method is to choose critical values or cutpoints and use indicator features. Either the nature of the feature suggests the choice of cutpoints (such as BMI in medicine) or it is more complicated and still open to be improved. Unfortunately, the binarization process increases the dimension of the dataset by adding many features. Hence, it is important to reduce the dimension with a feature selection.

Detecting patterns.

As you can imagine, this part of the process is still an open area of research, and the choice of algorithm to generate patterns depends on what kind of patterns you want to find. Here, I will present an algorithm, introduced in [4], to extract positive prime patterns. For the negative patterns the algorithm is similar.

The idea behind this pseudocode is simple. The goal is to test terms from degree 1 to degree D and keep only those that have a positive pattern. Negative patterns are not considered, and terms that cover both positive and negative points are kept as candidates for the next degree.

There are several algorithms for generating different types of suitable patterns. I refer you to [2] and [3] if you want a survey of them.

Forming theory.

At this stage, we have a lot of patterns. Certainly too many to create an interpretable model, which is an advantage of the LAD. So, we need to select a set of suitable patterns from the set of generated patterns. The selected set should have a reasonable number of elements, but also have enough elements to make a good distinction between the positive and negative observations. This selection process can be expressed as an optimization problem.

For each positive pattern, pₖ, we assign a binary variable yₖ such that yₖ=1 if and only if pₖ is included in the final model. And for each positive observation tᵢ we define a binary vector (tᵢ¹, tᵢ², …, tᵢᵖ), where p is the number of positive patterns and tᵢᵏ=1 if and only if pₖ covers tᵢ.

This optimization problem can be easily adapted to your desiderata at the price of small variations. The value 1 on the right side of the constraints can be increased to ensure that there are at least two (or more) patterns for a decision. We can also add a weight, wₖ, at each pattern in the sum ∑ wₖyₖ. For example, these weights can be indexed by the degree of each pattern to promote patterns with few literals that are more interpretable by humans.

Conclusion

Logical analysis of data is a very interesting Classification method that is still an active research area. LAD is based on the concept of patterns, which are a tool to classify new observations and provide an understandable explanation for each classification.

LAD has numerous application in medicine, such as breast cancer prognosis, ovarian cancer detection, analysis of survival data and others (see [2] for an overview). And I think it can be easily applied in many other fields as well.

Unfortunately, LAD seems to suffer from a lack of visibility, and I hope this short article will spark your interest. It is an elegant way to generate an interpretable classification model different from the usual tree-based algorithms and rule-based algorithms. An article will follow with python applications.

References

[1] I. Chikalov, V. Lozin, I. Lozina, M. Moshkov, H.S. Nguyen, A. Skowron, and B. Zielosko. Three approaches to data analysis: Test theory, rough sets and logical analysis of data (2012) Vol. 41. Springer Science & Business Media.

[2] G. Alexe, S. Alexe, T.O. Bonates, A. Kogan. _Logical analysis of data–the vision of Peter L. Hammer._ (2007) Annals of Mathematics and Artificial Intelligence 49, no. 1 : 265–312.

[3] P.L. Hammer, A. Kogan, B. Simeone, and S. Szedmák. Pareto-optimal patterns in logical analysis of data. (2004) Discrete Applied Mathematics 144, no. 1–2: 79–102.

[4] E. Mayoraz. C++ tools for Logical Analysis Of Data. (1995) Rutcor Research Raport : 1–95.

About Us

Advestis is a European Contract Research Organization (CRO) with a deep understanding and practice of statistics, and interpretable machine learning techniques. The expertise of Advestis covers the modeling of complex systems and predictive analysis for temporal phenomena.

LinkedIn: https://www.linkedin.com/company/advestis/

Do You Know the Logical Analysis of Data Methodology (LAD)?

Introduction

Notations and definitions.

What is a partially defined Boolean functions?

What is a pattern?

Methodology of LAD

Binarization.

Detecting patterns.

Forming theory.

Conclusion

References

About Us

Related Articles

Introduction to Support Vector Machines - Motivation and Basics

Binary Classification

Estimating Individualized Treatment Rules Using Outcome Weighted Learning

From Adaline to Multilayer Neural Networks

Why do Computers even use Binary?

How Vicious Do You Think Your Social Media Comments Are? 😈

Understanding the ROC and AUC metrics.

Understanding Deadly Police Encounters with Data Science

Researching Content-Based Filtering for News Feeds