MIL-OSI Russia: Scientists present new method for working with unbalanced data

Translartion. Region: Russians Fedetion –

Source: State University Higher School of Economics – State University Higher School of Economics –

Specialists Faculty of Computer Science HSE and Sber’s Artificial Intelligence Lab have developed a geometric method for data expansion — Simplicial SMOTE. Tests on different data sets have shown that it significantly improves the quality of AI work. The method is especially useful in situations where rare cases are very important, for example, in the fight against fraud or in the diagnosis of rare diseases. Research results available in the open archive Arxiv.org and will be presented at the International Conference on Knowledge Discovery and Data Mining (KDD) in Toronto in summer 2025.

The problem of imbalanced data is becoming increasingly important in various fields, including banking and medicine. Traditional methods – random duplication or global sampling – often produce low-quality samples or poorly model rare class data.

The new method proposed by scientists from the Higher School of Economics and Sberbank — Simplicial SMOTE (Synthetic Minority Oversampling Technique) — solves these problems: it provides more accurate modeling of complex topological data structures and increases the quality of classifiers on unbalanced data sets.

It helps to create new examples of a rare class using information from several close examples (“simplex”), and not just from two close points, as in the original version of SMOTE and its well-known analogues. This allows for a better understanding of the data and improves the work of AI. The method helps improve the training of artificial intelligence on imbalanced data, that is, in situations where there are many examples of one class (for example, normal transactions), but few examples of another (for example, fraud).

The researchers have experimentally demonstrated on a large number of test datasets that the proposed approach significantly improves the quality metrics (F1 measure, Matthews correlation coefficient) of both the basic SMOTE and its modifications. In particular, an improvement was recorded for gradient boosting, a classifier often used in practice.

“Our method is especially effective in tasks where unbalanced data is common and where the rare class is more significant. Banks can use Simplicial SMOTE to better detect fraud, and medical centers to diagnose rare diseases,” comments one of the authors of the article, Andrey Savchenko, a leading researcher. Laboratories of theoretical foundations of artificial intelligence models Institute of Artificial Intelligence and Digital Sciences Faculty of Computer Science, National Research University Higher School of Economics.

The new method can be integrated into existing oversampling algorithms (Borderline-SMOTE, Safe-level-SMOTE, and ADASYN), increasing their accuracy without significantly increasing computational complexity. The researchers believe that the developed approach can contribute to the development of more accurate and reliable machine learning models and, therefore, to improved analytics.

Please note: This information is raw content directly from the source of the information. It is exactly what the source states and does not reflect the position of MIL-OSI or its clients.

MIL OSI Russia News