Treffer: A resistance outlier sampling algorithm for imbalanced data prediction

Title:

A resistance outlier sampling algorithm for imbalanced data prediction

Authors:

Xiaoying Pan, Rong Jia, Jiahao Huang, Hao Wang

Source:

Intelligent Data Analysis. 26:583-598

Publisher Information:

SAGE Publications, 2022.

Publication Year:

2022

Subject Terms:

0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology

Document Type:

Fachzeitschrift Article

ISSN:

1571-4128
1088-467X

DOI:

10.3233/ida-211519

Accession Number:

edsair.doi...........b816a6c996f5692aff3b6aff66f653ea

Database:

OpenAIRE

Weitere Informationen

Classification of imbalanced data is an important challenge in current research. Sampling is an important way to solve the problem of imbalanced data classification, but some traditional sampling algorithms are susceptible to outliers. Therefore, an iF-ADASYN sampling algorithm is proposed in this paper. First, based on the ADASYN algorithm, we introduce the isolation Forest algorithm to overcome its vulnerability to outliers. Then, a calculation method of anomaly index which can delete outliers accurately of minority data is presented. The experimental results of four UCI public imbalanced datasets show that the algorithm can effectively improve the accuracy of the minority class, and increase the stability. In the real thrombus dataset, the AUC value of the iF-ADASYN algorithm is more significant than that of SMOTE and ADASYN algorithms, and the recognition rate of patients with thrombosis increased by 20%. The iF-ADASYN algorithm obtains better resistance to outliers than the original ADASYN algorithm. Meanwhile, it improves the accuracy of minority class decision boundary region division.

AN0157790668;flp01may.22;2022Jul06.02:44;v2.2.500

A resistance outlier sampling algorithm for imbalanced data prediction

Keywords: Imbalanced data; sampling algorithm; outlier; thrombosis risk prediction

1. Introduction

Imbalanced dataset problem refers to the frequency of each class in the dataset is not equal, which leads to the difficulty of classification. As a problem in almost every research area of data science, the difficulty caused by imbalanced data in classification may exist. For example, when predicting the distribution of animal species, only a small part of the world is suitable for animal living, while the rest is the domain of human activity. Compared with the size and scope of the whole patent publication, only a few inventions are widely recognized. In this paper, the number of patients with thrombosis in the real dataset is very small, which causes serious data imbalance. However, the mortality of patients is very high, if the classification is wrong, the consequences are serious, we pay more attention to the minority class patients with thrombosis.

In the orthopedic postoperative thrombosis problem, each patient label is denoted as "positive class" or "negative class", which represents thrombosis patients and non-thrombosis patients respectively. The number of negative class in the actual dataset is much larger than the number of positive class [[1]]. Most machine learning algorithms are designed to minimize the classification error rate, so the final classification results tend to be negative class [[2]]. However, in practical problems, especially in the case of disease detection, it is costly to misidentify patients as healthy. Therefore, the class imbalance has a great impact on the accuracy of predicting patients with thrombus. Thrombosis after orthopedic surgery is a very hidden disease with a high rate of sudden death [[3]]. If thrombosis [[4]] is not treated in time, it will cause serious consequences. The shedding of thrombosis into the pulmonary circulation can lead to acute pulmonary embolism, and even induce respiratory failure. With the rapid development of machine learning technology, it is significant to predict the risk of thrombosis in patients after surgery by relevant algorithms, which can prevent the formation of thrombosis and reduce the risk of thrombosis in patients.

He et al. [[5]] presents a novel adaptive synthetic (ADASYN) sampling approach for learning from imbalanced datasets. This method can synthesize adaptively minority samples according to the distribution of minority samples. Therefore, the ADASYN algorithm overcomes the limitations of SMOTE algorithm when synthesizing samples. He et al. applied ADASYN to deal with imbalanced data, and the recognition sensitivity of heart sound anomaly was improved by 58.6–84.4%. The algorithm reduces the learning bias introduced by the original imbalanced data distribution, and adaptively shifts the decision boundary to focus on those samples that are difficult to classify. Its biggest characteristic is that it determines the number of samples to be synthesized according to the weight, rather than synthesizing the same number of samples for each minority sample as SMOTE, which has a better effect than SMOTE. The ADASYN algorithm mainly samples the minority class with high weight, that is, the more the number of majority class around the minority class is, the higher the weight is. However, some minority classes with high weight are abnormal data. If a large number of abnormal data are sampled to generate new data, it will cause that the classifier cannot predict correctly the classification boundary, resulting in classification error, and it is easily affected by outliers.

Mandhare et al. [[6]] studied and compared the popular outlier detection algorithms, namely, outlier detection based on clustering, outlier detection based on distance, and outlier detection based on density. According to their respective parameters, they can be applied to different specific fields. Cai et al. [[5]] proposed an outlier detection method based on two-stage minimum weighted rare pattern mining, called MWRPM-Outlier. The experimental results show that the proposed method has an excellent performance in anomaly detection.

The existing sampling algorithms aim at an imbalanced dataset, which is only synthesized or deleted from the data level, and don't take into account the abnormal data existing in the real data. Most of the data will have errors from collection or measurement, resulting in abnormal points in the data. When there are outliers in minority data, the traditional sampling algorithm is not sensitive to outliers. If these outliers are ignored, the samples synthesized by the sampling algorithm will be distributed incorrectly, and the synthesized data will not be conducive to classifier classification, which will significantly reduce the classification effect of the machine learning algorithm.

In this paper, an outlier-resistant sampling algorithm iF-ADASYN based on the isolation Forest (iForest) [[7]] is proposed. Aiming at the disadvantage that the ADASYN algorithm is vulnerable to outliers, the iForest algorithm is used to improve the ADASYN algorithm, and the machine learning algorithm is used to classify the resampled data. In the proposed method, the iF-ADASYN algorithm deal with the imbalanced dataset problem, eliminate the influence of outliers on the sampling algorithm, sample the minority class data to improve the data distribution and increase the accuracy of the algorithm. The main contributions of this paper are as follows:

1) Based on the ADASYN algorithm, we propose an iF-ADASYN sampling algorithm by introducing the iForest algorithm to overcome the defect that it is susceptible to outliers.

2) An anomaly index calculation method is proposed, this method can delete accurately the outliers of sample data by evaluating the outliers of minority class samples in iForest;

3) We use four UCI public datasets for comparative experiments, and apply the iF-ADASYN algorithm to the real thrombosis dataset, complete the prediction of thrombosis and obtain better classification results. It assists medical work, provided certain reference results for thrombosis prediction, and reduces the harm of thrombosis after orthopedic surgery.

2. Related works

At present, the research on the processing of imbalanced datasets focuses on two aspects: sampling methods and classification algorithms. The sampling methods try to balance positive and negative classes by modifying the distribution of the dataset itself, which can be divided into Under-sampling, Over-sampling, and Hybrid sampling. Undersampling removes random samples from the majority class to balance data, while the remaining unselected samples are ignored [[8]]. Therefore, undersampling can obtain an equal number of samples and make classifier training faster. For instance, the compressed nearest neighbor rule [[9]] and unilateral selection [[10]] method. This method reduces the number of negative class to balance data, but it may lose most of the important information contained in the data. Wu et al. [[12]] extracted support vectors that play a key role in classification according to class overlap, and proposed a method of undersampling based on the class overlap, which effectively overcomes the problem that undersampling is easy to lose important sample information. Galar et al. [[13]]proposed random undersampling and Boosting algorithms to deal with imbalanced data distribution.

The over-sampling method changes the samples in the distribution of datasets by adding positive samples. Synthesis minority class over-sampling technology (SMOTE) is a kind of over-sampling method proposed by Chawla et al. [[13]] It is improved by a random over-sampling algorithm, and randomly selects several adjacent samples for each positive sample, randomly selects points on the line between the sample and the adjacent samples to generate non-repetitive positive samples. The characteristics of SMOTE algorithm determine that it cannot overcome the data distribution of an imbalanced dataset, which is easy to cause over-generalization problems [[15]]. Zheng et al. [[16]] proposed a technique to overcome the limitation of SMOTE, namely SNOCC, which can create new samples and ensure that the generated new samples can find new nearest neighbors, the SNOCC algorithm performs better than SMOTE and other methods in experiments. Nekooeimehr et al. [[17]] proposed an adaptive semi-unsupervised weighted over-sampling (A-SUWO) method, which applies clustering technology to group the minority class, determines the number of synthetic samples, and assigns the weight to generate samples. Douza et al. [[18]] proposed a simple and effective over-sampling method based on K-means and SMOTE, which can avoid generating noise samples and effectively overcome the imbalance between class and within the class. Then Kunakorntum et al. [[19]] proposed a SyMProD over-sampling method based on probability distribution. This method uses Z-score to standardize data, and remove noise data, then selects a small number of class samples according to the probability distribution of samples. This method can avoid noise generation, and reduce the possibility of class overlap and over-fitting.

The hybrid sampling method is the application of two resampling techniques to achieve a balanced dataset. Qian et al. [[20]] and Charte et al. [[21]] proposed mixed sampling technology. Wang et al. [[22]] designed a sampling algorithm (UCO) based on Random undersampling, K-means, and SMOTE to predict the attack of heart disease in patients with stroke data. The UCO algorithm reduces the imbalance rate of the dataset, generates nearly balanced data, and uses five classifiers to predict a heart attack. The RandomForest classifier has achieved high classification accuracy.

Because the traditional sampling method does not consider the influence of outliers on the sampling algorithm when processing imbalanced data, the classification result is not improved. Therefore, this paper proposes an anomaly index calculation method to evaluate the outliers of minority class samples through isolation Forest, and accurately delete outliers of sample data to improve the final classification results.

3. Theoretical basis

3.1 ADASYN sampling algorithm

ADASYN algorithm is an adaptive method to synthesize minority data. The algorithm can synthesize adaptively minority samples according to the distribution of minority samples. According to the distribution of sample weights, fewer samples are synthesized in places that are easy to classify, and more samples are synthesized in places that are difficult to classify. The key of this algorithm is to find the weight <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> , which is the standard of the ADASYN algorithm to synthesize the number of minority class samples.

<math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>D</mi><mo>=</mo><mrow><mo stretchy="false">{</mo><mrow><mo>(</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>y</mi><mn>1</mn></msub><mo>)</mo></mrow><mo>,</mo><mrow><mo>(</mo><msub><mi>x</mi><mn>2</mn></msub><mo>,</mo><msub><mi>y</mi><mn>2</mn></msub><mo>)</mo></mrow><mo>,</mo><mrow><mo>(</mo><msub><mi>x</mi><mn>3</mn></msub><mo>,</mo><msub><mi>y</mi><mn>3</mn></msub><mo>)</mo></mrow><mo>,</mo><mi mathvariant="normal">...</mi><mo>,</mo><mrow><mo>(</mo><msub><mi>x</mi><mi>N</mi></msub><mo>,</mo><msub><mi>y</mi><mi>N</mi></msub><mo>)</mo></mrow><mo stretchy="false">}</mo></mrow></mrow></math> , Where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>N</mi></msub></math> is a sample of N-dimensional feature space <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math> , <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>y</mi><mi>N</mi></msub><mo>∈</mo><mi>Y</mi><mo>=</mo><mrow><mo>{</mo><mn>1</mn><mo>,</mo><mn>0</mn><mo>}</mo></mrow></mrow></math> is a class label, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>y</mi><mi>N</mi></msub><mo>=</mo><mi /></mrow></math> 1 is minority data, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>y</mi><mi>N</mi></msub><mo>=</mo><mi /></mrow></math> 0 is majority data. The minority sample is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mi>s</mi></msub></math> , and the majority sample is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mi>l</mi></msub></math> . Therefore, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>m</mi><mi>s</mi></msub><mo>⩽</mo><msub><mi>m</mi><mi>l</mi></msub></mrow></math> and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>m</mi><mi>s</mi></msub><mo>+</mo><msub><mi>m</mi><mi>l</mi></msub></mrow><mo>=</mo><mi>m</mi></mrow></math> , <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>m</mi></math> is the total number of data.

For each sample <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub></math> belonging to the minority class, the Euclidean distance method is used to calculate its <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math> nearest neighbors, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">Δ</mi><mi>i</mi></msub></math> is the number of samples belonging to the majority class in <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math> nearest neighbors:

(1) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>r</mi><mi>i</mi></msub><mo>=</mo><mstyle displaystyle="true"><mfrac><msub><mi mathvariant="normal">Δ</mi><mi>i</mi></msub><mi>k</mi></mfrac></mstyle></mrow><mo>,</mo><mrow><msub><mi>r</mi><mi>i</mi></msub><mo>∈</mo><mrow><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></mrow></mrow></mrow></math>

where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>i</mi><mo>=</mo><mi /></mrow></math> 1, 2, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi mathvariant="normal">...</mi></math> , <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>N</mi></math> , <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> is the weight of minority class sample <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>i</mi></math> , <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> is normalized, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mover accent="true"><mi>r</mi><mo stretchy="false">^</mo></mover><mi>i</mi></msub><mo>∈</mo><mrow><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></mrow></mrow></math>

(2) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mover accent="true"><mi>r</mi><mo stretchy="false">^</mo></mover><mi>i</mi></msub><mo>=</mo><mstyle displaystyle="true"><mfrac><msub><mi>r</mi><mi>i</mi></msub><mrow><munderover><mo largeop="true" movablelimits="false" symmetric="true">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>m</mi><mi>s</mi></msub></munderover><msub><mi>r</mi><mi>i</mi></msub></mrow></mfrac></mstyle></mrow></math>

The number of synthetic samples <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>g</mi><mi>i</mi></msub></math> is calculated for each minority class of samples, where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>G</mi></math> is the total number of synthetic samples.

(3) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>g</mi><mi>i</mi></msub><mo>=</mo><mrow><msub><mover accent="true"><mi>r</mi><mo stretchy="false">^</mo></mover><mi>i</mi></msub><mo>×</mo><mi>G</mi></mrow></mrow></math>

According to the above three formulas, we can see that if the proportion of the majority class is higher in the <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math> nearest neighbors of the minority class, the weight <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> will be larger. Correspondingly, the number of samples synthesized according to its minority points will be larger in the total number of synthesized samples <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>G</mi></math> . When there are no outliers in the imbalanced data, the ADASYN algorithm will generate more new samples in the class boundary area, that is, the minority class and the majority class are difficult to distinguish, which increases the decision area of the minority class area, and allows the classifier to distinguish two classes of imbalanced data better. If there are outliers of the minority class in the middle of the majority class area, the weight <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> becomes very high, and the ADASYN algorithm will eventually generate more synthetic samples. This will cause the classifier to confuse the minority class area boundary and reduce the accuracy of the classifier.

3.2 Isolation Forest algorithm

An iForest algorithm is an unsupervised anomaly detection method suitable for continuous data. It was first proposed by Professor Zhi-Hua Zhou [[23]] of Nanjing University in 2008, and then an improved version was proposed in 2012 [[23]]. The algorithm uses an iTree binary search tree structure called an isolation tree to isolate samples. Due to the small number of outliers and the estrangement from most samples, the outliers will be isolated earlier than most data, that is, the outliers will be closer to the root node of iTree, and the normal values will be farther from the root node.

Graph: Figure 1.Illustration of a single tree a and an iForest b.

Given a dataset with dimension N, the algorithm selects a random sub-sample to construct a binary tree. The branch of the tree is performed by selecting the random feature <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub></math> of the data, where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>i</mi><mo>∈</mo><mrow><mo>{</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mi mathvariant="normal">...</mi><mo>,</mo><mi>N</mi><mo>}</mo></mrow></mrow></math> (a single variable). Then, it selects a random value <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>v</mi></math> between the minimum and maximum of the dimension. If the dimension value of the given data point is less than <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>v</mi></math> , the point is sent to the left branch, otherwise it is sent to the right branch. In this way, the data on the current node of the tree is divided into two. The branching process is executed recursively in the dataset until the single point is isolated or the predetermined depth limit is reached. This process starts again with a new random sub-sample to construct another random tree. After the construction of a large tree structure (i.e., forest), the training is completed. In the scoring step, a new data point to be tested traverses all the trees, and predicts the abnormal score according to the depth of the point in each tree, as shown in Fig. 1. Figure 1a shows a tree after training. The red dotted line represents the trajectory of a single abnormal point along the tree, while the blue dotted line represents the trajectory of a normal point. It can be seen that the abnormal point is quickly isolated, and in this case, the normal point has been moving along to the maximum depth. In Fig.1b, we can see a complete forest, including 60 trees. The straight line from each center represents a tree, and the outer circle represents the maximum depth limit. The red line is the trajectory of a single abnormal point stopping along each tree, and the blue line represents the trajectory of the normal point. It can be seen that the radius of the blue line is much larger than that of the red line. Based on this idea, the iForest can isolate the abnormal and normal points.

iForest is composed of multiple iTrees, iTree construction steps are as follows:

1) Put the corresponding dataset into the root node of iTree.

2) We randomly take an attribute q, and randomly take a partition value <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math> (between the maximum and minimum values of attribute q) in this attribute.

3) In attribute q, samples with values less than p are placed on the left node of the current node, and the remaining samples are placed on the right node.

4) Recursive operations 2 and 3 are performed on sub-nodes and construct new sub-nodes until the conditions are satisfied.

4. iF-ADASYN algorithm

Aiming at the disadvantage that the ADASYN algorithm is vulnerable to outliers, iF-ADASYN based on the iForest is proposed. This method uses an iForest algorithm to find outliers and remove them, then uses an adaptive synthetic sampling algorithm to oversample minority class samples. Figure 2 shows the flow chart of the iF-ADASYN algorithm. In the raw dataset, the minority class samples are found first, and the iForest is constructed for all minority class samples. At the same time, calculate the weight of each minority class. When the weight is close to 1, it indicates that the average path length of the test data is very small, which is an outlier. When the weight is less than 0.5, it shows that the average path length of the test data in the iForest is greater than the average path length of the binary search tree, which is normal data. When the weight is near 0.5, it does not include obvious outliers. After testing, the minority class data with a weight greater than 0.8 are brought into the trained iForest. The selection of outliers judged by iForest is ignored, while the minority data with high weight are sampled. Compared with the ADASYN algorithm, iF-ADASYN has a strong ability to resist outliers in theory. Compared with the commonly used SMOTE algorithm, the iF-ADASYN sampling algorithm synthesizes data according to the weight of minority distribution, which reduces the risk of over-fitting of a classification model. The iF-ADASYN algorithm does not require an isolation Forest model to judge outliers for each minority class data, but only judges the minority data with high weight, avoiding the algorithm consuming a lot of resources.

Graph: Figure 2.iF-ADASYN algorithm flow chart.

4.1 Determination of high weight outliers

The ADASYN algorithm calculates the weight according to the number of the majority class in the nearest neighbor of the minority class, and we can divide the minority class data in the dataset into three parts. As shown in Fig. 3, we assume that an imbalanced dataset with 1000 samples is created, and the sample imbalance ratio is 8:2. For convenience, we divide them into two categories, which are represented by two colors. According to the distribution of sample classes, the red line in the figure divides the minority class data into three parts, and a is the noise sample set, which exists in the majority class samples and is difficult to classify; b is classification boundary sample set, located at the boundary between majority samples and minority samples, which is difficult to distinguish; c is safe sample set, surrounded by minority class samples, such samples have little effect on the classifier. For the iF-ADASYN algorithm, the classification boundary sample set is the data that we sample, and the synthesis of these data will enable the algorithm to find effectively the classification boundary. The iF-ADASYN algorithm will calculate the weight of each minority class, while the minority points in the noise sample set are high-weight outliers. For these data, the iF-ADASYN algorithm will judge by the iForest and its weight.

Graph: Figure 3.Minority data distribution.

After the iForest is trained, the minority class data with high weight can be tested. The data <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> is segmented on the isolation tree along the corresponding conditions to reach the leaf node and the path length <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>h</mi><mo>⁢</mo><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow></math> of the process is recorded, that is, the number of edges passed from the root node to the leaf node. Finally, the anomaly index of each test data is calculated according to the path length, and the formula is:

(4) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>s</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><msup><mn>2</mn><mrow><mo>-</mo><mfrac><mrow><mi>E</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>h</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow></mfrac></mrow></msup></mrow></math>

Where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>E</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>h</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo stretchy="false">)</mo></mrow></mrow></math> is the average path length of test data <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>x</mi></math> in T isolation trees and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow></math> is the average path length of binary trees. Due to the similarity between isolation trees and binary trees, the algorithm uses it to normalize <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>h</mi><mo>⁢</mo><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mrow></mrow></math> for each isolation tree.

(5) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mrow><mrow><mn>2</mn><mo>⁢</mo><mi>H</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>ψ</mi><mo>-</mo><mn>1</mn></mrow><mo stretchy="false">)</mo></mrow></mrow><mo>-</mo><mstyle displaystyle="true"><mfrac><mrow><mn>2</mn><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mrow><mi>ψ</mi><mo>-</mo><mn>1</mn></mrow><mo stretchy="false">)</mo></mrow></mrow><mi>ψ</mi></mfrac></mstyle></mrow></mrow></math>

Where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>H</mi><mo>⁢</mo><mrow><mo>(</mo><mi>i</mi><mo>)</mo></mrow></mrow></math> is the harmonic series, which can be estimated by <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>H</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>i</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mrow><mtext>lni</mtext><mo>+</mo><mn>0.5772156649</mn></mrow></mrow></math> (Euler constant). Assuming that the tree depth is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mtext>log</mtext><mn>2</mn></msub><mo>⁢</mo><mrow><mo>(</mo><mi>ψ</mi><mo>)</mo></mrow></mrow></math> , a leaf node contains d datas, then the actual segmentation path length of the d datas is:

(6) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>h</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo stretchy="false">)</mo></mrow></mrow><mo>=</mo><mrow><mrow><msub><mi>log</mi><mn>2</mn></msub><mo>⁡</mo><mrow><mo stretchy="false">(</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow><mo>+</mo><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>d</mi><mo stretchy="false">)</mo></mrow></mrow></mrow></mrow></math>

If the anomaly index s of high-weight minority data is greater than 0.8, it indicates that the average path length of the test data is small, which is an anomaly point.

4.2 iF-ADASYN algorithm process

The imbalanced dataset is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>X</mi></math> , the number of minority class samples is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mi>s</mi></msub></math> , and the number of majority class samples is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>m</mi><mi>l</mi></msub></math> . The minority sample set is <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>D</mi><mi>s</mi></msub><mo>=</mo><mrow><mo stretchy="false">{</mo><msub><mi>x</mi><mn>1</mn></msub><mo>,</mo><msub><mi>x</mi><mn>2</mn></msub><mo>,</mo><msub><mi>x</mi><mn>3</mn></msub><mo>,</mo><mi mathvariant="normal">...</mi><mo>,</mo><msub><mi>x</mi><mi>N</mi></msub><mo stretchy="false">}</mo></mrow></mrow></math> .

Step 1: Randomly extract <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>ψ</mi></math> sample points from <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>D</mi><mi>s</mi></msub></math> to form a subset, and put it into the root node of the isolation tree (iTree);

Step 2: Randomly select a feature q from the d feature dimensions of the data, and randomly generate a cut point p from the data of the current feature;

(7) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mo>{</mo><mtable rowspacing="0pt"><mtr><mtd columnalign="left"><mrow><mi>p</mi><mo>></mo><mrow><mi>min</mi><mo>⁡</mo><mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mrow></msub><mo>,</mo><mrow><mi>j</mi><mo>=</mo><mi>q</mi></mrow><mo>,</mo><mrow><msub><mi>x</mi><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mrow></msub><mo>∈</mo><msup><mi>D</mi><mo>′</mo></msup></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow></mtd></mtr><mtr><mtd columnalign="left"><mrow><mi>p</mi><mo><</mo><mrow><mi>max</mi><mo>⁡</mo><mrow><mo stretchy="false">(</mo><msub><mi>x</mi><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mrow></msub><mo>,</mo><mrow><mi>j</mi><mo>=</mo><mi>q</mi></mrow><mo>,</mo><mrow><msub><mi>x</mi><mrow><mi>i</mi><mo>⁢</mo><mi>j</mi></mrow></msub><mo>∈</mo><msup><mi>D</mi><mo>′</mo></msup></mrow><mo stretchy="false">)</mo></mrow></mrow></mrow></mtd></mtr></mtable><mi /></mrow></math>

Step 3: Cutting point p generates a hyperplane that divides the current data space into two subspaces: a sample point with dimension less than <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math> is placed in the left child node, and a sample point with a dimension greater than or equal to <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi></math> is placed in the right child node;

Step 4: Loop step 2 and step 3 until all leaf nodes have only one sample point or isolation tree (iTree) has reached the specified height e;

Step 5: Loop step 1 to step 4, until the iteration completes t isolation trees (itrees) to form an isolation Forest;

Step 6: For each sample <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub></math> belonging to minority class, the Euclidean distance method is used to calculate the <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math> nearest neighbors in their n-dimensional space, and <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">Δ</mi><mi>i</mi></msub></math> is the proportion of the number of samples belonging to the majority class in the <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math> nearest neighbors:

(8) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mi>r</mi><mi>i</mi></msub><mo>=</mo><mstyle displaystyle="true"><mfrac><msub><mi mathvariant="normal">Δ</mi><mi>i</mi></msub><mi>k</mi></mfrac></mstyle></mrow><mo>,</mo><mrow><mi>r</mi><mo>∈</mo><mrow><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></mrow></mrow></mrow></math>

Where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>i</mi><mo>=</mo><mrow><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><mi mathvariant="normal">...</mi><mo>,</mo><mi>N</mi></mrow></mrow></math> ;

Step 7: if <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> is greater than the preset value <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>L</mi></math> (set above 0.8), <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub></math> is taken into the iForest generated in step 5 to detect the abnormal score <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>s</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow></math> ;

(9) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>s</mi><mo>⁢</mo><mrow><mo>(</mo><mi>x</mi><mo>,</mo><mi>ψ</mi><mo>)</mo></mrow></mrow><mo>=</mo><msup><mn>2</mn><mrow><mo>-</mo><mfrac><mrow><mi>E</mi><mrow><mo stretchy="false">(</mo><mi>h</mi><mrow><mo stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo></mrow></mrow></mrow><mrow><mi>c</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow></mfrac></mrow></msup></mrow></math>

If <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>s</mi><mo>⁢</mo><mrow><mo stretchy="false">(</mo><mi>x</mi><mo>,</mo><mi>ψ</mi><mo stretchy="false">)</mo></mrow></mrow></math> is greater than 0.8, it is judged as an abnormal point, then <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>m</mi><mi>s</mi></msub><mo>=</mo><mrow><msub><mi>m</mi><mi>s</mi></msub><mo>-</mo><mn>1</mn></mrow></mrow></math> , otherwise, it is judged as a normal point; Return step 6 until all minority class complete the calculation;

Step 8: According to the <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> of each minority sample, the majority class around each minority sample is calculated for <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>r</mi><mi>i</mi></msub></math> ;

(10) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mover accent="true"><mi>r</mi><mo stretchy="false">^</mo></mover><mi>i</mi></msub><mo>=</mo><mstyle displaystyle="true"><mfrac><msub><mi>r</mi><mi>i</mi></msub><mrow><munderover><mo largeop="true" movablelimits="false" symmetric="true">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><msub><mi>m</mi><mi>s</mi></msub></munderover><msub><mi>r</mi><mi>i</mi></msub></mrow></mfrac></mstyle></mrow></math>

Step 9: Calculate the imbalance ratio between class:

(11) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>d</mi><mo>=</mo><mstyle displaystyle="true"><mfrac><msub><mi>m</mi><mi>s</mi></msub><msub><mi>m</mi><mi>l</mi></msub></mfrac></mstyle></mrow><mo>,</mo><mrow><mi>d</mi><mo>∈</mo><mrow><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></mrow></mrow></mrow></math>

Step 10: Calculate the total number of minority samples to be synthesized:

(12) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>G</mi><mo>=</mo><mrow><mrow><mo stretchy="false">(</mo><mrow><msub><mi>m</mi><mi>l</mi></msub><mo>-</mo><msub><mi>m</mi><mi>s</mi></msub></mrow><mo stretchy="false">)</mo></mrow><mo>×</mo><mi>b</mi></mrow></mrow><mo>,</mo><mrow><mi>b</mi><mo>∈</mo><mrow><mo>[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo>]</mo></mrow></mrow></mrow></math>

Where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>b</mi></math> denotes the proportion of minority and majority samples that need to be synthesized, and when <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>b</mi><mo>=</mo><mi /></mrow></math> 1, that is, <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>G</mi></math> is equal to the difference between the minority class and the majority class, where the number of the majority class and the minority class data are balanced after synthesizing data;

Step 11: Calculate the number of synthetic samples <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>g</mi><mi>i</mi></msub></math> for each minority sample, where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>G</mi></math> is the total number of synthetic samples.

(13) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>g</mi><mi>i</mi></msub><mo>=</mo><mrow><msub><mover accent="true"><mi>r</mi><mo stretchy="false">^</mo></mover><mi>i</mi></msub><mo>×</mo><mi>G</mi></mrow></mrow></math>

Step 12: Select a minority class sample <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mi>i</mi></msub></math> from the <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>k</mi></math> neighbors around each minority class sample <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>x</mi><mrow><mi>z</mi><mo>⁢</mo><mi>i</mi></mrow></msub></math> to be synthesized, and synthesize the sample <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>s</mi><mi>j</mi></msub></math> according to the following equation:

(14) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><msub><mi>s</mi><mi>j</mi></msub><mo>=</mo><mrow><msub><mi>x</mi><mi>i</mi></msub><mo>+</mo><mrow><mrow><mo stretchy="false">(</mo><mrow><msub><mi>x</mi><mrow><mi>z</mi><mo>⁢</mo><mi>i</mi></mrow></msub><mo>-</mo><msub><mi>x</mi><mi>i</mi></msub></mrow><mo stretchy="false">)</mo></mrow><mo>×</mo><mi>λ</mi></mrow></mrow></mrow></math>

Where <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mi>λ</mi></math> is random <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>λ</mi><mo>∈</mo><mrow><mo stretchy="false">[</mo><mn>0</mn><mo>,</mo><mn>1</mn><mo stretchy="false">]</mo></mrow></mrow></math> . After removing outliers from the dataset, the method is used to generate samples. The experimental results show that the over-fitting problem can be avoided.

5. Experiment and discussion

5.1 Experimental environment

The experimental environment is pycharm under Anaconda3, the programming language is python3.8, and the computer operating system is Microsoft Windows 10 with 64 bits. The learning libraries used by python include sklearn, pandas, numpy, matplotlib, etc.

5.2 Dataset and Performance metrics

In order to verify the prediction performance of the iF-ADASYN sampling algorithm on different datasets, four UCI public imbalanced datasets are used, as shown in Table 1, including the source of the dataset, the name of the dataset, the number of samples, the number of features and the imbalance ratio. The proposed algorithm is applied to the real orthopedic postoperative thrombosis dataset, which can improve the recognition rate of patients with thrombosis, and be used for medical clinical reference.

Among them, the thrombosis dataset comes from the Department of Orthopedics, the CHINESE PLA GENERAL HOSPITAL (301 Hospital), and the data is authentic. Data entry of patients in hospital, there are inevitable errors and fewer records. This paper summarizes the preprocessing rules of the dataset by combining the guidance of doctors. Through the preprocessed data, a total of 15856 patients, including 15328 patients without thrombosis, and 528 patients with thrombosis, there is a serious imbalance problem.

Table 1 Imbalanced dataset information

<table><thead><tr><th valign="top" align="left">Source</th><th valign="top" align="left">Dataset</th><th valign="top" align="left"># samples</th><th valign="top" align="left"># features</th><th valign="top" align="left"># minority</th><th valign="top" align="left"># majority</th><th valign="top" align="left">IR</th></tr></thead><tbody><tr><td valign="top" align="left">UCI</td><td valign="top" align="left">Balance</td><td valign="top" align="left">625</td><td valign="top" align="left">4</td><td valign="top" align="left">49</td><td valign="top" align="left">576</td><td valign="top" align="left">11.76</td></tr><tr><td /><td valign="top" align="left">Glass</td><td valign="top" align="left">214</td><td valign="top" align="left">9</td><td valign="top" align="left">70</td><td valign="top" align="left">144</td><td valign="top" align="left">2.06</td></tr><tr><td /><td valign="top" align="left">Heart</td><td valign="top" align="left">269</td><td valign="top" align="left">13</td><td valign="top" align="left">119</td><td valign="top" align="left">150</td><td valign="top" align="left">1.26</td></tr><tr><td /><td valign="top" align="left">Pima</td><td valign="top" align="left">768</td><td valign="top" align="left">8</td><td valign="top" align="left">268</td><td valign="top" align="left">500</td><td valign="top" align="left">1.87</td></tr><tr><td valign="top" align="left">Real dataset</td><td valign="top" align="left">Thrombus data</td><td valign="top" align="left">15856</td><td valign="top" align="left">343</td><td valign="top" align="left">528</td><td valign="top" align="left">15328</td><td valign="top" align="left">29.10</td></tr></tbody></table>

Generally, the performance metrics of a classification model is that the higher the accuracy is, the better the classification model is. However, the classification model may misclassify due to the imbalanced data, resulting in errors. For example, assuming that the proportion of imbalanced data is 1:99, the classification accuracy is 99% when the classifier classifies all the majority class into minority class, but this is not meaningful for the classification evaluation of imbalanced data. This performance metric is not suitable for the classification of imbalanced data. The classification performance metrics of imbalanced data are mainly based on the confusion matrix, which is shown in Table 2. For a binary classification problem, the confusion matrix gives four values: (1) The number of positive samples correctly classified, True Positives (TP); (2) Number of negative samples correctly classified, True Negatives (TN); (3) Number of negative samples misclassified as a positive class, False Positives (FP); (4) Number of positive samples wrongly classified as a negative class, False Negatives (FN). For binary classification problems, TP, TN, FP, and FN represent all possible results of classification.

Table 2 Confusion matrix of performance evaluation

<table><thead><tr><th /><th valign="top" align="left">Predicted positive class</th><th valign="top" align="left">Predicted positive class</th></tr></thead><tbody><tr><th valign="top" align="left">Actual positive class</th><td valign="top" align="left">TP</td><td valign="top" align="left">FN</td></tr><tr><th valign="top" align="left">Actual negative class</th><td valign="top" align="left">FP</td><td valign="top" align="left">TN</td></tr></tbody></table>

Accuracy is the accuracy of all predictions generated by calculating the classification model:

(15) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mtext>𝐴𝑐𝑐𝑢𝑟𝑎𝑛𝑐𝑦</mtext><mo>=</mo><mstyle displaystyle="true"><mfrac><mrow><mtext>𝑇𝑃</mtext><mo>+</mo><mtext>𝑇𝑁</mtext></mrow><mrow><mtext>𝑇𝑃</mtext><mo>+</mo><mtext>𝑇𝑁</mtext><mo>+</mo><mtext>𝐹𝑃</mtext><mo>+</mo><mtext>𝐹𝑁</mtext></mrow></mfrac></mstyle></mrow></math>

In the face of imbalanced data classification evaluation, the classification accuracy of minority class is the focus of attention. The usual performance metrics are Precision and Recall. The Precision is to predict the probability of a positive class in the positive class. The Recall is to predict the probability of a positive class in the actual positive class. The Precision and Recall are calculated as follows:

(16) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><mi>Pr</mi><mo>⁡</mo><mtext>𝑒𝑐𝑖𝑠𝑖𝑜𝑛</mtext></mrow><mo>=</mo><mstyle displaystyle="true"><mfrac><mtext>𝑇𝑃</mtext><mrow><mtext>𝑇𝑃</mtext><mo>+</mo><mtext>𝐹𝑃</mtext></mrow></mfrac></mstyle></mrow></math> (17) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mtext mathvariant="italic">Re call</mtext><mo>=</mo><mstyle displaystyle="true"><mfrac><mtext>𝑇𝑃</mtext><mrow><mtext>𝑇𝑃</mtext><mo>+</mo><mtext>𝐹𝑁</mtext></mrow></mfrac></mstyle></mrow></math>

As real medical data, we pay more attention to those patients who are correctly classified, so we think the more important is True Negative Rate (TNR), which represents the correct classification ratio of the minority class in imbalanced data.

(18) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mtext>𝑇𝑁𝑅</mtext><mo>=</mo><mstyle displaystyle="true"><mfrac><mtext>𝑇𝑁</mtext><mrow><mtext>𝑇𝑁</mtext><mo>+</mo><mtext>𝐹𝑃</mtext></mrow></mfrac></mstyle></mrow></math>

The higher the Precision and Recall are, the better the classification effect is. In addition, the Precision and the Recall value are contradictory and usually cannot reach a higher value at the same time. Therefore, the F1-score can be used to comprehensively consider the two values [[25]]. The larger the F1-score is, the better the effect of the classifier to solve the classification problem of imbalanced data is. The G_mean value represents the geometric mean of the accuracy of minority class and that of majority class, which can measure effectively the overall classification accuracy of imbalanced data. The calculation formula is as follows:

(19) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mtext mathvariant="italic">F1</mtext><mo>=</mo><mstyle displaystyle="true"><mfrac><mrow><mn>2</mn><mo>×</mo><mrow><mi>Pr</mi><mo>⁡</mo><mrow><mtext>𝑒𝑐𝑖𝑠𝑖𝑜𝑛</mtext><mo>×</mo><mtext>𝑅𝑒𝑐𝑎𝑙𝑙</mtext></mrow></mrow></mrow><mrow><mrow><mi>Pr</mi><mo>⁡</mo><mtext>𝑒𝑐𝑖𝑠𝑖𝑜𝑛</mtext></mrow><mo>+</mo><mtext>𝑅𝑒𝑐𝑎𝑙𝑙</mtext></mrow></mfrac></mstyle></mrow></math> <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mtext mathvariant="italic">G_mean</mtext><mo>=</mo><msqrt><mrow><mstyle displaystyle="true"><mfrac><mtext>𝑇𝑃</mtext><mrow><mtext>𝑇𝑃</mtext><mo>+</mo><mtext>𝐹𝑁</mtext></mrow></mfrac></mstyle><mo>×</mo><mstyle displaystyle="true"><mfrac><mtext>𝑇𝑁</mtext><mrow><mtext>𝑇𝑁</mtext><mo>+</mo><mtext>𝐹𝑃</mtext></mrow></mfrac></mstyle></mrow></msqrt></mrow></math> (20) <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi /><mo lspace="22.5pt">=</mo><msqrt><mrow><mtext>𝑅𝑒𝑐𝑎𝑙𝑙</mtext><mo>×</mo><mtext>𝑇𝑁𝑅</mtext></mrow></msqrt></mrow></math>

Another common comprehensive indicator is the area (AUC) under the receiver operating characteristic curve (ROC). AUC is a quantitative description of ROC, and its maximum value is 1. The larger the AUC value is, the better the classification effect is. Different from other measurement values, AUC doesn't depend on the specific decision threshold of the classifier.

5.3 Experimental design and results analysis

In order to verify the effect of the proposed algorithm, four public UCI imbalanced datasets are used for experiments. Compared with the classical ADASYN sampling method, the A-SUWO [[17]], K-SMOTE [[18]], and SyMProD [[19]] sampling methods are verified on two classifiers of RandomForest and LogicalRegression. The AUC, F1_score, and G_mean are used for the evaluation of the experimental results, as shown in Tables 3 and 4.

Table 3 Performance evaluations using RandomForest algorithm

<table><thead><tr><th valign="top" align="left">Dataset</th><th valign="top" align="left">Metrics</th><th valign="top" align="left">Raw data</th><th valign="top" align="left">ADASYN</th><th valign="top" align="left">A-SUWO</th><th valign="top" align="left">K-SMOTE</th><th valign="top" align="left">SyMProD</th><th valign="top" align="left">iF-ADASYN</th></tr></thead><tbody><tr><th valign="top" align="left">Balance</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.5010</td><td valign="top" align="left">0.3761</td><td valign="top" align="left">0.3467</td><td valign="top" align="left">0.3959</td><td valign="top" align="left">0.4056</td><td valign="top" align="left"><bold>0.6187</bold></td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.0000</td><td valign="top" align="left">0.2963</td><td valign="top" align="left">0.2845</td><td valign="top" align="left">0.2356</td><td valign="top" align="left">0.2243</td><td valign="top" align="left"><bold>0.3963</bold></td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.0000</td><td valign="top" align="left">0.4289</td><td valign="top" align="left">0.4235</td><td valign="top" align="left">0.2243</td><td valign="top" align="left">0.4198</td><td valign="top" align="left"><bold>0.4888</bold></td></tr><tr><th valign="top" align="left">Glass</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.8904</td><td valign="top" align="left">0.9233</td><td valign="top" align="left">0.9190</td><td valign="top" align="left">0.9326</td><td valign="top" align="left">0.9372</td><td valign="top" align="left"><bold>0.9686</bold></td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.7271</td><td valign="top" align="left">0.7407</td><td valign="top" align="left">0.7442</td><td valign="top" align="left">0.7912</td><td valign="top" align="left">0.8058</td><td valign="top" align="left"><bold>0.9524</bold></td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.7902</td><td valign="top" align="left">0.9049</td><td valign="top" align="left">0.8125</td><td valign="top" align="left">0.8545</td><td valign="top" align="left">0.8664</td><td valign="top" align="left"><bold>0.9535</bold></td></tr><tr><th valign="top" align="left">Heart</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.6916</td><td valign="top" align="left">0.9107</td><td valign="top" align="left">0.9063</td><td valign="top" align="left">0.9097</td><td valign="top" align="left">0.9098</td><td valign="top" align="left"><bold>0.9464</bold></td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.4166</td><td valign="top" align="left">0.8189</td><td valign="top" align="left">0.8128</td><td valign="top" align="left">0.8176</td><td valign="top" align="left">0.8167</td><td valign="top" align="left"><bold>0.8693</bold></td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.4905</td><td valign="top" align="left">0.8146</td><td valign="top" align="left">0.8121</td><td valign="top" align="left">0.8174</td><td valign="top" align="left">0.8118</td><td valign="top" align="left"><bold>0.8707</bold></td></tr><tr><th valign="top" align="left">Pima</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.8073</td><td valign="top" align="left">0.8208</td><td valign="top" align="left">0.8271</td><td valign="top" align="left">0.8255</td><td valign="top" align="left"><bold>0.8275</bold></td><td valign="top" align="left">0.8107</td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.6800</td><td valign="top" align="left">0.6814</td><td valign="top" align="left">0.6833</td><td valign="top" align="left">0.6817</td><td valign="top" align="left">0.6859</td><td valign="top" align="left"><bold>0.7463</bold></td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.6885</td><td valign="top" align="left">0.7472</td><td valign="top" align="left">0.7472</td><td valign="top" align="left">0.7471</td><td valign="top" align="left"><bold>0.7506</bold></td><td valign="top" align="left">0.7463</td></tr></tbody></table>

Graph: Figure 4.ROC curve comparison.

It can be seen from Table 3, the AUC, F1_score, and G_mean values of our sampling algorithm are better than those of other algorithms after using RandomForest to classify on three datasets. The only F1_score is better on the Pima dataset, and the other two values are poor. Because there are relatively few outliers in the Pima dataset, the classification effect will not be significantly improved after using the iF-ADASYN algorithm, and the computational complexity is also increased.

Table 4 Performance evaluations using LogicalRegression algorithm

<table><thead><tr><th valign="top" align="left">Dataset</th><th valign="top" align="left">Metric</th><th valign="top" align="left">Raw data</th><th valign="top" align="left">ADASYN</th><th valign="top" align="left">A-SUWO</th><th valign="top" align="left">K-SMOTE</th><th valign="top" align="left">SyMProD</th><th valign="top" align="left">iF-ADASYN</th></tr></thead><tbody><tr><th valign="top" align="left">Balance</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.4483</td><td valign="top" align="left">0.5566</td><td valign="top" align="left">0.5342</td><td valign="top" align="left">0.5159</td><td valign="top" align="left"><bold>0.5806</bold></td><td valign="top" align="left">0.4358</td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.0000</td><td valign="top" align="left">0.2821</td><td valign="top" align="left">0.2760</td><td valign="top" align="left">0.2847</td><td valign="top" align="left"><bold>0.3576</bold></td><td valign="top" align="left">0.3517</td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.0000</td><td valign="top" align="left">0.5030</td><td valign="top" align="left">0.4940</td><td valign="top" align="left">0.5059</td><td valign="top" align="left"><bold>0.5637</bold></td><td valign="top" align="left">0.3850</td></tr><tr><th valign="top" align="left">Glass</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.8904</td><td valign="top" align="left">0.8208</td><td valign="top" align="left">0.8233</td><td valign="top" align="left">0.8252</td><td valign="top" align="left">0.8150</td><td valign="top" align="left"><bold>0.9545</bold></td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.7271</td><td valign="top" align="left">0.6814</td><td valign="top" align="left">0.6580</td><td valign="top" align="left">0.6741</td><td valign="top" align="left">0.6666</td><td valign="top" align="left"><bold>0.8871</bold></td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.7902</td><td valign="top" align="left">0.7472</td><td valign="top" align="left">0.7063</td><td valign="top" align="left">0.7402</td><td valign="top" align="left">0.7331</td><td valign="top" align="left"><bold>0.8903</bold></td></tr><tr><th valign="top" align="left">Heart</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.8615</td><td valign="top" align="left">0.9016</td><td valign="top" align="left">0.8997</td><td valign="top" align="left">0.9019</td><td valign="top" align="left">0.9026</td><td valign="top" align="left"><bold>0.9258</bold></td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.7806</td><td valign="top" align="left">0.8154</td><td valign="top" align="left">0.8095</td><td valign="top" align="left">0.8093</td><td valign="top" align="left"><bold>0.8160</bold></td><td valign="top" align="left">0.8145</td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.7907</td><td valign="top" align="left">0.8169</td><td valign="top" align="left">0.8096</td><td valign="top" align="left">0.8110</td><td valign="top" align="left"><bold>0.8185</bold></td><td valign="top" align="left">0.8145</td></tr><tr><th valign="top" align="left">Pima</th><th valign="top" align="left">AUC</th><td valign="top" align="left">0.8410</td><td valign="top" align="left">0.8216</td><td valign="top" align="left">0.8292</td><td valign="top" align="left">0.8272</td><td valign="top" align="left">0.8290</td><td valign="top" align="left">0.8211</td></tr><tr><th /><th valign="top" align="left">AUC</th><td valign="top" align="left">0.8410</td><td valign="top" align="left">0.8216</td><td valign="top" align="left">0.8292</td><td valign="top" align="left">0.8272</td><td valign="top" align="left">0.8290</td><td valign="top" align="left">0.8211</td></tr><tr><th /><th valign="top" align="left">F1_score</th><td valign="top" align="left">0.6948</td><td valign="top" align="left">0.6779</td><td valign="top" align="left">0.6829</td><td valign="top" align="left">0.6859</td><td valign="top" align="left">0.6872</td><td valign="top" align="left"><bold>0.7219</bold></td></tr><tr><th /><th valign="top" align="left">G_mean</th><td valign="top" align="left">0.7126</td><td valign="top" align="left">0.7398</td><td valign="top" align="left">0.7459</td><td valign="top" align="left">0.7503</td><td valign="top" align="left"><bold>0.7514</bold></td><td valign="top" align="left">0.7219</td></tr></tbody></table>

It can be seen from Table 4, the AUC, F1_score, and G_mean values of our sampling algorithm are better than those of other algorithms after using LogicalRegression for classification on the glass dataset. However, the classification performance is poor on the other three datasets. The experimental results show that after imbalanced data are sampled, the ensemble learning algorithm RandomForest has a higher classification performance than simple LogicalRegression.

The proposed iF-ADASYN algorithm is applied to the real orthopedic postoperative thrombosis dataset. We prove the feasibility and application of the proposed algorithm through experiments. The data is divided into the original dataset, the SMOTE algorithm is used to sample the dataset, the ADASYN algorithm is used to sample the dataset, and the iF-ADASYN algorithm is used to sample the dataset. Classify four datasets and compare the effect of the iF-ADASYN sampling algorithm, and five machine learning algorithms are used as classifiers. To quickly find the best parameters, we use GridsearchCV to adjust the parameters. This method does not need to traverse the parameters and run all parameter combinations, which reduces the training time of the model.

Table 5 lists the optimal hyperparameters of each model. The iF-ADASYN algorithm is consistent with the SMOTE algorithm and the ADASYN algorithm in parameter setting: the nearest neighbor number is set as <math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mi>k</mi><mo>=</mo><mi /></mrow></math> 5; The expected imbalance degree is 0.7, that is, the ratio of the number of synthetic minority class to the majority class is 0.7, and the results are shown in Table 6.

Table 5 Parameter table of each model

<table><thead><tr><th valign="top" align="left">Model</th><th valign="top" align="left">Hyperparameter</th></tr></thead><tbody><tr><th valign="top" align="left">LogicalRegression(LR)</th><td valign="top" align="left">(penalty <p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mo xmlns="">=</mo></math></p> 'l2', solver <p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mo xmlns="">=</mo></math></p> 'liblinear', max_iter <p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mo xmlns="">=</mo></math></p> 1000)</td></tr><tr><th valign="top" align="left">DecisionTree(DT)</th><td valign="top" align="left">{criterion <p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mo xmlns="">=</mo></math></p> 'gini', max_depth <p><math display="inline" xmlns="http://www.w3.org/1998/Math/MathML"><mo xmlns="">=</mo></math></p> 20}</td></tr><tr><th valign="top" align="left">RandomForest(RF)</th><td valign="top" align="left">{'n_estimators': 200, 'min_samples_split': 8}</td></tr><tr><th valign="top" align="left">AdaBoost(Ada)</th><td valign="top" align="left">{'n_estimators': 500, 'learning_rate': 0.7}</td></tr><tr><th valign="top" align="left">GBDT</th><td valign="top" align="left">{'n_estimators': 500, 'learning_rate': 0.9}</td></tr></tbody></table>

Table 6 Classification results of thrombosis dataset

<table><thead><tr><th valign="top" align="left">Classification model</th><th valign="top" align="left">Sampling algorithm</th><th valign="top" align="left">Accuracy</th><th valign="top" align="left">Precision</th><th valign="top" align="left">Recall</th><th valign="top" align="left">F1-Score</th><th valign="top" align="left">AUC</th><th valign="top" align="left">TNR</th><th style="width:42.7pt;" valign="top" align="left">Confusion matrix</th></tr></thead><tbody><tr><td valign="top" align="left">DT</td><td valign="top" align="left">Raw data</td><td valign="top" align="left">0.946</td><td valign="top" align="left">0.63</td><td valign="top" align="left">0.61</td><td valign="top" align="left">0.62</td><td valign="top" align="left">0.63</td><td valign="top" align="left">0.29</td><td style="width:42.7pt;" valign="top" align="left">[2971 93] [77 31]</td></tr><tr><td /><td valign="top" align="left">SMOTE</td><td valign="top" align="left">0.966</td><td valign="top" align="left">0.75</td><td valign="top" align="left">0.75</td><td valign="top" align="left">0.75</td><td valign="top" align="left">0.77</td><td valign="top" align="left">0.52</td><td style="width:42.7pt;" valign="top" align="left">[3011 53] [52 56]</td></tr><tr><td /><td valign="top" align="left">ADASYN</td><td valign="top" align="left">0.811</td><td valign="top" align="left">0.77</td><td valign="top" align="left">0.71</td><td valign="top" align="left">0.74</td><td valign="top" align="left">0.77</td><td valign="top" align="left">0.57</td><td style="width:42.7pt;" valign="top" align="left">[2984 80] [46 62]</td></tr><tr><td /><td valign="top" align="left"><bold>iF-ADASYN</bold></td><td valign="top" align="left"><bold>0.972</bold></td><td valign="top" align="left"><bold>0.81</bold></td><td valign="top" align="left"><bold>0.79</bold></td><td valign="top" align="left"><bold>0.80</bold></td><td valign="top" align="left"><bold>0.81</bold></td><td valign="top" align="left"><bold>0.62</bold></td><td style="width:42.7pt;" valign="top" align="left"><bold>[3018 46]</bold>[40 68]</td></tr><tr><td valign="top" align="left">LR</td><td valign="top" align="left">Original data</td><td valign="top" align="left"><bold>0.971</bold></td><td valign="top" align="left">0.63</td><td valign="top" align="left"><bold>0.86</bold></td><td valign="top" align="left"><bold>0.69</bold></td><td valign="top" align="left"><bold>0.85</bold></td><td valign="top" align="left">0.27</td><td style="width:42.7pt;" valign="top" align="left">[3054 10] [79 29]</td></tr><tr><td /><td valign="top" align="left">SMOTE</td><td valign="top" align="left">0.864</td><td valign="top" align="left">0.73</td><td valign="top" align="left">0.56</td><td valign="top" align="left">0.58</td><td valign="top" align="left">0.83</td><td valign="top" align="left">0.58</td><td style="width:42.7pt;" valign="top" align="left">[2679 385] [45 63]</td></tr><tr><td /><td valign="top" align="left">ADASYN</td><td valign="top" align="left">0.863</td><td valign="top" align="left">0.72</td><td valign="top" align="left">0.56</td><td valign="top" align="left">0.57</td><td valign="top" align="left">0.83</td><td valign="top" align="left">0.56</td><td style="width:42.7pt;" valign="top" align="left">[2678 386] [47 61]</td></tr><tr><td /><td valign="top" align="left"><bold>iF-ADASYN</bold></td><td valign="top" align="left">0.890</td><td valign="top" align="left"><bold>0.75</bold></td><td valign="top" align="left">0.58</td><td valign="top" align="left">0.61</td><td valign="top" align="left">0.82</td><td valign="top" align="left"><bold>0.58</bold></td><td style="width:42.7pt;" valign="top" align="left"><bold>[2766 298] </bold>[45 63]</td></tr><tr><td valign="top" align="left">RF</td><td valign="top" align="left">Original data</td><td valign="top" align="left">0.971</td><td valign="top" align="left">0.58</td><td valign="top" align="left"><bold>0.94</bold></td><td valign="top" align="left">0.63</td><td valign="top" align="left">0.76</td><td valign="top" align="left">0.17</td><td style="width:42.7pt;" valign="top" align="left">[3062 2] [90 18]</td></tr><tr><td /><td valign="top" align="left">SMOTE</td><td valign="top" align="left">0.967</td><td valign="top" align="left">066</td><td valign="top" align="left">0.76</td><td valign="top" align="left">0.70</td><td valign="top" align="left">0.89</td><td valign="top" align="left">0.33</td><td style="width:42.7pt;" valign="top" align="left">[3034 30] [72 36]</td></tr><tr><td /><td valign="top" align="left">ADASYN</td><td valign="top" align="left">0.968</td><td valign="top" align="left">0.68</td><td valign="top" align="left">0.77</td><td valign="top" align="left">0.72</td><td valign="top" align="left">0.90</td><td valign="top" align="left">0.37</td><td style="width:42.7pt;" valign="top" align="left">[3033 31] [68 40]</td></tr><tr><td /><td valign="top" align="left"><bold>iF-ADASYN</bold></td><td valign="top" align="left"><bold>0.973</bold></td><td valign="top" align="left"><bold>0.69</bold></td><td valign="top" align="left">0.84</td><td valign="top" align="left"><bold>0.74</bold></td><td valign="top" align="left"><bold>0.93</bold></td><td valign="top" align="left"><bold>0.38</bold></td><td style="width:42.7pt;" valign="top" align="left"><bold>[3047 17] </bold>[67 41]</td></tr><tr><td valign="top" align="left">Ada</td><td valign="top" align="left">Original data</td><td valign="top" align="left"><bold>0.969</bold></td><td valign="top" align="left">0.63</td><td valign="top" align="left"><bold>0.86</bold></td><td valign="top" align="left">0.68</td><td valign="top" align="left">0.83</td><td valign="top" align="left">0.25</td><td style="width:42.7pt;" valign="top" align="left">[3046 18] [81 27]</td></tr><tr><td /><td valign="top" align="left">SMOTE</td><td valign="top" align="left">0.958</td><td valign="top" align="left">0.78</td><td valign="top" align="left">0.70</td><td valign="top" align="left">0.74</td><td valign="top" align="left">0.93</td><td valign="top" align="left">0.59</td><td style="width:42.7pt;" valign="top" align="left">[2976 88] [44 64]</td></tr><tr><td /><td valign="top" align="left">ADASYN</td><td valign="top" align="left">0.951</td><td valign="top" align="left">0.78</td><td valign="top" align="left">0.68</td><td valign="top" align="left">0.72</td><td valign="top" align="left">0.92</td><td valign="top" align="left">0.60</td><td style="width:42.7pt;" valign="top" align="left">[2952 112] [43 65]</td></tr><tr><td /><td valign="top" align="left"><bold>iF-ADASYN</bold></td><td valign="top" align="left">0.964</td><td valign="top" align="left"><bold>0.90</bold></td><td valign="top" align="left">0.74</td><td valign="top" align="left"><bold>0.82</bold></td><td valign="top" align="left"><bold>0.97</bold></td><td valign="top" align="left"><bold>0.83</bold></td><td style="width:42.7pt;" valign="top" align="left"><bold>[2969 95] </bold>[18 90]</td></tr><tr><td valign="top" align="left">GBDT</td><td valign="top" align="left">Original data</td><td valign="top" align="left"><bold>0.976</bold></td><td valign="top" align="left">0.67</td><td valign="top" align="left"><bold>0.94</bold></td><td valign="top" align="left">0.74</td><td valign="top" align="left">0.90</td><td valign="top" align="left">0.33</td><td style="width:42.7pt;" valign="top" align="left">[3060 4] [72 36]</td></tr><tr><td /><td valign="top" align="left">SMOTE</td><td valign="top" align="left">0.964</td><td valign="top" align="left">0.78</td><td valign="top" align="left">0.73</td><td valign="top" align="left">0.75</td><td valign="top" align="left">0.94</td><td valign="top" align="left">0.57</td><td style="width:42.7pt;" valign="top" align="left">[2997 67] [46 62]</td></tr><tr><td /><td valign="top" align="left">ADASYN</td><td valign="top" align="left">0.957</td><td valign="top" align="left">0.81</td><td valign="top" align="left">0.70</td><td valign="top" align="left">0.74</td><td valign="top" align="left">0.95</td><td valign="top" align="left">0.66</td><td style="width:42.7pt;" valign="top" align="left">[2964 100] [31 71]</td></tr><tr><td /><td valign="top" align="left"><bold>iF-ADASYN</bold></td><td valign="top" align="left">0.975</td><td valign="top" align="left"><bold>0.93</bold></td><td valign="top" align="left">0.80</td><td valign="top" align="left"><bold>0.85</bold></td><td valign="top" align="left"><bold>0.98</bold></td><td valign="top" align="left"><bold>0.86</bold></td><td style="width:42.7pt;" valign="top" align="left"><bold>[3002 62]</bold>[15 93]</td></tr></tbody></table>

Graph: Figure 5.Comparison of TNR value.

Graph: Figure 6.Comparison of F1-score value.

The experimental results in Table 6 show that each algorithm has high accuracy for the original data, but their recognition rate for minority class (TNR) is still not accurate enough. In medical or other fields, the TNR value is more important. It can be seen from the comparative experiments that both SMOTE and ADASYN algorithms have a positive impact on the classification of imbalanced data. From the confusion matrix in the above table, it can be seen that the sampling algorithm expands the decision boundary of minority class samples and increases the number of correct minority class recognition. Compared with the original data classification results, SMOTE and ADASYN sampling algorithms also divide some majority class samples into minority class samples. Among them, the ADASYN algorithm is easy to be affected by outliers, and has a lot of wrong recognition for the majority class.

Figure 4 gives the ROC curves and AUC value of the five algorithms in the face of the original data and the sampled data. In addition to the DecisionTree and LogisticRegression, RandomForest, ADABOOST, and GBDT three classifiers have a positive effect on the AUC value of imbalanced data classification after iF-ADASYN sampling. At the same time, the AUC of the iF-ADASYN algorithm is improved compared with ADASYN and SMOTE algorithm, and the AUC of the highest GBDT algorithm is 0.98. Therefore, our sampling algorithm has a good classification effect on the ensemble learning classifier.

Figure 5 shows the comparison results of the TNR value predicted by the thrombus dataset, indicating the accuracy of predicted results for patients with thrombus. The higher the TNR, the higher the recognition rate of the minority class. According to Fig. 5, it can be seen that when the TNR value of an algorithm is closer to the standard line, the algorithm is better than other algorithms. If it is close to or within the minimum standard line, it indicates that the TNR value of the algorithm is in a poor state. In this paper we use the GBDT classifier to classify, and it can be seen that the TNR value is the closest to the standard line. Compared with other classifiers and sampling algorithms, the iF-ADASYN algorithm achieves a higher minority class recognition rate on GBDT.

Figure 6 shows the comparison results of the F1_score predicted by the thrombus dataset. It is a comprehensive evaluation index to balance precision rate and recall rate. As can be seen from the bar chart in Fig. 6, GBDT and Adaboost algorithm has a better effect than simple models such as DecisionTree and LogisticRegression in imbalanced thrombus dataset. In the case that the other evaluation indexes are roughly the same, our experiment also proves that iF-ADASYN has the highest recognition rate for minority samples compared with the other two sampling algorithms.

The experimental results show that in the patient dataset after orthopedic surgery, the AUC value of the iF-ADASYN sampling algorithm is higher than that of the commonly sampling algorithms SMOTE and ADASYN, and the recognition rate of patients with thrombosis are increased by 20%. Compared with the ADASYN algorithm, the iF-ADASYN sampling algorithm obtains better resistance to the interference of outlier data and improves the accuracy of minority decision boundary area division.

6. Conclusion

In this paper, we propose an iF-ADASYN sampling algorithm for predicting the risk of thrombosis after imbalanced orthopedic surgery. The advantage of this algorithm overcomes the defect that the ADASYN algorithm is vulnerable to outliers by introducing iFroests. The analysis of experimental results shows that the iF-ADASYN algorithm generates more minority samples in areas that are difficult to distinguish from two classes of data, so the classifier can find the decision boundary better, and improve the recognition rate of the minority class. The results obtained on four UCI datasets show that the improved algorithm can effectively improve the TNR value of the minority class, and the real thrombosis dataset of patients in the CHINESE PLA GENERAL HOSPITAL (301 Hospital) can predict whether patients have thrombosis. The experimental results show that the number of correct minority class predicted by the iF-ADASYN algorithm is more significant than that of SMOTE and ADASYN algorithms, and the number of false majority class predicted is less by the iF-ADASYN algorithm, AUC and TNR also show the effectiveness of the iF-ADASYN algorithm. However, the iF-ADASYN algorithm also has its shortcomings, in the confusion matrix, it can be seen that the classification results of the iF-ADASYN algorithm are not as accurate as of the traditional sampling algorithm for the majority class, which is also the focus of further research and improvement.

Acknowledgments

This work is supported by the National Key Research and Development Program (No. 2019YFC012 1502) and the Special Fund for Shaanxi Provincial Key Laboratory of Network Data and Intelligent Processing; and National Natural Science Foundation of China (N0.62001380).

References

1 N. Rout, D. Mishra and M.K. Mallick, Handling Imbalanced Data: A Survey, International Proceedings on Advances in Soft, 2018.

2 Z. Ye, Y. Wen and B. Lu, Review of the research on imbalanced classification, Journal of Intelligent Systems. 4 (2) (2009), 148-156.

3 T. Chen, Y. He, Y.Y. Mi, L. Zou et al., Application review and obstacle analysis of evidence for prevention and management of deep venous thrombosis in patients after spinal orthopedic surgery, Journal of Nurses Further Studies. 34 (24) (2019), 2238-2243.

4 Y.X. Wu, Prevention and treatment of deep venous thrombosis of lower limbs in patients after orthopedic trauma surgery, Clinical Research and Practice. 002 (020) (2017), 57-58.

5 H. He, B. Yang, E.A. Garcia et al., ADASYN: Adaptive synthetic sampling approach for imbalanced learning, 2008 IEEE International Joint Conference on Neural Networks, 2008.

6 S. Cai, R. Sun, S. Hao et al., An efficient outlier detection approach on weighted data stream based on minimal rare pattern mining, China Communications. 16 (3) (2019), 83-99.

7 H.C. Mandhare and S.R. Idate, A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques, IEEE, 2018, pp. 931-935.

8 M. Ektefa, S. Memar, F. Sidi et al., Intrusion detection using data mining techniques, International Conference on Information Retrieval & Knowledge Management.. EEE, 2010.

9 H.M. Nguyen, E.W. Cooper and K. Kamei, A comparative study on sampling techniques for handling class imbalance in streaming data, Joint International Conference on Soft Computing & Intelligent Systems, IEEE, 2012.

H.A. Fayed and A.F. Atiya, A novel template reduction approach for the -nearest neighbor method, IEEE Transactions on Neural Networks. 20 (5) (2009), 890-896.

G.E. Batista, R.C. Prati and M.C. Monard, A study of the behavior of several methods for balancing machine learning training data, ACM SIGKDD Explorations Newsletter. 6 (1) (2004), 20-29.

Y.Y. Wu and L.Y. Shen, Unbalanced fuzzy multi-class support vector machine based on class overlap undersampling, Journal of University of Chinese Academy of Sciences. 35 (4) (2018), 536-543.

M. Galar et al., EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognition. 46 (5) (2013), 3460-3471.

N.V. Chawla, K.W. Bowyer, L.O. Hall et al., SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence Research. 16 (1) (2002), 321-357.

C. Bellinger, C. Drummond and N. Japkowicz, Manifold-based synthetic oversampling with manifold conformance estimation, Machine Learning. 107 (3) (2018), 605-637.

Z.Y. Zheng, Y.P. Cai and Y. Li, Oversampling method for imbalanced classification, Comput Info. 34 (5) (2016), 1017-1037.

I. Nekooeimehr and S.K. Lai-Yuen, Adaptive semi-unsupervised weighted oversampling (A-SUWO) for Imbalanced Datasets, Expert Systems with Applications. 46 (5) (2016), 405-416.

D. Georgios, B. Fernando and L. Felix, Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE, Information Sciences. 465 (2018), 1-20.

I. Kunakorntum, W. Hinthong and P. Phunchongharn, A Synthetic Minority Based on Probabilistic Distribution (SyMProD) Oversampling for Imbalanced Datasets, IEEE Access, 2020, pp. 1.

Y. Qian, Y.C. Liang, M. Li et al., A resampling ensemble algorithm for classification of imbalance problems, Neuro Computing. 143 (2014), 57-67.

F. Charte, A.J. Rivera, M.J. Del Jesus et al., Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neuro Computing. 163 (2015), 3-16.

M. Wang, X. Yao and Y. Chen, An Imbalanced-Data Processing Algorithm for the Prediction of Heart Attack in Stroke Patients, IEEE Access, 2021, pp. 1.

T.L. Fei, M.T. Kai and Z.H. Zhou, Isolation Forest, IEEE DATA MINING, 2008.

F.T. Liu, K.M. Ting and Z.H. Zhou, Isolation-based anomaly detection, Acm Transactions on Knowledge Discovery from Data. 6 (1) (2012), 1-39.

G.J. Wei, L.J. Ye, S. University et al., Research and advancement of classification method of imbalanced data sets, Computer Science. 35 (4) (2008), 10-13.

By Xiaoying Pan; Rong Jia; Jiahao Huang and Hao Wang

Reported by Author; Author; Author; Author

Treffer: A resistance outlier sampling algorithm for imbalanced data prediction

Weitere Informationen

A resistance outlier sampling algorithm for imbalanced data prediction

1. Introduction

2. Related works

3. Theoretical basis

3.1 ADASYN sampling algorithm

3.2 Isolation Forest algorithm

4. iF-ADASYN algorithm

4.1 Determination of high weight outliers

4.2 iF-ADASYN algorithm process

5. Experiment and discussion

5.1 Experimental environment

5.2 Dataset and Performance metrics

5.3 Experimental design and results analysis

6. Conclusion

Acknowledgments

References

Links

Zusatz-Funktionen