Transformer fault diagnosis method based on SMOTE and NGO-GBDT
www.nature.com/scientificreports
OPEN
Transformer fault diagnosis
method based on SMOTE
and NGO‑GBDT
Li‑zhong Wang 1, Jian‑fei Chi 1, Ye‑qiang Ding 1, Hai‑yan Yao 2, Qiang Guo 2 & Hai‑qi Yang 3*
In order to improve the accuracy of transformer fault diagnosis and improve the influence of
unbalanced samples on the low accuracy of model identification caused by insufficient model training,
this paper proposes a transformer fault diagnosis method based on SMOTE and NGO-GBDT. Firstly,
the Synthetic Minority Over-sampling Technique (SMOTE) was used to expand the minority samples.
Secondly, the non-coding ratio method was used to construct multi-dimensional feature parameters,
and the Light Gradient Boosting Machine (LightGBM) feature optimization strategy was introduced
to screen the optimal feature subset. Finally, Northern Goshawk Optimization (NGO) algorithm was
used to optimize the parameters of Gradient Boosting Decision Tree (GBDT), and then the transformer
fault diagnosis was realized. The results show that the proposed method can reduce the misjudgment
of minority samples. Compared with other integrated models, the proposed method has high fault
identification accuracy, low misjudgment rate and stable performance.
Keywords Fault diagnosis, Transformers, Oversampling, LightGBM feature selection, GBDT, Northern
goshawk optimization algorithm
Power transformers are key equipment in the transmission and transformation system, and their operating status
is related to the stability of the power system. When a transformer malfunctions, if accurate diagnosis cannot
be made in a timely manner, it will cause significant economic losses. Therefore, how to improve the accuracy
of transformer fault diagnosis has always been a hot topic for scholars to study.
As the aging process of transformer insulation progresses, H2, CH4, C2H6, C2H4, C2H2, CO2, and other gases
are produced and dissolve into the insulating oil. The present condition of the transformer may be inferred from
the concentration and composition of these dissolved gases within the oil1. The predominant analytical techniques employed to assess the transformer’s condition encompass the IEC three-ratio method2, Rogers’ four-ratio
method3, Duval Pentagon4, Doernberg’s ratio method5, among others. In6, a fuzzy logic approach was proposed to
overcome the shortcomings of traditional IEC methods and enhance the accuracy of model diagnosis. I n7, based
upon the data of dissolved gases within oil, a fuzzy logic-based transformer fault diagnosis model employing
the Rogers Four Ratio Method has been developed. The model’s implementation has demonstrated its capacity
to rectify the deficiencies inherent in conventional fault diagnosis methods, thereby enhancing the accuracy of
fault diagnosis. Conversely, this method lacks comprehensive coding and the diagnostic threshold is too rigidly
defined, thereby failing to capture the intricate nature of faults within the transformer and compromising the
accuracy of fault d
iagnosis8. In9, the ratio coding method and raw gas data are used to construct 24-dimensional
features, which improves the model’s ability to distinguish between different faults and makes it more versatile.
Ref.10. proposes a PSO-RF diagnostic model that extracts transformer fault characteristic information without
using coding ratios, thereby improving the model’s fault diagnosis capabilities. However, in existing research,
the dimensionality explosion problem is less considered when constructing feature parameters. Because as the
sample size increases, the fault diagnosis model becomes better. However, the increase in feature dimension leads
to an exponential increase in the amount of calculation and an increase in redundant information. Therefore, it
is necessary to remove redundant information to improve model operation efficiency and diagnostic accuracy.
As artificial intelligence technology advances, machine learning applications in transformer fault diagnosis have gained momentum. Support Vector M
achine11–13, Convolutional Neural Network(CNN)14,15, SelfOrganizing Mapping Neural Network(SOM)16, Gate Recurrent Unit(GRU)17,18, Cloud Model(CM)19, Adaptive
1
State Grid Zhejiang Power Co., Ltd, Hangzhou Linping Power Supply Company, Hangzhou 311199,
China. 2Hangzhou Electric Power Equipment Manufacturing Co., Ltd, Yuhang Qunli Complete Sets Electricity
Manufacturing Branch Electric, Hangzhou 311000, China. 3School of Mechanical Engineering, Northeast Electric
Power University, Jilin 132012, China. *email:
Scientific Reports |
(2024) 14:7179
| https://doi.org/10.1038/s41598-024-57509-w
1
Vol.:(0123456789)
www.nature.com/scientificreports/
Boosting(AdaBoost)20, Gradient Boosting Decision Tree(GBDT)21 and other models have demonstrated remarkable success in classification identification. Yet, The fault diagnosis models mentioned above were all constructed
based on the assumption of having a relatively large dataset. However, in practical operations, transformers
rarely experience failures and the frequencies of different types of faults vary significantly. This makes it difficult
to meet the precision requirements using big data samples. Therefore, when addressing the practical challenges
of transformer fault diagnosis, the issue of sample imbalance needs to be given immediate attention in order to
achieve precision.
The formulation of transformer fault diagnosis models hinges upon an abundance of data sets. In practical
operations, the likelihood of transformer malfunction is slim; the variance of diverse fault types is vast, thereby
making it challenging to attain the requisite standards for extensive datasets.
Research on imbalanced datasets mainly focuses on developing classifiers and data preprocessing techniques.
Data-level processing involves reconstructing the dataset to better align with its inherent characteristics, thereby
addressing issues arising from an imbalance in sampling frequency. undersampling22 involves selecting a subset of
the most representative samples from the majority classes to mitigate the issue of class imbalance. However, this
approach may result in the loss of crucial information regarding the bulk of sample classes, ultimately impairing
the performance of classifiers. Oversampling involves artificially increasing a limited sample size to achieve data
balance. This can be done through techniques such as Synthetic Minority Oversampling Technique(SMOTE)23,24,
SVM SMOTE25, Borderline-SMOTE26, Adaptive Synthetic Sampling(ADASYN)27, Generative Adversarial
Network(GAN)28, and others. Common approaches at the classification algorithm level include CostSensitive29
and Ensemble L
earning30. In31, cost-sensitive classifiers are used to address class disparities and improve fault
categorization accuracy. The Auxiliary Generation Mutual Countermeasure Network (AGMAN) was proposed
in Ref.32. to enhance the accuracy of small sample class imbalance fault diagnosis. I n33, MeanRadius-SMOTE (...truncated)