PO133

How do missing data affect your data-driven model? Monitoring fatigue loads in wind turbines with SCADA data

Luis Vera-Tudela, Martin Kraft, Martin Kühn
ForWind – Carl Von Ossietzky University of Oldenburg, Oldenburg, Lower Saxony, Germany

Abstract

P { margin-bottom: 0.08in; direction: ltr; color: rgb(0, 0, 10); text-align: left; }P.western { font-family: "Cambria",serif; font-size: 12pt; }P.cjk { font-family: "Droid Sans Fallback"; font-size: 12pt; }P.ctl { font-size: 12pt; }A:link {Nowadays, the condition of wind turbines is monitored with SCADA data and purpose-specific sensors. If predictive maintenance is understood as a data-driven decision-making one, then it should aim to find the right balance between: low cost, accuracy of prediction and actionable results. Developing a monitoring technique with low cost and high accuracy, we previously demonstrated that it is possible to monitor fatigue loads on wind turbines based solely on available SCADA data, which can be used to assess lifetime consumption. However, working on actionable results, we found the process limited by the complete-case assumption, a broadly used technique that removes incomplete records beforehand. Since datasets are almost never complete due to sensors malfunction, records failed, etc., this may lead to biased results. To overcome this limitation we investigated one-year of blade loads and SCADA records from the offshore wind farm EnBW Baltic 1. We assessed the impact of various missing data mechanisms, which define randomness of lost data, and evaluated three traditional replacement procedures, which make different levels of assumptions on data patterns. Our results indicated that hot-deck imputation with a K-Nearest Neighbour (K-NN) algorithm is a robust approach suited to deal with missing data problems. Finally, although it uses a simple traditional approach, our results improve decision-making from data-driven model based on a complete-case assumption.

Method

P { margin-bottom: 0.08in; direction: ltr; color: rgb(0, 0, 10); text-align: left; }P.western { font-family: "Cambria",serif; font-size: 12pt; }P.cjk { font-family: "Droid Sans Fallback"; font-size: 12pt; }P.ctl { font-size: 12pt; }A:link { }One-year measurements from one wind turbine at the offshore wind farm EnBW Baltic 1 are investigated. First, SCADA data are summarized with 10-min statistics and blade loads are transformed to damage equivalent loads. Afterwards, a baseline neural network model mapped the relationship between SCADA and blade loads. Then, datasets with missing data at different randomness are introduced: completely at random, at random and not and random. Thy are used to understand and quantify the impact of  randomness. Finally, three traditional replacement procedures: mean substitution, regression imputation and hot-deck imputation are assessed.

Results

P { margin-bottom: 0.08in; direction: ltr; color: rgb(0, 0, 10); text-align: left; }P.western { font-family: "Cambria",serif; font-size: 12pt; }P.cjk { font-family: "Droid Sans Fallback"; font-size: 12pt; }P.ctl { font-size: 12pt; }A:link { }The missing data mechanism is important to select the data-handling technique. Thus, it needs to be evaluated before constructing a model. Deletion techniques require missing data to be completely at random, otherwise models created are biased. Simple replacement approaches, like mean substitution and regression imputation may be considered only in limited cases. Finally, hot-deck imputation, via a K-NN algorithm, is a robust manner to replace missing data since it does not need to assume any pattern in the data.

Conclusions

P { margin-bottom: 0.08in; direction: ltr; color: rgb(0, 0, 10); text-align: left; }P.western { font-family: "Cambria",serif; font-size: 12pt; }P.cjk { font-family: "Droid Sans Fallback"; font-size: 12pt; }P.ctl { font-size: 12pt; }A:link { }A review of wind energy literature dealing with data-driven models indicates that many authors do not explicitly indicate how missing data is handled. Thus, it appears that listwise deletion is the most common approach, which assumes missing data to be lost completely at random. Even if correct, deletion eliminates data that can be of interest. Our results indicate that simple traditional replacement methods outperform listwise deletion. Thus, this step should be included during model development. Furthermore, more advanced methods, like multiple imputation and maximum likelihood, are expected to further improve presented results.

Objectives

P { margin-bottom: 0.08in; direction: ltr; color: rgb(0, 0, 10); text-align: left; }P.western { font-family: "Cambria",serif; font-size: 12pt; }P.cjk { font-family: "Droid Sans Fallback"; font-size: 12pt; }P.ctl { font-size: 12pt; }A:link { }The main quantifiable objective is to raise awareness about the impact that missing data might have on data-driven models used for decision-making. The second one is to present a practical approach, which even if built on simple and traditional techniques in statistical terms, can already help the modeller to address the main issue. Finally, the qualitative goal is to inspire the community to address the same question that motivated this research: How do missing data affect your data-driven model?