Impact of measurement errors on machine learning models in steel production
Publikationen: Thesis / Studienabschlussarbeiten und Habilitationsschriften › Masterarbeit
Standard
2024.
Publikationen: Thesis / Studienabschlussarbeiten und Habilitationsschriften › Masterarbeit
Harvard
APA
Vancouver
Author
Bibtex - Download
}
RIS (suitable for import to EndNote) - Download
TY - THES
T1 - Impact of measurement errors on machine learning models in steel production
AU - Rechberger, Andreas
N1 - no embargo
PY - 2024
Y1 - 2024
N2 - The use of machine learning (ML) models to describe material properties has become an important research direction in materials science and relevant and useful datasets are becoming more and more available. However, the systematic treatment of uncertainties associated with measurement errors remains still a challenging topic. This thesis provides an investigation on the impact of measurement errors on the prediction accuracy of ML algorithms for the mechanical properties of steels.In the first part this topic is addressed using artificially generated linear and non-linear datasets. It is shown that the errors in the target variable y , although affecting prediction accuracy more strongly than the errors in the features X, average out if y is normally distributed. Errors in the features X, on the other hand, lead to bias in relation to the true correlation that systematically offsets the prediction with respect to the true value.In the second part, this thesis aims to provide a deeper understanding of the nature of the prediction error via the bias-variance decomposition. It analyses how separate determination of the measurement error and investigation of learning curves can be used to quantify bias or variance of the ML model. This is demonstrated on a martensite starting temperature (Ms) data set and a r-value data set. For the Ms dataset, it is found that for underparametrized models such as linear regression, the training error and validation error converge well above the measurement error indicative of bias. For overparametrized models such as XGBoost or Random Forest, the training error is smaller than the measurement error, while the validation error is sizably higher. Therefore, these models exhibit variance and would benefit from more data points. The performance on the validation set of XGBoost, Random Forest, and Gaussian Process regression is comparable. For the r-value dataset, XGBoost reveals behavior similar to that of the Ms dataset. Finally, an analysis of models available from the literature to predict the mechanical properties obtained from tensile testing reveals that the model's validation error is close to the measurement error. Therefore, the bias and variance of these models are very small and, therefore, the prediction accuracy is much higher than the validation error (or the associated coefficient of determination R2) suggests.
AB - The use of machine learning (ML) models to describe material properties has become an important research direction in materials science and relevant and useful datasets are becoming more and more available. However, the systematic treatment of uncertainties associated with measurement errors remains still a challenging topic. This thesis provides an investigation on the impact of measurement errors on the prediction accuracy of ML algorithms for the mechanical properties of steels.In the first part this topic is addressed using artificially generated linear and non-linear datasets. It is shown that the errors in the target variable y , although affecting prediction accuracy more strongly than the errors in the features X, average out if y is normally distributed. Errors in the features X, on the other hand, lead to bias in relation to the true correlation that systematically offsets the prediction with respect to the true value.In the second part, this thesis aims to provide a deeper understanding of the nature of the prediction error via the bias-variance decomposition. It analyses how separate determination of the measurement error and investigation of learning curves can be used to quantify bias or variance of the ML model. This is demonstrated on a martensite starting temperature (Ms) data set and a r-value data set. For the Ms dataset, it is found that for underparametrized models such as linear regression, the training error and validation error converge well above the measurement error indicative of bias. For overparametrized models such as XGBoost or Random Forest, the training error is smaller than the measurement error, while the validation error is sizably higher. Therefore, these models exhibit variance and would benefit from more data points. The performance on the validation set of XGBoost, Random Forest, and Gaussian Process regression is comparable. For the r-value dataset, XGBoost reveals behavior similar to that of the Ms dataset. Finally, an analysis of models available from the literature to predict the mechanical properties obtained from tensile testing reveals that the model's validation error is close to the measurement error. Therefore, the bias and variance of these models are very small and, therefore, the prediction accuracy is much higher than the validation error (or the associated coefficient of determination R2) suggests.
KW - machine learning
KW - measurement errors
KW - prediction accuracy
KW - steels
KW - linear datasets
KW - non-linear datasets
KW - target variable
KW - feature errors
KW - bias-variance decomposition
KW - learning curves
KW - martensite starting temperature
KW - linear regression
KW - XGBoost
KW - Random Forest
KW - Gaussian Process regression
KW - overparametrization
KW - validation error
KW - training error
KW - tensile testing
KW - coefficient of determination
KW - root mean square error
KW - RMSE
KW - Machine Learning
KW - Materialwissenschaft
KW - Messfehler
KW - Vorhersagegenauigkeit
KW - Stahl
KW - lineare Datensätze
KW - nicht-lineare Datensätze
KW - Bias-Variance Decomposition
KW - Bias
KW - Varianz
KW - Lernkurven
KW - Martensit-Starttemperatur
KW - r-Wert
KW - lineare Regression
KW - XGBoost
KW - Random Forest
KW - Gaussian Process Regression
KW - Validierungsfehler
KW - Trainingsfehler
KW - Bestimmtheitsmaß
U2 - 10.34901/mul.pub.2025.010
DO - 10.34901/mul.pub.2025.010
M3 - Master's Thesis
ER -