Impact of measurement errors on machine learning models in steel production

Publikationen: Thesis / Studienabschlussarbeiten und HabilitationsschriftenMasterarbeit

Standard

Impact of measurement errors on machine learning models in steel production. / Rechberger, Andreas.
2024.

Publikationen: Thesis / Studienabschlussarbeiten und HabilitationsschriftenMasterarbeit

Bibtex - Download

@mastersthesis{21b594f499d24960963d91f954c93ebc,
title = "Impact of measurement errors on machine learning models in steel production",
abstract = "The use of machine learning (ML) models to describe material properties has become an important research direction in materials science and relevant and useful datasets are becoming more and more available. However, the systematic treatment of uncertainties associated with measurement errors remains still a challenging topic. This thesis provides an investigation on the impact of measurement errors on the prediction accuracy of ML algorithms for the mechanical properties of steels.In the first part this topic is addressed using artificially generated linear and non-linear datasets. It is shown that the errors in the target variable y , although affecting prediction accuracy more strongly than the errors in the features X, average out if y is normally distributed. Errors in the features X, on the other hand, lead to bias in relation to the true correlation that systematically offsets the prediction with respect to the true value.In the second part, this thesis aims to provide a deeper understanding of the nature of the prediction error via the bias-variance decomposition. It analyses how separate determination of the measurement error and investigation of learning curves can be used to quantify bias or variance of the ML model. This is demonstrated on a martensite starting temperature (Ms) data set and a r-value data set. For the Ms dataset, it is found that for underparametrized models such as linear regression, the training error and validation error converge well above the measurement error indicative of bias. For overparametrized models such as XGBoost or Random Forest, the training error is smaller than the measurement error, while the validation error is sizably higher. Therefore, these models exhibit variance and would benefit from more data points. The performance on the validation set of XGBoost, Random Forest, and Gaussian Process regression is comparable. For the r-value dataset, XGBoost reveals behavior similar to that of the Ms dataset. Finally, an analysis of models available from the literature to predict the mechanical properties obtained from tensile testing reveals that the model's validation error is close to the measurement error. Therefore, the bias and variance of these models are very small and, therefore, the prediction accuracy is much higher than the validation error (or the associated coefficient of determination R2) suggests.",
keywords = "machine learning, measurement errors, prediction accuracy, steels, linear datasets, non-linear datasets, target variable, feature errors, bias-variance decomposition, learning curves, martensite starting temperature, linear regression, XGBoost, Random Forest, Gaussian Process regression, overparametrization, validation error, training error, tensile testing, coefficient of determination, root mean square error, RMSE, Machine Learning, Materialwissenschaft, Messfehler, Vorhersagegenauigkeit, Stahl, lineare Datens{\"a}tze, nicht-lineare Datens{\"a}tze, Bias-Variance Decomposition, Bias, Varianz, Lernkurven, Martensit-Starttemperatur, r-Wert, lineare Regression, XGBoost, Random Forest, Gaussian Process Regression, Validierungsfehler, Trainingsfehler, Bestimmtheitsma{\ss}",
author = "Andreas Rechberger",
note = "no embargo",
year = "2024",
doi = "10.34901/mul.pub.2025.010",
language = "English",
school = "Montanuniversitaet Leoben (000)",

}

RIS (suitable for import to EndNote) - Download

TY - THES

T1 - Impact of measurement errors on machine learning models in steel production

AU - Rechberger, Andreas

N1 - no embargo

PY - 2024

Y1 - 2024

N2 - The use of machine learning (ML) models to describe material properties has become an important research direction in materials science and relevant and useful datasets are becoming more and more available. However, the systematic treatment of uncertainties associated with measurement errors remains still a challenging topic. This thesis provides an investigation on the impact of measurement errors on the prediction accuracy of ML algorithms for the mechanical properties of steels.In the first part this topic is addressed using artificially generated linear and non-linear datasets. It is shown that the errors in the target variable y , although affecting prediction accuracy more strongly than the errors in the features X, average out if y is normally distributed. Errors in the features X, on the other hand, lead to bias in relation to the true correlation that systematically offsets the prediction with respect to the true value.In the second part, this thesis aims to provide a deeper understanding of the nature of the prediction error via the bias-variance decomposition. It analyses how separate determination of the measurement error and investigation of learning curves can be used to quantify bias or variance of the ML model. This is demonstrated on a martensite starting temperature (Ms) data set and a r-value data set. For the Ms dataset, it is found that for underparametrized models such as linear regression, the training error and validation error converge well above the measurement error indicative of bias. For overparametrized models such as XGBoost or Random Forest, the training error is smaller than the measurement error, while the validation error is sizably higher. Therefore, these models exhibit variance and would benefit from more data points. The performance on the validation set of XGBoost, Random Forest, and Gaussian Process regression is comparable. For the r-value dataset, XGBoost reveals behavior similar to that of the Ms dataset. Finally, an analysis of models available from the literature to predict the mechanical properties obtained from tensile testing reveals that the model's validation error is close to the measurement error. Therefore, the bias and variance of these models are very small and, therefore, the prediction accuracy is much higher than the validation error (or the associated coefficient of determination R2) suggests.

AB - The use of machine learning (ML) models to describe material properties has become an important research direction in materials science and relevant and useful datasets are becoming more and more available. However, the systematic treatment of uncertainties associated with measurement errors remains still a challenging topic. This thesis provides an investigation on the impact of measurement errors on the prediction accuracy of ML algorithms for the mechanical properties of steels.In the first part this topic is addressed using artificially generated linear and non-linear datasets. It is shown that the errors in the target variable y , although affecting prediction accuracy more strongly than the errors in the features X, average out if y is normally distributed. Errors in the features X, on the other hand, lead to bias in relation to the true correlation that systematically offsets the prediction with respect to the true value.In the second part, this thesis aims to provide a deeper understanding of the nature of the prediction error via the bias-variance decomposition. It analyses how separate determination of the measurement error and investigation of learning curves can be used to quantify bias or variance of the ML model. This is demonstrated on a martensite starting temperature (Ms) data set and a r-value data set. For the Ms dataset, it is found that for underparametrized models such as linear regression, the training error and validation error converge well above the measurement error indicative of bias. For overparametrized models such as XGBoost or Random Forest, the training error is smaller than the measurement error, while the validation error is sizably higher. Therefore, these models exhibit variance and would benefit from more data points. The performance on the validation set of XGBoost, Random Forest, and Gaussian Process regression is comparable. For the r-value dataset, XGBoost reveals behavior similar to that of the Ms dataset. Finally, an analysis of models available from the literature to predict the mechanical properties obtained from tensile testing reveals that the model's validation error is close to the measurement error. Therefore, the bias and variance of these models are very small and, therefore, the prediction accuracy is much higher than the validation error (or the associated coefficient of determination R2) suggests.

KW - machine learning

KW - measurement errors

KW - prediction accuracy

KW - steels

KW - linear datasets

KW - non-linear datasets

KW - target variable

KW - feature errors

KW - bias-variance decomposition

KW - learning curves

KW - martensite starting temperature

KW - linear regression

KW - XGBoost

KW - Random Forest

KW - Gaussian Process regression

KW - overparametrization

KW - validation error

KW - training error

KW - tensile testing

KW - coefficient of determination

KW - root mean square error

KW - RMSE

KW - Machine Learning

KW - Materialwissenschaft

KW - Messfehler

KW - Vorhersagegenauigkeit

KW - Stahl

KW - lineare Datensätze

KW - nicht-lineare Datensätze

KW - Bias-Variance Decomposition

KW - Bias

KW - Varianz

KW - Lernkurven

KW - Martensit-Starttemperatur

KW - r-Wert

KW - lineare Regression

KW - XGBoost

KW - Random Forest

KW - Gaussian Process Regression

KW - Validierungsfehler

KW - Trainingsfehler

KW - Bestimmtheitsmaß

U2 - 10.34901/mul.pub.2025.010

DO - 10.34901/mul.pub.2025.010

M3 - Master's Thesis

ER -