Symbolic Data Representation of Multi-Variate Machine Measurement Data to Identify Quasi-Linguistic Patterns with Machine Learning
Research output: Thesis › Master's Thesis
Standard
2023.
Research output: Thesis › Master's Thesis
Harvard
APA
Vancouver
Author
Bibtex - Download
}
RIS (suitable for import to EndNote) - Download
TY - THES
T1 - Symbolic Data Representation of Multi-Variate Machine Measurement Data to Identify Quasi-Linguistic Patterns with Machine Learning
AU - Nuser, Philip
N1 - no embargo
PY - 2023
Y1 - 2023
N2 - This thesis is an exploratory work of unsupervised anomaly detection in multi-variate time-series machine data with methods originating from natural language processing. The foundation is laid by tokenizing the time-series data, i.e., converting the numeric machine data into symbolic data similar to textual data. The process of tokenization is realized by discretizing the data and assigning unique tokens to the discrete values. The symbolic sequences obtained are then inspected for anomalies with two different approaches. The first method is based on word n-gram language models. A n-gram is a sequence of words of length n. The counts of those n-grams in the data set are computed and weighed with a measure for the importance of a word to a document, the term frequency-inverse document frequency measure, to derive anomaly scores for each sequence. The second approach presented utilizes a machine learning model, more specific a masked language model with a transformer architecture at its core. Random tokens in the input sequences get masked and the transformer is trained to recreate the numeric sequences. When an input sequence that has not been used for training outputs a diverging numeric sequence, anomalies in this sequence are expected. Both anomaly detection methods were programmed and successfully applied to an unlabeled data set originating from instrumented machinery used for ground improvement of building foundations. The results indicate that both approaches are principally functional and strongly justify continued work in this area, especially on the machine learning model.
AB - This thesis is an exploratory work of unsupervised anomaly detection in multi-variate time-series machine data with methods originating from natural language processing. The foundation is laid by tokenizing the time-series data, i.e., converting the numeric machine data into symbolic data similar to textual data. The process of tokenization is realized by discretizing the data and assigning unique tokens to the discrete values. The symbolic sequences obtained are then inspected for anomalies with two different approaches. The first method is based on word n-gram language models. A n-gram is a sequence of words of length n. The counts of those n-grams in the data set are computed and weighed with a measure for the importance of a word to a document, the term frequency-inverse document frequency measure, to derive anomaly scores for each sequence. The second approach presented utilizes a machine learning model, more specific a masked language model with a transformer architecture at its core. Random tokens in the input sequences get masked and the transformer is trained to recreate the numeric sequences. When an input sequence that has not been used for training outputs a diverging numeric sequence, anomalies in this sequence are expected. Both anomaly detection methods were programmed and successfully applied to an unlabeled data set originating from instrumented machinery used for ground improvement of building foundations. The results indicate that both approaches are principally functional and strongly justify continued work in this area, especially on the machine learning model.
KW - Anomalieerkennung
KW - multivariat
KW - Zeitreihe
KW - Maschinendaten
KW - maschinelles Lernen
KW - künstliche Intelligenz
KW - unbeaufsichtigtes Lernen
KW - linguistische Methoden
KW - neuronales Netzwerk
KW - Transformer
KW - Attention
KW - Self-Attention
KW - großes Sprachmodell
KW - Tokenisierung
KW - Diskretisierung
KW - Statistik
KW - maskiertes Sprachmodell
KW - BERT
KW - symbolische Daten
KW - n-Gramm
KW - Bag-of-words
KW - Bag-of-ngrams
KW - term frequency-inverse document frequency
KW - TF-IDF
KW - anomaly detection
KW - multi-variate
KW - time-series
KW - machine data
KW - machine learning
KW - artificial intelligence
KW - unsupervised learning
KW - linguistic methods
KW - neural network
KW - transformer
KW - attention
KW - self-attention
KW - large language model
KW - tokenization
KW - discretization
KW - statistics
KW - masked language model
KW - BERT
KW - symbolic data
KW - n-gram
KW - bag-of-words
KW - bag-of-ngrams
KW - term frequency-inverse document frequency
KW - TF-IDF
U2 - 10.34901/mul.pub.2024.015
DO - 10.34901/mul.pub.2024.015
M3 - Master's Thesis
ER -