Symbolic Data Representation of Multi-Variate Machine Measurement Data to Identify Quasi-Linguistic Patterns with Machine Learning

Research output: ThesisMaster's Thesis

Bibtex - Download

@mastersthesis{719bebfb8304468ab69e907dbe979591,
title = "Symbolic Data Representation of Multi-Variate Machine Measurement Data to Identify Quasi-Linguistic Patterns with Machine Learning",
abstract = "This thesis is an exploratory work of unsupervised anomaly detection in multi-variate time-series machine data with methods originating from natural language processing. The foundation is laid by tokenizing the time-series data, i.e., converting the numeric machine data into symbolic data similar to textual data. The process of tokenization is realized by discretizing the data and assigning unique tokens to the discrete values. The symbolic sequences obtained are then inspected for anomalies with two different approaches. The first method is based on word n-gram language models. A n-gram is a sequence of words of length n. The counts of those n-grams in the data set are computed and weighed with a measure for the importance of a word to a document, the term frequency-inverse document frequency measure, to derive anomaly scores for each sequence. The second approach presented utilizes a machine learning model, more specific a masked language model with a transformer architecture at its core. Random tokens in the input sequences get masked and the transformer is trained to recreate the numeric sequences. When an input sequence that has not been used for training outputs a diverging numeric sequence, anomalies in this sequence are expected. Both anomaly detection methods were programmed and successfully applied to an unlabeled data set originating from instrumented machinery used for ground improvement of building foundations. The results indicate that both approaches are principally functional and strongly justify continued work in this area, especially on the machine learning model.",
keywords = "Anomalieerkennung, multivariat, Zeitreihe, Maschinendaten, maschinelles Lernen, k{\"u}nstliche Intelligenz, unbeaufsichtigtes Lernen, linguistische Methoden, neuronales Netzwerk, Transformer, Attention, Self-Attention, gro{\ss}es Sprachmodell, Tokenisierung, Diskretisierung, Statistik, maskiertes Sprachmodell, BERT, symbolische Daten, n-Gramm, Bag-of-words, Bag-of-ngrams, term frequency-inverse document frequency, TF-IDF, anomaly detection, multi-variate, time-series, machine data, machine learning, artificial intelligence, unsupervised learning, linguistic methods, neural network, transformer, attention, self-attention, large language model, tokenization, discretization, statistics, masked language model, BERT, symbolic data, n-gram, bag-of-words, bag-of-ngrams, term frequency-inverse document frequency, TF-IDF",
author = "Philip Nuser",
note = "no embargo",
year = "2023",
doi = "10.34901/mul.pub.2024.015",
language = "English",
school = "Montanuniversitaet Leoben (000)",

}

RIS (suitable for import to EndNote) - Download

TY - THES

T1 - Symbolic Data Representation of Multi-Variate Machine Measurement Data to Identify Quasi-Linguistic Patterns with Machine Learning

AU - Nuser, Philip

N1 - no embargo

PY - 2023

Y1 - 2023

N2 - This thesis is an exploratory work of unsupervised anomaly detection in multi-variate time-series machine data with methods originating from natural language processing. The foundation is laid by tokenizing the time-series data, i.e., converting the numeric machine data into symbolic data similar to textual data. The process of tokenization is realized by discretizing the data and assigning unique tokens to the discrete values. The symbolic sequences obtained are then inspected for anomalies with two different approaches. The first method is based on word n-gram language models. A n-gram is a sequence of words of length n. The counts of those n-grams in the data set are computed and weighed with a measure for the importance of a word to a document, the term frequency-inverse document frequency measure, to derive anomaly scores for each sequence. The second approach presented utilizes a machine learning model, more specific a masked language model with a transformer architecture at its core. Random tokens in the input sequences get masked and the transformer is trained to recreate the numeric sequences. When an input sequence that has not been used for training outputs a diverging numeric sequence, anomalies in this sequence are expected. Both anomaly detection methods were programmed and successfully applied to an unlabeled data set originating from instrumented machinery used for ground improvement of building foundations. The results indicate that both approaches are principally functional and strongly justify continued work in this area, especially on the machine learning model.

AB - This thesis is an exploratory work of unsupervised anomaly detection in multi-variate time-series machine data with methods originating from natural language processing. The foundation is laid by tokenizing the time-series data, i.e., converting the numeric machine data into symbolic data similar to textual data. The process of tokenization is realized by discretizing the data and assigning unique tokens to the discrete values. The symbolic sequences obtained are then inspected for anomalies with two different approaches. The first method is based on word n-gram language models. A n-gram is a sequence of words of length n. The counts of those n-grams in the data set are computed and weighed with a measure for the importance of a word to a document, the term frequency-inverse document frequency measure, to derive anomaly scores for each sequence. The second approach presented utilizes a machine learning model, more specific a masked language model with a transformer architecture at its core. Random tokens in the input sequences get masked and the transformer is trained to recreate the numeric sequences. When an input sequence that has not been used for training outputs a diverging numeric sequence, anomalies in this sequence are expected. Both anomaly detection methods were programmed and successfully applied to an unlabeled data set originating from instrumented machinery used for ground improvement of building foundations. The results indicate that both approaches are principally functional and strongly justify continued work in this area, especially on the machine learning model.

KW - Anomalieerkennung

KW - multivariat

KW - Zeitreihe

KW - Maschinendaten

KW - maschinelles Lernen

KW - künstliche Intelligenz

KW - unbeaufsichtigtes Lernen

KW - linguistische Methoden

KW - neuronales Netzwerk

KW - Transformer

KW - Attention

KW - Self-Attention

KW - großes Sprachmodell

KW - Tokenisierung

KW - Diskretisierung

KW - Statistik

KW - maskiertes Sprachmodell

KW - BERT

KW - symbolische Daten

KW - n-Gramm

KW - Bag-of-words

KW - Bag-of-ngrams

KW - term frequency-inverse document frequency

KW - TF-IDF

KW - anomaly detection

KW - multi-variate

KW - time-series

KW - machine data

KW - machine learning

KW - artificial intelligence

KW - unsupervised learning

KW - linguistic methods

KW - neural network

KW - transformer

KW - attention

KW - self-attention

KW - large language model

KW - tokenization

KW - discretization

KW - statistics

KW - masked language model

KW - BERT

KW - symbolic data

KW - n-gram

KW - bag-of-words

KW - bag-of-ngrams

KW - term frequency-inverse document frequency

KW - TF-IDF

U2 - 10.34901/mul.pub.2024.015

DO - 10.34901/mul.pub.2024.015

M3 - Master's Thesis

ER -