The importance of data quality and traceability in data mining. Applications of robust methods for multivariate data analysis. A case-study conducting the herring industry

Stina Frosch

Research output: Book/ReportPh.D. thesisResearch

372 Downloads (Pure)

Abstract

The general aim of the thesis was to develop a documentation system and to improve the background upon which the decision-making process for quality and production control is founded within a herring processing industry. Furthermore, the possibilities of utilizing multivariate data analyses were investigated conducting data from catch to final product throughout the production chain. When generating vast amount of data, as in the case of processing herring, various samples turn out to deviate from the majority of samples, also designated outliers. Due to the nature of outliers, they posses the ability to impair analysing models based on traditional multivariate methods using least squares estimation. For that reason, possible advantages or drawbacks employing robust multivariate methods were investigated as a favoured alternative to the traditional methods.

The first part of the exploratory work was carried out as a case-study, exploiting the multiplicity of empirical and biological data, intended for quality determination in one of the leading businesses within the herring industry in Denmark. The work started out constructing a database to save all registered information, this being extended to be automatically imported, transmitted as e.g. measured weights to the database. In the case of non automatic transmission of data, the import of data to the database was manually recorded as soon as they were generated. The preliminary screening of data demonstrated that traceability could be confirmed from vessel unto the finished marinated produce of herring with the smallest unit of traceability being a batch of topped product. This finding revealed that it was possible, at any time, to track and trace any given product back to the vessel that originally caught the fish, and do
extraction of all data connected to that specific product. Unfortunately, a great part of the multiple registrations lacked variability and suffered from uncertainties caused by the lack of traceability and/or misgivings, related to the actual registering of analysis. This, in combination with missing information of relevance, lead to that data at its present form neither had any relevance nor was representative for any further multivariate data analyses. For that reason, it was not possible to identify and link any relations between, for instance the quality characteristics of the raw material and yield, and thereby improve the basis for the decision-making process concerned with quality and production control, within the herring processing industry.

In place of the fact that the data had to be discarded, in relation to multivariate data analyses, they proved useful in the sense that they could be informative in relation to what information needed to be improved or added to be profitable to the business. A few to mention is registration of belly bursting and waste, along with implementation of an on-line determination of fat content on single fish level and consecutive sorting of the raw material based on this fat determination. Additionally, a quality evaluating system of the marinated herring would improve the significance of the data.

Gas chromatograms of fatty acid methyl esters (GC-FAME) and of volatile lipid oxidation products (GC-ATD) from fish lipid extracts were analysed by multivariate data analysis (principal component analysis). Peak alignment was necessary in order to include all sampled points of the chromatograms in the data set. The ability of robust algorithms to deal with outlier problems, including both sample-wise and element-wise outliers, and the advantages and drawbacks of two robust PCA methods, robust PCA (ROBPCA) and robust singular value decomposition (RSVD) when analysing these GC data were investigated. The results showed that the usage of robust PCA is advantageous, compared to traditional PCA, when analysing the entire profile of chromatographic data in cases of sub-optimally aligned data. It was also demonstrated how the robust PCA method – sample (ROBPCA) or elementwise (RSVD) – depended on the type of outliers present in the data set. The potential of removing Rayleigh and Raman scatter from fluorescence data (excitation – emission landscapes), by employing robust PARAFAC, were investigated. A PARAFAC algorithm was made robust by substitution of least squares estimation by least absolute error (LAE). The conclusion was that LAE PARAFAC cannot be considered as a confident method for handling scatter, as a result of the systematic nature of scattering. However, by taking advantage of the systematic nature of the scatter an automatic method based on robust techniques for identification of scatter in fluorescence data were developed. This method can handle both Raman and 1st and 2nd order Rayleigh scatter, and do not demand any priori visual inspection of the data before modelling.

The investigation of using robust calibration methods for prediction of fat content of fish by NIR measurements in a data set with no extreme outliers present showed that the advantages of employing robust methods for prediction was ineligible. A slightly better prediction was obtained with robust SIMPLS (RSIMPLS) compared to classical PLSR, but further investigation is needed to test the performance on an independent test set. Focusing on the drawbacks of the robust methods, especially the lower statistical efficiency and the time-consuming computations, the advantages of robust methods seems to be eliminated, when the dataset contains no obvious outliers.
Original languageEnglish
Place of PublicationKgs. Lyngby
PublisherDanish Institute for Fisheries Research, Department of Seafood Research & The Technical University of Denmark
Number of pages84
Publication statusPublished - May 2006

Fingerprint

Dive into the research topics of 'The importance of data quality and traceability in data mining. Applications of robust methods for multivariate data analysis. A case-study conducting the herring industry'. Together they form a unique fingerprint.

Cite this