Methods and tools for the statistical data analysis of large datasets collected from bio-based manufacturing processes

Research output: Book/ReportPh.D. thesis

305 Downloads (Pure)


In bio-manufacturing, biological systems are harnessed for the production of useful organic materials to be used in, for example, the food, medicine or agricultural industries. The most common mode of production in this sector is through batch processes. In a batch process a reactor vessel is filled with raw materials such as bacteria culture, water and sugar. It is then subjected to controlled conditions for a finite duration during which its contents undergo transformation, and finally the end-product is harvested from the reactor. Typically, a variety of sensors measure conditions in the reactor throughout each batch, such as temperature, pressure and concentration. With advances in sensor technology, and computational power, the volume of data collected in this way is ever increasing. The goal of the thesis is to contribute new techniques for utilising this data to improve process understanding and product quality. The existing literature on statistical monitoring, and quality prediction, for batch processes is reviewed, highlighting the challenges presented by batch process data. These include its three dimensional structure (conventionally represented as I batches ×J variables×K time-points) comprising highly multivariate,cross-correlated, autocorrelated and non-stationary variable trajectories for each batch. An aspect of the data which leads to a number of contributions in the thesis is the variation in the time dimension often present in batch processes, meaning that comparable events occur at different times in different batches, so that the shapes and features in the resulting variable trajectories are not synchronised. In addition, the overall duration of different batches in a process may vary leading to different numbers of observations, complicating the application of standard bi-linear or tri-linear methods. Dynamic time warping (DTW) has previously been applied to synchronise batch process data and address these issues. The DTW algorithm identifies an optimal warping function, which stretches and compresses each batch in order to synchronise the variable trajectories. The warping function obtained for each batch may be interpreted as the progress signature of the batch. Using a case study of a bacteria culture batch process from Chr. Hansen, the advantages of including local constraints in the DTW algorithm,so that the warping function is a more realistic representation of batch progress, are demonstrated, and a method for selecting the local constraint is presented. In another case study using data from Chr. Hansen, a novel method is developed for predicting the harvest time of a batch at an early stage, whilst the batch is in progress. The method utilises lasso regression for selection of important variables for making the prediction, and combines the prediction with the progress information contained in the warping function from online alignment with DTW. Early harvest time prediction can contribute to scheduling of down-stream resources. In a third real industrial case study, lasso regression is again utilised to obtain quality predictions for batches of pectin produced by CP Kelco. The approach is contrasted with partial least squares models, and comparable estimated prediction error is obtained using lasso regression, in addition to a more parsimonious and interpretable model. Finally, the ability of DTW to quantify similarity between time series is exploited to develop a method for monitoring batch processes online to detect if a fault occurs. This method is based on the nearest neighbour principle, comparing a non going batch toits k nearest neighbours in a database of success fulbatches, according to the DTW distance. If the distance to the k nearest neighbours increases too quickly, an alarm is signalled to indicate that a fault has occurred. The method is demonstrated using a simulated dataset, representing batch production of penicillin, which contains a wide variety of fault types, magnitudes and onset times. The performance of the novel method is contrasted with a benchmark principle component analysis based approach, and shown to have a higher detection rate and faster detection speed when there is clustering of batches in the reference dataset.
Original languageEnglish
PublisherDTU Compute
Number of pages152
Publication statusPublished - 2018
SeriesDTU Compute PHD-2018

Fingerprint Dive into the research topics of 'Methods and tools for the statistical data analysis of large datasets collected from bio-based manufacturing processes'. Together they form a unique fingerprint.

Cite this