On November 30th, 2017, Kristjan Korjus defended his thesis on the topic "Analyzing EEG data and improving data partitioning for machine learning algorithms" ("EEG andmete analüüs ja andmepartitsioonide arendamine masinõppe algoritmidele")
Prof. Raul Vicente Zafra (Institute of Computer Science, UT)
PhD Davit Bzhalava ( Karolinska Institute, Deparment of Laboratory Medicine )
PhD Ricardo Vigario ( Dept of Physics, Faculty of Sciences and Technology, University of Nova )
A novel more efficient data handling method for machine learning. In classical statistics, models are rather simple and together with some assumptions about the data itself, it is possible to say if the given result is statistically significant or not. Machine learning algorithms on the other hand can have hundreds of millions of model weights. Such models can explain any data with 100% accuracy that changes the rules of the game. This issue is solved by evaluating the models on a separate test set. Some data points are not used in the model fitting phase. If the best model has been found, the quality of the model is evaluated on that test set. This method works well but it has a problem that some of the precious data is wasted for testing the model and not actually used in training. Researches have come up with many solutions to improve the efficiency of data usage. One of the main methods is called nested cross-validation that uses data very efficiently but it has a problem that it makes it very difficult to interpret model parameters. In this thesis, we invented a novel approach for data partitioning that we termed "Cross-validation and cross-testing". First, cross-validation is used on part of the data to determine and lock the model. Then testing of the model on a separate test set is performed in a novel way such that on each testing cycle, part of the data is also used in a model training phase. This gives us an improved system for using machine learning algorithms in the case where we need to interpret model parameters but not the model weights. For example, it gives us a nice possibility to be able to describe that the data has a linear relationship instead of quadratic one or that the best neural network has 5 hidden layers.