Abstract
This thesis presents two different projects rooted in uncertainty in machine learning and online news.
The first project concerns the prediction of reliability and bias in American news articles popularized as fake news detection. There are three associated papers on the topic. The first paper presents a collected dataset, containing 700.000+ news articles from 194 sources, as well as detailed labelling of the sources from multiple, independent authorities. Another paper analyses copying patterns between news sources, concluding that there is heavy copying and that the copying patterns reveal communities of sources publishing similar or even identical content. The final paper presents a large robustness study of a known reliability and bias detection system. The system is tested on unseen sources, tested for performance decrease over time, and tested against three types of attacks aimed at the system.
The second project aims at simplifying common problems with labels in classification. We propose a framework called decoupling, which uses probabilistic methods to handle - Semi-supervised learning: only some of the training data have labels - Positive-unlabelled learning: we only have labels on one of two classes - Multi-positive-unlabelled learning: we have labels on all classes but one - Noisy-label learning: labels are known to have errors The system can also handle combinations of the above. We derive the needed approximations for optimizing labels in the framework and empirically show that it can assist in solving the problems above. The project is currently only available in preprint, but we expect to publish the work soon.
We end off the decoupling-project by showing an new interesting classification task, that we have not seen elsewhere, which we call degenerate classification. We show a simple case in which decoupling can be used to encode the necessary assumptions needed to learn 6 classes using only 4 labels.
The first project concerns the prediction of reliability and bias in American news articles popularized as fake news detection. There are three associated papers on the topic. The first paper presents a collected dataset, containing 700.000+ news articles from 194 sources, as well as detailed labelling of the sources from multiple, independent authorities. Another paper analyses copying patterns between news sources, concluding that there is heavy copying and that the copying patterns reveal communities of sources publishing similar or even identical content. The final paper presents a large robustness study of a known reliability and bias detection system. The system is tested on unseen sources, tested for performance decrease over time, and tested against three types of attacks aimed at the system.
The second project aims at simplifying common problems with labels in classification. We propose a framework called decoupling, which uses probabilistic methods to handle - Semi-supervised learning: only some of the training data have labels - Positive-unlabelled learning: we only have labels on one of two classes - Multi-positive-unlabelled learning: we have labels on all classes but one - Noisy-label learning: labels are known to have errors The system can also handle combinations of the above. We derive the needed approximations for optimizing labels in the framework and empirically show that it can assist in solving the problems above. The project is currently only available in preprint, but we expect to publish the work soon.
We end off the decoupling-project by showing an new interesting classification task, that we have not seen elsewhere, which we call degenerate classification. We show a simple case in which decoupling can be used to encode the necessary assumptions needed to learn 6 classes using only 4 labels.
Original language | English |
---|
Publisher | Technical University of Denmark |
---|---|
Number of pages | 112 |
Publication status | Published - 2020 |