![]() We find that ordering the instances significantly increases classification accuracy and that filtering has the largest impact on classification accuracy. We compare ordering the instances in a data set in curriculum learning, filtering and boosting. Ordering the instances in a data set allows a learning algorithm to focus on the most beneficial instances and ignore the detrimental ones. The underlying assumption of this method is that instances with a high likelihood of being misclassified represent more complex concepts in a data set. In this paper, we present an automated method that orders the instances in a data set by complexity based on the their likelihood of being misclassified (instance hardness). However, an automated method for determining how beneficial an instance is for inferring a model of the data does not exist. Several machine learning techniques treat instances in a data set differently during training such as curriculum learning, filtering, and boosting. Some instances (such as outliers) are detrimental to inferring a model of the data. Not all instances in a data set are equally beneficial for inferring a model of the data. It is shown that our algorithm improves classification performance for all metrics by an overall average of 7.4% when at least 40% of the labels are missing from the training data, and improves by 18.4% when at least 90% of the labels are missing. This solution is compared with existing multi-label algorithms using data sets from multiple domains and the performance is measured with standard multi-label evaluation metrics. This algorithm can also be trained incrementally as it dynamically considers new labels. In addition, this algorithm can, using a naive Bayesian approach, infer missing labels in the training data. ![]() A novel multi-label adaptation of the backpropagation algorithm is proposed that does not assume implicit negativity. Additionally, many of the existing algorithms do not handle incremental learning in which new labels could be encountered later in the learning process. There are many existing learning algorithms for multi-label classification however, these algorithms assume implicit negativity, where missing labels in the training data are automatically assumed to be negative. Many real world problems require multi-label classification, in which each training instance is associated with a set of labels. We also demonstrate with 24 datasets and 9 supervised learning algorithms that classification accuracy is usually higher when randomly-withheld values are imputed using UBP, rather than with other methods. We evaluate UBP with the task of imputing missing values in datasets, and show that UBP is able to predict missing values with significantly lower sum-squared error than other collaborative filtering and imputation techniques. In this paper, we present a technique for unsupervised learning called Unsupervised Backpropagation (UBP), which trains a multi-layer perceptron to fit to the manifold sampled by a set of observed point-vectors. One approach to handling missing values is to fill in (impute) the missing values. Even many classification algorithms that are designed to operate with missing values still exhibit deteriorated accuracy. Real-world data sets, however, often contain unknown values. Many data mining and data analysis techniques operate on dense matrices or complete tables of data. We seek to integrate this information into the training process to alleviate the effects of class overlap and present ways that instance hardness can be used to improve learning. We find that class overlap is a principal contributor to instance hardness. We then use a set of hardness measures to understand why some instances are harder to classify correctly than others. We identify instances that are hard to classify correctly (instance hardness) by classifying over 190,000 instances from 64 data sets with 9 learning algorithms. The goal of this paper is to better understand the data used in machine learning problems by identifying and analyzing the instances that are frequently misclassified by learning algorithms that have shown utility to date and are commonly used in practice. Knowing which instances are misclassified and understanding why they are misclassified and how they contribute to data set complexity can improve the learning process and could guide the future development of learning algorithms and data analysis methods. Most data complexity studies have focused on characterizing the complexity of the entire data set and do not provide information about individual instances.
0 Comments
Leave a Reply. |