Google AI researchers present Tracln, a simple approach to estimate the influence of training data


In one paper published at NeurIPS 2020Google AI researchers have proposed a simple and scalable approach to estimate the influence of training data. TracIn. The quality of a machine learning the model’s training data can significantly influence its performance. The influence is the extent to which a given training example affects the model and its predictive performance is a useful measure of data quality. Although a few methods have been proposed recently to quantify influence, their use in products has been limited due to the resources required to run them at scale or the additional burdens placed on training. Tracln, on the other hand, traces the training process to capture prediction changes as it visits individual training examples.

deep learning models are usually trained using CAD algorithm. The algorithm works by performing multiple passes over the data and making changes to the model parameters that locally reduce the loss on each pass. Tracln efficiently finds mislabeled examples and outliers from various datasets and helps explain predictions in terms of training examples by assigning an influence score to each training example.

The researchers describe two types of relevant training examples: one that reduces loss called supportersand the other which increases the loss called opponents. The test samples at the time of training are unknown and the learning algorithm visits multiple points at once. The Tracln method overcomes these limitations by using the output of the learning algorithms’ checkpoints as a sketch of the training process and by applying point loss gradients. The TracIn method can be reduced simply to the dot product of the loss gradients of the test and training examples, weighted by the learning rate and summed over the control points. Alternatively, if the test example has no label, the influence on the prediction score can be examined.

Researchers demonstrate the utility of Tracln by computing the loss gradient vector for training data and a test sample for specific classification, then leveraging a standard k-nearest neighbor library to retrieve partisans and the opponents. The breakdown of test example loss into training example influences provided by Tracln suggests that the loss of any gradient descent-based neural model can be viewed as a sum of similarities in algorithm space. Tracln can therefore be used as a similarity function within a clustering algorithm.

Tracln can also be used to identify contours that exhibit strong self influence.

Tracln is task independent and can be applied to different models. It has no requirements other than being trained using SGD. Tracln, in fact, is relatively easy to implement, evolutionary method to calculate the influence of training data examples on individual predictions or to find rare and mislabeled training examples. We can find the line for coding examples from the GitHub link in the document.



James G. Williams