Discovery of errors in de novo metagenomics NGS sequencing is a difficult task. Even after clustering, the datasets retain sequences from multiple microbial species which contribute a considerable amount of variation that conceals the errors. The application of standard denoising algorithms available for genomics is no longer possible because of the high rate of false positives in regions with natural variation in the data; at the same time rare natural variants are a subject of study where they need to be distinguished from the errors.
This work uses machine learning to filter some of the false positives of other error discovery algorithms. A neural network and a random forest have been trained to identify the errors in the datasets with an accuracy of over 99%. While still insufficient for direct discovery of rare errors, it is demonstrated that the trained models provide a good filter to reduce the amount of incorrectly identified errors without an increase in the false negatives.
Key words: metagenomics, error detection, weighted frequency, artificial neural networks, random forests
Topic: BIOLOGY