Improving taxonomic inference from ancient environmental metagenomes by masking microbial-like regions in reference genomes.

Oskolkov N, Jin C, Clinton SL, Guinet B, Wijnands F, Johnson E, Kutschera VE, Kinsella CM, Heintzman PD, van der Valk T

Gigascience 14 (-) - [2025-01-06; online 2025-10-03]

Ancient environmental DNA is increasingly vital for reconstructing past ecosystems, particularly when paleontological and archaeological tissue remains are absent. Detecting ancient plant and animal DNA in environmental samples relies on using extensive eukaryotic reference genome databases for profiling metagenomics data. However, many eukaryotic genomes contain regions with high sequence similarity to microbial DNA, which can lead to the misclassification of bacterial and archaeal reads as eukaryotic. This issue is especially problematic in ancient eDNA datasets, where plant and animal DNA is typically present at very low abundance. In this study, we present a method for identifying bacterial- and archaeal-like sequences in eukaryotic genomes and apply it to nearly 3,000 reference genomes from NCBI RefSeq and GenBank (vertebrates, invertebrates, plants) as well as the 1,323 PhyloNorway plant genome assemblies from herbarium material from northern high-latitude regions. We find that microbial-like regions are widespread across eukaryotic genomes and provide a comprehensive resource of their genomic coordinates and taxonomic annotations. This resource enables the masking of microbial-like regions during profiling analyses, thereby improving the reliability of ancient environmental metagenomic datasets for downstream analyses.

Bioinformatics (NBIS) [Collaborative]

Bioinformatics Long-term Support WABI [Collaborative]

Bioinformatics Support, Infrastructure and Training [Collaborative]

PubMed 41041810

DOI 10.1093/gigascience/giaf108

Crossref 10.1093/gigascience/giaf108

pmc: PMC12491943
pii: 8271467


Publications 9.5.1