Calling Structural Variants with Confidence from Short-Read Data in Wild Bird Populations.

David G, Bertolotti A, Layer R, Scofield D, Hayward A, Baril T, Burnett HA, Gudmunds E, Jensen H, Husby A

Genome Biol Evol 16 (4) - [2024-04-02; online 2024-03-15]

Comprehensive characterization of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation, reproducible and high-confidence structural variation callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus). To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of structural variants is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analyzing short-read-discovered structural variation data sets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence structural variation callsets.

Bioinformatics Support for Computational Resources [Service]

Bioinformatics Support, Infrastructure and Training [Service]

PubMed 38489588

DOI 10.1093/gbe/evae049

Crossref 10.1093/gbe/evae049

pmc: PMC11018544
pii: 7630036


Publications 9.5.1