Ferreira PdL, Batista R, Andermann T, Groppo M, Bacon CD, Antonelli A
Mol. Phylogenet. Evol. 169 (-) 107432 [2022-04-00; online 2022-02-04]
Target sequence capture has emerged as a powerful method to sequence hundreds or thousands of genomic regions in a cost- and time-efficient approach. In most cases, however, targeted regions lack full sequence information for certain samples, due to taxonomic, laboratory, or stochastic factors. Loci lacking molecular data for a large number of samples are commonly excluded from downstream analyses, even though they may still contain valuable information. On the other hand, including data-poor loci may bias phylogenetic analyses. Here we use a target sequence capture dataset of an ecologically and taxonomically diverse group of spiny sunflowers (Asteraceae, or Compositae: Barnadesioideae) to test how the inclusion or exclusion of such data-poor loci affects phylogenetic inference. We investigate the sensitivity of concatenation and coalescent approaches to missing data with matrices of varying taxonomic completeness by filtering loci with different proportions of missing samples prior to data analysis. We find that missing data affect both the topology and branch support of the resulting phylogenies. The matrix containing all loci yielded the overall highest node support values, independently of the amount of missing nucleotides. These results provide empirical support to earlier suggestions based on single genes and data simulations that taxa with high amounts of missing data should not be readily dismissed as they can provide essential information for phylogenomic reconstruction.
NGI Stockholm (Genomics Applications) [Service]
NGI Stockholm (Genomics Production) [Service]
National Genomics Infrastructure [Service]
PubMed 35131421
DOI 10.1016/j.ympev.2022.107432
Crossref 10.1016/j.ympev.2022.107432
pii: S1055-7903(22)00045-8