README for AnophelesData2020.tgz 2 October 2023 Tomas Flouri and Ziheng Yang This archive includes the coding and noncoding alignments for the autosomes analyzed in table 1 of Flouri et al. (2020) and table 1 of Flouri et al. (2023). The numbers of loci for the chromosomal arms are listed in those tables, but note that table 1 in Flouri et al. (2020) had a typo: the number of loci for 2L1+2 noncoding should be 6463 instead of 6434. The X-chromosome alignments are not included in this archive, which are available in the other archive, AnophelesData2018.tgz, from Thawornwattana et al. (2018). The data were originally published and analyzed by Fontaine et al. (2015), and recompiled and analyzed by Thawornwattana et al. (2018). The original alignments had 13 sequences, with 2 sequences for each of the six ingroup species (A. gambiae, A. coluzzii, A. arabiensis, A. melas, A. merus, and A. quadriannulatus), plus one sequence for the outgroup (A. christyi). In Flouri et al. (2020, 2023), we did not use the outgroup, so that in this archive, every locus has 12 sequences, with 2 sequences from each of the six ingroup species, and with the outgroup sequence removed. The AnophelesData2018 archive contains alignments of 13 sequences including the outgroup. The data here are mostly the “realigned" sequences from the 2018 archive, although there may be minor discrepancies in the alignments. We note that the genomic data published by Fontaine et al. (2015) are the “haploid consensus sequences”, with heterozygotes resolved into the majority nucleotide. This practice effectively resolves the genotypic phase at multiple heterozygote sites at random, potentially generating chimeric sequences that do not exist in nature (Andermann et al. 2019, Huang et al. 2022). This pseudo-haploidization may be problematic for analytical methods that are based on genealogical trees; for a dramatic example, see figure 6 in Huang et al. (2022). Both the 2018 and 2020 datasets suffer from this problem. References Andermann T, Fernandes AM, Olsson U, Topel M, Pfeil B, Oxelman B, Aleixo A, Faircloth BC, Antonelli A. 2019. Allele phasing greatly improves the phylogenetic utility of ultraconserved elements. Syst Biol 68:32-46. Flouri T, Jiao X, Rannala B, Yang Z. 2020. A Bayesian implementation of the multispecies coalescent model with introgression for phylogenomic analysis. Mol Biol Evol 37:1211-1223. Flouri T, Jiao X, Huang J, Rannala B, Yang Z. 2023. Efficient Bayesian inference under the multispecies coalescent with migration. Proc Nat Acad Sci USA: in press. Fontaine MC, Pease JB, Steele A, Waterhouse RM, Neafsey DE, Sharakhov IV, Jiang X, Hall AB, Catteruccia F, Kakani E et al. 2015. Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science 347:1258524. Huang J, Bennett J, Flouri T, Yang Z. 2022. Phase resolution of heterozygous sites in diploid genomes is important to phylogenomic analysis under the multispecies coalescent model. Syst Biol 71:334-352. Thawornwattana Y, Dalquen DA, Yang Z. 2018. Coalescent analysis of phylogenomic data confidently resolves the species relationships in the Anopheles gambiae species complex. Mol Biol Evol 35:2512-2527.