Abstract
Copy number variation provides the raw material for gene family expansion and diversification, which is an important evolutionary force. Moreover, copy number variants (CNVs) can influence gene transcriptional and translational levels and have been associated with complex disease susceptibility. Therefore, natural selection may have affected at least some of the greater than one thousand CNVs thus far discovered among the genomes of phenotypically normal humans. While identifying and understanding particular instances of natural selection may shed light on important aspects of human evolutionary history, our ability to analyze CNVs in traditional population genetic frameworks has been limited. However, progress has been made by adapting some of these frameworks for use with copy number data. Moving forward, these efforts will be aided by non-human organism studies of the population genetics of copy number variation, and by more direct comparisons of within-species copy number variation and between-species copy number fixation.
Introduction
Human genetic diversity is comprised of single nucleotide polymorphisms (SNPs), small insertion and deletion variants, short tandem repeat polymorphisms, retrotransposable element insertion variants (e.g., Alu s), inversion variants, and copy number variants (CNVs). CNVs are larger-scale insertions and deletions that range from several kilobases (kb) to several megabases (Mb) in size (Feuk et al., 2006). While this component of genetic diversity has long been considered important by the scientific community (e.g., Ottolenghi et al., 1974; Awdeh and Alper, 1980; Trask et al., 1998; Buckland, 2003), recent advances in microarray and other genome-scale technologies have facilitated the discovery of more than 1,000 human CNVs (Redon et al., 2006; Sebat, 2007), leading to intensified interest and excitement. This review focuses on the potential evolutionary significance of copy number variation in the human genome, including results from previous studies and potential directions for future research.
The functional and evolutionary potential of copy number variation
Gene-containing CNVs may influence mRNA and protein expression levels (e.g., Aldred et al., 2005; Gonzalez et al., 2005; McCarroll et al., 2006; Stranger et al., 2007). Therefore, CNVs have the potential to affect downstream phenotypes and, ultimately, reproductive fitness (Kondrashov and Kondrashov, 2006; Hurles et al., 2008). However, one cannot simply assume a direct relationship between gene copy number and expression level. In part, this uncertainty may be attributable to the location of a gene-containing CNV with respect to that of the gene regulatory machinery (Cooper et al., 2007). For example, each duplicated segment of the starch-digesting amylase gene AMY1 contains the regulatory sequences necessary for salivary-specific expression (Groot et al., 1990; Ting et al., 1992), and there is a significant positive correlation between AMY1 copy number and amylase protein levels in saliva (Bank et al., 1992; Perry et al., 2007). In contrast, there is a single regulatory ‘locus control region’ upstream of the red (OPN1LW) and green (OPN1MW) opsin visual pigment genes on the X chromosome (Wang et al., 1992). Although mutations in the copy-number variable OPN1MW gene may result in color blindness (Nathans et al., 1986a, b; Wolf et al., 1999), only the copy nearest the locus control region is expressed to an appreciable extent, such that a male with a disrupted first gene but intact subsequent genes will have color blindness (Hayashi et al., 1999).
CNVs are reportedly associated with susceptibility to systemic autoimmunity diseases (Fanciulli et al., 2007; Yang et al., 2007), psoriasis (Hollox et al., 2008), and HIV infection and progression to AIDS (Gonzalez et al., 2005) among other complex diseases (Fellermann et al., 2006; Le Marechal et al., 2006). While these results provide further evidence of the potential functional relevance of CNVs in general, disease can be a powerful evolutionary force in its own right and could therefore affect patterns of CNV diversity via natural selection. For example, deletions of the hemoglobin genes HBA1, HBA2, HBB, or HBD result in thalassemia (Ottolenghi et al., 1974, 1976; Taylor et al., 1974; Orkin et al., 1979). Although homozygous deletion is typically fatal (Weatherall and Clegg, 1981), individuals heterozygous for these deletions receive protection against malaria infection, and thalassemia frequency is strongly correlated with malaria prevalence, even down to very local levels (Flint et al., 1986; Hill et al., 1988; Allen et al., 1997). This is a classic example of balancing selection in humans.
Higher copy numbers of the immunoregulatory and inflammatory cytokine CCL3L1 gene are associated with lower risks of HIV infection and the progression to AIDS (Gonzalez et al., 2005). Interestingly, average CCL3L1 copy number in Africans is nearly two times greater than in non-Africans: 5.95 versus 2.99 copies, respectively (Gonzalez et al., 2005). In a subsequent genome-wide study, the level of population differentiation at this locus was found to be extraordinary compared to that of other CNVs (Redon et al., 2006), suggesting that natural selection may have influenced CCL3L1 copy number in humans. Because AIDS has only recently been a human disease, it is unlikely to have driven patterns of CCL3L1 copy number in our species. However, other diseases for which susceptibility may also be associated with CCL3L1 copy number may have had such an effect. Therefore, we stand to benefit from experiments that interrogate the detailed functional effects of different CCL3L1 copy number genotypes (e.g., Dolan et al., 2007), which may lead to further medical and evolutionary insights. For example, Mamtani et al. (2008) recently reported an association between CCL3L1 copy number and susceptibility to systemic lupus erythematosus, providing evidence that this CNV may affect diverse, multi-systemic, pathways that could have been subject to dynamic evolutionary pressures during human evolution.
Population genetic analyses of copy number variation
Due to current technological limitations in CNV ascertainment, and diversity in and uncertainty over CNV architectures, we face considerable challenges in obtaining reliable genotypes for CNVs and using traditional population genetic analyses to understand their evolutionary significance (Conrad and Hurles, 2007; Kidd et al., 2007; McCarroll and Altshuler, 2007; Perry et al., 2008a). Despite these challenges, there have been some successful modifications of population genetic frameworks for use with CNV data. For example, rather than considering allele frequencies, Redon et al. (2006) analyzed directly the relative intensity log2 ratios for clones from their whole-genome array-based comparative genomic hybridization platform to highlight CNVs with relatively high levels of between-population differentiation. These CNVs, which include the CCL3L1 CNV discussed above, are excellent candidates for further functional and evolutionary analyses.
In another study, we discovered that mean AMY1 copy number is higher in populations with high-starch diets compared to populations with traditionally low-starch diets (Perry et al., 2007). In a subset of these populations, the level of differentiation at the AMY1 locus is unusual compared to that for other genome-wide CNVs, suggesting that positive or directional selection may have favored higher AMY1 copy numbers in at least some high-starch populations (Perry et al., 2007). Combined with findings from previous population genetic analyses of alleles responsible for lactase persistence (Bersaglieri et al., 2004; Tishkoff et al., 2007), this result demonstrates the importance of diet – and particularly the transition to agriculture – in human evolution.
Nozawa and colleagues (2007) compared gene- and pseudogene-containing CNVs to examine the evolutionary significance of olfactory receptor gene copy number variation. Previous studies have consistently shown that human CNVs are significantly enriched for genes with sensory perception (including olfactory receptors) and defense response functions (e.g. Cooper et al., 2007). While this enrichment could be interpreted as evidence of positive selection for variation at the copy number level of genes (Nguyen et al., 2006), it is also consistent with relatively stronger functional constraint on copy numbers of genes with other functions. Using CNV data from Redon et al. (2006), Nozawa et al. (2007) compared the proportion of functional olfactory receptor genes that are copy number variable to that for olfactory receptor pseudogenes, which are expected to reflect neutral patterns of diversity. A similar number of genes and pseudogenes were copy number variable, which is consistent with neutral evolution (e.g., genetic drift) on the copy numbers of functional olfactory receptors (Nozawa et al., 2007).
It will be interesting to revisit the Nozawa et al. (2007) olfactory receptor analysis once advances in CNV technologies facilitate improved breakpoint resolution and more accurate genotype estimates, for increased certainty of the specific copy-number-variable genes and to be able to consider the full frequency distributions, respectively. In addition, comparing the human results to those from other species in which olfactory receptors may have been subject to different evolutionary pressures including rodents, canines, and even other primates (Gilad et al., 2003, 2004) will be particularly informative. Finally, we may be enlightened by the results of such comparisons among human populations with different ecological histories.
In general, there are not large samples of copy-number-variable pseudogenes in the human genome for non-olfactory functional categories (Redon et al., 2006), which may preclude widespread application of the Nozawa et al. (2007) test. As an alternative neutral proxy, one could analyze a set of intergenic regions carefully matched (e.g., for repetitive element densities and recombination rates) to the functional genes of interest. Of course, not all intergenic region CNVs are likely to be impervious to natural selection; for example, Stranger et al. (2007) identified six CNVs that were significantly correlated with the mRNA expression levels of genes >1 Mb distant. However, intergenic CNVs are still likely to better reflect neutrality than gene-containing CNVs, and thus would provide a suitable database for initial comparisons in a population genetics framework.
Patterns of copy number variation in non-human species
Widespread copy number variation has now been described in the genomes of chimpanzees, rhesus macaques, mice, rats, the fruitfly Drosophila melanogaster, and even the malaria parasite Plasmodium falciparum(Li et al., 2004; Perry et al., 2006; Cutler et al., 2007; Dopman and Hartl, 2007; Egan et al., 2007; Graubert et al., 2007; Anderson et al., 2008; Emerson et al., 2008; Guryev et al., 2008; Lee et al., 2008; Mok et al., 2008; She et al., 2008). Characterizing CNVs in non-human genomes not only helps us to understand better the evolutionary histories of these species (e.g., Nair et al., 2007), but also will enhance our knowledge of the functional and evolutionary significance of human CNVs.
Specifically, CNVs occur in orthologous regions of different primate genomes considerably more often than would be expected by chance, likely a result of shared genomic architectures that facilitate recurrent CNV genesis (Perry et al., 2006; Lee et al., 2008). Even in rats, 113 CNVs were discovered that occur in regions orthologous to human CNVs (Guryev et al., 2008). Especially in model organisms, these loci represent excellent opportunities to examine the potential functional significance of human CNVs. Moreover, between-species comparisons of the detailed phenotypic effects of orthologous-region CNVs will contribute to our understanding of the functional importance of CNV fine-scale architecture (e.g. specific breakpoints) and genetic background (e.g. nucleotide sequence variation).
Analyses of the patterns of copy number variation in non-human genomes are also expected to aid in the development of CNV-tailored population genetic analyses. In this respect, the relatively high level of neutral genetic diversity in Drosophila(Aquadro et al., 2001) makes this model organism an ideal candidate for CNV evolutionary analyses. In an initial study comparing the Drosophila melanogaster reference sequence strain to five wild-type strains, Dopman and Hartl (2007) identified an average of 436 CNVs per strain. These CNVs were then analyzed in the context of the detailed knowledge of functional elements in the Drosophila genome to test intriguing hypotheses concerning the biological and evolutionary significance of copy number variation. For example, genes with tissue-specific rather than widespread expression are significantly more likely to be copy number variable in Drosophila melanogaster, and these tissue-specific genes are particularly enriched for midgut and male accessory gland expression, including genes involved in digestion, defense response, insecticide detoxification, and sperm competition (Dopman and Hartl, 2007). Detailed population genetic analyses focused on these CNVs may provide important insights into Drosophila evolutionary history and would contribute to our general understanding of the potential functional and evolutionary significance of copy number variation. In a more recent Drosophila melanogaster study, Emerson et al. (2008) identified four high-frequency duplications that contain one or more genes with toxin response/insecticide detoxification functions. These particular CNVs may have been affected by positive selection and are therefore excellent candidates for further interrogation.
The relationship between copy number variation and fixation
Inter-specific copy number differences (CNDs) are common among primate genomes (Locke et al., 2003; Fortna et al., 2004; Newman et al., 2005; Goidts et al., 2006; Wilson et al., 2006; Dumas et al., 2007), and may have been involved in the evolution of species- and lineage-specific phenotypes (Kehrer-Sawatzki and Cooper, 2007). In fact, with respect to base pair content, CNDs may account for a greater proportion of total human-chimpanzee genome divergence than single nucleotide substitutions (Cheng et al., 2005; Chimpanzee Sequencing and Analysis Consortium, 2005). Although the relationship between CNVs and CNDs is relatively complex since segmental duplications are prone to subsequent CNV genesis via non-allelic homologous recombination mechanisms (e.g., Cooper et al., 2007), we can still advance our understanding of the evolutionary significance of both CNVs and CNDs by analyzing them in consort.
The McDonald-Kreitman test (1991) compares ratios of fixation to polymorphism for functional and putatively neutral sites. An excess of functional fixation suggests that some differences may have been fixed by positive selection. Zhang (2007) adapted this test to compare CND:CNV ratios for intact olfactory receptor genes and pseudogenes (CNDs were based on comparison of the human and chimpanzee reference genome sequences; CNV data were from Redon et al. (2006)). Although there is a slight excess of intact gene fixation (16:116 for intact genes; 11:143 for pseudogenes), the ratios were not significantly different; therefore, the null hypothesis of neutrality could not be rejected (Zhang, 2007).
Recently, we extended this framework to consider CNDs and CNVs that encompass genes with different functional categories, based on Gene Ontology classifications (Perry et al., 2008b). In our study, CNVs were identified in 30 human and 30 chimpanzee individuals, using a whole-genome array-based comparative genomic hybridization platform. To identify fixed CNDs, we used the same platform to compare one human and one chimpanzee individual, and then filtered out the gains and losses that overlapped human or chimpanzee CNVs. We compared the CND:human CNV ratio for each functional category to that for intergenic regions. Relative to intergenic regions (18:52) and compared to the ratios for other functional categories, the cell proliferation (6:8) and inflammatory response (5:7) categories have an excess of CNDs (Perry et al., 2008b). Although these results are also not statistically significant, the cell proliferation and inflammatory response CNDs are intriguing candidates for future studies aiming to characterize the genetic basis of adaptive phenotypic differences between humans and chimpanzees.
Comparisons such as those discussed above will help us understand better the obvious relationship between CNDs and CNVs and can identify among-taxa variation in evolutionary pressures on copy number. These analyses will become more powerful with technological advances – especially with improved breakpoint estimation and more precise knowledge of the functional elements contained within each variant. However, it will be more difficult to alleviate an ascertainment bias inherent in human and non-human primate comparisons. Specifically, we are challenged to identify reliably deletions of unique sequence that were fixed in the human lineage. For example, our study (Perry et al., 2008b) was based on a human-specific platform; these sequences (fixed human-specific deletions) would not have been represented on the microarray. One could construct a multi-species microarray platform (e.g., Gilad et al., 2005) or identify CNDs based on genome sequence comparisons (e.g., Cheng et al., 2005; Chimpanzee Sequencing and Analysis Consortium, 2005), but to reliably identify human-specific deletions, both of these approaches would require high-quality finished genome sequences for all species of interest. Once these issues are circumvented, the annotation and characterization of functional elements contained within the regions that are deleted in humans may require unconventional approaches, but these results will be particularly interesting and could provide important insights into our evolutionary history.
References
G.H.P. is supported by NIH fellowship F32GM085998.