FAQs

FAQs about “Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences”

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document provides information about the study:

 

Karlsson Linnér et al. 2019. “Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences.” Nature Genetics.

The document was prepared by several of the study’s coauthors and draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

          1. Background

          2. Study design and results

          3. Social and ethical implications of the study

          4. Appendices

For clarifications or additional questions, please contact Jonathan P. Beauchamp (jonathan.pierre.beauchamp@gmail.com).

Quick Links

1.1.  Who conducted this study? What is the group’s overarching goals?

1.2.   The current study focuses on a variable called "general risk tolerance." What is general risk tolerance?

1.3.  What was already known about the genetics of risk tolerance prior to this study?

2.1.  What did you do in this paper? How was the study designed?

2.2.  What did you find in the GWAS?

2.3.  Are the SNPs associated with higher risk tolerance in your study also associated with other phenotypes?

2.4.  How much of a particular person's risk tolerance can be predicted from the results of this paper?

2.5.  What do your results tell us about human biology and brain development

2.6.  How do your results relate to previous research on the genetics of risk tolerance?

3.1.  Did you find “the gene for” (or "the genes for") risk tolerance?

3.2.  Does this study show that an individual's level of risk tolerance is determined and fixed at conception?

3.3.  Can you use the results in this paper to meaningfully predict a particular person's risk tolerance?

3.4.  Can environmental factors modify the effects of the specific SNPs you identified?

3.5.  What policy lessons or practical advice do you draw from this study?

3.6.  Could this kind of research lead to discrimination against, or stigmatization of, people with specific genetic variants? If so, why conduct this research?

Appendix 1:  Quality control measures

Appendix 2:  Additional reading and references

1. Background

1.1.  Who conducted this study? What was the group's overarching goal?

The authors are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, multi-disciplinary, international research group that aims to identify statistically robust links between genetic variants (for instance, base-pairs of DNA that vary across people) and phenotypes of interest to social scientists. A “phenotype” refers to anything that may be influenced by DNA, such as disease risk or physical characteristics. The phenotypes of interest to social scientists include behaviors, preferences, personality traits, and socioeconomic outcomes.

 

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. As is now well understood (Chabris et al. 2015), most phenotypes—including virtually all social-science phenotypes—are influenced by hundreds or thousands of genetic variants. Although in combination their collective effects can be sizeable, almost every one of these genetic variants has an extremely small effect on its own. To reliably identify these individual variants, therefore, scientists must study large samples; typically, hundreds of thousands of individuals are required. One approach to obtaining a large enough sample is for many research groups to pool analyses of their data into a single, large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of diseases and conditions (Visscher et al. 2017a). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

 

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believed that genetic data could have a substantial positive impact on research in the social sciences, and that social-science genetics could make important contributions to medical research. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Genetics, Broad Institute and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Statistical Genetics, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

 

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). These, together with an analysis plan, are often preregistered on the Open Science Framework (OSF) [The analysis plan for this study can be downloaded here: https://osf.io/cjx9m/]. Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate to journalists and the public what was found and what can and cannot be concluded from the research findings.

The SSGAC’s first major project was a genome-wide association study (GWAS) of educational attainment published in Science (Rietveld et al. 2013b). The study is summarized in a FAQ posted on the SSGAC website (https://www.thessgac.org/faqs). The study was followed by two related studies, using successively much larger samples, published in Nature (Okbay et al. 2016b) and Nature Genetics (Lee et al. 2018). Subsequent SSGAC papers have studied subjective well-being, depressive symptoms, the personality trait neuroticism, cognitive performance, and reproductive behavior. These papers have been published in Nature Genetics (Barban et al. 2016, Okbay et al. 2016a), Proceedings of the National Academy of Sciences (Rietveld et al. 2013a, 2014b), and Psychological Science (Chabris et al. 2012, Rietveld et al. 2014a), among other journals. The present study is the SSGAC’s first study that focuses on the genetics of general risk tolerance.

1.2.  The current study focuses on a variable called “general risk tolerance.” What is general risk tolerance?

Risk pervades many aspects of human life and is a central concept in the study of decision-making and behavior. Somewhat surprisingly, then, there is no universally agreed-upon definition of “risk.” For our purposes, we define “risk” as the degree of variability in possible outcomes, and “risk tolerance” as a person’s willingness to choose options that entail more risk, typically to have the chance of obtaining a more rewarding outcome. For example, an engineer with a high degree of risk tolerance would be more willing to quit her job at a stable, large corporation and join a risky start-up. An individual with a high degree of risk tolerance may also be more likely to drive faster than the speed limit on a highway, thus incurring a higher risk of having an accident or a traffic ticket in order to save time.

 

An individual’s risk tolerance typically varies across domains of behavior. For instance, an individual may be willing to take relatively more risks in the career and financial domains, but not in the health and leisure domains. Nonetheless, individuals with greater risk tolerance in one domain are statistically more likely to exhibit greater risk tolerance in other domains as well. For this reason, survey-based measures of general risk tolerance—defined as a person’s general willingness to take risks—have been used as all-around predictors of risky behaviors such as portfolio allocation, occupational choice, smoking, drinking alcohol, and starting one’s own business (Beauchamp et al. 2017, Dohmen et al. 2011, Falk et al. 2015). In our study, we analyze a measure of general risk tolerance based on responses to questions such as: “Would you describe yourself as someone who takes risks? Yes / No.” The exact phrasing and number of response categories varied across the study cohorts, but all questions asked subjects about their overall or general attitudes toward risk.

1.3.  What was already known about the genetics of risk tolerance prior to this study?

Researchers have found that identical twins (who share all of their genes) tend to be more similar to one another in terms of their risk tolerance than fraternal twins (who share, on average, only half of their genes), which suggests that genetic factors influence risk tolerance. With some assumptions, it is possible to translate the greater similarity of identical twins into an estimate of the “heritability” of risk tolerance. The heritability of risk tolerance is the percentage of the variation in risk tolerance among individuals that can be accounted for statistically by genetic differences, given current environmental conditions. Estimates from twin studies suggest that risk tolerance is moderately heritable (~30%) (Beauchamp et al. 2017, Cesarini et al. 2009, Harden et al. 2017). We note, however, that such estimates are based on several assumptions and vary across studies, in part because different studies use different measures of risk tolerance as well as different assumptions and methods.

As we further discuss in FAQ 2.2, the current study also estimated the “SNP heritability” of risk tolerance, which is the percentage of the variation in risk tolerance among individuals that can be accounted for statistically by “common SNPs” (a type of genetic variants; see FAQ 2.1 for details), given current environmental conditions. Our estimate suggests that common SNPs account for only ~5% to 9% of the variation in risk tolerance across individuals. Importantly, while these heritability estimates all suggest that genetic factors influence risk tolerance, we emphasize that this does not imply that risk tolerance is pre-determined at birth or that genetic factors act independently of the environment, as we discuss below in FAQs 3.2 and 3.4.

Risk tolerance has been one of the most studied phenotypes in social science genetics. To date, however, nearly all published studies attempting to discover the genetic variants associated with risk tolerance have been “candidate-gene studies” conducted in relatively small samples, ranging from a few hundred to a few thousand individuals. A candidate-gene study tests the associations between a phenotype of interest and a few selected genetic variants that are hypothesized to be associated with the phenotype. Though there is nothing wrong in principle with such studies, we now know that the sample sizes of the candidate-gene studies for risk tolerance and other behavioral traits were probably too small to robustly identify genetic variants [As mentioned above, it is now well established that the bulk of the genetic variation in the vast majority of behavioral phenotypes is attributable to a large number of genetic variants, each having a very small effect (Chabris et al. 2015). For that reason, large samples are needed to detect individual genetic variants.] (Chabris et al. 2012, Hewitt 2012). Indeed, as we explain in FAQ 2.6, we used our own results to assess the evidence in favor of the main biological pathways and genetic variants which previous candidate-gene studies had hypothesized or reported to relate to risk tolerance. Although our sample was several orders of magnitude larger than the samples used in the candidate-gene studies, we found no evidence that these biological pathways and genetic variants are associated with risk tolerance.

To the best of our knowledge, prior to our study there had only been two studies with samples that were large enough to provide sufficient statistical power to robustly detect genetic variants with small effect sizes (Day et al. 2016, Strawbridge et al. 2018). From these studies, only two genetic variants associated with risk tolerance had been identified.

In summary, when our study was initiated, despite much interest, little was known about which genetic variants are related to risk tolerance.

 

“Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences”

 

“Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”

"Genome-wide association study identifies 74 loci associated with educational attainment"

"Genetic variants associated with subjective well-being, depressive symptoms and neuroticism identified through genome-wide analyses"

"GWAS of 126,559 individuals identifies genetic variants associated with educational attainment"

"Common Genetic Variants Associated with Cognitive Performance Identified Using Proxy-Phenotype Method"

 
 
 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

2. Study design and results

2.1.  What did you do in this paper? How was the study designed?

We performed the largest-to-date genome-wide association study (GWAS) of risk tolerance. In a GWAS, scientists look across the human genome for genetic variants that are associated with a phenotype of interest. If a genetic variant is associated, then individuals who have a certain “allele” (i.e., a certain version of that variant) are more likely than those with a different allele to exhibit a phenotype (in this case, higher general risk tolerance).

We chose a GWAS study design because it has been a successful research strategy for identifying genetic variants associated with many traits and diseases, including body height (Wood et al. 2014), BMI (Locke et al. 2015), Alzheimer’s disease (Lambert et al. 2013), and schizophrenia (Ripke et al. 2014). GWAS have also recently been used to identify genetic variants associated with a variety of health-relevant social science outcomes, such as the number of children a person has (Barban et al. 2016), happiness (Okbay et al. 2016a, Turley et al. 2018), and educational attainment (Okbay et al. 2016b, Rietveld et al. 2013b). Furthermore, scientists who have attempted to replicate reported GWAS associations in independent samples of sufficiently large size have typically been successful (Visscher et al. 2017b), thereby indicating that GWAS associations are robust findings. 


In our GWAS of general risk tolerance, we tested ~9.3M single nucleotide polymorphisms (SNPs) from across the human genome for association with general risk tolerance. SNPs are the most common type of genetic variant in the genome and are the genetic variants that are captured by the genetic data used in our study and most other modern genome-wide association studies. (There are other types of genetic variants, which we did not analyze.) Some SNPs have alleles that are relatively common in the population and are called “common SNPs,” while other SNPs have one allele that is rare in the population; our GWAS analyzed both common SNPs and some rare SNPs.


As mentioned above, genetic variants associated with social-science phenotypes tend to have very small individual effects on the phenotypes. Therefore, in order to have sufficient statistical power to discover SNPs associated with risk tolerance, we pooled the results from analyses of two very large datasets, the UK Biobank (n = 431,126 individuals) and a dataset of research participants from 23andMe (n = 508,782 individuals), thereby yielding a “discovery” sample of 939,908 individuals. We replicated the findings from this discovery sample in a “replication” sample comprised of ten smaller datasets and totaling 35,445 individuals. In all of these samples, to avoid the statistical confounding that arises from studying ethnically diverse populations, we restricted our GWAS to individuals of European ancestries. (For a somewhat technical explanation, see Appendix 1.)


We used the results of our GWAS of general risk tolerance for a wide range of additional analyses. For example, to examine the extent to which SNPs that are associated with risk tolerance also tend to be associated with other phenotypes, we estimated “genetic correlations” between risk tolerance and a wide range of phenotypes (see FAQ 2.3). In addition, in several samples of genotyped individuals, we used individuals’ SNP data and the results of our GWAS to construct “polygenic scores” that partially predict individuals’ risk tolerance based on their SNP data (see FAQ 2.4). We also performed a suite of bioinformatics analyses to get insight into the biology of risk tolerance (see FAQs 2.5 and 2.6).
 

In addition to our GWAS of general risk tolerance, we conducted six supplementary GWAS, of six phenotypes related to risk tolerance and risk-taking behaviors. We conducted a GWAS of “adventurousness,” defined as the self-reported tendency to be adventurous vs. cautious. We also conducted GWAS of four risky behaviors that each plausibly capture risk taking in a different domain of behavior: “automobile speeding propensity” (the tendency to drive faster than the speed limit), “drinks per week” (the average number of alcoholic drinks consumed per week), “ever smoker” (whether one has smoked more than once or twice), and “number of sexual partners” (the lifetime number of sexual partners). Finally, we conducted a GWAS of the first principal component of the four risky behaviors. (The first principal component is a variable that captures the common variation across the four risky behaviors and can be interpreted as capturing the general tendency to take risks across domains.) Section 1.2 of our article’s Supplementary Information provides more detail on the definitions of these phenotypes. The analyses of the six supplementary phenotypes were performed in samples ranging from ~315,000 to ~557,000 individuals. These samples were smaller because of more limited data availability for these phenotypes.
 

2.2.  What did you find in the GWAS?

Our main GWAS identified 124 SNPs associated with general risk tolerance in our discovery sample. The 124 SNPs are located in 99 “loci” (a locus is a small region of the genome). As expected, the estimated individual effects of the 124 SNPs are all very small: none of the SNPs explain more than 0.02% of the variation in general risk tolerance across individuals. 
 

We verified that the 124 SNPs identified in our discovery sample also tend to be associated with general risk tolerance in our replication sample. Because the replication sample was not large enough to provide adequate statistical power to replicate the associations of each of the 124 SNPs individually, we performed a “holistic” replication analysis. This analysis compares the overall agreement in estimates for the 124 SNPs across the discovery and the replication GWAS. This holistic replication was successful, indicating that it is highly unlikely that the results from our discovery sample were driven by chance alone.

 
We also estimated the “SNP heritability” of risk tolerance. The SNP heritability of a phenotype is the share of the variation in the phenotype that is statistically accounted for by common SNPs, given current environmental conditions (see FAQ 1.3). We used several methods to obtain our estimates. With all methods, we used a set of common SNPs—that is, SNPs that have alleles that are relatively common in the population—to estimate the heritability. Because the different methods make different assumptions and because we applied the different methods to slightly different data, the methods yielded different heritability estimates. Our estimates suggest that common SNPs account for ~5% to 9% of the variation in risk tolerance across individuals. (The true heritability of risk tolerance is likely to be somewhat higher, since other genetic variants, such as rare SNPs and structural genetic variants, are likely to also contribute to variation in risk tolerance.) 

 

Our six supplementary GWAS (of the phenotypes related to risk tolerance and risk-taking behaviors) identified a total of 741 associations between a specific SNP and one of the phenotypes. Because of the lack of suitable replication samples, we did not perform replication analyses for the GWAS of these six phenotypes.

2.3.  Are the SNPs associated with higher risk tolerance in your study also associated with other phenotypes?

Yes. Of the 124 SNPs we identified as associated with general risk tolerance, we found that 72 are also associated with one or more of the six supplementary phenotypes related to risk tolerance and risk-taking behaviors [Equivalently, as we write in the abstract of the paper, of the 99 loci referred to above and that contain the 124 SNPs associated with general risk tolerance, 46 also contain one or more SNPs associated with at least one of the six supplementary phenotypes.]. We also identified several regions of the genome that stood out as being associated with general risk tolerance and with all or most of the six supplementary phenotypes. We verified that the effects of the SNPs in these regions are concordant, such that SNPs associated with higher general risk tolerance are also associated with more risky behavior. This suggests that these regions represent shared genetic influences on risk tolerance and risky behaviors (rather than just being genomic hot spots containing SNPs associated with many different phenotypes).


In addition, we estimated the “genetic correlation” between general risk tolerance and various other phenotypes. The genetic correlation between two phenotypes is a measure of the extent to which the SNPs that affect one phenotype also tend to affect the other phenotype. We found that general risk tolerance is moderately to highly genetically correlated with a range of risky behaviors. General risk tolerance is genetically correlated with the six supplementary phenotypes (which capture various types of risky behavior), with estimates of the genetic correlations ranging from 0.25 to 0.83. General risk tolerance is also moderately to highly genetically correlated with a number of additional risky behaviors, including cannabis use and self-employment. Importantly, the direction of the genetic correlations is in the expected direction, with higher risk tolerance being associated with riskier behavior. Moreover, our estimates of the genetic correlations between general risk tolerance and the supplementary risky behaviors are substantially higher than the corresponding phenotypic correlations [Although measurement error partly accounts for the lower phenotypic correlations, the genetic correlations remain considerably higher even after adjustment of the phenotypic correlations for measurement error.], implying that general risk tolerance is more strongly associated with these risky behaviors at the genetic level than at the non-genetic (environmental) level. The relatively high genetic correlations between general risk tolerance and risky behaviors suggests the existence of a genetically-influenced “general factor of risk tolerance” that captures a general tendency to take risk across domains of behavior. 


We also found that risk tolerance is moderately genetically correlated with several personality and neuropsychiatric phenotypes. Of note, the estimated genetic correlations with the personality traits extraversion (    = 0.51)["    " denotes a genetic correlation estimate.], neuroticism (    = –0.42), and openness to experience (    = 0.33) are highly statistically significant and are substantially larger in magnitude than previously reported phenotypic correlations, pointing to shared genetic influences among general risk tolerance and these personality traits. We also found statistically significant and positive genetic correlations between general risk tolerance and the neuropsychiatric phenotypes ADHD, bipolar disorder, and schizophrenia.

2.4.  How much of a particular person’s risk tolerance can be predicted from the results of this paper?

Although each individual SNP has a very small effect, the GWAS estimates of the SNPs’ (very small) effects can be combined to create a “polygenic score,” an index that takes into account the effects of many SNPs from across the genome. Because a polygenic score aggregates the information from many SNPs, it can predict far more of the variation in risk tolerance among individuals than any single SNP. We found that polygenic scores constructed using the results of our GWAS of general risk tolerance explain up to ~1.6% of the variation across individuals in general risk tolerance. While 1.6% is far larger than the amount of variation explained by individual SNPs (less than 0.02%, as noted above), it is small in absolute terms. As we explain in FAQ 3.3, such polygenic scores cannot be used to meaningfully predict a particular person’s risk tolerance.


The predictive power of the polygenic scores is so small partly because our estimates of the SNPs’ effect sizes are relatively imprecise. As the available sample sizes for GWAS get larger, estimates of the SNPs’ effect sizes will become more precise, and the scores’ explanatory power will rise; in theory, if environmental conditions remain the same, it should be possible one day to construct a polygenic score whose explanatory power is close to the heritability of risk tolerance. For example, a score constructed using the set of common SNPs we used to estimate the ~5% to 9% SNP heritability of risk tolerance (see FAQ 2.2), may ultimately explain ~5% to 9% of the variation in risk tolerance across individuals.
Although the polygenic scores we constructed have too little explanatory power to usefully predict any individual’s risk tolerance, they have sufficient explanatory power to be useful in social science studies, which focus on average or aggregated behavior in the population (not individual outcomes). Indeed, with 80% statistical power (the conventional threshold for adequate power), the effect of our polygenic scores can be detected in a study with 500 individuals. Therefore, the polygenic scores provided by our study can be useful in social science studies that have at least 500 participants and in which the participants’ genomes have been measured. (Several datasets commonly used in social science research meet these criteria.)

2.5.  What do your results tell us about human biology and brain development?

To gain insights into the biological mechanisms through which genetic variation influences general risk tolerance, we conducted a suite of bioinformatics analyses. Our bioinformatics analyses point to the involvement of the neurotransmitters glutamate and GABA, which were heretofore not generally believed to play a role in risk tolerance. Glutamate is the most abundant neurotransmitter in the body and plays an excitatory role (i.e., when one neuron secretes it onto another, the second neuron is more likely in turn to transmit its own signal). GABA, by contrast, is the main inhibitory transmitter. To our knowledge, with the exception of a recent study (Lee et al. 2018) prioritizing a much larger number of pathways, no published large-scale GWAS of cognition, personality, or neuropsychiatric phenotypes has pointed to clear roles both for glutamate and GABA. Our results suggest that the balance between excitatory and inhibitory neurotransmission may contribute to variation in general risk tolerance across individuals.


Perhaps unsurprisingly, our bioinformatics analyses point to a role for the brain and the central nervous system in modulating risk tolerance. Specifically, our analyses point to the involvement of some brain regions that have previously been identified in neuroscientific studies on decision-making, including the prefrontal cortex, basal ganglia, and midbrain.

2.6.  How do your results relate to previous research on the genetics of risk tolerance?

As mentioned above in FAQ 1.3, risk tolerance has been one of the most studied phenotypes in social science genetics. However, almost all previous studies have been “candidate-gene studies” conducted in relatively small samples, whose limitations are now appreciated. 


We used the results of our GWAS to revisit this previous research. We reviewed the literature that aimed to link risk tolerance to biological pathways, and identified five main biological pathways that have been previously hypothesized to relate to risk tolerance: the steroid hormone cortisol, the monoamine neurotransmitters dopamine and serotonin, and the steroid sex hormones estrogen and testosterone. We then tested whether these five biological pathways relate to risk tolerance.


To understand how we tested these five biological pathways, it is helpful to first define what a gene is. A “gene” is a sequence of DNA in the genome that codes for a molecule that has a biological function. The human genome has roughly 20,000 to 25,000 genes; although genes comprise only about 1% to 2% of human genome, they have important biological functions. Genes, like other parts of the genome, can contain SNPs. 


To test the five biological pathways for association with risk tolerance, thus, we first used external databases created by other researchers to identify the genes that are involved, or are likely to be involved, in each of these five pathways. Then, we conducted various bioinformatics analyses that used the results of our GWAS and tested the hypothesis that SNPs located in the genes involved in each of the five pathways tend to be more strongly associated with general risk tolerance than other SNPs. We found no evidence in support of that hypothesis, suggesting that the five pathways are not particularly important contributors to individual variation in risk tolerance. 


We also used our GWAS results to examine whether SNPs located within (or highly correlated with) 15 specific genes, which previous candidate-gene studies had tested for association with risk tolerance, are indeed associated with risk tolerance. Our sample was several orders of magnitude larger than the samples used in the previous candidate-gene studies (as mentioned above in FAQ 1.3, these studies were conducted in relatively small samples). Despite this, we found no evidence that these 15 genes are associated with risk tolerance, and failed to replicate the main associations the previous candidate-gene studies had reported. Our results are consistent with other studies that have found that small-sample candidate-gene studies have a poor replication record (Chabris et al. 2012, Hewitt 2012). 


We also note that our discovery GWAS replicated the associations between general risk tolerance and the two SNPs that had previously been found to be associated with general risk tolerance in the two previous studies with large samples (Day et al. 2016, Strawbridge et al. 2018; see FAQ 1.3). This is not surprising, however, since those two studies analyzed data from the UK Biobank, and the UK Biobank is one of the two large datasets we included in our discovery GWAS.


In summary, instead of pointing to the main genetic variants and biological pathways that had previously been hypothesized to relate to risk tolerance, our analyses identified 124 SNPs associated with risk tolerance (see FAQ 2.2), and point to the involvement of the neurotransmitters glutamate and GABA and of several brain regions (see FAQ 2.5).

3. Social and ethical implications of the study

3.1.  Did you find “the gene for” (or “the genes for”) risk tolerance?

No. We did find several genes [As mentioned in FAQ 2.6, a gene is a sequence of DNA in the genome that codes for a molecule that has a biological function; genes, like other parts of the genome, can contain SNPs.] containing SNPs associated with general risk tolerance, but that does not mean that these genes determine general risk tolerance. The genetic factors we identified are involved in a long chain of biological processes that exert an influence on human behavior, and those processes are intricately entwined with the environment. 


In summary, our findings conform with the expectation that variation in risk tolerance across individuals is influenced by at least thousands, if not millions, of genetic variants (Chabris et al. 2015).

3.2.  Does this study show that an individual’s level of risk tolerance is determined and fixed at conception?

No. A large share of the variation in risk tolerance among individuals is determined by environmental factors, and environmental factors may also interact with genetic factors. As mentioned in FAQ 1.3, twin studies have found that part of the variation in risk tolerance across individuals is statistically accounted for by genetic factors. But even if all of the variation in risk tolerance at a certain point in time were accounted for by genetic factors (which is definitely not the case), this would not rule out the possibility of past or future environmental influences on risk tolerance. For instance, even if poor eyesight were perfectly heritable and hence completely determined by genetic factors (it is not), the invention of eye glasses, contact lenses, and laser surgery would all drastically improve a person’s poor genetic outlook for clear vision. On the flip side, environmental trauma (e.g., a poke to the eye) could drastically worsen another individual’s genetic outlook for clear vision. The lesson of eyesight as a phenotype is that heritability of a phenotype—even 100% heritability—does not imply biological determinism: environmental factors can still in principle influence the phenotype. And again, risk tolerance is far from being perfectly heritable.

3.3.  Can you use the results in this paper to meaningfully predict a particular person’s risk tolerance?

No, the results cannot be used to meaningfully predict either a particular person’s general risk tolerance, nor their likelihood of taking any particular risk and engaging in any particular sort of risky behavior. As mentioned in FAQ 2.4, we used the results of our GWAS of general risk tolerance to construct polygenic scores that can explain up to ~1.6% of the variation across individuals in general risk tolerance. That means that ~98.4% of the variation in general risk tolerance is explained by factors other than the polygenic scores. 


As we also mentioned in FAQ 2.4, we expect that future, larger GWAS will allow the construction of polygenic scores with higher predictive power. However, the predictive power of such scores would still pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are therefore vastly more accurate forecasters than social science geneticists will ever be.


We also note that, while the polygenic scores we constructed can’t usefully predict any individual’s risk tolerance, they can be useful in social science studies, which focus on aggregated behavior in the population.

 

3.4.  Can environmental factors modify the effects of the specific SNPs you identified?

It is a plausible hypothesis that environmental factors are both moderators and mediators of genetic influences on risk tolerance. For example, it is conceivable that some SNPs have alleles [As mentioned above, an allele is a certain version of a genetic variant.] that tend to make individuals relatively less risk tolerant, but only when the individuals are exposed to certain environments (e.g., when they experience a traumatic episode). (Such environments factors would be said to “moderate” the influence of those SNPs.) It is also conceivable that some SNPs affect risk tolerance indirectly, by influencing individuals’ preferences for certain environments (e.g., by influencing their preferences for socializing with quiet, cautious friends), which may in turn affect their risk tolerance. (Such environments would be said to “mediate” the influence of those SNPs.)  


We did not perform any statistical tests of “gene-environment interactions” in our study. (Gene-environment interactions refer to the moderation of genetic influences by environmental factors.) One promising approach for future studies that seek to identify gene-environment interactions will be to use our GWAS results to construct polygenic scores of general risk tolerance, and then test whether environmental or demographic variables moderate the association between the polygenic scores and an outcome of interest. 


To facilitate such research, we have made the summary results of our GWAS publicly available on the SSGAC’s website (www.thessgac.org); interested researchers who have access to datasets with genotypic data can download these results and use them to construct polygenic scores.

 

3.5.  What policy lessons or practical advice do you draw from this study?

None whatsoever. Any practical response—individual or policy-level—to this or similar research would be extremely premature. In this respect, our study is no different from genome-wide association studies (GWAS) of complex medical outcomes. In medical GWAS research, it is well understood that identifying genetic variants that affect disease risk is merely a first step toward understanding the underlying biology of that disease. It is not sufficient to assess risk for any specific individual. It is not appropriate to base policies and practices on such assessments.

3.6.  Could this kind of research lead to discrimination against, or stigmatization of, people with specific genetic variants? If so, why conduct this research?

Unfortunately, like a great deal of research—including, for instance, research identifying genetic variants associated with increased cancer risk—the results can be misunderstood and could be misapplied, including by being used to discriminate against individuals with specific genetic variants (e.g., in insurance markets). Nevertheless, for a variety of reasons, we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing the knowledge.


First, even if we believed that some knowledge (and specifically knowledge about genetic influences on risk-taking behavior) should be forbidden, that goal is unattainable. Behavioral genetics research, including studies of the relationships between genes and a variety of social-science phenotypes, including risk tolerance, is already being conducted by many scientists and other individuals around the world and will continue to be conducted. Not all of this work involves the use of appropriate scientific methods or the transparent communication of results. In this context, researchers who are committed to developing, implementing, and spreading best practices for conducting and communicating potentially controversial research, including behavioral genetics research, arguably have an ethical responsibility to participate in the development and dissemination of this body of knowledge—rather than abstain from it because of its sensitive nature. 


An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral traits have serious methodological limitations, fail to replicate, and are likely to have false-positive findings (Beauchamp et al. 2011, Benjamin et al. 2012, Chabris et al. 2012, 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral traits), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt 2012). Consistent with this, the current study was unable to replicate the results of previous candidate-gene studies of risk tolerance (see FAQ 2.6). One of the most important reasons why earlier work has generated unreliable results is that the sample sizes were far too small, given that the true effects of individual genetic variants on behavioral traits are tiny.

 
Second, one should not assume that behavioral genetics research carries only the potential to increase stigmatization. For instance, behavioral phenotypes such as general risk tolerance are often assumed to be fully and equally within the control of every individual. That view of these behaviors likely contributes to a lack of sympathy for those who exhibit a self-destructive level of risk-taking and, perhaps, suboptimal support for programs that attempt to reduce such behavior. Our purpose here is not to advocate for or against any particular policy for addressing risk behaviors; rather, we mean only to point out that a finding that genes do have some influence can reduce, rather than increase, stigma of those who exhibit risk-tolerant or even risk-seeking behavior. 


Third, behavioral genetics research has the potential to yield other benefits, especially as sample sizes continue to increase. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity rather than being foreseeable. For instance, identifying variants associated with risk tolerance may lead to insights regarding the underlying biological pathways. To take an example from medicine, genetic variants in the LMTK2 (lemur tyrosine kinase 2) gene have small effects on an individual’s predisposition to prostate cancer. Nonetheless, knowing that this gene is involved can point scientists toward studying what the gene does, which may end up teaching us something critical about the pathology of prostate cancer. The effect from modifying a biological pathway, e.g., with a pharmaceutical, is potentially much larger than the effect of the gene itself. Moreover, although we are not quite there yet, when many genetic variants taken together capture ~10% of the variation across individuals in risk tolerance, this amount of predictive power (while still too low to be relevant for individual predictions) will be useful for controlling for genetic factors when studying the effect of a policy or program on an outcome that is also affected by risk tolerance. For example, when studying a policy intervention that aims to reduce the use of illicit substances that present health risks, controlling for as many factors as possible, including genetic factors associated with risk taking, can help generate more precise estimates of the effectiveness of the policy.


In sum, the potential benefits of this research, when conducted responsibly, seem reasonable in relation to the risks, especially considering that this research is already being conducted, sometimes with lesser attention to both scientific rigor and thoughtful science communication. We thus agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics 2002, p. 114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner.” In our view, responsible behavioral genetics research includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show (hence, this FAQ document).

4. Appendices

Appendix 1:  Quality control measures

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWAS). We took many precautions to guard against these pitfalls.


One potential source of spurious results is incomplete “quality control (QC)” of the genetic data. To avoid this problem, we used state-of-the-art QC protocols from medical genetics research (Winkler et al. 2014). We supplemented these protocols by a more recent protocol from Okbay et al. (2016a), as well as by developing and applying additional, more stringent QC filters.


Another potential source of spurious results is a confound known as “population stratification” (e.g., Hamer & Sirota 2000). To illustrate, suppose we were conducting a GWAS of height. People from Northern Europe are on average taller than people from Southern Europe, and there are also small differences in how often certain genetic variants occur in Northern and Southern Europe. If we combine samples of Northern and Southern Europeans and perform a GWAS that ignores the regions the individuals come from, then we would find genetic associations for these variants. However, those associations would simply reflect the fact that the variants are correlated with a population (Northern or Southern Europe) and may actually have nothing to do with height.


In our study we were extremely careful to avoid population stratification as much as possible. At the outset, we restricted the study to individuals of European ancestries, since population stratification problems are more severe when including individuals of different ancestries in the same sample. As is standard in GWAS of medical outcomes, we controlled for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across populations, so controlling for them largely removes the spurious associations arising solely from these small differences. 


After taking these steps to minimize population stratification, we conducted several analyses to assess how much population stratification still remained in our data. First, we analyzed data on 17,684 sibling pairs from the Swedish Twin Registry and the UK Biobank. The key idea underlying our test was to examine if differences in genetic variants across siblings are associated with differences in the siblings’ risk tolerance. If so, then these associations cannot be the result of population stratification. The reason is that full siblings (from the same two biological parents) share their ancestry entirely, and therefore differences in their genetic variants cannot be due to being from different population groups. Unfortunately, because our sample of siblings is much smaller than our discovery GWAS sample (939,908 individuals), our estimates of the effects of the genetic variants within the sibling pairs are much less precise than those in the GWAS. However, we can test whether the GWAS results are entirely due to population stratification, because if they were, then the sibling estimates would not line up with the GWAS estimates at all. In fact, we found that the within-family estimates are more similar to the GWAS estimates in both sign and magnitude than would be expected by chance. These results imply that our GWAS results are not entirely due to population stratification. A second analysis, known as a “LD score regression intercept” analysis (Bulik-Sullivan et al. 2015), indicated that there is some, but not much, population stratification in our GWAS results.

Appendix 2:  Additional reading and references

  1. Barban N, Jansen R, de Vlaming R, Vaez A, Mandemakers JJ, et al. 2016. Genome-wide analysis identifies 12 loci influencing human reproductive behavior. Nat. Genet. 48(12):1462–72

  2. Beauchamp JP, Cesarini D, Johannesson M. 2017. The psychometric and empirical properties of measures of risk preferences. J. Risk Uncertain. 54(3):203–37

  3. Beauchamp JP, Cesarini D, Johannesson M, van der Loos MJHM, Koellinger PD, et al. 2011. Molecular genetics and economics. J. Econ. Perspect. 25(4):57–82

  4. Benjamin DJ, Cesarini D, Chabris CF, Glaeser EL, Laibson DI, et al. 2012. The promises and pitfalls of genoeconomics. Annu. Rev. Econom. 4(1):627–62

  5. Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, et al. 2015. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47(3):291–95

  6. Cesarini D, Dawes CT, Johannesson M, Lichtenstein P, Wallace B. 2009. Genetic variation in preferences for giving and risk taking. Q. J. Econ. 124(2):809–42

  7. Chabris CF, Hebert BM, Benjamin DJ, Beauchamp JP, Cesarini D, et al. 2012. Most reported genetic associations with general intelligence are probably false positives. Psychol. Sci. 23(11):1314–23

  8. Chabris CF, Lee JJ, Cesarini D, Benjamin DJ, Laibson DI. 2015. The fourth law of behavior genetics. Curr. Dir. Psychol. Sci. 24(4):304–12

  9. Day FR, Helgason H, Chasman DI, Rose LM, Loh P-R, et al. 2016. Physical and neurobehavioral determinants of reproductive onset and success. Nat. Genet. 48(6):617–23

  10. Dohmen T, Falk A, Huffman D, Sunde U, Schupp J, Wagner GG. 2011. Individual risk attitudes: Measurement, determinants, and behavioral consequences. J. Eur. Econ. Assoc. 9(3):522–50

  11. Falk A, Dohmen T, Falk A, Huffman D. 2015. The nature and predictive power of preferences: Global evidence. IZA Discussion Papers.

  12. Hamer DH, Sirota L. 2000. Beware the chopsticks gene. Mol. Psychiatry. 5(1):11–13

  13. Harden KP, Kretsch N, Mann FD, Herzhoff K, Tackett JL, et al. 2017. Beyond dual systems: A genetically-informed, latent factor model of behavioral and self-report measures related to adolescent risk-taking. Dev. Cogn. Neurosci. 25:221–34

  14. Hewitt JK. 2012. Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behav. Genet. 42(1):1–2

  15. Lambert J-C, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, et al. 2013. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat. Genet. 45(12):1452–58

  16. Lee J, Wedow R, Okbay A, Kong E, Maghzian O, et al. 2018. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50:1112–21

  17. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, et al. 2015. Genetic studies of body mass index yield new insights for obesity biology. Nature. 518(7538):197–206

  18. Nuffield Council on Bioethics. 2002. Genetics and human behaviour: the ethical context. Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf], London

  19. Okbay A, Baselmans BML, Neve J-E De, Turley P, Nivard MG, et al. 2016a. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat. Genet. 48(6):624–33

  20. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, et al. 2016b. Genome-wide association study identifies 74 loci associated with educational attainment. Nature. 533:539–42

  21. Rietveld CA, Cesarini D, Benjamin DJ, Koellinger PD, De Neve J-E, et al. 2013a. Molecular genetics and subjective well-being. Proc. Natl. Acad. Sci. 110(24):9692–97

  22. Rietveld CA, Conley DC, Eriksson N, Esko T, Medland SE, et al. 2014a. Replicability and robustness of GWAS for behavioral traits. Psychol. Sci. 25(11):1975–86

  23. Rietveld CA, Esko TT, Davies G, Pers TH, Turley PA, et al. 2014b. Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Natl. Acad. Sci. U. S. A. 111(38):13790–94

  24. Rietveld CACA, Medland SESE, Derringer J, Yang J, Esko T, et al. 2013b. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science. 340(6139):1467–71

  25. Ripke S, Neale BM, Corvin A, Walters JTR, Farh K-H, et al. 2014. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 511(7510):421–27

  26. Strawbridge RJ, Ward J, Cullen B, Tunbridge EM, Hartz S, et al. 2018. Genome-wide analysis of self-reported risk-taking behaviour and cross-disorder genetic correlations in the UK Biobank cohort. Transl. Psychiatry. 8(1):1–11

  27. Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, et al. 2018. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50(2):229–37

  28. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, et al. 2017a. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101(1):5–22

  29. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, et al. 2017b. 10 years of GWAS discovery: Biology, function, and translation. Am. J. Hum. Genet. 101(1):5–22

  30. Winkler TW, Day FR, Croteau-Chonka DC, Wood AR, Locke AE, et al. 2014. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 9(5):1192–1212

  31. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, et al. 2014. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46(11):1173–8

̂

r

g

̂

r

g

̂

r

g

̂

r

g

 
 
 

̂

r

g

 
 
 
 
 
 
 
 
 
 
 
 

FAQs about “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”

As sample sizes for GWAS continue to grow, it will likely be possible to construct a polygenic score for educational attainment whose predictive power comes closer to 20% of the variance in educational attainment across individuals (Rietveld et al. 2013). Even this level of predictive power would pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are therefore vastly more accurate forecasters than social science geneticists will ever be.

Note: The results of SSGAC studies have sometimes been used in other projects to predict individual traits. We recognize that returning individual genomic “results” can be a fun way to engage people in research and other projects and to stoke their interest in, and educate them about, genomics. But it is important that participants/users understand that these individual results are not meaningful predictions and should be regarded essentially as entertainment. Failure to make this point clear risks sowing confusion and undermining trust in genetics research.

3.5.  Can your polygenic score be used for research studies in non-European-ancestry populations?

Only in a limited way. As a practical matter, it is possible to calculate a polygenic score for any individual for whom genome-wide data is available, but the polygenic score will be much less “predictive” (see FAQ 1.4) in non-European-ancestry populations.

Our study was conducted only using samples of individuals of European ancestries (see Appendix 1). The set of SNPs that are associated with educational attainment in people of European ancestries is unlikely to overlap perfectly with the set of SNPs associated with EA in people of non-European ancestries. And even if a given SNP is associated in both ancestry groups, the effect size—in other words, the strength of the association—will almost surely differ. This is primarily because linkage disequilibrium (LD) patterns (i.e., the correlation structure of the genome) vary by ancestry. This means that some variant may be associated with educational attainment because the variant is in LD (i.e., correlated) with a variant elsewhere in the genome that causally affects education (see FAQ 1.3). If the strength of the correlation is greater in one ancestry group than in another, then the size of the association will be larger in that ancestry group. Moreover, even if LD patterns were similar in each ancestry group, the association may differ in different groups because environmental conditions differ (see FAQ 2.5). The fact that there are differences across ancestry groups in the set of associated SNPs and their effect sizes has two important implications.

First, it means that polygenic scores of individuals from different ancestry groups cannot be meaningfully compared. A recent paper (Martin et al. 2017) illustrated this point in the context of polygenic scores for predicting height; in the sample analyzed in that paper, polygenic scores for height for individuals of European ancestries are on average larger than those of South Asian ancestries which in turn are larger than those of African ancestries. In actuality, however, populations of African ancestries represented by the sample have similar height to populations of European ancestries, and both African and European populations tend to be taller than South Asian populations.

Second, while polygenic scores can be used to predict differences across individuals within a sample of people of non-European-ancestries, the amount of predictive power will be much smaller than in a sample of people of European ancestries. Such an attenuation of predictive power has been repeatedly found in prior work (Domingue et al. 2015; Vassos et al. 2017; Domingue et al. 2017; Belsky et al. 2013). Unfortunately, this attenuation means that for non-European-ancestry populations, many of the benefits of having a polygenic score available will have to wait until large GWAS studies are conducted using samples from these populations. (Currently, most large genotyped samples are of European ancestries.)

For a more extensive, excellent discussion of these and related issues, see Graham Coop’s blog post “Polygenic scores and tea drinking”: https://gcbias.org/2018/03/14/polygenic-scores-and-tea-drinking/.

For more on population stratification, see FAQs 1.3 & 2.4 and Appendix 1.

3.6.  What policy lessons do you draw from this study?

None whatsoever. Any practical response—individual or policy-level—to this or similar research would be extremely premature and unsupported by the science. Much more research is still needed to understand why the genetic variants we identified are associated with educational attainment. In this respect, our study is no different from GWAS of complex medical outcomes. In medical GWAS research, it is well understood that identifying genetic variants that “predict” (see FAQ 1.4) disease risk is merely a first step toward understanding the underlying biology. It is not sufficient to assess risk for any specific individual. It is not appropriate to base policies and practices on such assessments. However, the results of our study may be useful to social scientists (e.g., by allowing them to construct polygenic scores that can be used as control variables in randomized controlled trials or in studies of gene-by-environment interactions, see FAQ 1.6).

3.7.  Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants? If so, why conduct this research?

Unfortunately, like a great deal of research—including, for instance, research identifying genomic variants associated with increased cancer risk—the results can be misunderstood and misapplied. This includes being used to discriminate against those with the variants in question (e.g., in insurance markets). Nevertheless, for a variety of reasons, in this instance, we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing the knowledge. Here, we briefly discuss some of the broad potential benefits of this research. We then describe what we take to be our ethical obligation as researchers conducting this work.

First, one benefit of conducting social science genetics research in ever larger samples is that doing so allows us to correct the scientific record. An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral traits have serious methodological limitations, fail to replicate, and are likely to be false-positive findings (Benjamin et al. 2012; Chabris et al. 2012; Chabris et al. 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral traits), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt 2012). One of the most important reasons why earlier work has generated unreliable results is that the sample sizes were far too small, given that the true effects of individual genetic variants on behavioral traits are tiny. Pre-existing claims of genetic associations with complex social-science outcomes have reported widely varying effect sizes, many of them purporting to “predict” (see FAQ 1.4) ten to one hundred times as much of the variation across individuals as did the genetic variants we found in this study and in our other studies.

Second, behavioral genetics research also has the potential to correct the social record and thereby to help combat discrimination and stigmatization. For instance, at various times and places throughout human history (unfortunately, including the present day), girls and women have been discouraged or even prevented from pursuing as much education as their male counterparts. There are of course many reasons why that argument has been made and sometimes prevailed, but to the extent that it is rooted in a belief in genetically-based variance differences between males and females, our study’s analysis of the X chromosome finds no such evidence (see FAQ 2.6). Similarly, overestimating the role of genetics can be damaging, and the present work can help debunk this myth, too. Of the 20% of the variance in educational attainment that is related to the additive effects common genetic variants, we have found that the relationship to educational attainment depends importantly on environmental factors (see FAQ 2.5). By clarifying the limits of deterministic views of complex traits, recent behavioral genetics research—if communicated responsibly—could make appeals to genetic justifications for discrimination and stigmatization less persuasive to the public in the future.

Third, behavioral genetics research has the potential to yield many other benefits, especially as sample sizes continue to increase—as briefly summarized in FAQ 1.6. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity rather than being foreseeable. For instance, as explained in FAQ 2.9, because educational attainment is measured in far larger genotyped samples than brain function, large-scale GWAS of educational attainment have provided better insights into brain function than GWASs to date that directly examine brain function, since the latter have necessarily been conducted in much smaller samples.

In sum, we agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics 2002, p.114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner.” In our view, responsible behavioral genetics research includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show (hence, this FAQ document).

4. Appendices

Appendix 1:  Quality Control Measures

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWAS). We took many precautions to guard against these pitfalls.

One potential source of spurious results is incomplete “quality control (QC)” of the genetic data. To avoid this problem, we used state-of-the-art QC protocols from medical genetics research (Winkler et al. 2014). We supplemented these protocols by developing and applying additional, more stringent QC filters.

Another potential source of spurious results is a confound known as “population stratification.” To give a well-known illustration, suppose we were conducting a GWAS on the use of chopsticks (Lander & Schork 1994). People of Asian ancestries are far more likely to use chopsticks than people of European ancestries. If we combined samples of Chinese and European ancestries and performed a GWAS that ignores ancestry, then we would find genetic associations for these variants. However, those associations would simply reflect the fact that allele frequencies vary across ancestry groups.

In our study we were extremely careful to correct for population stratification as much as possible. At the outset, we restricted the study to individuals of European ancestries. As is standard in GWAS, we also controlled for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across ancestry groups within European populations, so controlling for them largely removes the spurious associations arising solely from these small differences.

After taking these steps to minimize bias stemming from population stratification, we conducted a number of analyses to assess how much population stratification still remained in our data. The results of these tests indicate that there is some, but not much.

For one such analysis, we used a subset of the individuals in our data, ~22,000 sibling pairs (from five of the datasets that contributed to our study). The key idea underlying our tests is to examine if differences in genetic variants across siblings are associated with differences in the siblings’ educational attainment. If so, then these associations cannot easily be attributed to bias in the estimates of the original studies, which compared individuals from different families. When comparing individuals who have different parents, genetic differences across individuals may be confounded with environmental differences associated with the parents’ genetic variants (including the parents’ ancestries, as discussed above). By contrast, full siblings share the same genetic parents, and genetic differences between siblings are random. Unfortunately, because our sample of siblings (~44,000 individuals) is much smaller than our overall GWAS sample (~1.1 million individuals), our estimates of the effects of the genetic variants within the sibling pairs are much noisier than in the GWAS. However, we can test whether the GWAS results are entirely due to population stratification, because if they were, then the sibling estimates would not line up with the GWAS estimates. In fact, we find that the within-family estimates are more similar to the GWAS estimates in both sign and magnitude than would be expected by chance. These results imply that our GWAS results are not solely due to population stratification. However, we also found that within-family estimates are substantially smaller than the GWAS estimates, as we discuss in FAQ 2.4.


As another analysis to assess how much population stratification still remained in our data after our efforts to minimize it, we applied a state-of-the-method from statistical genetics called LD Score regression (Bulik-Sullivan et al. 2015). The results of this analysis indicated that the biases in our results due to population stratification are small.

Appendix 2:  Additional reading and references

  1. Amos, C.I. et al., 2008. Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at  15q25.1. Nature Genetics, 40, pp.616–622.

  2. Anderson, E.L. et al., 2017. The causal effect of educational attainment on Alzheimer’s disease: A two-sample Mendelian  randomization study. bioRxiv [https://doi.org/10.1101/127993].

  3. Bansal, V. et al., 2017. Genetics of educational attainment aid in identifying biological subcategories of schizophrenia. bioRxiv [https://doi.org/10.1101/114405].

  4. Barban, N. et al., 2016. Genome-wide analysis identifies 12 loci influencing human reproductive behavior. Nature Genetics, 48(12), pp.1462–1472.

  5. Barcellos, S.H., Carvalho, L.S. & Turley, P., 2018. Education can Reduce Health Disparities Related to Genetic Risk of Obesity: Evidence from a British Reform. bioRxiv [https://doi.org/10.1101/260463].

  6. Belsky, D.W. et al., 2013. Development and evaluation of a genetic risk score for obesity. Biodemography and Social Biology, 59(1), pp.85–100.

  7. Belsky, D.W. et al., 2016. The Genetics of Success. Psychological Science, 27(7), pp.957–972.

  8. Benjamin, D.J. et al., 2012. The Promises and Pitfalls of Genoeconomics. Annual Review Of Economics, 1(4), pp.627–662.

  9. Branigan, A.R. et al., 2013. Variation in the Heritability of Educational Attainment: An International Meta-Analysis. Social Forces, 92(1), pp.109–140.

  10. Bulik-Sullivan, B.K. et al., 2015. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature Genetics, 47(3), pp.291–295.

  11. Cesarini, D. & Visscher, P.M., 2017. Genetics and educational attainment. npj Science of Learning, 2(1), p.4.

  12. Chabris, C.F. et al., 2012. Most reported genetic associations with general intelligence are probably false positives. Psychological Science, 23(11), pp.1314–1323.

  13. Chabris, C.F. et al., 2015. The fourth law of behavior genetics. Current Directions in Psychological Science, 24(4), pp.304–312.

  14. Cutler, D.M. & Lleras-Muney, A., 2008. Education and Health: Evaluating Theories and Evidence. In J. House et al., eds. Making Americans Healthier: Social and Economic Policy as Health Policy. New York: Russell Sage Foundation.

  15. Davies, N.M. et al., 2018. The causal effects of education on health outcomes in the UK Biobank. Nature Human Behaviour.

  16. Domingue, B.W. et al., 2017. Mortality selection in a genetic sample and implications for association studies. International Journal of Epidemiology, 46(4), pp.1285–1294.

  17. Domingue, B.W. et al., 2015. Polygenic Influence on Educational Attainment: New evidence from The National Longitudinal Study of Adolescent to Adult Health. AERA Open, 1(3), pp.1–13.

  18. Editors, N., 2013. Dangerous work. Nature, 502(7469), pp.5–6.

  19. Goldberger, A.S.A., 1979. Heritability. Economica, 46(184), pp.327–347.

  20. Heath, A.C. et al., 1985. Education policy and the heritability of educational attainment. Nature, 314(6013), pp.734–736.

  21. Hewitt, J.K., 2012. Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behavior Genetics, 42(1), pp.1–2.

  22. Hung, R.J. et al., 2008. A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature.

  23. Jencks, C., 1980. Heredity, environment, and public policy reconsidered. American Sociological Review, 45(5), pp.723–736.

  24. van Kippersluis, H. & Rietveld, C.A., 2017. Pleiotropy-robust Mendelian randomization. International Journal of Epidemiology, pp.1–10.

  25. Koellinger, P.D. & Harden, K.P., 2018. Using nature to understand nurture: Genetic associations show how parenting matters for children’s education. Science, 359(6374), pp.386–387.

  26. Kong, A. et al., 2018. The nature of nurture: Effects of parental genotypes. Science, 359(6374), pp.424–428.

  27. Lambert, J.-C. et al., 2013. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nature Genetics, 45(12), pp.1452–1458.

  28. Lander, E.S. & Schork, N.J., 1994. Genetic dissection of complex traits. Science, 265, pp.2037–48.

  29. Linnér, R.K. et al., 2017. An epigenome-wide association study meta-analysis of educational attainment. Nature Publishing Group.

  30. Locke, A.E.A. et al., 2015. Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538), pp.197–206.

  31. Marioni, R.E. et al., 2016. Genetic variants linked to education predict longevity. Proceedings of the National Academy of Sciences, 113(47), pp.13366–13371.

  32. Martin, A.R. et al., 2017. Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations. American Journal of Human Genetics, 100(4), pp.635–649.

  33. Nuffield Council on Bioethics, 2002. Genetics and human behaviour: the ethical context, London: Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf].

  34. Okbay, A., Baselmans, B.M.L., et al., 2016. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nature Genetics, 48(6), pp.624–633.

  35. Okbay, A., Beauchamp, J.P., et al., 2016. Genome-wide association study identifies 74 loci associated with educational attainment. Nature, 533(7604), pp.539–542.

  36. Parens, E. & Appelbaum, P.S., 2015. An introduction to thinking about trustworthy research into the genetics of intelligence. Hastings Center Report, 45(S1), pp.S2–S8.

  37. Pickrell, J.K. et al., 2016. Detection and interpretation of shared genetic influences on 42 human traits. Nature Genetics, 48(7), pp.709–717.

  38. Rietveld, C.A. et al., 2013. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science, 340(6139), pp.1467–1471.

  39. Ripke, S. et al., 2014. Biological insights from 108 schizophrenia-associated genetic loci. Nature, 511(7510), pp.421–427.

  40. Ross, C.E. & Wu, C., 1995. The links between education and health. American Sociological Review, 60(5), pp.719–745.

  41. Sacerdote, B., 2007. How Large are the Effects from Changes in Family Environment? A Study of Korean American Adoptees. The Quarterly Journal of Economics, 122(1), pp.119–157.

  42. Sacerdote, B., 2011. Nature and Nurture Effects On Children’s Outcomes: What Have We Learned From Studies of Twins And Adoptees? In J. Benhabib, A. Bisin, & M. O. Jackson, eds. Handbook of Social Economics. Elsevier/North-Holland, pp. 1–29.

  43. Savage, J.E. et al., 2017. GWAS meta-analysis (N=279,930) identifies new genes and functional links to intelligence. bioRxiv [https://doi.org/10.1101/184853].

  44. Schmitz, L.L. & Conley, D., 2017. The effect of Vietnam-era conscription and genetic potential for educational attainment on schooling outcomes. Economics of Education Review, 61, pp.85–97.

  45. Silventoinen, K. et al., 2004. Heritability of body height and educational attainment in an international context: comparison of adult twins in Minnesota and Finland. American Journal of Human Biology, 16(5), pp.544–555.

  46. Sniekers, S. et al., 2017. Genome-wide association meta-analysis of 78,308 individuals identifies new loci and genes influencing human intelligence. Nature Genetics, 49(7), pp.1107–1112.

  47. Thorgeirsson, T.E. et al., 2008. A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature, 452(7187), pp.638–642.

  48. Tillmann, T. et al., 2017. Education and coronary heart disease: Mendelian randomisation study. BMJ (Online).

  49. Trampush, J.W. et al., 2017. GWAS meta-analysis reveals novel loci and genetic correlates for general cognitive function: A report from the COGENT consortium. Molecular Psychiatry, 22(3), pp.336–345.

  50. Turkheimer, E., 2000. Three laws of behavior genetics and what they mean. Current Directions in Psychological Science, 9(5), pp.160–164.

  51. Turley, P. et al., 2018. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nature Genetics, 50(2), pp.229–237.

  52. Vassos, E. et al., 2017. An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis. Biological Psychiatry, 81(6), pp.470–477.

  53. Visscher, P.M. et al., 2017. 10 Years of GWAS Discovery: Biology, Function, and Translation. The American Journal of Human Genetics, 101(1), pp.5–22.

  54. Warrier, V. et al., 2016. Genetic overlap between educational attainment, schizophrenia and autism. bioRxiv [https://doi.org/10.1101/093575].

  55. Winkler, T.W. et al., 2014. Quality control and conduct of genome-wide association meta-analyses. Nature Protocols, 9(5), pp.1192–1212.

  56. Wood, A.R. et al., 2014. Defining the role of common variation in the genomic and biological architecture of adult human height. Nature Genetics, 46(11), pp.1173–1186.

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document provides information about the study:

 

Lee et al. (2018) “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment.” Nature Genetics, in press.

The document was prepared by several of the study’s coauthors and draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

          1. Background

          2. Study design and results

          3. Social and ethical implications of the study

          4. Appendices

 

For clarifications or additional questions, please contact Daniel Benjamin (djbenjam@usc.edu).

 

Quick Links

1.1.  Who conducted this study? What are the group’s overarching goals?

1.2.   The current study focuses on an outcome called “educational attainment.” What is educational attainment?

1.3.  What is a GWAS? Are the genetic variants identified in a GWAS “causal”?

1.4.  In what sense do the genetic variants identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

1.5.  What is a polygenic score?

1.6.  Why conduct a GWAS of educational attainment?

1.7.  What was already known about genetic associations with educational attainment prior to this study?

2.1.  What did you do in this paper? How was the study designed? Why was the study designed in this way?

2.2.  What did you find in the GWAS of educational attainment?

2.3.  How predictive is the polygenic score developed in this study?

2.4.  What did you find in the analysis of siblings?

2.5.  What did you find in the analysis of environmental heterogeneity?

2.6.  What did you find in the analysis of the X chromosome?

2.7.  What did you find in the analysis of cognitive performance and math abilities?

2.8.  Are the genetic variants associated with higher educational attainment in your study also associated with other outcomes?

2.9.  What do your results tell us about human biology and brain development?

3.1.  Did you find “the gene for” educational attainment?

3.2.  Well, then, did you find “the genes for” educational attainment?

3.3.  Does this study show that an individual’s level of educational attainment is determined, or fixed, at conception?

3.4.  Can the polygenic score from this paper be used to accurately predict a particular person’s educational attainment?

3.5.  Can your polygenic score be used for research studies in non-European-ancestry populations?

3.6.  What policy lessons do you draw from this study?

3.7.  Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants? If so, why conduct this research?

Appendix 1:  Quality control measures

Appendix 2:  Additional reading and references

1. Background

1.1.  Who conducted this study? What was the group's overarching goal?

 

The authors of the study are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, international research group that aims to identify statistically robust links between genetic variants and social-science-relevant traits. These include traits such as behavior, preferences, and personality that are traditionally studied by social and behavioral scientists (e.g., economists, psychologists, sociologists) but are often also of interest to health and other researchers.

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. Most traits and behaviors are associated with thousands of genetic variants. Although their collective effect can be substantial (see FAQs 1.5 & 2.3), we now know that almost every one of these genetic variants has an extremely weak effect on its own. To identify specific variants with such small effects, scientists must study at least hundreds of thousands of people (to separate weak signals from noise). One promising strategy for doing this is for many investigators to pool their data into one large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of diseases and conditions (Visscher et al. 2017). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believe that studying genetic variants associated with social scientific outcomes can have substantial positive impacts across many research fields. This includes research that aims to better understand the effects of the environment (e.g., research on policy interventions, including the effects of different school environments) and interactions between genetic and environmental effects. The potential benefits also span a diverse set of research questions in the biomedical sciences, such as why and how educational attainment is linked to longevity and better overall health outcomes.

To conduct such research, the SSGAC implements genome-wide association studies (GWAS, see FAQ 1.3) of social-scientific outcomes. For example, to conduct a GWAS of educational attainment, every participating cohort uploads the (within-cohort) statistical association between educational attainment and a single-nucleotide polymorphism (SNP) in the genomes of the individuals in the cohort.  A SNP is a base-pair of the genome where there is common variation in the human population (see FAQ 1.3).  This statistical analysis is repeated for each SNP on the genome. The cohort-level results do not contain individual-level data – just summary statistics about these within-cohort statistical associations. The SSGAC then combines these cohort results to produce the overall GWAS results. By using existing datasets and combining cohort-level results, we can study the genetics of ~1.1 million people at very low cost. The SSGAC publicly shares the overall, aggregated results at www.thessgac.org/data so that other scientists can build on this work. These publicly available data have already catalyzed many research projects and analyses across the social and biomedical sciences (see FAQ 1.6. for examples).

The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, Princeton University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Biology and Human Genetics, University of Tartu and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Genetic Epidemiology, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics and Law, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). Whenever possible, we pre-register our analyses at OSF (formerly Open Science Framework). Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate what was found less tersely and technically than in the paper, as well as what can and cannot be concluded from the research findings more broadly. FAQ documents produced for SSGAC publications are available at https://www.thessgac.org/faqs.

In addition to educational attainment, SSGAC-affiliated papers have studied subjective well-being, reproductive behavior, and risk tolerance. The SSGAC website contains an up-to-date list of our major publications, which have been published in journals such as Science, Nature, Nature Genetics, Proceedings of the National Academy of Sciences, Psychological Science, and Molecular Psychiatry.

1.2.  The current study focuses on an outcome called “educational attainment.” What is educational attainment?

Educational attainment is the amount of formal education a person completes (measured as the number of years of education completed for people in our sample, all of whom are at least age 30 or older). Although educational attainment is most strongly influenced by social and other environmental factors (see FAQ 1.7), it is also influenced by thousands of genes. People vary considerably in how much education they complete. Education is recognized throughout the social and biomedical sciences as an important “predictor” (see FAQ 1.4) of many other life outcomes, such as income, occupation, health, and longevity (Ross & Wu 1995; Cutler & Lleras-Muney 2008). Educational attainment is also among the relatively few social-scientific traits for which it is feasible to conduct a large-scale genome-wide study, because educational attainment is frequently measured by a variety of cohorts, including medical cohorts, due to its robust association with health. A large-scale study is necessary (but not sufficient) to generate scientific findings that are reproducible.

1.3.  What is a GWAS? Are the genetic variants identified in a GWAS “causal”?

In a genome-wide association study (GWAS), scientists look at genetic variants measured across the entire human genome to see whether any of them are, on average, associated with higher or lower levels of some outcome. Commonly, and in our studies, such analyses focus on the most common genetic variants—so called single-nucleotide polymorphisms (SNPs). SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. SNPs usually have two different possible base pairs, or alleles. Although there are tens of millions of sites where SNPs are located in the human genome, GWASs typically investigate only SNPs that can be measured (or imputed) with a high level of accuracy. These days, such procedures usually yield millions of SNPs that together capture most common genetic variation across people.

GWAS has been a successful research strategy for identifying genetic variants associated with many traits and diseases, including body height (Wood et al. 2014), BMI (Locke et al. 2015), Alzheimer’s disease (Lambert et al. 2013), and schizophrenia (Ripke et al. 2014). It has also recently been used to identify genetic variants associated with a variety of health-relevant social science outcomes, such as the number of children a person has (Barban et al. 2016), happiness (Okbay, Baselmans, et al. 2016; Turley et al. 2018), and educational attainment (Rietveld et al. 2013; Okbay, Beauchamp, et al. 2016).

GWAS identifies genetic variants that are associated with the outcome, but an observed association with a specific variant need not imply that the variant causes the outcome, for a variety of reasons. First, genetic variants are often highly correlated with other, nearby variants on the same chromosome. As a result, when one or more variants in a region causally influence an outcome (in that particular environment), many non-causal variants in that region may also be identified as associated with the outcome. When GWAS results are analyzed, researchers will often tend to emphasize results for the genetic variant in a region that showed the strongest evidence of association. This variant need not be the causal variant. In fact, the causal genetic variant may not have even been measured directly. For example, GWAS that focus on common SNPs would not be able to identify rare or structural genetic variants (e.g., deletions or insertions of an entire genetic region) that are causal, but they may identify SNPs that are correlated with these unobserved variants.

Second, the frequencies of many genetic variants vary systematically across environments. If those environmental factors are not accounted for in the association analyses, some of the associations found may be spurious. To use a well-known example (Lander & Schork 1994), any genetic variants common in people of Asian ancestries will be associated statistically with chopstick use, but these variants would not cause chopstick use; rather, these genetic variants and the outcome of chopstick use are both distributed unevenly among people with different ancestries. This is the problem of “population stratification” discussed in Appendix 1. GWAS researchers have a number of strategies for addressing the challenges posed by population stratification (see FAQs 2.4 & 3.5 and Appendix 1).

Even in studies such as ours that attempt to address and correct for heterogeneity in genetic ancestry, allele frequencies may nonetheless vary systematically with environmental factors. For example, a genetic variant that is associated with improved educational outcomes in the parental generation may have downstream effects on parental income and other factors known to influence children’s educational outcomes (such as neighborhood characteristics). This same genetic variant is likely to be inherited by the children of these parents, creating a correlation between the presence of the genetic variant in a child’s genome and the extent to which the child was reared in an environment with specific characteristics. A recent study of Icelandic families showed that the parental allele that is not passed on to the parent’s offspring is still associated with the child’s educational attainment, suggesting that GWAS results for educational attainment partly represent these intergenerational pathways (Kong et al. 2018). Our sibling analyses yield results that are consistent with this conclusion (see FAQ 2.4).

Third, variants’ effects on an outcome may be indirect, so a variant that may be “causal” in one environment may have a diminished effect or no effect at all in other environments. For example, the nicotinic acetylcholine receptor gene cluster on chromosome 15 is associated with lung cancer (Thorgeirsson et al. 2008; Amos et al. 2008; Hung et al. 2008). From this observation alone we cannot conclude that these genetic variants cause lung cancer through some direct biological mechanism. In fact, it is likely that these genetic variants increase lung cancer risk through their effects on smoking behavior. In a tobacco-free environment, it is plausible that many of the associations would be substantially weaker and perhaps disappear altogether. Thus, even if we have credible evidence that a specific association is not spurious, it is entirely possible that the genetic variant in question influences the outcome through channels that we, in common parlance, would label environmental (e.g., smoking). Nearly forty years ago, the sociologist Christopher Jencks criticized the widespread tendency to mistakenly treat environmental and genetic sources of variation as mutually exclusive (see also Turkheimer 2000). As the example of smoking illustrates, it is often overly simplistic to assume that “genetic explanations of behavior are likely to be exclusively physical explanations while environmental explanations are likely to be social” (Jencks 1980, p.723).

In general, GWAS is just one step in a longer, often complex process of identifying causal pathways, but the results of a large-scale GWAS are a useful tool for that purpose and often lead to novel and important insights (Visscher et al. 2017). In other words, GWAS results provide important signals as to where scientists should invest future in-depth research to understand why the association exists.

1.4.  In what sense do the genetic variants identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

When we and other scientists say that genetic variants (and other variables, such as demographics) “predict” certain outcomes, our use of the word differs in several important ways from how “predict” is used in standard language (e.g., outside of social science research papers). First, we do not mean that the presence of a genetic variant guarantees an outcome with 100% probability, or even with a high degree of likelihood. Rather, we mean that the variant is, on average across people, statistically associated with an outcome. In other words, on average, people with the genetic variant have a higher likelihood of the outcome compared to people without the genetic variant. A genetic variant is said to be statistically “predictive” of an outcome even if the presence of the genetic variant only very weakly increases the likelihood of that outcome—as is the case, for instance, with every SNP that we identify that is associated with educational attainment.

Second, in standard language, “prediction” usually refers to the future. In contrast, when scientists say that genetic variants “predict” an outcome, they mean that they expect to see the association in new data. “New data” means data that haven’t been analyzed yet—regardless of whether that data will be collected in the future or has already been collected.

Finally, in standard language, a “prediction” is often an unconditional guess about what will happen. Instead of meaning it unconditionally, scientists mean that they expect to see an association in new data under certain conditions, for example, that the environment for the new data is the same as the environment in which the variants were found in the previously studied data to be associated with the outcome. In the example given in FAQ 1.3, in which a genetic variant is associated with lung cancer due to its effect on smoking, we would not expect the genetic variant to be as strongly predictive of lung cancer in an environment where cigarettes are absent.

We use the term “effect size” as a concise way to refer to the magnitude of the predicted difference in the outcome resulting from having one allele of a genetic variant as opposed to the other possible allele (for example, see FAQ 2.2). The use of the word “effect” is not intended to imply that we believe it is generally appropriate to use the strength of the association between a variant and educational attainment as a measure of the variant’s causal effect on educational attainment (see FAQ 1.3).

1.5.  What is a polygenic score?

The results of a GWAS can be used to create a “polygenic score,” an index composed of many genetic variants from across the genome. Because a polygenic score aggregates the information from many genetic variants, it can “predict” (see FAQ 1.4) far more of the variation among individuals for the GWAS outcome than any single genetic variant. Often, the polygenic scores with the most predictive power are those created using all the (millions of) genetic variants studied in a GWAS. The larger the GWAS sample size, the greater the predictive power (in other, independent samples) of a polygenic score constructed from the GWAS results. More precisely, the GWAS results are used to create a formula for how to construct a polygenic score. Using this formula, a polygenic score can then be constructed for any individual with genome-wide data. Indeed, some of the value of a GWAS is that the polygenic score it produces can be used in subsequent studies conducted in other samples.

1.6.  Why conduct a GWAS of educational attainment?

We are motivated to conduct this research because we believe it can be fruitful for the social sciences and health research. In addition to the specific findings of our paper, which are discussed in Section 2 of these FAQs, the results of a GWAS of educational attainment also provide inputs for other research. For example, results from our earlier GWAS of educational attainment (Rietveld et al. 2013; Okbay, Beauchamp, et al. 2016) conducted in much smaller sample sizes (see also FAQ 1.7) have been used to:

  • examine the genetic overlap between educational attainment and ADHD, schizophrenia, Alzheimer’s disease, intellectual disability, cognitive decline in the elderly, brain morphology, and longevity (Pickrell et al. 2016; Warrier et al. 2016; Anderson et al. 2017; Marioni et al. 2016);

  • help us better identify possible genetic subtypes of schizophrenia (Bansal et al. 2017);

  • explore why educational attainment appears to be protective against coronary artery disease (Tillmann et al. 2017) and obesity (van Kippersluis & Rietveld 2017);

  • control for genetic influences in order to generate more credible estimates of how changes in school policy influence health outcomes (Davies et al. 2018);

  • study why specific genetic variants predict educational attainment. For example, it appears that some genetic effects on educational attainment operate through associations with cognitive performance and traits such as self-control (Belsky et al. 2016), which in turn affect educational attainment;

  • study how the effects of genes on education differ across environmental contexts (Schmitz & Conley 2017; Barcellos et al. 2018); and

  • develop new statistical tools that may advance our understanding of how parenting and other features of a child’s rearing environment influence his or her developmental outcomes (Kong et al. 2018; Koellinger & Harden 2018).

These are just some examples of follow-up studies that previous GWASs of educational attainment have already enabled. By making the results of our analyses publicly available at https://www.thessgac.org/data, we hope to facilitate further valuable work by other researchers.

1.7.  What was already known about genetic associations with educational attainment prior to this study?

Educational attainment is strongly influenced by social and other environmental factors. For example, holding all other influences equal, those who live in communities where education (at least beyond a certain level) is relatively expensive are less likely to obtain a high level of educational attainment. Even when education is free or heavily subsidized, full-time education constitutes an opportunity cost that not everyone is equally able to bear: some individuals, due to a variety of family or economic circumstances, will face more pressure than others to leave school and enter the labor force. More generally, educational outcomes are strongly influenced by environmental factors such as social norms, early-life educational experiences, and economic opportunity.

A variety of findings—from twin, family, and GWAS studies—suggest that in affluent countries, genetic factors account for some of the differences across people in their educational attainment (Branigan et al. 2013; Heath et al. 1985; Silventoinen et al. 2004). Studies have found repeatedly that identical twins raised in the same home are substantially more similar to each other in their educational attainment than fraternal twins (or other full siblings) reared together. Full siblings reared together are, in turn, more similar than half siblings reared together who, in turn, are more similar than genetically unrelated siblings (e.g., siblings who are conventionally unrelated, typically because at least one of them is adopted) reared together (Cesarini & Visscher 2017; Sacerdote 2011; Sacerdote 2007). The studies have also provided strong evidence that so-called common environment (the environmental factors shared by siblings raised in the same household) can have long-lasting effects on educational outcomes. In Sweden, the educational outcomes of adopted (i.e., genetically unrelated) brothers reared in the same households are about as similar as the educational outcomes of full siblings reared in separate homes (Cesarini & Visscher 2017). A study of Korean-American adoptees finds that adoptees assigned to households where both parents had college degrees were 16 percentage points more likely to attend college than children assigned to families in which neither parent completed college (Sacerdote 2007).

Research (like the current study) using molecular genetic data—data that measures each person’s DNA and can be used to identify differences between people at the molecular level—has similarly found that common SNPs jointly predict up to 20% of variation across individuals (Rietveld et al. 2013). This predictive power may derive from many different types of mechanisms. For example, genetic variation may affect neural functions such as memory. Genetic variation may improve sleep quality (making it easier to subsequently stay awake in boring lectures). Genetic variation can affect personality traits, such as the willingness to listen politely to and follow the instructions of teachers (who aren’t always right but nevertheless dictate grades and other outcomes). There may also be even more convoluted pathways. For example, genetic variation can affect one’s sociability, which might draw someone into or drive someone out of the particular social environments that exist in higher education.

In prior GWAS studies, researchers have observed that some genetic variants are associated with educational attainment. In the SSGAC’s first major publication (Rietveld et al. 2013), we conducted a GWAS in a sample of roughly 100,000 people and identified three genetic variants that were statistically associated with educational attainment. In 2016, the SSGAC conducted another GWAS of educational attainment, this time in a sample of around 300,000 people (Okbay, Beauchamp, et al. 2016). We found that 74 genetic variants were associated with educational attainment. These included the three genetic variants identified in our earlier study (Rietveld et al. 2013). Both of these studies involved, at the time they were conducted, the largest sample sizes ever studied for genetic associations with a social science outcome.

There were three key takeaways from the SSGAC’s prior work:

     1. A GWAS approach can identify specific genetic variants statistically associated with behavioral variables if the study             is conducted in large enough samples (at least 100,000 people).

     2. Genetic variants that are associated with a behavioral variable such as educational attainment are each likely to have           less predictive power (i.e., a smaller effect size) than are genetic variants that are associated with a biomedical or                 other physical outcome (Chabris et al. 2015). For example, of the hundreds of genetics variants found to be associated           with height to date (Wood et al., 2014), the genetic variant with the strongest association predicts 0.4% of the variation           across individuals in height, whereas the genetic variant with the strongest association with educational attainment               identified to date predicts less than one tenth (<0.04%) as much of the variation in educational attainment (Okbay,                   Beauchamp, et al. 2016). (The genetic variants that have not yet been identified will very likely explain less variance               than those that are currently known, since statistical power is greatest for those that explain the most variance.)

      3. In the samples studied, at least 20% of the variation in educational attainment is predicted by genetic variation                        (Rietveld et al. 2013), implying that the genetic associations with educational attainment result from the cumulative                effects of at least thousands (probably millions) of different genetic variants, not just a few.

These findings from twin, family, and GWAS studies imply that individuals who carry an allele associated with greater educational attainment will on average complete slightly more formal education than other (similarly environmentally situated) individuals who carry a different allele of the same genetic variant. Put in population terms, these findings imply that people with particular alleles will tend on average to complete more formal education, while people who carry other alleles will tend on average to complete less formal education. It is important to emphasize that these associations represent average tendencies in a population. Many individuals with high polygenic scores for educational attainment will not get a college degree, and vice-versa. This makes polygenic scores for educational attainment poor predictors of individual outcomes (see FAQ 3.4), but increasingly useful tools in social science research (see FAQ 2.3).

2. Study Design and Results

2.1.  What did you do in this paper? How was the study designed? Why was the study designed in this way?

We conducted a GWAS (see FAQ 1.3) of educational attainment (see FAQ 1.2) in a sample of over 1.1 million people. The sample size we used in the current study is much larger than that used in previous GWAS of educational attainment (see FAQ 1.7). By constructing a current sample of over 1.1 million, we expected to estimate genetic effects with much greater accuracy than previous studies (with smaller samples) and, thus, to learn much more about the specific genetic variants that are associated with educational attainment.

To construct such a large sample, we combined information from our previous GWAS of roughly 300,000 research participants from 64 datasets (which we refer to as “cohorts”) (Okbay, Beauchamp, et al. 2016) with data that have recently become available from seven additional cohorts. These seven new cohorts include the UK Biobank and the personal genomics company 23andMe, both of which have surveyed and genotyped hundreds of thousands of research participants.

Our study was limited to only the most common type of genetic variant: single-nucleotide polymorphisms (SNPs, see FAQ 1.3). Unlike most other studies, which have analyzed only the autosomes (the non-sex chromosomes), our study also included SNPs on the X chromosome (see FAQ 2.6). In total, our analyses included approximately 10 million SNPs. And, as in other GWASs, our analyses included only individuals of primarily European genetic ancestry. This restriction is needed in order to reduce statistical confounds that otherwise arise from studying populations with diverse genetic ancestries (see the discussion of population stratification in Appendix 1; see also FAQs 1.3, 2.4 & 3.5).

In the remainder of the paper, we used the findings from the GWAS for a range of additional analyses that explored (among other things):

  • the extent to which siblings with different alleles end up with different amounts of formal schooling (see FAQ 2.4);

  • which environmental conditions affect the size of the association between genetic variants and educational attainment (see FAQ 2.5);

  • the genetic overlap between educational attainment and other outcomes, such as cognitive performance (constituting the largest GWAS of cognitive performance to date) and self-reported math ability (see FAQ 2.7);

  • which other outcomes are also correlated with genetic variants that are associated with educational attainment (see FAQ 2.8); and

  • the biological functions of the genetic variants identified (see FAQ 2.9).

2.2.  What did you find in the GWAS of educational attainment?

In our sample of roughly 1.1 million people, we found 1,271 genetic variants that were associated with educational attainment (using the standard statistical threshold in GWAS, which adjusts for multiple hypothesis testing). This is a substantial increase from the 74 variants identified in our last GWAS of around 300,000 individuals (Okbay, Beauchamp, et al. 2016), confirming the importance of large sample size for identifying specific genetic variants associated with behavioral traits.

The current study further confirmed the finding from our earlier work that the effects of individual genetic variants on educational attainment are extremely small. The average effect size across the 1,271 genetic variants was just 1.8 weeks of schooling per allele; even the SNPs with the strongest associations only predicted around 3 weeks of additional schooling per allele. Taken together, these 1,271 SNPs accounted for just 3.9% of the variation across individuals in years of education completed.

Here is another way to think about this result. Imagine that we used the results for these 1,271 genetic variants (not the ~1 million SNPs across entire genome we discuss in FAQ 2.3) to predict the educational attainment for a new group of people (separate from our discovery sample). We could then compare each individual’s predicted educational attainment to their actual educational attainment. If we did so, our results suggest that we would find that the predictions and actual outcomes correlate only very modestly (at about r = 0.20). That, in turn, means that if someone were predicted to complete an above average number of years of schooling (i.e., to be in the top half of educational attainment), that person would have about a 58% chance of actually being in the top half of educational attainment. Fifty-eight percent is better than chance (i.e., 50%), suggesting that a prediction based on these 1,271 SNPs has more power to predict educational attainment than a coin flip—but only a bit more power. By contrast, a prediction based on a polygenic score that combines ~1 million SNPs that we studied (see FAQs 1.5 & 2.3) has more predictive power: r = 0.33, corresponding to 11% of the variation across individuals.

The contrast between the 3.9% of the variation predicted by the 1,271 SNPs and the 20% known to be explained by common SNPs (see FAQ 1.7) implies that there are many other SNPs that have not yet been identified. Even larger sample sizes will be needed to identify them.

It is also important to keep in mind that educational attainment is a complex phenomenon, and our study focuses on only a tiny piece of the bigger picture. In this paper, we only examine one type of genetic variant (SNPs). Further, we conduct only preliminary analyses of how the effects of genetic variants on educational attainment differ depending on environmental conditions (see FAQ 2.5). These other genetic effects, environmental effects, and their interactions are important topics of active research and of future work by the SSGAC. Such work includes further studies of associations between educational attainment and epigenetic marks (Linnér et al. 2017).

2.3.  How predictive is the polygenic score developed in this study?

As discussed in FAQ 1.5, we can create an index using the GWAS results from around ~1 million genetic variants. Such an index is called a “polygenic score.”

The polygenic score we constructed “predicts” (see FAQ 1.4) around 11% of the variation in education across individuals (when tested in independent data that was not included in the GWAS). This ~1 million SNP polygenic score predicts much more of the variation than does the genetic predictor described in FAQ 2.2, which was based on only 1,271 SNPs. Including all ~1 million SNPs tends to add predictive power because the threshold for significance/inclusion that is used to identify the 1,271 SNPs is very conservative (i.e., many of the other ~1 million SNPs are also associated with educational attainment but are not identified by our study, and on net, it turns out empirically that more signal than noise is added by including them). This study’s polygenic score has much more predictive power than polygenic scores constructed from our earlier two GWAS of educational attainment, because both of those studies had much smaller sample sizes (~100,000 and ~300,000 individuals, respectively, compared with ~1.1 million individuals of the current study).

Individuals with high polygenic scores have, on average, higher levels of education than those with lower polygenic scores. In the present study, we found that in a U.S. sample of young adults (the National Longitudinal Study of Adolescent to Adult Health), 12% of those with the lowest 20% of polygenic scores graduated from college, compared with 57% of those with the highest 20% of polygenic scores. These results show both that polygenic scores have some predictive power but also that polygenic scores do not determine or pin down individual outcomes: even when polygenic scores are based on GWAS of many more people and therefore have even greater predictive power than ours, there will always be many people whose polygenic scores “predict” lower educational attainment who in fact attain relatively high amounts of education and vice-versa.

As we discuss in FAQ 3.4, an individual’s polygenic score for education (even a polygenic score based on ~1 million SNPs) is still not a very accurate prediction of that individual’s actual level of education attained. However, polygenic scores are useful for scientific studies (including social science, health research, etc.). Such studies are concerned with aggregate population trends and averages rather than with individual outcomes. In particular, because the polygenic score predicts 11% of the variation across individuals, studies of its association with other variables can be well powered in sample sizes as small as 75 individuals (but not as small as 1 individual!).

Through this lens, the fact that the current study’s polygenic score for educational attainment predicts 11% of the variation across individuals in education attained is quite meaningful and rivals or exceeds the predictive power of other variables commonly used in research—none of which, taken alone, predicts a large amount of variation in a behavioral outcome. For example, using our sample in order to maximize the comparability with the polygenic score, we estimated that household income predicts ~7% of variation in educational attainment and mother’s education predicts ~15%. Thus, our score has approached the predictive power of important demographic variables and can be used in similar ways (e.g., to control for genetics as an additional confound when evaluating the effects of environmental differences or interventions).

With a relatively high level of predictive power, the polygenic score we constructed enables other research that is of value to social scientists and health researchers. Such studies are already being conducted with the (much less powerful) polygenic scores from earlier GWAS of educational attainment (see FAQ 1.6). Our new results will enable many additional applications, such as studies that use the polygenic score in relatively small samples that contain rich health and behavioral data that is expensive to collect (e.g., a randomized controlled trial that studies the effects of subsidizing higher education and uses the polygenic score as a control variable).

2.4.  What did you find in the analysis of siblings?

In a sample of ~44,000 siblings (~22,000 pairs), we examined the genetic variants identified in our GWAS. Specifically, we tested whether having more alleles of particular genetic variants than one’s sibling is associated with having greater educational attainment than that sibling. One purpose of this analysis was to assess to what extent GWAS results are biased by factors such as unaccounted-for “population stratification” (see Appendix 1 and FAQs 1.3 & 3.5). We found strong evidence that the genetic variants identified in the GWAS are associated with educational attainment in our sibling analysis.

However, we also found that the associations with educational attainment were substantially smaller in the sibling analysis than in the GWAS (when we conducted an analogous study of height, we did not observe any quantitative discrepancies between the GWAS and the within-family estimates after a technical correction for assortative mating). We examined a number of possible explanations for the difference. Ultimately, we believe that the GWAS estimates are larger because they partly reflect the kinds of intergenerational mechanisms discussed in FAQ 1.3 and studied by Kong et al. (2018): an individual with a genetic variant associated with greater educational attainment is more likely to have a parent with that variant. Such a parent is likely to have attributes and behaviors (such as higher income or a greater likelihood of reading to a child) that contribute to increasing the child’s educational attainment. These intergenerational mechanisms are not measured in the sibling analysis (since the siblings share the same parents).

If our conjecture about the source of the discrepancy is correct, it reinforces the importance of interpreting genetic associations with caution. Behavior geneticists sometimes criticize social scientists for failing to consider a role for genetic factors when interpreting correlations between relatives (e.g., parent-child correlations in educational attainment). We believe this criticism has merit (see FAQ 1.7 for a summary of the evidence) but it goes both ways. Since the variants identified in our GWAS also show evidence of association in our sibling analyses, we can be confident that our main results are not fully explained by factors that siblings share, such as parental genotypes and many features of rearing environment. But since the associations are weaker in the sibling analyses, it is plausible that some of the predictive power – perhaps a quarter – of our GWAS-identified variants arises because the variants are correlated with environmental factors that siblings share. For this and other reasons, we believe it is misleading to use phrases such as “innate ability” or “genetic endowments” to describe what is measured by polygenic scores based on our GWAS estimates.

2.5.  What did you find in the analysis of environmental heterogeneity?

We expect the associations of particular genetic variants with educational attainment to depend on environmental context (such as a country’s school system and the quality of an individual’s schools). That is partly because the meaning of a specific educational qualification varies across time and place. It is also because genetic variants don’t affect educational attainment directly. Instead, they are likely to operate through a myriad of complex pathways. For example, they may affect psychological characteristics such as cognitive abilities and personality traits that ultimately influence educational attainment (see FAQ 1.7). To take one example, one would expect genetic variants associated with educational attainment to play a smaller role in countries whose laws make education compulsory for a relatively long period of time, because this environmental factor (education laws) constrains the range of outcomes to begin with. The genetic variants we have identified as associated with educational attainment in the current environments in which they were studied would play a lesser role still in a zombie apocalypse where many schools have been overrun by walkers (if any schools at all remain, different genetic variants might be associated with educational attainment—say, those associated with muscle twitch speed or immunity to the zombie virus). Finally, because genes influence educational attainment through other traits and behaviors, different pedagogies might make some of these traits more important to educational attainment than others, which would in turn likely modify the effect sizes—and even identities—of genetic variants associated with educational attainment.

In this study, we found some evidence that the effects of genetic variation on educational attainment differed across the 71 cohorts that contributed data. Characteristics such as cognitive abilities and personality traits are likely to matter differently in different places and time periods since educational systems also vary, and the 71 cohorts contributing data come from 15 different countries and enroll people born in a wide range of years. Documenting heterogeneity across cohorts in the associations of individual SNPs is a contribution of this paper because much previous work did not have sufficient statistical power to do so. Although most researchers expected such heterogeneity, the sample size of our study made it possible to measure the existence of these differences. We performed an exploratory analysis to investigate which observable environmental factors predicted differences in genetic effects across cohorts, but we were not sufficiently well powered to identify robust results. As GWAS sample sizes continue to grow, researchers will be able to understand in greater detail how environments shape genetic effects. This is one example of how adequately-powered GWAS can help establish the limits and nuances of genetic explanations of behaviors (see FAQ 3.7).

2.6.  What did you find in the analysis of the X chromosome?

In contrast to most previous GWAS (including the previous GWAS of educational attainment), this study also examined variants on the X chromosome. In addition to the 1,271 variants identified on the autosomes (the non-sex chromosomes), we identified 10 variants associated with educational attainment on the X chromosome. Part of the reason we found so few variants on the X chromosome is because we only had X chromosome data in a smaller sample size (~700,000 individuals, compared with 1.1 million for the autosomes). But even adjusting for sample size, we found fewer variants on the X chromosome than on other chromosomes of similar length. Moreover, the variants on the X chromosome as a whole “predicted” (see FAQ 1.4) less of the variation in educational attainment than the variants on other chromosomes of similar size. These results are of interest for human geneticists, as they are some of the first GWAS evidence about the effects of SNPs on the X chromosome (on any outcome, not just educational attainment).

Finally, in separate GWAS of men and women, we found that variants on the X chromosome predict similar amounts of variation in educational attainment in men and in women. Some researchers had hypothesized that genetic influences on the X chromosome are an important source of differences in the variance in cognitive performance across men and women. While there were compelling scientific reasons to view such claims skeptically even prior to the publication of our study, our results provide further evidence against the hypothesis.

2.7.  What did you find in the analysis of cognitive performance and math abilities?

In supplementary analyses, we estimated GWAS of cognitive performance (as measured by scores on cognitive tests), self-reported math ability, and self-reported highest math class completed. Each of these GWAS was estimated in a substantially smaller sample than the GWAS of educational attainment, which contained information from roughly 1.1 million individuals. This difference in sample size reflects the fact that education is simple and standard to collect in large surveys, while cognitive performance, for example, is assessed less often because it requires respondents to answer time-consuming questions.

Still, with a sample size of around 250,000, our GWAS of cognitive performance is the largest published to date. A previous GWAS of cognitive performance was based on a sample of roughly 35,000 individuals (Trampush et al. 2017). We combined the results of that study with data from over 200,000 UK Biobank respondents who completed a test of verbal and numerical reasoning. Our GWAS identified 225 genetic variants associated with cognitive performance (using a standard threshold for genome-wide significance). A polygenic score constructed from all the genetic variants “predicts” (see FAQ 1.4) 7-10% of the variation in cognitive performance across individuals. A study of cognitive performance based on an even larger sample than ours (e.g., Savage et al. 2017) is presently being conducted under the auspices of the Psychiatric Genetics Consortium (PGC). The PGC study, in turn, is a follow-up to a previously published GWAS (Sniekers et al. 2017).

Self-reported math ability and highest math class completed have not been studied with GWAS before. Our GWAS of self-reported math ability used data from around 550,000 research participants of the personal genomics company 23andMe, and our GWAS of highest math class used data from over 400,000 research participants. We identified 618 and 365 genetic variants associated with self-reported math ability and highest math class completed, respectively.

We also found that many of the genetic variants that affect educational attainment also affect cognitive performance and math abilities. Exploiting this overlap in genetic effects, we applied a recently developed method, called Multi-Trait Analysis of GWAS (MTAG) (Turley et al. 2018). By doing so, we leveraged information from our large GWAS of educational attainment to identify additional genetic variants associated with cognitive performance, self-reported math ability, and highest math class completed. For all three outcomes, more genetic variants were identified after incorporating information about genetic correlates with educational attainment.

2.8.  Are the genetic variants associated with higher educational attainment in your study also associated with other outcomes?

Yes. We found that genetic variants associated with increased educational attainment are negatively associated with grade retention (i.e., having to repeat a grade) and positively associated with grade point average (GPA), cognitive performance, and self-reported math ability. This suggests that genetic variants “predict” (see FAQ 1.4) educational attainment at least in part through their correlation with cognitive development and academic performance.

In our previous GWAS of educational attainment, we also found that the genetic variants that predict educational attainment overlap with those that predict health outcomes, including Alzheimer’s disease, bipolar disorder, and schizophrenia (Okbay, Beauchamp, et al. 2016). Using the results of our previous GWAS, other researchers have identified genetic overlap between educational attainment and other outcomes, including ADHD, intellectual disability, cognitive decline in the elderly, brain morphology, and longevity (Pickrell et al. 2016; Warrier et al. 2016; Anderson et al. 2017; Marioni et al. 2016). Future research is needed to understand why the genetic variants linked to education overlap with those associated with these other traits.

2.9.  What do your results tell us about human biology and brain development?

We can draw inferences about biological pathways using computational methods that examine whether genes known to be involved in particular biological systems are especially likely to be associated with educational attainment.

In our earlier GWAS of educational attainment (Okbay, Beauchamp, et al. 2016), we found that the identified genes tended to be strongly active in the brain, especially prenatally, and were especially likely to be involved in neural development. The additional genes identified in the current study are also strongly active in the brain and involved in neural development. However, these additional genes are active both pre- and post-natally, at virtually all stages of brain development. Moreover, many of the newly identified genes are involved in neuron-to-neuron communication in the brain.

It is not surprising that genes may influence educational attainment in part because of their effects on brain development and communication within the brain. Cognitive abilities and personality traits (such as conscientiousness and resilience) that matter for school performance may be partially reflected in how the brain is organized. It is perhaps more surprising that our study of educational attainment generates a biological picture of brain development that is clearer than those generated by previous GWAS that focused directly on brain structures. We believe that the greater clarity of the biological picture we observe is due to the relatively large sample size of our study, which afforded us greater statistical power than previous GWAS. Since it will remain much easier to measure educational attainment than to conduct brain scans in large samples of individuals, we believe that GWAS of educational attainment will continue to play a useful role in understanding the biology of brain development and constitutes one of the benefits of this research (see FAQ 3.7).

3. Ethical and social implications of the study

3.1.  Did you find “the gene for” educational attainment?

No.

We did not find “the gene for” educational attainment or anything else. We identified many genetic variants that are associated with educational attainment. Although it was once believed that scientists would discover numerous one-to-one associations between genes and outcomes, we have known for a number of years that the vast majority of human traits and other outcomes are complex and are influenced by many (thousands or even millions of) genes, each of which alone tends to have a small influence on the relevant outcome.

3.2.  Well, then, did you find “the genes for” educational attainment?

Although we did find several genes that are associated with educational attainment, we believe that characterizing these as “genes for educational attainment” is still likely to mislead, for many reasons.

First, most of the variation in people’s educational attainment is accounted for by social and other environmental factors, not by additive genetic effects (See FAQ 1.7). “Genes for educational attainment” might be read to imply, incorrectly, that genes are the strongest predictor of variation in educational attainment.

Second, the genetic variants that are associated with educational attainment are also associated with many other things (only some of which we identify in this study, see FAQ 2.8). These variants are no more “for” educational attainment than for the other outcomes with which they are associated.

Third, the “predictive” power (see FAQ 1.4) of each individual genetic variant that we identify is very small. Our results show that genetic associations with educational attainment are comprised of thousands, or even millions, of genetic variants, each of which has a tiny effect size. Each variant is therefore weakly associated with, rather than a strong influence on, educational attainment. “Genes for educational attainment” might misleadingly imply the latter. 

Fourth, environmental factors can increase or decrease the impact of specific genetic variants. Put differently, even if a genetic variant is associated with higher or lower levels of educational attainment on average, it may have a much larger or smaller effect depending on environmental conditions. Indeed, in the current paper and elsewhere, we report exploratory analyses that provide evidence of such gene-environment interactions (see FAQ 2.5). Educational attainment couldn’t even exist as a meaningful object of measurement if we didn’t have schools, and having schools introduces societal mechanisms that influence who goes to them. Accordingly, genetic associations with educational attainment necessarily will be mediated by societal systems and therefore genetic variation should often be expected to interact with environmental factors when it influences social phenomena, such as educational attainment. “Genes for educational attainment” suggests a stability in the relationship between these genes and the outcome of educational attainment that does not exist.

Finally, genes do not affect educational attainment directly (see FAQ 2.5). As described in FAQ 2.9, the genes identified as associated with educational attainment tend to be especially active in the brain and involved in neural development and neuron-to-neuron communication. The “predictive” power (see FAQ 1.4) of genes on educational attainment may therefore be the result of a long process starting with brain development, followed by the emergence of particular psychological traits (e.g., cognitive abilities and personality). These traits may then lead to behavioral tendencies as well as experiences and treatment by parents, peers, and teachers. All of these factors may additionally interact with the environment in which a person lives. Eventually these traits, behaviors, and experiences may influence (but not completely determine) educational attainment.

3.3.  Does this study show that an individual’s level of educational attainment is determined, or fixed, at conception?

No. 

Social and other environmental factors account for most variation in educational attainment. But even if it were true that genetic factors accounted for all of the differences among individuals in educational attainment, it would still not follow that an individual’s number of years of formal schooling is “determined” at conception. There are at least three reasons for this:

First, some genetic effects may operate through environmental channels (Jencks 1980). As an illustrative example, suppose—hypothetically—that the genetic variants we identified help students to memorize and, as a result, to become better at taking tests that rely on memorization. In this example, changes to the intermediate environmental channels—the type of tests administered in schools—could have drastic effects on individuals’ educational attainment, even though individuals’ genetic variants would not have changed. A genetic association with educational attainment might not be found at all if schools did not use tests that rely on memorization. More generally, the genetic associations that we found might not apply as strongly if the education system were organized differently than it is at present (see also FAQ 1.3).

Second, even if the genetic associations with educational attainment operated entirely through non-environmental mechanisms that are difficult to modify (such as direct influences on the formation of neurons in the brain and the biochemical interactions among them), there could still exist powerful environmental interventions that could change the genetic relationships. In a famous example suggested by the economist Arthur Goldberger, even if all variation in unaided eyesight were due to genes, there would still be enormous benefits from introducing eyeglasses (Goldberger 1979). Similarly, policies such as a required minimum number of years of education and dedicated resources for individuals with learning disabilities can increase educational attainment in the entire population and/or reduce differences among individuals.

Third, even if the genetic effects on educational attainment were not influenced by changes in the environment, those environmental changes themselves could still have a major impact on the educational attainment of the population as a whole. For example, if young children were given more nutritious diets, then everyone’s school performance might improve, and college graduation rates might increase. By analogy, 80%-90% of the variation across individuals in height is due to genetic factors. Yet the current generation of people is much taller than past generations due to changes in the environment such as improved nutrition.

3.4.  Can the polygenic score from this paper be used to accurately predict a particular person’s educational attainment?

No. While the “predictive” power (see FAQ 1.4) of our polygenic score is substantial—it predicts 11% of variation in educational attainment across individuals—and useful for some purposes (see FAQ 1.6), it is important to keep in mind that the score fails to predict the vast majority (89%) of variation in years of education across individuals. Many of those with low polygenic scores go on to achieve high levels of education, and a large proportion of those with high polygenic scores do not complete college.

Thus, an important message of this paper and our earlier papers is that DNA does not “determine” an individual’s level of education, for multiple reasons: First, it is estimated that, at least in the environments in which we have been measuring it, the additive effects of common genetic variants will only ever predict about 20% of the variance in educational attainment across individuals. Second, today’s polygenic score is only able to predict a little more than half of that 20% (11 percentage points). Third, since genetic variants matter more or less depending on environmental context (see FAQ 2.5), a polygenic score might be less (or more) predictive for individuals in some environments than for individuals in others. Finally, polygenic predictions only hold for as long the environment in which they were developed remains substantially the same: if the laws or pedagogy underlying a population’s educational system changes substantially, then so, too, might the polygenic score. Just as eyeglasses allow those genetically predisposed to poor vision to have nearly perfect vision, innovations in education (say, an innovation that makes education irresistibly engaging, thus mitigating the risk to those with genetic variants associated with lower ability to pay attention or maintain self-control) might result in those with lower polygenic scores now achieving just as much education, on average, as those with higher polygenic scores (see also FAQs 3.2 and 3.3).

Back to Top

FAQs about "Genome-wide association study identifies 74 loci associated with educational attainment"

 

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This description of the Nature paper “Genome-wide association study identifies 74 loci associated with educational attainment” includes the following information:

 

          1. Background: authorship, goals, definition of “educational attainment,” previous research

          2. Study design and results: genes, variants, and biology linked to educational attainment

          3. Social implications of the study: potential use in medical research and in policy

          4. Appendices: quality-control measures, further reading and references

 

The document was prepared by several co-authors of the paper and Advisory Board members of the Social Science Genetic Association Consortium. For clarifications or additional questions, please contact Daniel Benjamin (djbenjam@usc.edu).

 

Quick Links

Who conducted this study? What was the group's overarching goal?

 

The authors are members of the Social Science Genetic Association Consortium (SSGAC), a multi-institutional research group that aims to draw statistically rigorous links between genetic variants—for instance, base-pairs of DNA that vary across people—and social science variables such as behavior, preferences, and personality. The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. First, most traits and behaviors are influenced by hundreds or thousands of genetic variants, and almost all of these genetic variants have extremely weak effects on their own (though, when combined, their collective effects can be meaningful). Second, to rigorously identify such variants, scientists must study hundreds of thousands of people, and therefore a promising strategy is for many investigators to pool their data into one large study. This approach has borne considerable fruit in medical genomics; recent successful studies of the genetics of autism (Gaugler et al., 2014), schizophrenia (Ripkeet al., 2014), and many other diseases and conditions would not have been possible without large consortia in which members shared their data. The SSGAC is an attempt to recapitulate this research model for understanding genetic associations with non-medical traits.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believe that genetic data could have a substantial positive impact on research in the social sciences. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Genetics, Broad Institute and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard), Sarah Medland (Statistical Genetics, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics, Clarkson University and Icahn School of Medicine at Mount Sinai), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). These, together with an analysis plan, are posted on the Open Science Framework’s preregistration website. Major publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate to the public what was found and what can and cannot be concluded from the research findings.

The SSGAC's first major project was a genome-wide association study (GWAS) on educational attainment—similar to this study but using a much smaller sample—published in Science (Rietveld et al., 2013). The study is summarized in a FAQ posted on the SSGAC website below. Subsequent papers have been published in Proceedings of the National Academy of Sciences and Psychological Science, among other journals.

The current study focuses on a variable called “educational attainment.” What is educational attainment?

Educational attainment is the amount of formal education a person completes. People vary considerably in how much education they complete. Education is recognized throughout the social and medical sciences as an important predictor of many other life outcomes, such as income, occupation, and health (e.g., Ross and Wu, 1995; Cutler and Lleras-Muney, 2008).

What was already known about the genetics of educational attainment prior to this study?

 

Decades of twin and family studies have found that one reason people differ in educational attainment is that they differ genetically. For example, several studies have found that identical twins are more similar to each other in their educational attainment than fraternal twins are to each other (Taubman, 1976; Branigan, McCallum, and Freese, 2013). Nonetheless, educational attainment is also strongly influenced by social and other environmental factors.

Recent research using molecular genetic data—in other words, using data that measures each person’s DNA and identifies variation at the molecular level across people—has similarly found that genetic factors play a role, accounting for at least 20% of variation in educational attainment (e.g., Rietveld et al., 2013).

These findings imply that there are genetic variants associated statistically with more educational attainment (people who carry these variants will tend on average to complete more formal education) and genetic variants associated statistically with less educational attainment (people who carry these variants will tend on average to complete less formal education). It is important to emphasize that these associations represent average tendencies in the population—not pre-determined outcomes for each person. It is likely that many genetic variants matter more or less depending on environmental context (such as a country’s school system and the quality of an individual’s school).

In the SSGAC’s first major publication (Rietveld et al., 2013), we conducted a genome-wide association study (GWAS) in a sample of roughly 100,000 individuals and identified three genetic variants statistically associated with educational attainment. In the same paper and subsequent work (Rietveld et al., 2014a), we verified that the associations with those variants were replicated in separate samples of individuals (25,000 and 35,000 people, respectively). There were two key takeaways from this work:

 

(1) A GWAS approach can identify specific genetic variants statistically associated with social-science variables if the study is conducted in large enough samples (at least one hundred thousand people).

 

(2) A specific genetic variant that is associated with a social science variable is likely to have much smaller predictive power for that trait than a specific genetic variant that is associated with a bio-medical outcome (Chabris et al., 2015). For example, the known genetic variant with the largest effect on height predicts 0.4% of the variation in height across individuals in the sample, whereas the three variants identified by Rietveld et al. (2013) each predict only 0.02% of the variation in educational attainment in the sample.

 

What did you do in this paper? How was the study designed?

 

The central contribution of the paper is a genome-wide association study (GWAS) of about 300,000 people (based on combined results from 64 separate analyses conducted in cohorts of participants from 15 different countries). This is by far the largest sample size ever studied for genetic associations with any social science outcome. We included only individuals of European descent to reduce statistical confounds that otherwise arise from studying ethnically diverse populations (see the discussion of population stratification in Appendix 1). For each person in our data, we analyzed approximately 9 million genetic variants called single nucleotide polymorphisms, or SNPs. (SNPs are the most common type of genetic variant (a way in which the genomes of people can differ), but they are not the only type of genetic variant.)

 

We subsequently gained access to a large, independent sample of roughly 110,000 individuals (data from the U.K. Biobank). We used this new dataset to replicate the genetic associations that we initially reported.

 

In the remainder of the paper, we used the findings from the GWAS for a range of additional analyses that explored (among other things) the biological pathways associated with genetic variants of interest, and the genetic overlap between educational attainment and other outcomes such as Alzheimer’s disease.

 

What did you find in the GWAS?

 

In our “discovery sample” (of roughly 300,000 people), we found 74 SNPs associated with educational attainment. These include the 3 genetic variants identified in our earlier study. In our “replication sample”(of roughly 110,000 additional people from the U.K. Biobank), these findings held up extremely well. For example, in our replication sample, 72 of the 74 SNPs were associated with educational attainment with the same sign as in the discovery sample (i.e., those that were associated with higher educational attainment in the first sample also did so in the replication sample, and those that were associated with lower educational attainment in the first sample also did so in the replication sample).

As a group, the 74 SNPs explain 0.43% of the variation in educational attainment across individuals in the sample. Individually, each of the 74 SNPs had an extremely small influence on educational attainment. The variant with the strongest association explained only 0.035% of the variation in educational attainment. Put another way, the difference between people with 0 and 2 copies of this genetic variant predicts (on average) about 9 extra weeks of schooling.

How do we reconcile our finding that the predictive power of each individual SNP association is extremely small and the finding from much previous work that at least 20% of the overall variation in educational attainment is associated with genetic factors? The two findings taken together imply that the genetic associations with educational attainment result from the cumulative effects of at least thousands (probably millions) of different genetic variants, not just a few.

 

This is not a surprise: educational attainment is a complex phenomenon, and our study focuses on only a tiny piece of the puzzle. In this paper, we only examine one type of genetic variant (SNPs), we consider only one of many forms of genetic difference among individuals, and we conduct only preliminary and exploratory analyses of how the effects of genetic variants differ depending on environmental conditions. There are substantial additional sources of molecular genetic variation that remain to be discovered. These other genetic effects, environmental effects, and their interactions are important topics of active research, and of future work by the SSGAC.

Can you use the results in this paper to meaningfully predict a particular person's educational attainment?

No.

Each individual genetic variant has a very small effect. It is true that many genetic variants combined together into an index can explain much more of the variation across individuals. Such an index is called a “polygenic score.” However, when we construct a polygenic score using all ~9 million SNPs in our data, we still find that on average the polygenic score explains only 3.2% of the variation across individuals. It will likely be possible to construct a polygenic score whose explanatory power is closer to 20% as the available sample sizes for GWAS get larger.

Our existing polygenic score is not an accurate predictor of any individual’s educational attainment. Even a (currently non-existent) polygenic score that could account for 20% of variation in educational attainment would pale in comparison to other scientific predictors. For comparison, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are vastly more accurate forecasters than social science geneticists will ever be.

 

Yet a polygenic score, based on our study, that reflects 3.2% of the variation in educational attainment, is large enough to be useful in social science studies, which focus on average or aggregated behavior in the population (not individual outcomes). Indeed, with 80% statistical power (the conventional threshold for adequate power), the effect of our polygenic score can be detected in a study with 250 individuals (notably, many orders of magnitude smaller than the sample sizes we needed to be able to construct the score). Therefore, the polygenic score provided by our study can be useful in social science studies that have at least 250 participants and in which the participants’ genomes have been measured.

 

Are the variants associated with higher educational attainment in your study also associated with other outcomes?

 

In the data we analyzed for this paper, we find that on average SNPs associated with higher educational attainment are also associated with increased cognitive performance and intracranial volume, increased risk of bipolar disorder, decreased risk of Alzheimer’s disease, and lower neuroticism. These results highlight the potential relevance of our results for medical research, but future research is needed to shed light on the reasons why genetic variants are shared in common across these traits.

 

What do your results tell us about human biology and brain development?

 

We can draw inferences about biological pathways using computational methods that examine whether genes known to be involved in particular biological systems are especially likely to be associated with educational attainment.

 

We found that genes identified by our analyses tend to be strongly active in the brain, especially prenatally, and are especially likely to be involved in neural development. Moreover, the specific SNPs we identify tend to be in regions of the genome believed to be involved in regulation of gene activity in the fetal brain.

 

It is not surprising that genes may influence educational attainment in part because of their effects on brain development. Cognitive abilities and personality traits (such as conscientiousness and resilience) that matter for school performance may be partially reflected in how the brain is organized. It is perhaps more surprising that our study of educational attainment generates a biological picture of brain development that is clearer than those generated by previous GWAS that focused directly on brain structures. We believe that the relative clarity of the biological picture we observe is due to the large sample size of our study, which afforded us greater statistical power than previous GWAS. Since it will remain much easier to measure educational attainment than to conduct brain scans in large samples of individuals, we believe that GWAS of educational attainment will continue to play a useful role in understanding the biology of brain development.

 

Did you find “the genes for” educational attainment?

 

No. We did not find “the genes for” educational attainment—or for anything else. Characterizing the results this way is misleading for many reasons. First, educational attainment is primarily determined by environmental factors, not genes. Second, the explanatory power of each individual genetic variant that we identify is extremely small. Our results show that genetic associations with educational attainment are comprised of thousands, or even millions, of genetic variants, each of which has a tiny effect size. Third, environmental factors are likely to increase or decrease the impact of specific genetic variants. Indeed, in the current paper we report exploratory analyses that provide suggestive evidence of such gene-environment interactions. Finally, genes do not affect educational attainment directly. Rather, genes that are associated with educational attainment might influence many different biological factors that in turn affect psychological characteristics that finally influence educational attainment.

 

Does this study show that an individual’s level of educational attainment is determined at conception?

 

No. Even if it were true that genetic factors accounted for all of the differences among individuals in educational attainment (which they certainly do not), it would still not follow that an individual’s number of years of formal schooling is “determined” at conception. There are at least three reasons for this:

 

First, some genetic effects may operate through environmental channels. As an illustrative example, suppose—hypothetically—that the genetic variants we identified help students to memorize and, as a result, to become better at taking tests that rely on memorization. In this example, changes to the intermediate environmental channels—the type of tests administered in schools—could have drastic effects on individuals’ educational attainment, even though their genetic variants would not have changed. A genetic association with educational attainment might not be found at all if schools did not use tests that rely on memorization. More generally, the genetic associations that we found might not apply as strongly if the education system were organized differently than it is at present.

 

Second, even if the genetic associations with educational attainment operate entirely through non-environmental mechanisms that are difficult to modify (such as direct influences on the formation of neurons in the brain and the chemical interactions among them), there could still exist powerful environmental interventions that could change the genetic relationships. In a famous example suggested by the economist Arthur Goldberger, even if all variation in unaided eyesight were due to genes, there would still be enormous benefits from introducing eyeglasses. Similarly, policies such as a required minimum number of years of education and help for individuals with learning disabilities can increase educational attainment in the entire population and/or reduce differences among individuals.

Third, even if the genetic effects on educational attainment were not influenced by changes in the environment, those environmental changes themselves could still have a major impact on the educational attainment of the population as a whole. For example, if young children were given more nutritious diets, then everyone’s school performance might improve, and college graduation rates might increase. By analogy, 80%-90% of the variation across individuals in height is due to genetic factors. Yet the current generation of people is much taller than past generations, entirely due to changes in the environment such as improved nutrition.

 

Can environmental factors modify the effects of the specific genetic variants you identified?

 

We believe the answer is yes, and we report some exploratory analyses of this question in the paper. We examined a sample of Swedish individuals born between 1929 and 1958. During the 1950s and 1960s, when many of these individuals were in school, Sweden (like many other European countries) introduced a comprehensive new schooling system that extended mandatory schooling from seven to nine years, eliminated the lower level in secondary school, and postponed ability tracking from around age 10 until age 16. Another set of reforms sought to increase equality of outcomes and opportunity by increasing the availability of high schools, colleges, and universities.

We find that the association between educational attainment and our polygenic score (an index of the genetic variants in our data) is only about half as large among Swedish individuals born in the late 1950s compared with those born in the early 1930s. This finding is consistent with the possibility that the Swedish reforms reduced the effects of genetic variants in generating differences in educational attainment. While the analyses we report are exploratory, we believe that one contribution of our paper is to pave the way for more in-depth studies of such gene-environment interactions.

 

What policy lessons or practical advice do you draw from this study?

 

None whatsoever. Any practical response—individual or policy-level—to this or similar research would be extremely premature. In this respect, our study is no different from genome-wide association studies (GWAS) of complex medical outcomes. In medical GWAS research, it is well understood that identifying genetic variants that affect disease risk is merely a first step toward understanding the underlying biology of that disease. It is not sufficient to assess risk for any specific individual. It is not appropriate to base policies and practices on such assessments.

 

Do your findings have implications for health? Could they be used to advance medical research?

 

There is a well-known relationship between educational attainment and health outcomes, and this connection has been one motivation for our research. Indeed, some of the genetic variants we identify may be associated with educational attainment because they affect the health of people who carry them (which, in turn, could impact the amount of education a person receives). Our analyses of genetic overlap suggest that some of the same genes that matter for educational attainment also matter for Alzheimer’s disease, bipolar disorder, and schizophrenia. In previous work (Rietveld et al., 2014b), we found that an index of genetic variants associated with educational attainment had some predictive power for dementia in older individuals, and several groups of medical researchers have used the genetic variants and polygenic score identified by our earlier study on educational attainment (Rietveld et al., 2013) to study other health conditions including dyslexia and psychiatric disorders. By making the results of our analyses publicly available at www.thessgac.org, we hope to facilitate such research.

 

Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants?

 

There is always a risk that research will be misinterpreted or misused. In the case of behavioral genetic research, one risk is that findings may be misinterpreted (whether willfully or not) and misused to stigmatize or discriminate. One response to this risk is to abstain from conducting behavioral genetics research. In this case, however, we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing that knowledge. Indeed, there are at least two major ethical problems with abstention.

First, behavioral genetics research, including studies of the relationships between genes and a variety of social and cognitive traits, is already being conducted and will continue to be conducted. Not all of this work involves appropriate scientific methods or transparent communication of results. In this context, researchers who are committed to developing, implementing, and spreading best practices for conducting and communicating potentially controversial research, including behavioral genetics research, arguably have an ethical responsibility to participate in the development of this body of knowledge—rather than abstain from it and hope for the best. In essence, we believe that we have an ethical duty to set the record straight.

 

For instance, an important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral traits have serious methodological limitations, fail to replicate, and are likely to be false-positive findings (Benjamin et al., 2012; Chabris et al., 2012; Chabris et al., 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral traits), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt, 2012). One of the most important reasons why existing work has generated unreliable results is that their sample sizes were far too small, given that the true effects of individual genetic markers on behavioral traits are tiny.

 

In our view, responsible behavioral genetics research includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show (hence, this FAQ document).

 

Second, one should not assume that behavioral genetics research carries only the potential to increase stigmatization. One benefit of recent behavioral genetics research is that it has clarified the limits of deterministic views of complex traits by establishing upper bounds for the amount of variation among individuals attributable to common genetic variants—thus perhaps making discrimination and stigmatization less likely in the future. Pre-existing claims of genetic associations with complex social-science outcomes have reported widely varying effect sizes, many of them purporting to explain ten to one hundred times as much of the variation across individuals as did the genetic variants we have found in this study and in our other studies.

 

The bottom line is that individual genetic variants have very little explanatory power for educational attainment, and even composite indexes of millions of genetic variants have too little explanatory power to usefully predict any individual’s educational attainment.

 

Appendix 1: Quality control measures

 

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWAS). We took many precautions to guard against these pitfalls.

 

One potential source of spurious results is incomplete “quality control (QC)” of the genetic data. To avoid this problem, we used state-of-the-art QC protocols from medical genetics research (Winkler et al., 2014). We supplemented these protocols by developing and applying additional, more stringent QC filters.

 

Another potential source of spurious results is a confound known as “population stratification” (e.g., Hamer and Sirota, 2000). To illustrate, suppose we were conducting a GWAS on height. People from Northern Europe are on average taller than people from Southern Europe, and there are also small differences in how often certain genetic variants occur in Northern and Southern Europe. If we combine samples of Northern and Southern Europeans and perform a GWAS that ignores the regions the individuals come from, then we would find genetic associations for these variants. However, those associations would simply reflect the fact that the variants are correlated with a population (Northern or Southern Europe) and may actually have nothing to do with height.

 

In our study we were extremely careful to avoid population stratification as much as possible. At the outset, we restricted the study to individuals of European descent, since population stratification problems are more severe when including European-descent and non-European-descent individuals in the same sample. As is standard in GWAS on medical outcomes, we controlled for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across populations, so controlling for them largely removes the spurious associations arising solely from these small differences. To eliminate even weak population stratification effects, we controlled for a larger number of principal components than is standard in GWAS on medical outcomes (ten rather than four).

 

After taking these steps to minimize population stratification, we conducted a number of analyses to assess how much population stratification still remained in our data. The results of these tests indicate that there is some, but not much.

 

We conducted additional tests to confirm that our GWAS results are not driven by this remaining population stratification. To do so, we used a subset of the individuals in our data, 5,506 sibling pairs (from five of the datasets that contributed to our study). The key idea underlying our tests is to examine if differences in genetic variants across siblings are associated with differences in the siblings’ educational attainment. If so, then these associations cannot be the result of population stratification. The reason is that full siblings (from the same genetic parents) share their ancestry entirely, and therefore differences in their genetic variants cannot be due to being from different population groups (genetic differences between siblings are random). Unfortunately, because our sample of siblings (~11,000 individuals) is much smaller than our overall GWAS sample (~300,000 individuals), our estimates of the effects of the genetic variants within the sibling pairs are much noisier than in the GWAS. However, we can test whether the GWAS results are entirely due to population stratification, because if they were, then the sibling estimates would not line up with the GWAS estimates. In fact, we find that the within-family estimates are more similar to the GWAS estimates in both sign and magnitude than would be expected by chance. These results imply that our GWAS results are not solely due to population stratification.

 

Appendix 2: Additional reading and references

 

  1.    Benjamin DJ et al. (2012). The promises and pitfalls of genoeconomics. Annu Rev Econom 4: 627-662.

 

  2.    Branigan AR, McCallum KJ, Freese J (2013). Variation in the heritability of educational attainment: An international

         meta-analysis. Northwestern University Institute for Policy Research Working Paper. 13-09.

 

  3.    Chabris C et al. (2012). Most Reported Genetic Associations with General Intelligence Are Probably False Positives.

         Psychol Sci 23(11): 1314-1323.

 

  4.    Chabris C et al. (2015). The Fourth Law of Behavior Genetics. Curr Dir Psychol Sci 24(4): 304-312.

 

  5.    Cutler DM, Lleras-Muney A (2008). Education and Health: Evaluating Theories and Evidence.In House J, Schoeni R,

         Kaplan G, Pollack H (Eds.), Making Americans Healthier: Social and Economic Policy as Health Policy (Russell Sage

         Foundation, New York).

 

  6.    Editorial (2013). Dangerous Work. Nature 502(7469): 5-6.

 

  7.    Gaugler, T et al. (2014) Most genetic risk for autism resides with common variation. Nat Genet 46(8):881-885.

 

  8.    Hewitt J (2012). Editorial Policy on Candidate Gene Association and Candidate Gene-by-Environment Interaction

         Studies of Complex Traits. Behav Genet 42(1): 1-2.

 

  9.    Hamer DH, Sirota L (2000). Beware the chopsticks gene. Mol Psychiatry 5(1): 11–13.

 

  10.  Nuffield Council on Bioethics (2002). Genetics and Human Behavior: the ethical context (Nuffield Council on Bioethics:

          London).

 

  11.  Parens E, Appelbaum PS (2015). An Introduction to Thinking about Trustworthy Research into theGenetics of

         Intelligence. Hastings Center Report 45(5): S2-S8.

 

  12.  Rietveld CA et al. (2013). GWAS of 126,559 individuals identifies genetic variants associated with educational

         attainment. Science 340(6139): 1467–1471.

 

  13.  Rietveld CA et al. (2104a). Replicability and Robustness of GWAS for Behavioral Traits. Psychol Sci 25(11): 1975- 1986.

 

  14.  Rietveld CA et al. (2014b). Common genetic variants associated with cognitive performance identified using proxy-

         phenotype method. Proc Natl Acad Sci USA 111(38): 13790–13794.

 

  15.  Ripke, S et al. (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511(7510): 421-427.

 

  16.  Ross CE, Wu C (1995). The links between education and health. Am Sociol Rev 60(5): 719-745.

 

  17.  Taubman P (1976). Earnings, education, genetics, and environment. J Hum Resour 11(4): 447-461.

 

  18.  Winkler TW et al (2014). Quality control and conduct of genome-wide association meta-analyses. Nat Protoc 9(5):

          1192–212.

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

FAQs about "Genetic variants associated with subjective well-being, depressive symptoms and neuroticism identified through genome-wide analyses"

 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document was prepared by several of the co-authors of the paper and Advisory Board members of the

Social Science Genetic Association Consortium. For clarifications or additional questions, please contact:

Daniel Benjamin (djbenjam@usc.edu).

 

Quick Links

 

What is the Social Science Genetic Association Consortium (SSGAC)?

 

The SSGAC is a research infrastructure designed to stimulate dialogue and cooperation among medical researchers, geneticists, and social scientists. The SSGAC facilitates collaborative research that seeks to identify associations between specific genetic variants (small segments of DNA that differ across people) and social science variables, such as behavior, preferences, personality, well-being, and mental health. One major impetus for the formation of the SSGAC was the growing recognition that with respect to most human traits, even though the joint effects of many thousands of genetic variants can be substantial, any individual genetic variant has a very weak effect. Consequently, very large samples are required to accurately measure the effect of each particular variant. A decade ago, medical researchers began responding to a similar recognition—that most effects of individual genetic variants on complex diseases are very small—by forming research consortia in which groups collaborate by pooling results from many datasets. These efforts have borne considerable fruit, including recent findings on the genetics of autism (Gaugler et al., 2014), schizophrenia (Ripke et al., 2014), and many other diseases and conditions (Visscher et al., 2012). The SSGAC is an attempt to encourage analogous pooling among social-science geneticists. It is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium.

 

The SSGAC was founded by three social scientists (Daniel Benjamin, David Cesarini, and Philipp Koellinger) who believe that genetic data could have a substantial positive impact on research in the social sciences, yet are troubled by how some work in social-science genetics is conducted and communicated. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Genetics, Broad Institute and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard), Sarah Medland (Statistical Genetics, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics, Clarkson University and Icahn School of Medicine at Mount Sinai), and Peter Visscher (Statistical Genetics, University of Queensland).

 

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). These, together with an analysis plan, are posted on the Open Science Framework’s preregistration website. In many cases, publications are accompanied by a FAQ document (such as this one). The FAQ document is written to communicate to the public what was and was not found and what can and cannot be concluded from the research findings. 

 

The first major project of the SSGAC was a large-scale genome-wide association study (GWAS) on educational attainment, whose results were published in Science (Rietveld et al., 2013). The paper was accompanied by a FAQ document posted on the SSGAC website here and below. Subsequent work of the SSGAC has been published in (or is in press at) Nature, Proceedings of the National Academy of Sciences, Psychological Science, and other journals.

 

What is “subjective well-being”?

 

In a nutshell, subjective well-being is the term that social scientists use to describe human psychological well-being, which is usually self-assessed. More precisely, subjective well-being is a catch-all category that includes many specific ways of measuring psychological well-being. One facet of subjective well-being is positive affect, which refers to the emotions a person is experiencing at a particular moment of time. Typical survey questions to measure positive affect include “During the past week, I was happy” and “How would you rate your emotional well-being at present?” Another facet of subjective well-being is life satisfaction, which refers to a longer-term, higher-level evaluation of one’s life. A typical survey question would be “How satisfied are you with your life as a whole?” Positive affect and life satisfaction are different from each other but are highly correlated nevertheless.

 

In our study, we combined different survey measures of positive affect and life satisfaction. This strategy allowed us to assemble much larger samples than prior work and to maximize statistical power to discover genetic associations.

 

A drawback of our research strategy is that mixing different measures of subjective well-being makes any discovered associations more difficult to interpret. For that reason, research isolating specific, high quality measures of the various facets of subjective well-being (as well as depressive symptoms and neuroticism) is an important next step. Our results will facilitate such work because the genetic variants that we identify can be used as candidate genes for follow-up studies conducted in smaller samples with fine-grained measures of subjective well-being.

 

What do you mean by “depressive symptoms” and “neuroticism”?

 

The variable we call “depressive symptoms” is closely related to depression. Depression is a psychiatric condition characterized by feelings of sadness, anxiety, low energy, bodily aches and pains, pessimism, and other symptoms. Researchers often study depression by administering questionnaires to ask subjects if they are experiencing the symptoms of depression. The researchers then divide the survey respondents into two groups: those who are depressed and those who are not.

 

Instead of dividing respondents into two groups (i.e., binary categorization into a depressed group and a non-depressed group), we created a single continuous scale/spectrum that we call “depressive symptoms.” All respondents are placed somewhere on this continuous scale, depending on their survey responses. The scale is constructed so that respondents who have more depressive symptoms have a higher scale value. We decided to study depressive symptoms rather than the binary categories “depressed/non-depressed” because using a continuous scale gives us greater statistical power. Binary categorization throws away information that has statistical value, like symptom variation within each of the binary categories.

 

Neuroticism is a personality trait characterized by easily experiencing negative emotions such as anxiety and fear. Like other personality traits, it is usually measured by questionnaires that ask people to report about their own personality and behaviors. Here too we constructed a continuous scale that represents the degree of neuroticism.

 

As in other genetic studies of depression and neuroticism, our analysis combined data from different studies that used different surveys to measure these traits.

 

What was already known about the genetics of subjective well-being, depression, and neuroticism prior to this study?

 

Twin and family studies have found that genetic differences across individuals can lead to differences in subjective well-being, depression, and neuroticism. Such studies have also found that these three traits share some of the same genetic factors in common.

 

Although genetic factors in general are known to play a role in these traits, few specific genetic variants have been identified. Our study is the first genome-wide association study (GWAS) of subjective well-being. There have been a few genome-wide association studies of depression (Cai et al., 2015; de Moor et al., 2015; Ripke et al., 2013) and neuroticism (de Moor et al., 2015), but these have found fewer genetic variants, probably because the sample sizes in these studies were relatively small. Concurrently with our study, a GWAS of neuroticism using a subset of our sample reports similar findings to our neuroticism findings (Smith et al., in press).

 

What did you do in this particular study?

 

Our primary analysis is a genome-wide association study (GWAS) of subjective well-being based on a sample of 298,420 individuals. We were able to obtain this sample size by combining results from separate analyses conducted in 59 different cohorts of individuals. This analysis is one of the largest genome-wide association studies ever conducted for a behavioral trait.

 

In our study, we also conducted genome-wide association studies of depressive symptoms in a sample of 161,460 individuals and neuroticism in a sample of 170,911 individuals. For these analyses, we combined results from previously published papers (de Moor et al., 2015; Ripke et al., 2013) with new analyses of additional data.

 

We subsequently partnered with a large, ongoing study of depression (Hyde et al., under review) in a sample of roughly 368,890 individuals who are customers of the personal genomics company 23andMe. We used this new dataset to replicate the genetic associations that we reported for depressive symptoms and neuroticism.

 

In our analyses, we examined approximately 2.5 million genetic variants called single nucleotide polymorphisms, or SNPs. SNPs are the smallest and most common type of genetic variant (ways in which the genomes of people can differ), but they are not the only type of genetic variant. Another type of genetic variant is an inversion polymorphism. An inversion polymorphism is a large segment of the genome that is reversed end to end, or inverted, in some people. In our SNP data, we can sometimes statistically detect the presence of an inversion polymorphism. In some of our analyses, we examined inversion polymorphisms in addition to SNPs. Inversion polymorphisms are especially interesting because they tend to have larger effects than SNPs, and far fewer inversion polymorphisms than SNPs have been identified as associated with human traits.

 

The results of these genetic association analyses are the core scientific contribution of our paper. We conducted several additional analyses to shed some light on possible biological mechanisms underlying our findings and to explore the genetic correlations between the three phenotypes we studied and various health outcomes.

 

What did you find?

 

In our GWAS of subjective well-being (in our sample of roughly 300,000 individuals), we identified three SNPs.

 

In our GWAS of depressive symptoms (in our sample of roughly 180,000 individuals), we identified two SNPs.

 

In our GWAS of neuroticism (in our sample of roughly 170,000 individuals), we identified nine SNPs and two inversion polymorphisms.

 

In our joint analyses of the three traits, we identified two additional SNPs associated with neuroticism and two associated with both depressive symptoms and neuroticism. We also found that most of the genetic variants associated with depressive symptoms and/or neuroticism are also associated with subjective well-being, and vice-versa.

 

In our replication sample from an ongoing study of depression (of roughly 370,000 additional individuals), both of the SNPs that we found to be associated with depressive symptoms replicated. We also found that the eleven genetic variants that we found to be associated with neuroticism showed up strongly in the depression replication sample. (We did not study the SNPs that we found to be associated with subjective well-being in the replication sample because some of the individuals in the replication sample were also in the sample for the GWAS of subjective well-being. This sample overlap would have biased the analysis.)

 

The estimated effect sizes of the genetic variants are small. For subjective well-being, each SNP we identified explains only 0.01% of the variation across individuals. Each of the SNPs associated with depressive symptoms and neuroticism account for only 0.02% to 0.04% of the variation of these outcomes in the population. Since an inversion polymorphism affects much more of the genome than a SNP, we expected that the inversion polymorphisms we identified would have a larger effect size. We were able to estimate the effect size of one of the inversion polymorphisms that we found to be associated with neuroticism. The inversion polymorphism does in fact have a larger effect size—roughly 0.06% of the variation in neuroticism—but this effect size is still small. By way of comparison, the largest effect sizes that have been found for SNPs associated with height and BMI are 0.4% and 0.3% of the variation, respectively—an order of magnitude larger than those we found for the behavioral traits we study.

 

Our finding that individual genetic variants have very weak associations with these outcomes confirms that very large samples—such as the hundreds of thousands of individuals that we studied—are necessary to accurately detect genetic variants associated with them. Accurately identifying more genetic variants would require larger samples and/or more accurate outcome measures than we had available. Our results support the view that many more genetic variants with depression (and the other traits we study) will be identified when the available sample sizes become even larger (Hyman, 2014).

 

There is no contradiction between our finding that the effect sizes of individual genetic variants are small and the findings from previous work that a substantial share of the variation across individuals in subjective well-being, depression, and neuroticism can be attributed to genetic factors (e.g., some studies estimate roughly 40%). These findings taken together imply that the genetic influences on these traits result from the cumulative effects of at least thousands (probably millions) of different genetic variants, not just a few.

 

How do we know that the GWAS results are not spurious?

 

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWAS) such as ours. We took many precautions to guard against these pitfalls.

 

One potential source of spurious results is incomplete “quality control” (QC) of the genetic data. To avoid this problem, we used state-of-the-art QC protocols from medical genetics research (Winkler et al., 2014).

 

Another potential source of spurious results is a confound known as “population stratification” (Hamer and Sirota, 2000). To illustrate, suppose we were conducting a GWAS on height. People from Northern Europe are on average taller than people from Southern Europe, and there are also small differences in how often certain genetic variants occur in Northern and Southern Europe. If we combine samples of Northern and Southern Europeans and perform a GWAS ignoring the origins of the individuals, then we would find genetic associations for these variants. However, those associations would simply reflect the fact that the variants are correlated with a population (Northern or Southern Europe) and may actually have nothing to do with height.

 

In our study we employed multiple strategies that reduce the impact of population stratification. At the outset, we restricted the study to individuals of European descent, since population stratification problems are more severe when including European-descent and non-European-descent individuals in the same sample. As is standard in GWAS on medical outcomes, we controlled for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across populations, so controlling for them largely removes the spurious associations arising solely from these small differences.

 

After taking these steps to minimize population stratification, we conducted a number of analyses to assess how much population stratification still remained in our data. The results of these tests indicate that there is very little stratification in our estimates. 

 

We conducted additional tests to confirm that our GWAS results for subjective well-being are not driven by this remaining population stratification. To do so, we used a subset of the individuals in our data, 4,869 sibling pairs (from three of the datasets that contributed to our study). The key idea underlying our tests is to examine if differences in genetic variants across siblings are associated with differences in the siblings’ subjective well-being. If so, then these associations cannot be the result of population stratification. The reason is that full siblings (from the same genetic parents) share their ancestry entirely, and therefore differences in their genetic variants cannot be due to being from different population groups (in fact, genetic differences between siblings are random). Unfortunately, because our sample of siblings (~9,000 individuals) is much smaller than our overall GWAS sample (~300,000 individuals), our estimates of the effects of the genetic variants within the sibling pairs are much noisier than in the GWAS. However, we can test whether the GWAS results are entirely due to population stratification, because if they were, then the sibling estimates would not line up at all with the GWAS estimates. In fact, we find that the within-family estimates are more similar to the GWAS estimates in both sign and magnitude than would be expected by chance. These results imply that our GWAS results are not solely due to population stratification.

 

The results of a number of the other analyses in the paper provide additional reassurance that our GWAS results are not spurious. For example, the findings from our analyses of genetic overlap between the three traits we focus on and other outcomes are similar to findings from prior studies that examined some of the same outcomes that we do

 

What did you find in additional analyses?

 

Our genetic-association results served as a starting point for several additional analyses:

 

(i) Identifying the extent of genetic overlap between subjective well-being, depressive symptoms, and neuroticism. Because we have data on ~9 million SNPs, we can estimate the extent of genetic overlap between these traits far more precisely than prior studies that used the similarity between twins and family members (rather than using direct measurement of genetic data). Specifically, we can estimate the extent of genetic overlap by examining how strongly the SNPs associated with one of the traits are associated with the other traits. We find that the three traits are strongly genetically overlapping, with pairwise genetic correlations of roughly 0.8 in magnitude.

 

(ii) Identifying the extent of genetic overlap between our three traits—subjective well-being, depressive symptoms, and neuroticism—and other outcomes. Using similar methods, we can also estimate the extent of genetic overlap between our three traits and other outcomes that have been studied in GWAS with large samples. We examined five physical health outcomes that are known or believed to be risk factors for poor health: body mass index, ever-smoker status, coronary artery disease, fasting glucose, and triglycerides. We also examined five neuropsychiatric outcomes: Alzheimer’s disease, anxiety disorders, autism spectrum disorder, bipolar disorder, and schizophrenia.

 

We find rather weak genetic overlap with all five of the physical health phenotypes, as well as with Alzheimer’s disease and autism spectrum disorder. We find moderate genetic overlap with schizophrenia and bipolar disorder and strong genetic overlap with anxiety disorders. In fact, the genetic correlations between our three traits and anxiety disorders are of similar magnitude as the genetic correlations of our three traits with each other. This finding suggests that future studies of the genetics of anxiety disorders may benefit by analyzing anxiety disorders jointly with the three traits on which we focus.

 

(iii) Investigating biological pathways. We can draw inferences about biological pathways using methods (from bioinformatics) that synthesize the patterns of association from many SNPs across the genome. In general, these methods examine whether genes known to be involved in particular biological systems are especially likely to be associated with our three traits.

 

Using such a method, we find that across our three traits, genetic variants regulating gene expression in the central nervous system and adrenal/pancreas tissues are strongly enriched for association. The cause of the adrenal/pancreas enrichment is unclear, but we note that the adrenal glands produce several hormones, including cortisol, epinephrine, and norepinephrine, known to play important roles in the bodily regulation of mood and stress. More speculatively, some of our biological analyses aimed to pinpoint specific genes that are promising candidates for further investigation in relation to the traits we study. One of these genes is DRD2, which encodes the D2 subtype of the dopamine receptor, a target for antipsychotic drugs that is also known to play a key role in neural reward pathways. Another such gene is MAPT, which has previously been reported to be involved in neurodegenerative disorders, including Parkinson’s disease and progressive supranuclear palsy, a rare disease whose symptoms include depression and apathy.

 

What policy lessons do you draw from this study?

 

None. Any practical response—individual or policy-level—to this or similar research would be extremely premature. In this respect, our study is no different from genome-wide association studies (GWAS) of complex medical outcomes. In medical GWAS research, it is well understood that known genetic variants are not yet predictive enough of complex diseases to have significant value for assessing the risk to any given individual. Our current paper shows that most genetic effects on the outcomes we studied are even smaller and more diffuse than the genetic associations estimated with typical medical phenotypes.

 

Did you find “the genes” for subjective well-being, depression, and neuroticism?

 

No. We did not find “the genes” for the outcomes we studied. Characterizing the results this way would be misleading for several reasons.

 

First, subjective well-being, depression, and neuroticism are primarily determined by environmental factors.

 

Second, the explanatory power of each individual genetic variant that we identify is extremely small. Our results show that the genetic influences on the outcomes we study are comprised of thousands, or even millions, of genetic variants, each of which matters a little bit.

 

Third, environmental factors are likely to amplify or attenuate the impact of specific genetic 8 variants (and may affect which genetic variants are associated with well-being, depression, and neuroticism).

 

Does this study show that an individual’s level of subjective well-being (or neuroticism, or risk of being depressed) is determined at birth?

 

No. This is probably one of the most common misconceptions about genetics research. Even if it were true—and it is certainly not—that genetic factors accounted for all of the differences among individuals in subjective well-being, it would still not follow that an individual’s subjective well-being is “determined” at birth (or, more accurately, at conception). There are at least three reasons for this:

 

First, some genetic effects may operate through environmental channels. As an illustrative example, suppose that the genetic variants we identified influence how extraverted, or outgoing, an individual is. Furthermore, suppose that being more extraverted helps a person to make more friends, which in turn makes the person happier. In this example, changes to the intermediate environmental channel—number of friends—could have drastic effects on the outcome of happiness. Indeed, the genetic association might not be found at all in environments in which a person’s number of friends is less strongly related to extraversion, such as in a close-knit community where everyone knows each other.

 

Second, even if the genetic effects on well-being operated entirely through non-environmental mechanisms that are difficult to modify (such as direct influences on the neurotransmitters that operate in the brain’s reward pathways), there could still exist powerful environmental interventions that, if implemented, would change the genetic relationships. In a famous example suggested by the economist Arthur Goldberger, even if all the variation in unaided eyesight were due to genes, there could still be enormous benefits from introducing eyeglasses. Indeed, the environmental intervention of eyeglasses often counteracts 100% of the effect of genes on eyesight (Goldberger, 1979). Similarly, policies that aim to reduce differences in subjective well-being (e.g., through redistribution that makes society more egalitarian, thereby reducing differences in happiness that result from income inequality) may counteract the effects of genetic predisposition on subjective well-being.

 

Third, even if the genetic effects on subjective well-being were not altered by changes in the environment, those environmental changes themselves could still have a major impact on the subjective well-being of the population as a whole. For example, if economic progress enabled people to work fewer hours, then everyone might have more leisure, and the population as a whole might become happier. By analogy, 80%–90% of the variation across individuals in height is due to genetic factors. Yet the current generation of people is much taller than past generations—all due to changes in the environment (such as improved nutrition).

 

 

References

 

  1.    Cai, N. et al. Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523, 588–591          (2015).

 

  2.    De Moor, V. D. B. et al. Genome-wide association study identifies novel locus for neuroticism and shows polygenic

         association with Major Depressive Disorder. JAMA Psychiatry 72, 642–650 (2015).

 

  3.    Gaugler, T. et al. Most genetic risk for autism resides with common variation. Nat. Genet. 46, 881–5 (2014).

 

  4.    Goldberger, A. S. Heritability. Economica 46, 327–347 (1979).

 

  5.    Hamer, D. H., & Sirota, L. Beware the chopsticks gene. Mol. Psychiatry 5, 11–13 (2000).

 

  6.    Hyde, C. L. et al. Common genetic variants associated with major depressive disorder among individuals of European

         descent. Nat. Genet. Under review

 

  7.    Hyman, S. Mental health: Depression needs large human-genetics studies. Nature 515, 189– 191 (2014).

 

  8.    Rietveld, C. A. et al. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment.

         Science 340, 1467–1471 (2013).

 

  9.    Ripke, S. et al. Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427 (2014).

 

  10.  Smith, D. J. et al. Genome-wide analysis of over 106,000 individuals identifies 9 neuroticism-associated loci.

         Molecular Psychiatry. In press.

 

  11.  Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. Am. J. Hum. Genet. 90, 7–24

         (2012).

 

  12.  Winkler, T. W. et al. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 9, 1192–

         1212 (2014).

 

 

 

 
 
 
 
 
 
 
 
 
 
 
 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

FAQs about “GWAS of 126,559 individuals identifies genetic variants associated with educational attainment”

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication.

 

For clarifications or additional questions, please contact: Daniel Benjamin (djbenjam@usc.edu), David Cesarini (dac12@nyu.edu), Philipp Koellinger (p.d.koellinger@vu.nl), or Peter Visscher (peter.visscher@uq.edu.au).

 

Quick Links

 

This study is the initial project of the Social Science Genetic Association Consortium (SSGAC). What is the SSGAC?

 

The SSGAC is a research infrastructure designed to stimulate dialogue and cooperation between medical researchers and social scientists. The SSGAC facilitates collaborative research that seeks to identify associations between specific genetic markers (segments of DNA) and behavioral traits, such as preferences, personality and social-science outcomes. One major impetus for the formation of the SSGAC was the growing recognition that most effects of individual genetic markers on behavioral traits are very small, and that, consequently, very large samples are required to accurately detect them. Several years ago medical researchers responded to a similar recognition—that most effects of individual genetic markers on complex diseases are very small—by forming research consortia in which groups collaborate by pooling results across many datasets. The SSGAC is an attempt to encourage analogous pooling among social-science geneticists and is organized under the auspices of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. The SSGAC was founded by three social scientists (Daniel Benjamin, David Cesarini, and Philipp Koellinger) who are excited about the potentially transformative impact that genetic data could have on the social sciences, yet troubled by how current approaches are not bearing fruit. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey-Smith (Epidemiology, University of Bristol), Albert Hofman (Epidemiology, Erasmus University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard), Sarah Medland (Statistical Genetics, Queensland Institute of Medical Research), Michelle Meyer (Bioethics, Harvard Law School), and Peter Visscher (Statistical Genetics, University of Queensland).

 

Why do you say current approaches are not bearing fruit?

 

An important theme in our earlier work has been to point out that most existing studies in social science genetics that report genetic associations with behavioral traits have serious methodological limitations. The extent of the problems with existing study designs is increasingly recognized by the research community. Indeed, a leading journal for the genetics of behavioral traits recently issued an editorial statement that includes the following passage (Hewitt, 2012):

 

“The literature on candidate gene associations is full of reports that have not stood up to rigorous replication. This is the case both for straightforward main effects and for candidate gene-by-environment interactions (Duncan and Keller 2011). As a result, the psychiatric and behavior genetics literature has become confusing and it now seems likely that many of the published findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge. The reasons for this are complex, but include the likelihood that effect sizes of individual polymorphisms are small, that studies have therefore been underpowered, and that multiple hypotheses and methods of analysis have been explored; these conditions will result in an unacceptably high proportion of false findings (Ioannidis 2005).”

 

Despite the growing awareness of these problems in the research community, studies in social science genetics in which researchers report large genetic associations with behavioral traits continue to be published (and often receive uncritical media attention).

 

The evidence is now accumulating that many of these original studies fail to replicate (Benjamin et al. 2012; Chabris et al. 2012). In our view, one of the most important reasons why existing work has generated unreliable results is that their sample sizes were far too small, given that the true effects of individual genetic markers on behavioral traits are tiny.

 

Why is it important to know that effect sizes for behavioral traits are small?

 

The effect size, or strength of the relationship between an individual genetic variant and behavioral trait, determines which research strategies will succeed and which will fail. Virtually all existing studies in social-science genetics use sample sizes in the range 100-2,000. The tiny effect sizes for genetic variants identified in our study suggest that studies seeking to identify genetic influences on behavioral traits should include at least tens of thousands of research participants in order to generate accurate results.

 

If effect sizes are so small, why bother studying them?

 

Despite the tiny effect sizes, identifying genetic variants related to behavioral traits could be useful to social scientists for a number of reasons. Here we give two examples.

 

First, even if a genetic variant has a very small effect, identifying it may lead to insights regarding the underlying biological pathways. To take an example from medicine, genetic variants in the LMTK2 (lemur tyrosine kinase 2) gene have small effects on an individual’s predisposition to prostate cancer. Nonetheless, knowing that this gene is involved can point scientists toward studying what the gene does, which may end up teaching us something critical about the pathology of prostate cancer.

 

Second, even if an individual genetic variant has a very small effect, many genetic variants taken together may have more predictive power. In one of our analyses, we estimate that when it becomes possible to analyze data from 1 million or more individuals—which is still several years away—many genetic variants taken together will be able to capture 15% of the variation across individuals in educational attainment. This amount of predictive power is still too low to be relevant for predicting any one person’s educational attainment, but it would be useful for controlling for genetic factors when studying the effect of an education-promoting policy. When social scientists study expensive policy interventions, such as providing preschool to disadvantaged children, controlling for as many factors as possible can help generate more accurate estimates of the effectiveness of the policy.

 

What did you do in this particular study on educational attainment?

 

We conducted what is called a genome-wide association study (GWAS) by combining data from 54 cohorts with a total of about 125,000 individuals—a sample size about 10 times larger than any previous study of any social-scientific outcome. We study approximately 2 million genetic variants called single nucleotide polymorphisms, or SNPs. SNPs are the smallest and most common type of genetic variants (ways in which the genomes of people can differ), but they are not the only kind of genetic variation.

 

To create a harmonized measure of educational attainment across cohorts, we coded study-specific measures using the International Standard Classification of Education (ISCED) scale. This yielded two measurements of educational attainment: (1) a quantitative variable defined as an individual’s years of schooling (EduYears); and (2) a binary variable for whether or not an individual had completed college (College). We sought to only include individuals in the study who were likely to have completed their formal schooling.

 

We conducted the study in two stages.

 

In the “discovery phase,” we tested each of the ~2,000,000 SNPs for association with educational attainment using a sample of approximately 100,000 subjects from 42 cohorts.

 

In the “replication phase,” we sought to replicate our findings using an additional (approximately) 25,000 subjects from 12 additional cohorts that became available after the discovery phase was completed.

 

In the “combined phase,” we conducted the same GWAS analysis with the combined data from both the discovery and replication phases.

 

Finally, using the results of the combined meta-analyses of the discovery and replication cohorts, we conducted a number of additional analyses designed thoroughly investigate the genetic variants we identified as most strongly associated with educational attainment.

 

What did you find?

 

In the discovery phase, we found three genetic variants significantly associated with educational attainment—one associated with years of education, and two with college completion.

 

In the replication phase, we found that all three significant variants were also associated with educational attainment in the 12 independent cohorts (and that the strength of the relationship between these genetic variants and educational attainment was similar between the discovery and replication phases). This replication is important because it represents a separate test of whether these three variants were truly associated with education. If they were not—if they were just false positives, or chance findings due to statistical noise in the discovery phase—it is very unlikely that all three associations would replicate in an independent sample.

 

The observed effect sizes of these SNPs are small. For example, the size of the effect of the SNP identified in the analysis EduYears suggests that the difference between people with 0 and 2 copies of the variant predicting more education is about 2 months of education. One way to put the figure into perspective is to compare our findings to those in the medical and anthropometric literature. It is well-known that the SNP with the largest effect on human height explains about 0.4% of the variation (that is, extent to which individuals differ) in height. By contrast, the SNP we identified explains about 0.02% of the variation in years of education—only one-twentieth of the effect size found for height.

 

In the combined phase, we found seven additional significant SNPs (three for college completion and four for years of education). Of these seven, three are physically very close in the genome to the replicated SNPs and probably represent the same underlying genetic effect. The remaining four are in other places in the genome and warrant replication attempts in further research.

 

We next used all the genetic data, including the significant and non-significant SNPs, to create a “polygenic score” that represents our best prediction of each person’s educational attainment based on their SNPs. We found that using all the genetic data in this way, we were able to explain a little over 2% of the variation between people in how many years of education they attained.

 

To explore one of the many possible channels through which the SNPs may be impacting educational attainment, we examined whether the same genetic risk score predicted cognitive function in a sample of Swedish military conscripts. (This analysis was done with just one of the cohorts, from Sweden, because that was the only one for which we had cognitive test data from all the participants.) We found that the score explained about 3% of the variance in cognitive function.

 

What does this study reveal about the biological pathways potentially affecting educational attainment?

 

Before explaining what we believe we have learned, it is important to understand three key limitations to inferring underlying biological influences from the current study.

 

First, educational attainment, like every complex behavior, is influenced by a large number of both genetic and environmental factors, and this study focuses on only a tiny piece of the puzzle. This study examines common SNPs, only one of many forms of genetic differences between individuals, and does not include specific measures of environmental factors. Therefore, it is likely that there are additional sources of genetic variation that remain to be discovered, as well as effects of SNPs that may differ based on environmental conditions (such as access to formal schooling). These other genetic effects, environmental effects, and interactions between genetic and environmental factors, are not discoverable from the current approach.

 

Second, the current study is limited to identifying regions of the genome—sometimes very large ones that include many different genes—that are statistically correlated with educational attainment; we cannot say for sure which specific variants or genes, if any, actually cause differences in educational attainment, or the mechanisms through which they may act. Even if we knew the true “causal” variants, their direct effects would likely be on health, cognitive function, or personality, which only have a downstream (i.e., eventual and indirect) impact on an individual’s likelihood or ability to remain in a formal schooling environment.

 

Third, only the existing scientific literature and bioinformatics tools have been used to prioritize plausible biological candidates in the identified genomic regions. Thus, the accuracy of our conclusions are limited by current knowledge of genetic functions. As understanding of the underlying biological function of different genes develops over time, so will our ability to interpret findings arising from studies such as this one that finds statistical associations between genetic variants and human behavior.

 

For these reasons, strong conclusions about underlying biological mechanisms would be premature. We view our findings as providing a limited number of testable hypotheses for future research, providing a starting point for future studies to investigate in more detail a substantially narrowed field of likely genetic influences on educational attainment.

 

With those caveats in mind, we believe that our findings regarding biological pathways are strongly suggestive. Some identified genomic regions have previously been shown to affect cognitive functions or long-term memory in model organisms, or are predicted by bioinformatics analyses to influence neuron-related pathways (like cell differentiation, neurotransmitter signaling, and regulation of transmission of nerve impulse). In several cases, genes located in different genomic regions identified as associated with educational attainment appear to function as part of the same biological pathways. This convergence of evidence suggests that our findings are consistent with each other, which increases our confidence that the results may be accurate (rather than a spurious findings due to chance). Nonetheless, additional research, including experimental and molecular methods, will be required to investigate which, if any, of these candidate biological pathways have real, causal influences on behaviors related to educational outcomes.

 

What are the contributions of your study?

 

In addition to the most direct contribution—identifying several SNPs associated with educational attainment—we believe that our study also makes a number of other contributions. Here we list two.

 

First, it provides a methodological template that social-science genetics scholars could follow in future work. Genetic-association studies of behavioral traits have so far focused mostly on outcomes such as cognitive function and personality and have so far failed to document many associations that replicate consistently. The GWAS conducted to date have not found any genetic variants that are reliably associated with these phenotypes. One common view is that the appropriate response to the null findings in such studies is for researchers to gather better measures of the phenotypes and their facets in more environmentally homogenous samples—for example, by giving more in-depth personality tests to people who are all members of an isolated population. Our findings of replicable genetic associations for educational attainment demonstrate the feasibility of a complementary approach: identify an outcome variable that is measured with less precision, and therefore theoretically less directly connected to underlying genetic influences, but is available in a much larger GWAS sample and can be measured at a fraction of the cost. In our study, this corresponds to measuring educational attainment, which requires just a couple of survey questions, rather than giving a lengthy battery of cognitive function tests, which may require much more time or even expert interviewers to interact with each participant. The SNPs we identify could be used in follow-up research that directly studies precursors to educational attainment, such as cognitive function or personality traits such as perseverance.

 

Second, our study provides new evidence about what effect sizes can be expected for associations between individual genetic variants and complex behavioral traits. These effect-size estimates—which are roughly one-twentieth as large as those found for human height, a complex physical trait—will be useful for determining what sample sizes should be used in social-science genetics research. The estimates are also useful for assessing how much to trust existing reports of genetic associations from smaller samples.

 

These and other contributions—such as our exploration of potential biological pathways that might underlie the associations we observe—are likely to trigger follow-up research in various disciplines (such as the social sciences, epidemiology, and biology).

 

What policy lessons do you draw from this study?

 

None. Any practical response—genetic or environmental, individual or policy-level—to this or similar research would be extremely premature. In this respect, our study is no different from most GWAS of complex medical outcomes. In medical GWAS research, it is well understood that the known genetic variants are not yet predictive enough of complex diseases to be useful for assessing the risk to any given individual. Our paper shows that most genetic effects on behavioral are probably even smaller and more diffuse.

 

Did you find “the gene” for educational attainment?

 

No. We did not find “the gene” for educational attainment, cognitive function—or anything else. Educational attainment, like most complex behaviors and outcomes, is influenced by myriad genes, each with effects that are likely to be tiny (as well as a huge host of environmental factors).

 

Does this study show that an individual’s level of education is determined at birth?

 

No, and this is probably one of the most common misconceptions about genetic research. Even if it were true—and it is certainly not—that genetic factors accounted for all of the variation across individuals in educational attainment, it would not follow that an individual’s education is “determined” at birth. There are two distinct reasons for this.

 

First, some genetic effects may operate through environmental channels. For example, consider body mass index (BMI). Genetic factors may impact a person’s BMI indirectly through genetic influences on food preferences, which in turn impact caloric intake and thus BMI. In this case, changes to the intermediate environmental channels can have drastic effects on the outcome. For example, lack of access to certain foods (or higher taxes on those foods) could cause substantial differences between an individual’s genetically-predicted “propensity” and actual observed outcome in BMI. Similarly, genetic variants that are associated with educational attainment under current environmental conditions may no longer be associated if environmental conditions or policies change.

 

Second, even if the genetic effects operated entirely through non-environmental mechanisms that are difficult to modify, there could still exist powerful environmental interventions that do not contribute to differences across individuals in the current population. In a famous example attributable to the economist Arthur Goldberger, even if all the variation in eyesight were due to genes, there could still be enormous benefits from introducing eyeglasses. Similarly, policies such as a required minimum number of years of education and help for individuals with learning disabilities can have a drastic impact on educational outcomes for individuals who otherwise may be less likely to participate in formal schooling.

 

We found a handful of SNPs associated with educational attainment, each of which we estimate to have only a small effect on that outcome. Even if we had found that a single gene has a very large effect on educational attainment, that finding would be perfectly consistent with the possibility that environmental factors or interventions can modify or even cancel out this influence; traits that have a genetic component, even a large genetic component, are not necessarily immutable. For instance, the metabolic disorder phenylketonuria (PKU) is caused by gene mutations that prevent the carrier from metabolizing phenylalanine, an amino acid. Without environmental intervention, PKU leads to mental retardation and other serious medical problems. But through early genetic detection and an environmental intervention—namely, maintaining a special diet free of phenylalanine, monitoring of their protein levels and daily medication— individuals with PKU can lead lives with normal cognitive function and life expectancy.

 

How are your findings relevant for health?

 

There is a well-known relationship between educational attainment and health outcomes, and this connection was one important motivation for our study. Indeed, some of the genetic variants we identify may be associated with education because of their effects on health (which, in turn, could impact education). Some of the genes identified in our analyses have been previously implicated in studies of common diseases such as inflammatory bowel disease and rheumatoid arthritis. Our analyses also identified genomic regions that have been linked to brain and central nervous system development in non-human animals, including mice and zebrafish. This suggests that the regions we identify as associated with educational attainment are promising candidates for future exploration, as they may turn out to play a role in both health and cognition.

 

The “polygenic score” that we estimate—which aggregates the effects of many individual genetic variants—may also prove useful for health researchers. Indeed, several groups of medical researchers have already expressed interest in examining the predictive utility of the score for the health conditions that they study (including dyslexia and psychiatric disorders). By making the results of our analyses publicly available on www.ssgac.org, we hope that other researchers will make use of our findings to develop and test hypotheses for how genetic factors may jointly influence educational attainment and related health and cognitive outcomes.

 

Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genotypes, and doesn’t that make the research ethically suspect?

 

It is always possible that research results may be used inappropriately by others, either willfully or due to misinterpretation. Those of us working in this area have an obligation to be vigilant about our use of language that may be susceptible to misinterpretation, and to communicate clearly not only potential benefits of the research that motivate us to do the work, but also its limitations. For a variety of reasons, in general we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing the knowledge.

 

Behavioral genetics research, including studies of the relationships between genes and a variety of cognitive traits, is already being conducted and will continue to be conducted. Common misunderstandings of the results of genetic research are well documented, yet careless discussions of results and publication bias remain problems within the research community across many disciplines and fields. In this context, responsible researchers who are committed to developing, implementing, and spreading best practices for conducting and communicating potentially controversial research, including behavioral genetics research, should participate in the development of this body of knowledge, rather than abstaining from it and hoping for the best.

 

The results of behavioral genetics research can be used in positive, as well as negative, ways. One of the benefits of properly conducted behavioral genetics research is that it has made clear the limits of deterministic views of complex traits by establishing accurate upper bounds for effect sizes and prediction accuracy—thus perhaps making discrimination and stigmatization less likely in the future. Existing claims of genetic associations with complex social-science traits have reported widely varying effect sizes—most of them purporting to explain more than one hundred times as much variance as did the genetic variants we found in this study. Our evidence, from a much larger sample, suggests instead that individual SNPs associated with educational attainment have about one-twentieth as large effects as do SNPs for complex physical traits that are also influenced by many separate genes, such as height.

 

More ambitiously and longer-term—but quite plausibly—learning more about the genetic influences on educational attainment (or cognition) may lead to effective environmental interventions, as it did in the case of PKU (see “Does this study show that an individual’s level of education is determined at birth?”). As noted above (see “If effect sizes are so small, why bother studying them?”), in order to study effective policies to reduce gaps in educational attainment or similar disparities, it is helpful to account for any genetic effects on those outcomes. In addition, identifying genetic variants that contribute to differences in educational attainment may lead to insights regarding the biological pathways underlying that outcome, and may provide a firm foundation for research on interactions between genetic and environmental influences. 

 

In order to realize these benefits, however, behavioral genetics research must be carefully and responsibly conducted and communicated. Responsible behavioral genetics research, in our view, includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show (hence, these FAQs).

 

In summary, we agree with the Nuffield Council on Bioethics, which concluded in a 2002 report on behavioral genetics research, including research on cognitive function, that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” and that “researchers and those who report research have a duty to communicate findings in a responsible manner.”

 

References

 

Benjamin, Daniel J., David Cesarini, Christopher F. Chabris, Edward L. Glaeser, David I. Laibson, Vilmundur Guðnason, Tamara B. Harris, Lenore J. Launer, Shaun Purcell, Albert Vernon Smith, Magnus Johannesson, Patrik K.E. Magnusson, Jonathan P. Beauchamp, Nicholas A. Christakis, Craig S. Atwood, Benjamin Hebert, Jeremy Freese, Robert M. Hauser, Taissa S. Hauser, Alexander Grankvist, Christina M. Hultman, and Paul Lichtenstein (2012). “The Promises and Pitfalls of Genoeconomics.” Annual Review of Economics, 4, 627-662.

 

Chabris, Christopher F., Benjamin M. Hebert, Daniel J. Benjamin, Jonathan P. Beauchamp, David Cesarini, Matthijs J.H.M. van der Loos, Magnus Johannesson, Patrik K.E. Magnusson, Paul Lichtenstein, Craig S. Atwood, Jeremy Freese, Taissa S. Hauser, Robert M. Hauser, Nicholas A. Christakis, and David Laibson (2012). “Most Published Genetic Associations with General Intelligence Are Probably False Positives.” Psychological Science. doi:10.1177/0956797611435528

 
 
 
 
 
 
 
 
 
 
 
 
 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

FAQs about “Common Genetic Variants Associated with Cognitive Performance Identified Using Proxy-Phenotype Method”

 

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document was prepared by several of the co-authors of the paper and board members of the Social

Science Genetic Association Consortium. For clarifications or additional questions, please contact: Daniel

Benjamin (djbenjam@usc.edu).

 

Quick Links

 

This study is the outcome of a collaboration between the Social Science Genetic Association Consortium (SSGAC) and the Childhood Intelligence Consortium (CHIC). What is the SSGAC, and what is the CHIC?

 

The SSGAC is a research infrastructure designed to stimulate dialogue and cooperation between medical researchers and social scientists. The SSGAC facilitates collaborative research that seeks to identify associations between specific genetic markers (segments of DNA) and social science variables, such as behavior, preferences, and personality. One major impetus for the formation of the SSGAC was the growing recognition that most effects of individual genetic markers on behavioral traits are extremely small, and that, consequently, very large samples are required to accurately measure them. Several years ago medical researchers responded to a similar recognition—that most effects of individual genetic markers on complex diseases are very small—by forming research consortia in which groups collaborate by pooling results across many datasets. These efforts have borne considerable fruit, including recent findings on the genetics of autism (Gaugler et al., 2014) and schizophrenia (Ripke et al., 2014). The SSGAC is an attempt to encourage analogous pooling among social-science geneticists and is organized under the auspices of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. The SSGAC was founded by three social scientists (Daniel Benjamin, David Cesarini, and Philipp Koellinger) who are excited about the potentially transformative impact that genetic data could have on the social sciences. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Genetics, Broad Institute and Estonian Genome Center), Albert Hofman (Epidemiology, Erasmus University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard), Sarah Medland (Statistical Genetics, Queensland Institute of Medical Research), Michelle Meyer (Bioethics, Union Graduate College and Icahn School of Medicine at Mount Sinai), and Peter Visscher (Statistical Genetics, University of Queensland). The pilot project of the SSGAC was a large scale GWAS on educational attainment whose results were published in Science (Rietveld et al., 2013).

 

CHIC was established to combine the efforts of several research centers to elucidate the genetic and environmental bases of differences in cognitive performance among children. The results of the first project of CHIC were published in Molecular Psychiatry (Benyamin et al., 2014) and are based on data from six discovery (N = 12,441) and three replication (N = 5,548) cohorts with a total sample size of 17,989 children of European ancestry for whom genome-wide single nucleotide polymorphism (SNP) genotypes and cognitive performance scores were available. Prior to the current study published in Proceedings of the National Academy of Sciences, the CHIC initiative was the largest study that aimed to identify genetic variants associated with cognitive performance.

 

The current project is a collaboration between the SSGAC and the CHIC. We combined our complementary resources and expertise, making the current project the largest effort to investigate genetic associations with cognitive performance to date.

 

What is “cognitive performance”?

 

We use the term cognitive performance to refer to the fact that, even though different types of cognitive tests sometimes differ from each other in important ways, people who perform well on one type of cognitive test also tend on average to perform well on other types of cognitive tests. To measure cognitive performance, it is standard practice to summarize how well participants did on a range of cognitive tests with a statistical technique called “principal component analysis.” This technique transforms the observed performance on many tasks into a single score that accounts for as much of the variability in the data as possible. This score is our measure of cognitive performance.

 

We use the term “cognitive performance” rather than the more traditional terms “cognitive ability” or “general intelligence” because it does not prejudge the sources or causes of differences among people, which could be explained by genetic factors, environmental factors, their interactions, or all of these. Performance (as opposed to ability or intelligence) is a neutral term intended to capture only the fact that there are stable, systematic differences in how people perform on cognitive tests, without making any claims about whether they are (a) innate or acquired and (b) fixed or malleable. As the distinctness of these two sets of questions suggests, it is entirely possible that traits that are significantly innate, because they are under genetic influence, are also quite malleable. For example, eyesight is very strongly influenced by genes, but through the environmental interventions of eyeglasses, contact lenses, and surgery, that trait is also highly malleable.

 

Furthermore, although “cognitive performance” is a key construct for the social sciences and medicine, it is not the only aspect of cognition that matters. There are many other cognitive skills, some of which may be independent of “cognitive performance” as our study measures it, that have considerable importance in different settings and for different outcomes. For example, in some fields creativity and social skills may contribute to success to an equal or even greater extent than cognitive performance.

 

What was already known about the genetics of cognitive performance prior to this study?

 

Decades of twin and family studies suggest that cognitive performance is influenced by genes (1). For example, these studies have consistently found that identical twins are far more similar to each other in their cognitive performance than fraternal twins are (2). 

 

However, despite considerable interest and effort, research to date has largely failed to identify specific common genetic variants (DNA sequence variations that occur commonly within a population as opposed to “rare variants”) associated with cognitive performance in the normal range. “Candidate gene studies,” which investigate a small set of genes that are believed to have well-understood functions and potential relevance for cognition, have led to many published association results that have not consistently replicated (3). Other research approaches which investigated a large set of common genetic variants using “genome-wide association studies (GWAS)” on cognitive performance have not found replicable results either (4, 5), with the exception of variants in the gene APOE, which significantly predicts cognitive decline in older individuals (6–8).

 

What did you do in this particular study on cognitive performance?

 

Our current paper builds on the insights gained by previous research efforts which suggested that the effect sizes of any common genetic variant on cognitive performance are likely to be so small that they cannot be detected using existing data and research approaches (4, 5). In contrast to these earlier efforts, we applied an alternative, two-stage research strategy, which we call the proxy-phenotype method. In the first stage, we conducted a genome-wide association study on a “proxy phenotype” (in this case, educational attainment) that is correlated with the phenotype of interest (in this case, cognitive performance) and is well measured in a much larger sample of respondents. This first stage identified a relatively small set of SNPs (a type of common genetic variant) that are associated with educational attainment. In the second stage, these SNPs serve as empirically plausible candidates that we tested in independent samples for association with cognitive performance. This two-stage approach has more statistical power than previously employed research methods, allowing us to identify robust results that are likely to replicate. (For further details on the proxy-phenotype method, see “How is your approach—the “proxy-phenotype method”—different from prior work?” below.)

 

What did you find?

 

First, we found three common genetic variants that are robustly associated with cognitive performance. These associations remained statistically significant even when we took account of the fact that we tested many more SNPs than these three. The effect of these variants was extremely small, accounting for approximately 0.3 points per copy of each variant on the familiar IQ scale (which has a mean of 100 and standard deviation of 15). According to our analysis of these three genetic variants, the largest possible difference between two individuals would occur if the individuals were identical except that one of them possessed two copies of each of the three variants associated with greater cognitive performance and the other possessed no copies of any of the three variants. In this extreme—and rare—case, the former individual would be expected to score about 1.8 points higher on an IQ test. At least 98% of people will not possess the full six copies of the “positive” variants; indeed the typical person would possess three, with others differing by no more than 0.9 IQ points above or below the average—as a result of these three SNPs.

 

Second, in an independent sample of older Americans (N = 8,652), we found that a polygenic score derived from 60 of the education-associated SNPs from stage 1 is associated with memory and absence of dementia, which illustrates the potential health relevance of our findings. (A polygenic score is an index that combines many different SNPs to create an overall measure that predicts variation in some behavior or trait of interest. In this case, we found that when considered as a group, the education-associated SNPs were also able to predict memory and cognitive health.)

 

Third, we conducted a range of bioinformatics analyses to gain further insights into the biological mechanisms of cognitive performance that are implicated by our genetic association results. We stress that the accuracy of our conclusions is limited by current knowledge of the neurological functions of genes. As understanding of the underlying biological function of different genes develops over time, so will our ability to interpret findings arising from studies such as this one. Thus, strong conclusions about underlying biological mechanisms would be premature. With this caveat in mind, we nevertheless believe that our findings regarding biological pathways are intriguing. The convergence of all of these analyses suggests that four specific genes (KNCMA1, NRXN1, POU2F3, SCRT) are involved in differences between people in cognitive performance. All four of these genes happen to be associated with a single neurotransmitter pathway involved in synaptic plasticity, which is the brain’s main mechanism for learning and memory.

 

Cognitive performance is a complex phenomenon that is influenced by a large number of both genetic and environmental factors, and our study focuses on only a tiny piece of the puzzle. We only examine a small set of common genetic variants, we consider only one of many forms of genetic difference among individuals, and we do not include specific measures of environmental factors. There are additional sources of genetic variation that remain to be discovered. It is also likely that the effects of variants—both those we found and those remaining to be discovered— will differ based on environmental conditions (for example, a variant associated with greater cognitive performance may only have that effect in the context of schooling of sufficient quality). These other genetic effects, environmental effects, and interactions between genetic and environmental factors are important topics of active research.

 

How is your approach—the “proxy-phenotype method”—different from prior work?

 

Prior work has typically relied on one of two research strategies. The first is the candidate-gene study, in which researchers test a small number of genetic variants for association with the phenotype of interest, typically based on hypotheses derived from the known biological functions of the candidate genes. The candidate-gene associations that have been reported with cognitive performance (9), however, fail to replicate when larger samples are used (3). This suggests that the candidate-gene approach is susceptible to false positive results that can mislead researchers and misdirect research efforts. The second strategy is the genome-wide association study (GWAS), in which researchers test millions of single-nucleotide polymorphisms (SNPs) for association with the phenotype, without regard for whether these SNPs might be biologically related to the phenotype, and then apply an extremely stringent threshold for determining statistical significance in order to reduce the likelihood of false positives. For physical and medical phenotypes, GWAS methods have identified many novel associations that do in fact replicate (10). GWAS methods on cognitive performance, however, have not yet identified any genome-wide-significant associations (4, 5).

 

The proxy-phenotype method that we applied in our study is an alternative that combines features of both of these previous approaches. In the first stage, we conduct a GWAS on a “proxy phenotype” to identify a relatively small set of 69 SNPs that are associated with the proxy phenotype. In the second stage, these 69 SNPs serve as candidates that are tested in independent samples for association with the phenotype of interest, at a significance threshold corrected for the number of proxy-associated SNPs. Hence, no additional data collection was required for our study (we leveraged previous GWAS efforts on educational attainment and cognitive performance).

 

As far as we aware, prior work (with the exception of our own previous study on educational attainment (11)) has not laid out the statistical framework and conducted calculations to evaluate whether and when identifying candidates via a proxy phenotype gives a study more statistical power (i.e., makes a study more likely to find true positive results) than would a direct GWAS-style analysis of the phenotype of interest. We do this in the online supplement to the present paper. Conducting such calculations is crucial for understanding why our analysis in this paper succeeds in identifying novel associations with cognitive performance even though the previous GWAS papers have not.

 

Although related ideas have been discussed before (12), the proxy-phenotype method has not generally been applied as an alternative to the GWAS and candidate-gene approaches. We think its application could prove fruitful when the main phenotype of interest is not available (yet) in large enough samples for GWAS, but a correlated phenotype is available in large samples. For example, in addition to educational attainment, GWAS results from the widely-available phenotypes of height and body-mass index (BMI) could provide proxies for discovering genes associated with conditions that are correlated with those phenotypes, thereby speeding progress on significant behavioral and medical conditions.

 

We caution, however, that the proxy-phenotype method (like theory-based candidate-SNP approaches) could generate an unacceptably high rate of false positives if it were applied without high statistical power (i.e., large enough samples) and if results were reported selectively (i.e., publishing successful discoveries of association with the ultimate phenotype and not failures). To avoid this, we propose a set of “best practices” that proxy-phenotype studies should follow: researchers should (a) conduct power calculations in advance to justify the use of the proxy-phenotype method for a particular phenotype of interest and  report these calculations in their papers; (b) circulate a data analysis plan to all participating cohorts prior to conducting any analysis and register the plan in a public repository; (c) commit to publishing all findings from the study, including null results; and (d) conduct Bayesian calculations that provide an estimate of how likely the results are to be false positives.

 

Why is it important to know that genetic effect sizes for traits like cognitive function are extremely small?

 

The effect size of an individual genetic variant on a behavioral trait determines which research strategies will succeed and which will fail. The tiny effect sizes for genetic variants identified in our study suggest that studies seeking to identify genetic influences on behavioral traits should include at least tens of thousands of research participants in order to generate accurate results. The effect sizes here are so tiny that, given the size of our sample, we could only identify them with the two-stage approach we employed.

 

The small effects of individual genetic variants also imply that the genetic influences on cognitive performance result from the cumulative effects of hundreds or thousands of different genetic variants, not just a few. When a variant has a small effect size, therefore, its presence or absence has practically no relevance at all for predicting any particular individual’s overall cognitive performance.

 

If effect sizes are so small, why bother studying them?

 

Despite their tiny effect sizes, identifying genetic variants related to behavioral traits could be useful for a number of reasons. Here are two examples:

 

First, even if a genetic variant has a very small effect, identifying it may lead to insights regarding the biological pathways involved in the phenotype. For example, one very successful cholesterol-lowering drug works by targeting a biological mechanism that was identified in a GWAS on circulating lipid levels, even though the effect sizes in the GWAS were small (13, 14). It is possible that studies on the genetics of cognitive performance may eventually shed light on biological mechanisms underlying memory loss and dementia.

 

Second, even if an individual genetic variant has a very small effect, many genetic variants taken together may explain a large share of the variance in a trait. For example, estimates from previous research suggest that, if extremely large samples were available to conduct genome-wide association studies on cognitive performance, as much as 29% of the differences among people in cognitive performance might be captured by all currently known common genetic variants, considered as a group (15). This amount of stat