FAQs

“Polygenic prediction within and between families from a 3-million-person GWAS of educational attainment”

"Resource Profile and User Guide of the Polygenic Index Repository" 

“Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences”
 
“Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”

"Genome-wide association study identifies 74 loci associated with educational attainment"

"Genetic variants associated with subjective well-being, depressive symptoms and neuroticism identified through genome-wide analyses"

"GWAS of 126,559 individuals identifies genetic variants associated with educational attainment"

"Common Genetic Variants Associated with Cognitive Performance Identified Using Proxy-Phenotype Method"

FAQs

FAQs about “Polygenic prediction within and between families from a 3-million-person GWAS of educational attainment”

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication.

 

This document provides information about the study: 

Okbay et al. (2022). “Polygenic prediction within and between families from a 3-million-person GWAS of educational attainment.” Nature Genetics.

The document was written by Daniel Benjamin, David Laibson, Michelle N. Meyer, and Patrick Turley. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections: 

  1. Background 

  2. Study design and results

  3. Social and ethical implications of the study 

  4. Appendices

 

For clarifications or additional questions, please contact Daniel Benjamin (daniel.benjamin@gmail.com).

Quick Links

1     Background

1.1       Who conducted this study? What are the group’s overarching goals?

1.2       The current study focuses on an outcome called “educational attainment.” What is educational attainment?

1.3       What is a GWAS?

1.4       Are the SNPs identified in a GWAS “causal” (i.e., would a change in the SNPs someone has, if everything else stayed the same, cause a person’s life to change)?

1.5       In what sense do the SNPs identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

1.6       What is a polygenic index?

1.7       Why conduct a GWAS of educational attainment?

1.8       What was already known about the relationships between genes and educational attainment prior to this study?

2     Study design and results

2.1       What did you do in this paper? How was the study designed? Why was the study designed in this way?

2.2       What are common pitfalls in GWASs? What precautions did you take against them?

2.3       What did you find in the main GWAS of educational attainment?

2.4       How predictive is the polygenic index developed in this study?

2.5       Why is the polygenic index less predictive in samples of African genetic ancestries than in samples of European genetic ancestries?

2.6       What did you find in the analysis of disease risk?

2.7       What did you find in the family-based analysis?

2.8       What did you find in the analysis of assortative mating?

2.9       What did you find in the analysis of the X chromosome?

2.10     What did you do in the “dominance GWAS” of educational attainment? What did you find?

3          Ethical and social implications of the study

3.1       Did you find “the gene for” educational attainment?

3.2       Well, then, did you find “the genes for” educational attainment?

3.3       Does this study show that an individual’s level of educational attainment (or any other outcome) is determined, or fixed, at conception? Do genes determine the choices we make and who we become?

3.4       Can the polygenic index from this paper be used to accurately predict a particular person’s educational attainment?

3.5       Can your polygenic index be used for research studies in diverse genetic ancestry populations?

3.6       Should practitioners (e.g., in education or other domains) use the results of this study to make decisions?

3.7       Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants? What has been done to help avert the potential harms of this research?

Additional reading and references

1       Background
1.1       Who conducted this study? What are the group’s overarching goals?

The authors of the study are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, international research group that aims to identify statistically robust links between genetics and social-science-relevant outcomes. These outcomes include behavior, preferences, and personality. They are traditionally studied by social and behavioral scientists (e.g., economists, psychologists, sociologists) but are often also of interest to health and other researchers.

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. Most social-scientific outcomes are associated with thousands of genetic differences called single-nucleotide polymorphisms (SNPs, pronounced “snips”).  A SNP is a place in the genome where people differ genetically from each other (see FAQ 1.3). Although when you add up thousands of SNPs, their collective predictive power can be meaningful (see FAQs 1.6 & 2.4), we now know that almost every one of these SNPs has an extremely weak association with a particular social-scientific outcome on its own. To identify specific SNPs with such weak associations, scientists must study at least hundreds of thousands of people (to separate weak signals from noise, and thereby avoid finding false positives). One promising strategy for doing this is for many investigators to pool their data into one large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of medical conditions (Visscher et al., 2017). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. (In genetics research, “cohort” is a term that means “dataset.”) The SSGAC was founded by three social scientists—Daniel Benjamin (University of California, Los Angeles), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believe that studying SNPs associated with social-scientific outcomes can have substantial positive impacts across many research fields (see FAQ 1.7). 

The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, Princeton University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Biology and Human Genetics, University of Tartu and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Genetic Epidemiology, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics and Law, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). Whenever possible, we pre-register our analyses at OSF (formerly Open Science Framework). Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate what was found less tersely and technically than in the paper, as well as to emphasize what can and cannot be concluded from the research findings more broadly and how they should and shouldn’t be used. FAQ documents produced for SSGAC publications are available on the SSGAC website.

In addition to educational attainment, SSGAC-affiliated papers have studied subjective well-being, reproductive behavior, risk tolerance, and dietary intake. The SSGAC website contains a list of our research publications, including papers in Science, Nature, Nature Genetics, Nature Human Behaviour, Proceedings of the National Academy of Sciences, Psychological Science, and Molecular Psychiatry.

1.2       The current study focuses on an outcome called “educational attainment.” What is educational attainment?

Educational attainment is the number of years of formal education a person has completed, starting with kindergarten or its equivalent. The vast majority of people in our sample are at least age 30; almost all of the people that we study have completed their formal education. Although educational attainment is most strongly influenced by social and other environmental factors (see FAQ 1.8), it is also influenced by thousands of genes. People vary considerably in how much education they complete. Education is recognized throughout the social and biomedical sciences as an important “predictor” (see FAQ 1.5) of many other life outcomes, such as income, occupation, health, and longevity (Ross and Wu, 1995; Cutler and Lleras-Muney, 2010). Educational attainment has also been among the relatively few social-scientific outcomes for which it is feasible to conduct a large-scale genome-wide study, because educational attainment is frequently measured in cohorts, including medical cohorts, due to its robust association with health. The current study is also based on a large sample of research participants of the personal genomics company 23andMe, which asks participants a survey question about educational attainment. A large-scale study is necessary (but not sufficient) to generate scientific findings that are reproducible. 

1.3       What is a GWAS?

In a genome-wide association study (GWAS, pronounced JEE-wahs), scientists look across the entire human genome at genetic differences among people to see whether any of these differences are, on average, associated statistically with higher or lower levels of some outcome—for instance, more or less cancer, height, or risk tolerance. Typically, and in our studies, such analyses focus on places in the human genome where people commonly differ: so-called single-nucleotide polymorphisms (SNPs). At a given SNP located on a particular copy of a chromosome, each of us has one of the four genetic base pairs (A-T, T-A, C-G, or G-C), which is called an “allele.” We inherit one of each chromosome from our biological father and one from our biological mother, so at each SNP, we inherit one allele from each biological parent and hence have two alleles in total. In some cases, we inherit the same allele from each parent, and in other cases, we inherit one allele from one parent and a different allele from the other parent. In a GWAS, researchers look to see whether particular alleles are associated statistically with having more or less of some outcome. 

Although there are tens of millions of sites where SNPs are located in the human genome, GWASs typically investigate only SNPs that can be measured (or imputed) with a high level of accuracy. These days, such procedures usually yield millions of SNPs that together capture most common genetic variation across people.

When the SSGAC conducts a GWAS, every participating cohort uploads the (within-cohort) statistical associations between the outcome—for example, educational attainment—and each SNP that was measured in the genomes of the individuals in the cohort. The cohort-level results do not contain individual-level data—just summary statistics about these within-cohort statistical associations. The SSGAC then combines these cohort results to produce the overall GWAS results. By using existing datasets and combining cohort-level results, we can study the genetics of ~3 million people at very low cost. The SSGAC publicly shares overall, aggregated results (subject to some Terms of Service; see FAQ 3.7) so that other scientists can build on this work. These publicly available data have already catalyzed many research projects and analyses across the social and biomedical sciences (see FAQ 1.7 for examples). 

GWASs have been a successful research strategy for identifying genetic variants associated with many outcomes and diseases, including body height (Wood et al., 2014) , BMI (Locke et al., 2015), Alzheimer’s disease (Lambert et al., 2013), and schizophrenia (Ripke et al., 2014). It has also recently been used to identify genetic variants associated with a variety of health-relevant social-science outcomes, such as the number of children a person has (Barban et al., 2016), happiness (Okbay, Baselmans, et al., 2016; Turley et al., 2018), and educational attainment (Rietveld et al., 2013; Okbay, Beauchamp, et al., 2016; Lee et al., 2018).

1.4       Are the SNPs identified in a GWAS “causal” (i.e., would a change in the SNPs someone has, if everything else stayed the same, cause a person’s life to change)?

GWASs identify alleles that are associated with the outcome and cannot distinguish whether the associations are causal or not. While such an association can arise if a SNP causally influences the outcome, it is not necessarily the case that all associations between SNPs and outcomes are causal. Here are several non-causal reasons why a SNP may be associated statistically (i.e., correlated) with an outcome. First, SNPs are often highly correlated with other, nearby SNPs on the same chromosome. As a result, when one or more SNPs in a region causally influence an outcome (in that particular environment), many non-causal SNPs in that region may also be identified as statistically associated with the outcome. When GWAS results are analyzed, researchers typically report results for the SNP in a region that shows the strongest evidence of association. Even if there is a causal SNP, GWASs may not identify that particular SNP. In fact, the causal SNP may not have even been included among the SNPs that were originally measured directly for the study. For example, GWASs that focus on common SNPs would not be able to identify rare or structural genetic differences between people (e.g., deletions or insertions of an entire genetic region) that are causal, but GWASs may identify SNPs that are correlated with these unobserved sources of genetic variation.

Second, at a particular SNP, the frequencies of different alleles might vary systematically across environments. If those environmental factors are not accounted for in the association analyses, some of the associations found may be spurious—that is, the result of coincidence or of a third factor. Consider the well-known example of a GWAS of chopstick use (Lander & Schork 1994; Hamer and Sirota, 2000). Because alleles are, by chance, more and less common in different populations, some alleles are more common in people with Asian genetic ancestries. At the same time, for cultural reasons, practices like chopstick use are often more common in some populations than in others. Both alleles and social outcomes like chopstick use, then, are distributed unevenly among people with different genetic ancestries. As a result, a “chopstick GWAS” would almost certainly find some alleles that are associated statistically with chopstick use, but these associations would be coincidental, and the alleles would not cause chopstick use. This is the problem of “population stratification bias” discussed in FAQ 2.2. GWAS researchers have a number of strategies for addressing the challenges posed by population stratification bias (see FAQs 2.7 & 3.5).

Even in studies such as ours that attempt to account for diversity in genetic ancestry, allele frequencies may nonetheless vary systematically with social practices and other environmental factors even within a group of people of similar genetic ancestry. For example, an allele that is associated with improved educational outcomes in the parental generation may have downstream effects on parental income and other factors known to influence children’s educational opportunities and outcomes (such as neighborhood characteristics). This same allele is likely to be inherited by the children of these parents, creating a correlation between the presence of the allele in a child’s genome and the extent to which the child was reared in a specific kind of environment. A recent study of Icelandic families showed that the parental allele that is not passed on to the child is still associated with the child’s educational attainment, suggesting that GWAS results for educational attainment partly represent these intergenerational pathways (Kong et al., 2018). Our family-based analyses yield results that are consistent with this conclusion (see FAQ 2.7). 

There are also cases where a SNP may indeed be causal, but not in the way that some people may think when they hear that genes “cause” an outcome. In these cases, SNPs’ effects on an outcome may be indirect, so a SNP that may be “causal” in one environment may have a diminished effect or no effect at all in other environments. For example, the nicotinic acetylcholine receptor gene cluster on chromosome 15 is associated with lung cancer (Amos et al., 2008; Hung et al., 2008; Thorgeirsson et al., 2008). From this observation alone we cannot conclude that these genetic variants cause lung cancer through some direct biological mechanism. In fact, it is likely that one version of this gene, which is part of the nicotinic acetylcholine receptor gene cluster that affects nicotine metabolism, increases lung cancer risk through effects on smoking behavior. In a tobacco-free environment, it is plausible that many of the associations would be substantially weaker and perhaps disappear altogether. Thus, even if we have credible evidence that a specific association is not spurious, it is entirely possible that the SNP in question influences the outcome through channels that most people would call environmental (e.g., smoking). Nearly forty years ago, the sociologist Christopher Jencks criticized the widespread tendency to mistakenly treat environmental and genetic sources of variation as mutually exclusive (see also Turkheimer, 2000). As the example of smoking illustrates, and as Jencks (1980)explains, it is often overly simplistic to assume that “genetic explanations of behavior are likely to be exclusively physical explanations while environmental explanations are likely to be social” (p.723).

In general, a GWAS is just one step in a longer, often complex process of identifying causal pathways, but the results of a large-scale GWAS are a useful tool for that purpose and often lead to novel and important insights (Visscher et al., 2017). In other words, GWAS results provide important signals as to where scientists should invest future in-depth research to understand why the association exists (see also FAQ 3.6).

1.5       In what sense do the SNPs identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

When we and other scientists say that SNPs—and other variables, such as demographics or environmental factors—“predict” certain outcomes, we mean that people with particular alleles will tend—with some degree of likelihood, and only on average—to complete in the future or to have already completed more formal education, while people who carry other alleles will tend—again with some degree of likelihood, and only on average—to complete less formal education (see FAQ 1.8).

Our use of “predict” in this sense differs in several important ways from how “predict” is sometimes used in standard language (e.g., outside of social science research papers). First, we do not mean that the presence of an allele guarantees an outcome with certainty, or even with a high degree of likelihood. Rather, we mean that the SNP is, on average across people, associated statistically with an outcome. In other words, on average, people with one allele at that SNP have a higher likelihood of the outcome compared to people with the other allele. Scientists describe a SNP as statistically “predictive” of an outcome even if it has only a weak association with the outcome—as is the case, for instance, with every SNP that we identify that is associated with educational attainment.

Second, in standard language, “prediction” usually refers to the future. In contrast, when scientists say that SNPs “predict” an outcome, they mean that they expect to see the association in other data. “Other data” means data that aren’t part of the current study—regardless of whether those data will be collected in the future or have already been collected. In other words, in social science, it makes perfect sense to ask how well a SNP predicts an outcome that has already occurred, like how many years of education were attained by older adults.

 

Finally, in standard language, a “prediction” is often an unconditional guess about what will happen. Instead of meaning it unconditionally, scientists mean that they expect to see an association in other data collected—but only if those data will be or were collected in an environment that is approximately the same as the environment in which the original data were collected. In the example given in FAQ 1.4, in which a SNP is associated with lung cancer due to its effect on smoking, we might not expect the SNP to be predictive of lung cancer in an environment where cigarettes and other smoked tobacco products are absent.

“Effect size” is a scientific term that refers to the magnitude of the predicted difference in the outcome resulting from having one allele of a SNP as opposed to the other possible allele (see FAQ 2.3). For example, the average SNP identified in the current study is associated with only 1.4 more weeks of school on average. (Note that the association might average out to 1.4 weeks of school if, for example, one of the alleles is associated with an additional year of school for 3.5% of people and no additional school for 96.5% of people.) The use of the word “effect” is not intended to imply that the strength of the association between a SNP and educational attainment is necessarily a measure of the SNP’s causal effect on educational attainment (see FAQ 1.4).

1.6       What is a polygenic index?

The results of a GWAS can be used to create a “polygenic index,” which sums up the net “effects” (see FAQ 1.5) of many SNPs from across an individual’s genome on the GWAS outcome. (We prefer the term “polygenic index” over the more common terms “polygenic score” and “polygenic risk score,” because the words “score” and “risk” can convey a value judgment where none is intended.) Because a polygenic index aggregates the information from many SNPs, it can “predict” (see FAQ 1.5) far more of the variation among individuals for the GWAS outcome than any single SNP. (Note, however, that even polygenic indexes are not good predictors of outcomes for one person; see FAQ 2.4.) Often, the polygenic indexes with the most predictive power are those created using all the (millions of) SNPs studied in a GWAS. The larger the GWAS sample size, the greater the predictive power (in other, independent samples) of a polygenic index constructed from the GWAS results. More precisely, the GWAS results are used to create a formula for constructing a polygenic index based on the effects on the outcome of having each allele. Using this formula, a polygenic index can then be constructed for any individual for whom there is genetic data that includes the SNPs that were used to construct the index. Indeed, some of the value of a GWAS is that the polygenic index it produces can be used in subsequent studies conducted in other samples. 

We do not refer to the association between a polygenic index and an outcome as “causal.” That is because the polygenic index is composed of many SNPs, and while some of these may be causal, some (or, in principle, all) may not be causal (see FAQ 1.4). Some of the analyses in our paper are designed to quantify how much of the predictive power of the polygenic index for educational attainment is due to causal effects of SNPs (see FAQ 2.1).

 
1.7       Why conduct a GWAS of educational attainment?

We are motivated to conduct this research because we believe it can be fruitful for the social sciences and health research. In addition to the specific findings of our paper, which are discussed in Section 2 of these FAQs, the results of a GWAS of educational attainment also provide inputs for other research. In our view, some of the most valuable uses of the results will be to improve our ability to study the effects of environments. Because this may be counterintuitive, we will give a few examples of what we mean.

One example is using a polygenic index to control for, or hold constant, genetic influences when studying the effect of an environment. Doing so can be important when the study is correlational, but it can be valuable even in a study where the environmental variable is randomly assigned. Suppose researchers are studying the effect of an educational intervention, such as providing free preschool to economically disadvantaged children, on subsequent school achievement. Because a year of preschool is expensive, the sample sizes of such studies have been small (e.g., Weikart and Perry Preschool Project, 1967). In a study where a year of free preschool is randomly given to half of the children participating in the study, control variables are not needed in order to get an estimate of the effect of preschool on average—but control variables, such as gender, age, and parental socioeconomic status, are typically included in order to make the estimate more precise (by removing some of the background “noise” that makes it harder to detect the effect of the intervention). In effect, using the polygenic index as an additional control variable can allow the researchers to learn more from the same-size sample. The value of using the polygenic index as a control depends on how much predictive power it has over the set of other control variables used. In our first GWAS of educational attainment (Rietveld et al., 2013, Supplementary Materials section 8), we conducted calculations to quantify these gains. For the purposes of illustration, suppose control variables other than the polygenic index capture 10% of the variation in the outcome, and the polygenic index captures an additional 12% of the variation. Then to attain any given level of precision for the estimate of the effect of preschool, including the polygenic index as a control variable reduces the required sample size for the study by 13%. Relative to the cost of providing preschool to additional research participants—for instance, estimated to be $19,208.61 (in 2010 dollars) per child for the two-year Perry preschool study (Heckman et al., 2010)—genotyping participants can be highly cost effective. (Genotyping currently costs roughly $30/person, with this cost falling quickly over time.) There are currently only a few examples of polygenic indexes used in this way (e.g., Davies et al. 2018), because polygenic indexes have only recently attained enough predictive power to usefully serve as control variables. We anticipate that this type of application will become widespread in future social-scientific studies.

Another example is using the results of a GWAS of educational attainment to study how parenting and other features of a child’s rearing environment influence his or her developmental outcomes. This idea was pioneered in a paper by Kong et al.(2018), who studied SNPs identified in one of our earlier GWASs of educational attainment (Okbay, Beauchamp, et al., 2016). Kong et al. showed that the alleles of the mother and father that are not transmitted to a child are nevertheless related to the child’s outcomes, including the child’s educational attainment. Because the child did not inherit these alleles, their association with the child’s outcomes cannot be due to genetic influences on the child. Instead, their association with the child’s outcomes must be due to their effects on the parents, which in turn affects the environment in which the child is reared. While Kong et al. did not pin down the specific pathways that account for these associations, there are many interesting possibilities that can be explored in future work; for instance, parents with these alleles are more likely to attend school longer and earn higher incomes, which may enable them to provide educational advantages to their children. Kong et al.’s methodology can also be used to address other questions, such as whether the non-transmitted alleles of the mother or father are more strongly with the child’s outcomes.

Much more briefly, here are some other examples of how results from our earlier GWASs of educational attainment (Rietveld et al., 2013; Okbay, Beauchamp, et al., 2016; Lee et al., 2018) conducted in much smaller sample sizes (see also FAQ 1.8) have been used:

  • examine the genetic overlap between educational attainment and ADHD, schizophrenia, Alzheimer’s disease, intellectual disability, cognitive decline in the elderly, brain morphology, and longevity (Pickrell et al., 2016; Riccardo E. Marioni et al., 2016; Warrier et al., 2016; Anderson et al., 2017);

  • help us better identify possible genetic subtypes of schizophrenia (Bansal et al., 2017);

  • provide insights into the genetics of brain development and function (Lee et al., 2018);

  • explore why educational attainment appears to be protective against coronary artery disease (Tillmann et al., 2017) and obesity (van Kippersluis and Rietveld, 2018);

  • study why specific SNPs predict educational attainment. For example, it appears that some genetic effects on educational attainment operate through associations with cognitive performance and outcomes such as self-control (Belsky et al., 2016), which in turn affect educational attainment;

  • study how the effects of genes on education differ across environmental contexts (Schmitz and Conley, 2017; Barcellos, Carvalho and Turley, 2018; Cheesman et al., 2020); and

  • determine the limits of genetic influences and debunk cultural myths about group differences (e.g., between men and women, see FAQs 2.9 & 3.7) (Houmark, Ronda and Rosholm, 2020).

By making the results of our analyses publicly available at https://www.thessgac.org/data, we hope to facilitate this and other valuable work by other researchers.

1.8       What was already known about the relationships between genes and educational attainment prior to this study?

Educational attainment is strongly influenced by social and other environmental factors. For example, holding all other influences equal, those who live in communities where education (at least beyond a certain level) is relatively expensive are less likely to obtain a high level of educational attainment. Even when education is free or heavily subsidized, full-time education implies an opportunity cost that not everyone is equally able to bear: some individuals, due to a variety of family or economic circumstances, will face more pressure than others to leave school and enter the work force. More generally, educational outcomes are strongly influenced by environmental factors such as social norms, early-life educational experiences, economic opportunity, and many forms of bias and discrimination that make it harder for some people to succeed or stay in school.

A variety of findings—from twin, family, and GWASs—suggest that genetic factors predict some of the differences across people in their educational attainment (Heath et al., 1985; Silventoinen et al., 2004; Branigan et al., 2013). Studies have found repeatedly that identical twins raised in the same home are substantially more similar to each other in their educational attainment than fraternal twins (or other full siblings) reared together. Full siblings reared together are, in turn, more similar than half siblings reared together who, in turn, are more similar than genetically unrelated siblings reared together (e.g., siblings who are conventionally unrelated, typically because at least one of them is adopted) (Sacerdote, 2007, 2011; Cesarini and Visscher, 2017). The studies have also provided strong evidence that the so-called “common environment” (the environmental factors shared by siblings raised in the same household) can have long-lasting effects on educational outcomes. In Sweden, the educational outcomes of adopted (i.e., genetically unrelated) brothers reared in the same households are about as similar as the educational outcomes of full siblings reared in separate homes (Cesarini and Visscher, 2017). A study of Korean-American adoptees finds that adoptees assigned to households where both parents had college degrees were 16 percentage points more likely to attend college than children assigned to families in which neither parent completed college (Sacerdote, 2007).

Research (like the current study) using molecular genetic data—data that measures each person’s DNA and can be used to identify differences among people at the molecular level—has similarly estimated that SNPs may jointly predict up to 20% of the variation in educational attainment across individuals (Rietveld et al., 2013). Prior GWASs have begun to identify some of those SNPs. In the SSGAC’s first major publication (Rietveld et al., 2013), we conducted a GWAS in a sample of roughly 100,000 people and identified three SNPs that were statistically associated with educational attainment. In 2016, the SSGAC published another GWAS of educational attainment, this time in a sample of around 300,000 people (Okbay, Beauchamp, et al., 2016). We found that 74 SNPs were associated with educational attainment. These included the three SNPs identified in our earlier study (Rietveld et al., 2013). In 2018, the SSGAC published its most recent GWAS of educational attainment in a sample of roughly 1.1 million people (Lee et al., 2018). We found 1,271 SNPs associated with educational attainment, and earlier findings continued to replicate well. All three of these studies involved, at the time they were conducted, the largest sample sizes ever studied for genetic associations with a social-science outcome.

Researchers don’t yet know why these SNPs are associated with differences in educational attainment. Their predictive power may derive from many different types of mechanisms, some of which would be quite indirect. For example, genetic variation may affect neural functions such as memory. Genetic variation may improve sleep quality (making it easier to subsequently stay awake in boring lectures). Genetic variation can affect personality traits, such as the willingness to listen politely to and follow the instructions of teachers (who aren’t always right but nevertheless dictate grades and other outcomes). There may also be even more convoluted pathways. For example, genetic variation can affect one’s sociability, which might draw someone into or drive someone out of the particular social environments that exist in higher education.

 

There were three key takeaways from the SSGAC’s prior work: 

(1)   A GWAS approach can identify specific SNPs statistically associated with socio-behavioral outcomes if the study is conducted in large enough samples (at least 100,000 people).

(2)   SNPs that are associated with a socio-behavioral outcome such as educational attainment are each likely to have less predictive power than are SNPs that are associated with a biomedical or other physical outcome (Chabris et al., 2015). For example, of the hundreds of SNPs found to be associated with height to date (Wood et al., 2014; Yengo, Sidorenko, et al., 2018), the SNP with the strongest association predicts 0.4% of the variation across individuals in height, whereas the SNP with the strongest association with educational attainment identified to date predicts less than one tenth (<0.04%) as much of the variation in educational attainment (Lee et al., 2018). (The SNPs that have not yet been identified will very likely explain even less variance than those that are currently known, since statistical power is greatest for those that explain the most variance; in other words, the largest effect-size SNPs are likely to have been the first ones to have been identified in earlier GWASs.)

(3)In the samples studied, at least 20% of the variation in educational attainment can in principle be predicted by genetic differences (Rietveld , 2013), implying that the genetic associations with educational attainment result from the cumulative effects of at least thousands (and probably millions) of SNPs, not just a few. 

These findings from twin, family, and GWASs imply that individuals who carry an allele associated with greater educational attainment will on average complete slightly more formal education than other (similarly environmentally situated) individuals who carry a different allele of the same SNP. Put in population terms, these findings imply that people with particular alleles will tend on average to complete more formal education, while people who carry other alleles will tend on average to complete less formal education. It is important to emphasize that these associations represent average tendencies in a population. Women are, on average, shorter than men. But you likely know many tall women and many short men. Similarly, many individuals with high polygenic indexes for educational attainment will not get a college degree, and vice-versa (see FAQ 3.4). It is also important to recall (from FAQ 1.5) that these average tendencies of alleles on educational outcomes may reflect indirect genetic influences on education that operate through environmental channels, such that a polygenic index that is moderately predictive in one environment may become less predictive or not at all predictive in a very different environment. Polygenic indexes for educational attainment are poor predictors of individual outcomes and sensitive to environments, but increasingly useful tools in social science research (see FAQ 2.4)

 
 
 
 
 
 
 
 
 
 
2 Study design and results
2.1 What did you do in this paper? How was the study designed? Why was the study designed in this way?

We conducted a GWAS (see FAQ 1.3) of educational attainment (see FAQ 1.2) in a sample of over 3 million people. The sample size we used in the current study is much larger than that used in previous GWAS of educational attainment (see FAQ 1.8). By constructing a sample of over 3 million, we expected to estimate genetic effects with much greater accuracy than previous studies (with smaller samples). As a result, we expected to identify many more specific SNPs that are associated with educational attainment and to build a more accurate polygenic index.

To construct such a large sample, we started with the data analyzed in our most recent paper (Lee et al., 2018): a GWAS of roughly 300,000 research participants from 69 datasets; a GWAS of roughly 440,000 research participants from the UK Biobank, a large-scale biomedical database and research resource; and a GWAS of roughly 365,000 research participants from the personal genomics company 23andMe. We then replaced the earlier 23andMe sample with an updated GWAS based on roughly 2.3 million 23andMe research participants. This new data increased the combined sample size from about 1.1 million participants to about 3 million participants. All of these datasets have surveyed and genotyped their research participants.

Our study was limited to only the most common type of genetic difference: SNPs (see FAQ 1.3). Like our most recent previous study (Lee et al., 2018) but unlike most other studies, which have analyzed only the autosomes (the non-sex chromosomes), our study also included SNPs on the X chromosome (see FAQ 2.9). Also unlike most other studies (including our own previous work), which have studied only the additive (i.e., linear) effects of SNPs, our study also studied their dominance (i.e., non-linear) effects (see FAQ 2.10). In total, our analyses included approximately 10 million SNPs.

As in other GWASs, our analyses included only individuals of primarily European genetic ancestries. (We say geneticancestries because we are not talking about who someone identifies as their ancestors but, rather, the similarity between someone’s genome and the genome of a “reference sample” for a population from prior genetic studies. And throughout these FAQs we refer to European and African genetic ancestries, plural, because there is tremendous genetic diversity within each continent, especially Africa.) Such individuals are identified in different ways in different cohorts that participated in our study (depending on, for example, the demographic composition of the country where the individuals live). In all cohorts, though, statistical summaries of the allele frequencies and allele correlation patterns in people’s genomes (see FAQ 2.2), called principal components, are used as part of the procedure. In particular, individuals are only identified as having European genetic ancestries if their principal components are sufficiently similar to those of reference individuals recruited in prior genetic studies, whose ancestors over several generations were all born in a European country. The restriction to European genetic ancestries is needed in order to reduce statistical confounds that otherwise arise from studying populations that include people with different genetic ancestries (see the discussion of population stratification bias in FAQ 2.2; see also FAQs 1.4, 2.7 & 3.5).

In the remainder of the paper, we used the findings from the GWAS for a range of additional analyses that explored (among other things):

  • the predictive power of the polygenic index for educational attainment (FAQ 2.4), as well as cognitive performance and high school academic achievement;

  • why the polygenic index is less predictive in individuals of African genetic ancestries than in individuals of European genetic ancestries (FAQ 2.5);

  • the predictive power of the polygenic index for the risk of various diseases (FAQ 2.6);

  • the extent to which the polygenic index’s predictive power for educational attainment and other outcomes is due to its correlation with environmental factors rather than to genes, per se (see FAQ 2.7); 

  • assortative mating based on educational attainment (see FAQ 2.8);

  • the effects of SNPs on the X chromosome on educational attainment (FAQ 2.9); and

  • the magnitude of “dominance effects” for the effects of SNPs on educational attainment (see FAQ 2.10). 

2.2       What are common pitfalls in GWASs? What precautions did you take against them?

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWASs). We took many precautions to guard against these pitfalls.

One potential source of spurious results is incomplete “quality control” (QC) of the genetic data. To avoid this problem, we use QC protocols from medical genetics research (Winkler et al., 2014). We supplement these protocols by developing and applying additional, more stringent QC filters.

Another potential source of spurious results is a confound known as “population stratification bias.” (We discuss a well-known illustration of this confound—a hypothetical GWAS of chopstick use—in FAQ 1.3.) In our study we correct for population stratification bias as much as possible. At the outset, we restrict the study to individuals of European genetic ancestries. As is standard in GWASs, we also control for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across genetic ancestry groups within European populations, so controlling for them largely removes the spurious associations arising solely from these small differences.

After taking these steps to minimize bias stemming from population stratification, we conduct a standard analysis to assess how much population stratification bias still remains in our data after our efforts to minimize it, called LD Score regression (Bulik-Sullivan et al., 2015). The results of this analysis indicate that the biases in our results due to population stratification are small.

The “direct effect” of the polygenic index from our within-family analysis, described in FAQ 2.7, is immune to any remaining population stratification bias. Population stratification bias can only arise when individuals are from different families with different genetic ancestries. By controlling for the polygenic indexes of an individual’s parents, we are also controlling for any differences in genetic ancestry across individuals.

2.3       What did you find in the main GWAS of educational attainment?

In our sample of roughly 3 million people, we found 3,952 SNPs that were associated with educational attainment (using the standard statistical threshold in GWAS, which adjusts for multiple hypothesis testing). This is a substantial increase from the 1,271 SNPs identified in our last GWAS of around 1 million individuals (Lee et al., 2018), further confirming the importance of large sample size for identifying SNPs associated with socio-behavioral outcomes. 

The current study further confirmed the finding from our earlier work that the effects of individual SNPs on educational attainment are each extremely small. The median effect size across the 3,952 SNPs was just 1.4 weeks of schooling per allele; even the SNPs with the strongest associations only predicted around 3.5 weeks of additional schooling per allele. Taken together, these 3,952 SNPs accounted for roughly 8% of the variation across individuals in years of education completed.

Here is another way to think about this result. We could use the results for these 3,952 SNPs (not the ~1 million SNPs across entire genome that we discuss in FAQ 2.4) to predict the educational attainment for a new group of people (separate from our discovery sample) whose educational attainment is unknown to us. We could then compare each individual’s predictededucational attainment to their actual educational attainment. If we did so, our results would show that if someone were predicted to complete an above average number of years of schooling (i.e., to be in the top half of educational attainment), that person would have about a 59% chance of actually being in the top half of educational attainment. 59% is better than the 50% odds of making a correct prediction that you would have if you used a coin flip to predict whether someone is in the top or bottom half of educational attainment—but only a bit better. By contrast, a prediction based on a polygenic index that combines the complete set of ~1 million SNPs that we studied (see FAQs 1.6 & 2.4) has more predictive power: about 13% of the variation across individuals. (Even this amount of predictive power still corresponds to having only a 62% chance of correctly guessing whether someone is in the top or bottom half of educational attainment.)

The contrast between the 8% of the variation predicted by the 3,952 SNPs and the 20% estimated to be explained by common SNPs (see FAQ 1.8) implies that there are many other SNPs that have not yet been identified. Even larger sample sizes will be needed to identify them. 

It is also important to keep in mind that educational attainment is a complex outcome, and our study focuses on only a tiny piece of the bigger picture. In this paper, we only examine one type of genetic difference (SNPs). Other genetic effects, environmental effects, and their interactions are important topics of active research and of future work by the SSGAC. Such work includes studies of associations between educational attainment and epigenetic marks, i.e., other molecules that attach to a person’s DNA over the course of their lifetime and tell their genes to switch “on” or “off” (Linnér et al., 2017).

2.4       How predictive is the polygenic index developed in this study?

As discussed in FAQ 1.6, we can create a polygenic index using the GWAS results from around ~1 million SNPs. The polygenic index we construct “predicts” (see FAQ 1.5) around 13% of the variation in education across individuals of European genetic ancestries (when tested in independent data that was not included in the GWAS). This ~1 million SNP polygenic index predicts much more of the variation than does the genetic predictor described in FAQ 2.2, which was based on only 3,952 SNPs. Including all ~1 million SNPs tends to add predictive power because the threshold for significance/inclusion that is used to identify the 3,952 SNPs is very conservative (i.e., many of the other ~1 million SNPs are also associated with educational attainment but are not identified by our study, and on net, it turns out empirically that more signal than noise is added by including them). This study’s polygenic index has much more predictive power than polygenic indexes constructed from our earlier three GWASs of educational attainment, because all of those studies had much smaller sample sizes (~100,000, ~300,000, and ~1.1 million individuals, respectively, compared with ~3 million individuals in the current study).

Individuals with high polygenic indexes have, on average, higher levels of education than those with lower polygenic indexes. In the present study, we found that among the individuals of European genetic ancestries from a U.S. sample of young adults (the National Longitudinal Study of Adolescent to Adult Health), 7% of those with the lowest 10% of polygenic indexes graduated from college, compared with 71% of those with the highest 10% of polygenic indexes. These results show both that polygenic indexes have some predictive power but also that polygenic indexes do not at all pin down individual outcomes: even when polygenic indexes are based on a GWAS of many more people and therefore have even greater predictive power than ours, there will always be many people whose polygenic indexes “predict” lower educational attainment who in fact attain relatively high amounts of education and vice-versa.

As we discuss further in FAQ 3.4, an individual’s polygenic index for education (even a polygenic index based on ~1 million SNPs) is still not a very accurate prediction of that individual’s actual level of education attained. We emphasize that point using Figure 2c in the paper, reproduced here:

 

 

 

In the figure on the left, each point is an individual of European genetic ancestries from the U.S. sample of younger adults mentioned above (the National Longitudinal Study of Adolescent to Adult Health). In the figure on the right, each point is an individual of European genetic ancestries from a U.S. sample of older adults (the Health and Retirement Study). In both figures, the x-axis is the individual’s polygenic index, and the y-axis is the individual’s actual number of years of formal schooling (after converting the level of education to an internationally standardized scale and adjusting for age and sex). The points are jittered slightly from their actual values in order to ensure that points do not lie directly on top of each other. While the figures show that there is a relationship between the polygenic index and the actual amount of an individual’s education, they also show that people with the same polygenic index value—points with the same x-axis value—vary a great deal in how much education they have. For instance, in both samples, among those with a polygenic index that is one standard deviation below average, individuals range in their actual educational attainment from about 7 years of formal education to about 22 years.

 

Despite the fact that polygenic indexes are not useful for predicting a particular individual’s educational attainment, they are useful for scientific studies (including social science, health research, etc.). Such studies are concerned with aggregate population trends and averages rather than with individual outcomes. In particular, because the polygenic index predicts about 13% of the variation across individuals, studies of its association with other variables can be well powered in sample sizes as small as 61 individuals (but not as small as 1 individual!).

 

Through this lens, the fact that the current study’s polygenic index for educational attainment predicts 13% of the variation across individuals in education attained is quite meaningful and rivals the predictive power of other variables commonly used in research—none of which, taken alone, predicts a large amount of variation in a socio-behavioral outcome. For example, in our prior work (Lee et al., 2018) we estimated that household income predicts ~7% of variation in educational attainment and mother’s education predicts ~15%. Thus, our index has approached the predictive power of important demographic variables and can be used in similar ways (e.g., to control for genetics as an additional confound when evaluating the effects of environmental differences or interventions). 


​With a relatively high level of population-level predictive power, the polygenic index we constructed enables other research that is of value to social scientists and health researchers. Such studies are already being conducted with the (less powerful) polygenic indexes from our earlier GWASs of educational attainment (see FAQ 1.7). Our new results will enable many additional applications, such as studies that use the polygenic index in relatively small samples that contain rich health and socio-behavioral data that is expensive to collect (e.g., a randomized controlled trial that studies the effects of subsidizing higher education and uses the polygenic index as a control variable).

A major caveat to all of this is that polygenic indexes developed from GWASs of particular genetic ancestry populations are known to be less predictive when applied to people of any other genetic ancestry (for reasons we discuss in FAQ 2.5). (As noted above in FAQ 2.1, we say genetic ancestries because we are not talking about who someone identifies as their ancestors but, rather, the similarity between someone’s genome and the genome of a “reference sample” for a population from prior genetic studies. And throughout these FAQs we refer to European and African genetic ancestries, plural, because there is tremendous genetic diversity within each continent, especially Africa.) For example, studying polygenic indexes for various health outcomes derived from GWAS participants of European genetic ancestries, Martin et al., (2017) and Duncan et al.(2019) found that, on average across the polygenic indexes, the predictive power for the outcomes was roughly 20-30% as large in samples of African genetic ancestries as it was in samples of European genetic ancestries.

Our educational attainment polygenic index, like most other polygenic indexes, was developed with participants of European genetic ancestries (because currently most genotyped people are of European genetic ancestries, and very large numbers of people are needed to create meaningful polygenic indexes). We illustrate and quantify the attenuation in predictive power for individuals of African genetic ancestries, populations for which previous work has found that the attenuation is especially large. Specifically, we examined the predictive power of our educational attainment polygenic index in the samples of African genetic ancestries of our two prediction datasets, HRS and Add Health. The polygenic score explained 12.0% and 15.8% of the variance among the participants of European genetic ancestries in the HRS and in Add Health, respectively. By contrast, the polygenic index explained far less in the participants of African genetic ancestries: 1.3% and 2.3%, respectively. Thus, when applied to those of African genetic ancestries, the polygenic index has only 10-15% of the predictive power in has with those of European genetic ancestries. (In our previous GWAS of educational attainment, we also tested the predictive power of the polygenic index in a sample of HRS participants with African genetic ancestries (who may or may not have additional genetic ancestries). We similarly found that the earlier polygenic index had predictive power only 11% as large as the predictive power in the European genetic ancestries sample.) Thus, our results suggest that the drop-off in predictive power for the polygenic index for educational attainment is especially large, relative to polygenic indexes for other outcomes.

Unfortunately, this attenuation of predictive power means that for most populations, many of the benefits of a polygenic index will be postponed until large GWAS studies are conducted using samples from these populations. Currently, most large, genotyped samples are of European genetic ancestries. We prioritize GWASs of samples of other genetic ancestries, but cannot implement this analysis until large enough samples of these populations have been genotyped and are made available to the research community.

2.5       Why is the polygenic index less predictive in samples of African genetic ancestries than in samples of European genetic ancestries?

As noted above in FAQ 2.4, we expect attenuated predictive power when applying a polygenic index developed with participants from any particular genetic ancestry populations to people of any other genetic ancestry, and we illustrated this attenuation in samples of participants of African genetic ancestries. We conducted additional analysis in order to shed some light on why the polygenic index is less predictive in participants of African genetic ancestries than in samples of European genetic ancestries. We study the main potential reasons, which fall into two categories.

The first category is primarily about environmental factors. In this category, there are two main explanations of the reduced predictive power in samples of African genetic ancestries. First, genetic factors as a whole might simply matter relatively less for predicting educational attainment in samples of African genetic ancestries because environmental factors matter relatively more. In that case, the polygenic index—which captures some of these genetic factors—would similarly predict less well in the samples of African genetic ancestries. Second, there are gene-environment interactions (see FAQ 3.2). Since the samples of African genetic ancestries face different environments on average than do the samples of European genetic ancestries—for instance, racist expectations for classroom performance, poorer access to educational resources, and other average socio-economic circumstances that can affect the ability to succeed or remain in school—the associations between SNPs and educational attainment could be different in those of African genetic ancestries than in those of European genetic ancestries. If so, the SNP weights that produce a predictive polygenic index in populations of European genetic ancestries will turn out to be suboptimal weights for prediction in African-genetic-ancestry populations.

 

The second category can be thought of as purely genetic reasons. In this category, there are also two main explanations of the reduced predictive power in samples of African genetic ancestries (to use our example). First, purely by chance, particular alleles are more or less common in populations with different genetic ancestries. Much of the predictive power of the polygenic index comes from the (positive or negative) weights it puts on alleles that are relatively common in populations of European genetic ancestries. Because many of these alleles are not as common in populations of African genetic ancestries, the polygenic index will have less predictive power in those populations. Second, populations also differ from each other in their linkage disequilibrium (LD) patterns, i.e., their correlation structure across SNPs (see FAQs 2.2 & 3.5). A given SNP may be associated with educational attainment because the SNP is in LD (i.e., correlated) with a SNP elsewhere in the genome that causally affects education (see FAQ 1.6). If the strength of the correlation is greater in one genetic ancestry group than in another, then the size of the association will be larger in that genetic ancestry group. The fact that there are differences across genetic ancestry groups in the set of associated SNPs and their effect sizes means that the weights for constructing polygenic indexes in individuals of European genetic ancestries (FAQ 1.4) would be the “wrong” weights for individuals of other genetic ancestries.

The first category of explanations is difficult for us to directly assess given the data we have, but in our paper, we directly evaluated the second category of explanations and used those results to indirectly assess the first category. Geneticists have conducted in-depth studies of the genetic differences across genetic ancestry groups in allele frequencies and LD. This makes it possible for us to assess how much the polygenic index’s predictive power would be expected to be reduced in samples of African genetic ancestries based on these factors alone. The paper that developed the methodology for this analysis applied it in the UK Biobank dataset and studied height, BMI, HDL and LDL cholesterol, triglycerides, asthma, type 2 diabetes, and hypertension (Wang et al., 2020). We also used the UK Biobank in order to enable us to compare educational attainment to these other phenotypes.

We find that, based on the second category of explanations, we would expect the predictive power for educational attainment to be 35% as large in samples of African genetic ancestries than in samples of European genetic ancestries. This is much larger than the 10-15% we actually find in our U.S.-based samples (see FAQ 2.4). We conclude that the first category of explanations—the environmental factors—is therefore likely to be important. Moreover, the discrepancy between the actual predictive power in samples of African genetic ancestries and the predictive power expected based on the second category of explanations is larger for educational attainment than for the phenotypes studied by Wang et al. (2020), suggesting that the environmental factors are more important for educational attainment.

For a number of reasons, this analysis of ours is only suggestive. One reason is that the UK Biobank sample of African genetic ancestries likely includes a higher fraction of immigrants to the UK than does the sample of European genetic ancestries, and individuals who completed some or all of their schooling outside the UK education system are less comparable. However, we believe our analysis points toward one direction for future work to understand why the polygenic index is less predictive in people of different genetic ancestries.

2.6       What did you find in the analysis of disease risk?

In addition to studying how accurately the polygenic index predicts educational attainment, we also examined how accurately it could predict some common diseases. Prior work, including our own, has found that the SNPs that predict educational attainment overlap with those that predict health outcomes, including Alzheimer’s disease, bipolar disorder, ADHD, schizophrenia, coronary artery disease, and longevity (Okbay, Beauchamp, et al., 2016; Pickrell et al., 2016; Riccardo E Marioni et al., 2016; Warrier et al., 2016; Anderson et al., 2017; Tillmann et al., 2017). However, the polygenic index for educational attainment has not been used as a predictor of such outcomes.

 

We studied ten common diseases, including asthma, arthritis, migraine, depression, and several related to heart disease (such as Type 2 diabetes and heart attack). We chose diseases that themselves have been the focus of large-scale GWASs. Thus, we could compare the predictive accuracy of our polygenic index for educational attainment with disease-specific polygenic indexes from those GWASs. For these analyses, we used a sample of roughly 440,000 individuals from the UK Biobank (the number of individuals with each disease varied depending on the disease).

The main result from these analyses is that, on average across the diseases, predicting disease risk using both the polygenic index for educational attainment and the disease-specific polygenic index increases predictive accuracy by roughly 50%, relative to using only the disease-specific polygenic index. On average, a disease-specific polygenic index predicts roughly 1.2% of the variation across individuals, whereas a disease-specific polygenic index together with the polygenic index for educational attainment jointly predict roughly 1.8% of the variation. This finding points to the potential value of the polygenic index for educational attainment for medical and epidemiological research. However, we highlight that the actual amounts of predictive power are small, much smaller than the roughly 13% for predicting educational attainment itself (see FAQ 2.4). We also note that genes are estimated to have a stronger influence (relative to environmental influences) on many complex diseases than they do on educational attainment; the primary reason that the educational attainment polygenic index has much greater predictive power than these disease polygenic indexes is that the GWASs that created the disease polygenic indexes are to date much smaller than our educational attainment GWAS.

2.7       What did you find in the family-based analyses?

Our family-based analyses involve looking at how predictive the polygenic index for educational attainment is once we control for the educational attainment polygenic indexes of the individual’s parents. Doing this allows us to better understand some of the sources of the predictive power of the polygenic index. There are three categories of sources of predictive power, listed here along with conventional names for these sources used in the literature:

  • Direct genetic effects: Some SNPs (that are either included in the polygenic index or correlated with SNPs that are included) may have an effect on characteristics of an individual, such as cognitive skills and personality, that in turn may influence educational attainment. These effects may be mediated by environmental factors (e.g., a child who likes reading will be more likely to pursue that interest in school if the child lives in a society where reading is valued in school).

 

  • Gene-environment correlation: The polygenic index is correlated with environmental factors that affect educational attainment. For example, a person’s polygenic index is correlated with the polygenic indexes of that person’s biological parents. Rearing parents’ polygenic indexes affect the environment in which the person grows up. For example, if the parents are more educated, they are likely to earn higher incomes and live in a neighborhood with well-funded schools (where local funding matters), which may provide educational advantages to their child. (Another source of gene-environment correlation is “population stratification,” in which certain genetic variants are more common in certain genetic ancestries, e.g., English versus Scottish. This can generate “population stratification bias” if having those genetic ancestries is also associated with cultural influences that affect educational attainment. However, population stratification bias should be largely reduced by the “quality control” procedures of our GWAS; see FAQ 2.2.)

 

  • Assortative mating: Having a higher polygenic index is correlated with having other SNPs that are also associated with greater educational attainment. Assortative mating on educational attainment refers to the fact that, on average, there is a tendency for people to marry and have children with other people who have a similar amount of education (see FAQ 2.8). Consequently, people who inherit SNPs associated with higher educational attainment from one of their parents are also more likely than average to inherit SNPs associated with higher educational attainment from their other parent. Thus, the SNPs included in the polygenic index are correlated with SNPs not included in the polygenic index in such a way that magnifies the index’s predictive power.

The key idea of the family-based analyses is to study the predictive power of the polygenic index controlling for the polygenic indexes of an individuals’ parents. This allows us to control the second and third sources of a polygenic index’s predictive power from the bullet list above and isolate the first, which are commonly referred to as the “direct genetic effects.” (This terminology is used to distinguish effects of SNPs on one’s own outcome—the “direct effects” of a SNP—from effects on someone else’s outcome, which are called “indirect effects.” An example of indirect effects is in the second bullet list above: parents’ SNPs affecting someone else’s—the child’s—educational attainment.) When controlling for the polygenic indexes of a person’s parents, the association between the person’s polygenic index and that person’s outcomes captures only direct genetic effects. We therefore call it the “direct effect” of the polygenic index.

While we have noted that the predictive power of the polygenic index as a whole does not necessarily reflect causal effects (see FAQ 1.6)—and indeed, the second and third categories above generate predictive power that is correlational but not causal—the component of the polygenic index’s predictive power that is due to “direct effects” does reflect the causal effects of some SNPs. Because of that, when we identify how much of the predictive power is due to direct effects, we interpret it as telling us how much is due to causal effects. However, we cannot infer that it is the SNPs included in the polygenic index that have those causal effects; the “direct effects” might be due to correlation between measured non-causal SNPs and unmeasured causal SNPs.  

Controlling for the polygenic indexes of a person’s parents requires having genetic information on the person’s parents. Such data is not available in most of the samples available to us, but we have it or can construct it in some of the samples. The Generation Scotland sample contains data on ~3,500 trios: individuals and both their parents. Two other samples—the UK Biobank and the Swedish Twin Registry (where we use only the fraternal twins)—contain large numbers of siblings, ~53,000 individuals in total. From the sibling data, we use a recently developed method (Young et al., 2020) to impute (i.e., statistically partially reconstruct) parental genetic data. These are the samples we use for our family-based analyses.

In our analyses, we compare the direct effect of the polygenic index (i.e., controlling for the polygenic indexes of an individual’s parents) with the “population effect,” which is the term we use for the association between the polygenic index and an outcome when we do not control for the polygenic indexes of an individual’s parents. In contrast to the direct effect of the polygenic index, which captures only the source of predictive power in the first bullet point above, the population effect captures all three sources of predictive power in the bullet list above. Our analyses estimate the ratio of the direct effect of a polygenic index to its population effect. This ratio tells us what fraction of a polygenic index’s association with some outcome is due to direct genetic effects.

When the educational attainment polygenic index is used to predict educational attainment itself, we estimate that this ratio is 0.556. That is, we estimate that 56% of the association between the polygenic index and educational attainment is due to direct genetic effects, and the remainder—44%—is due to the other sources of predictive power representing the second and third bullets above.

Next, we sought to determine this ratio when using the educational attainment polygenic index to predict other outcomes, such as the diseases we analyzed in the paper, including Alzheimer’s disease, bipolar disorder, ADHD, schizophrenia, coronary artery disease, and longevity (see FAQ 2.6). However, we cannot study this same set of diseases in our family-based analyses because our trio and sibling samples either do not contain data on the diseases or (in the case of the UK Biobank siblings) do not contain a sufficient number of individuals that have one of the diseases. Instead, to estimate the fraction of the educational attainment polygenic index’s association with complex diseases that is due to direct genetic effects, we study a set of 22 health, cognitive, and socioeconomic outcomes. These include several biomarkers related to disease risk, such as BMI, blood pressure, and cholesterol. The set of outcomes also includes height, cognitive performance, smoking, alcohol use, income, and depression. For each of these 22 outcomes, we estimate both the direct effect and the population effect of the polygenic index for educational attainment.

On average across the 22 outcomes, we estimate that the ratio of direct to population effects is 0.588. This is very similar to the ratio when the outcome is educational attainment, and the conclusion is correspondingly similar: we estimate that 59% of the association between the polygenic index and these other outcomes is due to direct genetic effects, and the remaining roughly 41% is due to the other sources of predictive power.

In summary, our family-based analyses find that a substantial part of the predictive power of the polygenic index is due to direct effects, and a substantial part is not. This is true both when using the educational attainment polygenic index to predict educational attainment itself, and when using it to predict other outcomes.

The finding that much of the predictive power is due to direct effects is important for at least three reasons. First, it shows that biases in the GWAS, such as unaccounted-for “population stratification bias” (see FAQs 1.4 & 2.2), are not entirely responsible for the predictive power that we find. This had been shown previously for predicting educational attainment but not for using the educational attainment polygenic index to predict other outcomes. 

Second, the finding is a preliminary step toward unpacking the reasons why the genetic influences on educational attainment also matter for other outcomes. One possibility is that SNPs influence an outcome that in turn separately influences both education and health. For example, it could be that genetic influences on conscientiousness partially affect both how much education a person gets and also health-promoting behaviors that reduce disease risk. Another possibility is that SNPs influence educational attainment and there is something about formal schooling that, in turn, causes people to engage in more health-promoting behaviors. While our paper does not distinguish between these and other possibilities, our results are informative about the overall strength of the relationship between genetic influences on educational attainment and on certain other outcomes.

Third, the finding that much of the polygenic index’s predictive power is not due to direct effects—either for educational attainment or for the disease-related biomarkers and outcomes we investigated—is also important. It reinforces the importance of interpreting genetic associations with caution. Our finding implies that a substantial part of the predictive power of the polygenic index is due to some mix of assortative mating and gene-environment correlation. For this and other reasons, we believe it is misleading to use phrases such as “innate ability” or “genetic endowments” to describe what is measured by polygenic indexes based on our GWAS estimates. These phrases incorrectly imply that the polygenic index is entirely capturing direct effects, and they further ignore the potentially important role that environmental factors play in mediating direct effects.  

2.8       What did you find in the analysis of assortative mating?

Assortative mating refers to the idea that people tend to have children with people who are similar to themselves in particular ways. Assortative mating is a research topic in the social sciences and also in the field of genomics. 

Much prior research has found that there is assortative mating on educational attainment: i.e., the available data reveals a tendency for people to have children with people who have similar educational attainment as themselves (e.g., Mare, 1991). For example, in the UK Biobank, we estimate that the correlation between the educational attainments of mates (i.e., biological mothers and fathers) is roughly 0.4. That is a moderately sized correlation. In recent decades in Western countries, educational attainment is one of the outcomes with the strongest assortative mating. In our analysis of assortative mating, we use height as an outcome to compare with educational attainment because it is another outcome for which assortative mating is relatively strong. In the UK Biobank, we estimate that the correlation between mates’ heights is roughly 0.3.

In our paper, we study assortative mating using polygenic indexes, which has also been done in some prior research (e.g. Conley et al., 2016; Hugh-Jones et al., 2016; Robinson et al., 2017 and Yengo, Robinson, et al., 2018). In each of the datasets we use, we identify mate pairs based on the genetic data: we find pairs of individuals who share a child in common. We identify 894 mate pairs in the UK Biobank and 2,964 mate pairs in Generation Scotland. Averaging across these data sources, we find that the mate correlation in the polygenic index for educational attainment is roughly 0.17. We again compare with height, this time using a polygenic index for height constructed using the largest published height GWAS that was not based on the datasets we study (Wood et al., 2014). We find that the mate correlation in the polygenic index for height is roughly 0.10.

What is new in our paper is that we use these correlations to test a model of assortative mating that is often assumed in the genetics literature, called the “phenotypic assortment model.” When applied to educational attainment, this model states that the mate-pair correlation in the polygenic index for educational attainment is entirely due to the mate-pair correlation in educational attainment. Given the predictive power of the polygenic index, which we estimated (see FAQ 2.4), this model makes a precise prediction about how the mate-pair correlation in an outcome should be related to the mate-pair correlation in the polygenic index for that outcome.

For height, that prediction comes close to what we observe. That is, for height, it does appear that the mate-pair correlation in the polygenic index for height is entirely due to the people tending to marry others with similar height. However, for educational attainment, the prediction of the phenotypic assortment model is far off: the mate-pair correlation in the polygenic index for educational attainment is too high. Thus, for educational attainment, our results provide strong evidence against the phenotypic assortment model. Instead, our results imply that people are marrying other people who are similar to them based on some factor or factors other than educational attainment (perhaps in addition to educational attainment itself) but which is correlated with the polygenic index for educational attainment.

 

We conduct additional analyses to shed light on what these other factors might be. One possible factor is genetic ancestry, which in our data might reflect, for example, people being more likely to marry others from the same city or region of the UK. Another possible factor is cognitive performance. However, we find that assortative mating on both of these factors, added to the effect of assortative mating on educational attainment itself, are not together sufficient to fully account for the mate-pair correlation in the polygenic index for educational attainment. While our results raise the question of what else explains the high mate-pair correlation in the polygenic index, we cannot fully answer the question with the data we have.

 

In addition to helping us better understand assortative mating, our results also relate to a common theme across several of our analyses (see FAQs 2.4, 2.6 & 2.7): helping us better understand the sources of the polygenic index’s predictive power. Specifically, we draw two conclusions about the polygenic index’s predictive power from our analysis of assortative mating. First, there are factors besides educational attainment on which people assortatively mate that contribute to the mate-pair correlation in the polygenic index for educational attainment—and these factors in turn likely contribute to the predictive power of the polygenic index for a range of outcomes. Suppose one of these factors is the region where a person grew up. In order to be a factor that contributes to the mate-pair correlation, the region where a person grew up must be associated with the polygenic index. Moreover, the region where a person grew up is likely to be associated with many other things that relate to educational, socioeconomic, and health outcomes, such as quality of local schools, local economic opportunities, air quality, and so on. If the region where a person grew up is correlated with both the polygenic index for educational attainment and these various outcomes, then it is one component of the gene-environment correlation that helps explain the polygenic index’s predictive power (see FAQ 2.4). Thus, the results of our assortative mating analysis provide evidence that there is substantial gene-environment correlation that likely contributes to the polygenic index’s predictive power.

Second, if people assortatively mate on factors that are correlated with the educational attainment polygenic index—as our results imply that they do—then this increases the variation of the polygenic index in the population and thereby magnifies its predictive power. To continue the example from above, imagine an extreme scenario of exact assortative mating on where a person grew up. That is, everyone has a mother and a father who are from the same region. In this scenario, there will be more people with very high and very low educational attainment polygenic indexes compared to a scenario where people marry at random across regions. That is because people from regions with high average polygenic indexes are marrying other people from the same region and having offspring that are relatively more likely to also have a high polygenic index. Similarly, people from regions with low polygenic indexes are marrying other people from their same region and having offspring that are relatively more likely to also have a low polygenic index. Thus, relative to the scenario of people marrying at random across regions, there is more variation of the polygenic index across people in the scenario with assortative mating. Consequently, variation in the polygenic index across people will be associated with (and hence “statistically predict”) more of the variation in educational attainment across people.

2.9       What did you find in the analysis of the X chromosome?

Like our most recent GWAS of educational attainment (Lee et al., 2018)—but unlike most GWASs—this study also examined genetic variants on the X chromosome. In addition to the 3,952 variants identified on the autosomes (the non-sex chromosomes), we identified 57 variants associated with educational attainment on the X chromosome.

The results of our analysis of the X chromosome in this study are fully consistent with the results from the previous GWAS of educational attainment (but our confidence in these results is even greater in the current study because of our larger sample size). For example, as in the previous study, we found fewer SNPs associated with educational attainment on the X chromosome than on other chromosomes of similar length. Also as in the previous study, in separate GWASs of men and women, we found that, in aggregate, SNPs on the X chromosome predict similar amounts of variation in educational attainment in men and in women. Some researchers had hypothesized that genetic influences on the X chromosome are an important source of differences in the variance in cognitive performance across men and women. While there were compelling scientific reasons to view such claims skeptically even prior to the publication of our earlier study, the results of both of our studies provide further evidence against the hypothesis.

2.10     What did you do in the “dominance GWAS” of educational attainment? What did you find?

In addition to our standard GWAS of educational attainment on the autosomes (i.e., the non-sex chromosomes) described in FAQ 1.3 above, we also conducted a “dominance GWAS” of educational attainment on the autosomes. As in a standard GWAS, in a dominance GWAS we test each SNP for its association with educational attainment. The only difference is that, unlike in a standard GWAS, in a dominance GWAS, we allow for the possibility that each SNP has a non-linear relationship with educational attainment. A linear relationship is one where an increase in one variable is associated with a correspondingly-sized increase or decrease in another variable. Specifically, suppose (as is typical) there are three possible combinations of alleles at a given SNP: let’s call them AA, AB, and BB. And let’s assume that the B allele is associated with greater educational attainment. A standard GWAS assumes the “additive model,” according to which the effect of going from zero to one B allele (i.e., AA to AB) is assumed to be equal to the effect of going from one to two B alleles (i.e., AB to BB). In contrast, a dominance GWAS separately estimates each of these two effects and thereby allows us to test whether or not they are equal. 

(In this context, the term “dominance” originally comes from the idea of “dominant” and “recessive” alleles. In the classical usage, often called “complete dominance,” an organism has a trait—for example, a pea is smooth rather than wrinkled—if there are any B alleles. In that case, AB and BB would both yield the dominant trait of a smooth pea, and only AA would yield the recessive trait of a wrinkled pea. A dominance GWAS allows for this possibility: a large effect of going from AA to AB but no effect of going from AB to BB. It also allows for other, more common possibilities. For example, in “incomplete dominance,” the effect of going from AA to AB is larger than the effect of going from AB to BB, but both effects are non-zero. Another possibility is “overdominance,” where an organism with the AB combination of alleles has more of the trait than an organism with AA or BB.)

To many researchers, it seems natural to expect that relationships between SNPs and outcomes like educational attainment would be non-linear—that is, that there might be less of an effect, or no effect at all, of going from one B allele to two B alleles. Indeed, there is a long tradition in behavior genetics research (much of which compares outcomes across identical and fraternal twins) of assuming that non-linear relationships between SNPs and socio-behavioral outcomes account for a non-trivial fraction of the variation in such outcomes across individuals (e.g. Jinks and Eaves, 1974).

Perhaps surprisingly, there is an equally long tradition in a field of research called quantitative genetics showing that, theoretically, for outcomes like educational attainment that are influenced by many SNPs, deviation from linear relationships are likely to be small and account only for a small fraction of the variation in outcomes across individuals (e.g. Hill, Goddard and Visscher, 2008). There are a variety of theoretical arguments, which stem from both biological and statistical considerations. Based on other kinds of studies that are not dominance GWAS, for many outcomes (but not educational attainment), there is also evidence that dominance deviations account for only a small fraction of the variation across individuals (e.g. Pazokitoroudi et al., 2020; Hivert et al., 2021).

Thus, researchers are divided about whether the relationship between SNPs and socio-behavioral outcomes are likely to be largely linear or to have significant dominance effects. Partly because a dominance GWAS needed to answer that question is more complex than a standard GWAS, ours is one of the first large-scale dominance GWAS conducted for any outcome. We conducted our dominance GWAS in a sample of individuals from 23andMe and the UK Biobank. It was a slightly smaller sample than our standard GWAS, but still a very large sample: ~2.6 million individuals.

Our results strongly support the view of those researchers who believe that with respect to educational attainment, deviations from linear relationships are likely to be small and account only for a small fraction of the variation in outcomes across individuals. We cannot identify any SNPs with such a non-linear relationship to educational attainment, despite our very large sample size. While some non-linear relationships between SNPs and educational attainment probably exist, our results indicate that such SNPs must be at least an order of magnitude smaller than the (already very small) linear effects of SNPs on educational attainment. 

Even though we cannot identify any dominance effects of specific SNPs, we can use the aggregate results of our dominance GWAS to estimate how much of the variation in educational attainment is explained by the dominance effects that do exist among SNPs in our analysis. Our results suggest that, taken altogether, dominance effects of the SNPs included in our GWAS account for only roughly 0.02% (two hundredths of one percent) of the variation in educational attainment across individuals.

We note two important qualifications about how our findings should be interpreted. First, our results leave open the possibility that there are rare SNPs that have large non-linear relationships with educational attainment. The data included in our GWAS, as in almost all GWASs, are common SNPs (see FAQ 2.1). These common SNPs capture most of the information about common ways in which people vary genetically (e.g., at least 1% of the population has a different genotype than the remainder of the population). However, these common SNPs do not capture information about rare SNPs (e.g., over 99% of the population has the same genotype, but a small percentage of people differ). Our results are therefore silent about whether these rare SNPs may have substantial non-linear relationships with educational attainment.

Second, although the dominance effects of SNPs included in our GWAS account for only a tiny fraction of the variation across individuals, the combined effect of dominance across many SNPs on a particular individual can nonetheless be substantial. In particular, when two close relatives have offspring, the offspring will have an unusually large number of AA and BB SNPs and an unusually small number of AB SNPs (because the parents are more likely than unrelated individuals to both have the same allele, either A or B). While there is a lot of variation across SNPs and outcomes, on average, having the same two alleles at a SNP, i.e., having AA or BB rather than AB, is known to be harmful to an organism. When an individual’s recent genetic ancestors are closely genetically related, there can be a noticeably harmful effect on certain outcomes due to the unusually large number of AA and BB genetic variants. Using our dominance GWAS results, we estimate that the offspring of first cousins will have, on average, roughly 1 fewer month of formal schooling than the offspring of unrelated individuals.

Picture1.png
Pictur2.png
 
 
 
 
 
 
 
 
 
 
 
3          Ethical and social implications of the study
3.1       Did you find “the gene for” educational attainment?

No.

We did not find “the gene for” educational attainment or anything else. We identified many SNPs that are associated with educational attainment. Although it was once believed that scientists would discover a few strong associations between genes and outcomes, we have known for a number of years that the vast majority of human outcomes are complex and influenced by many thousands of genes, each of which alone tends to have a small influence on the relevant outcome.

Furthermore, many complex outcomes are also influenced by parts of the genome that are not genes at all but instead serve to regulate genes (e.g., sequences of DNA that influence when a gene is turned on or off). Genes typically contain many SNPs (often dozens or hundreds, and in some cases thousands), and there are even more SNPs outside of genes than inside genes. Complex outcomes are often influenced by millions of SNPs.

3.2       Well, then, did you find “the genes for” educational attainment?

Although we did find many SNPs that are associated with educational attainment, we believe that characterizing these as “genes for educational attainment” is still likely to mislead, for many reasons. 

First, most of the variation in people’s educational attainment is accounted for by social and other environmental factors, not by additive genetic effects (see FAQ 1.8). “Genes for educational attainment” might be read to imply, incorrectly, that genes are the strongest predictor of variation in educational attainment.

Second, the SNPs that are associated with educational attainment are also associated with many other things (only some of which we identify in this study; see FAQs 2.6. & 2.7). These SNPs are no more “for” educational attainment than for the other outcomes with which they are associated.

Third, the “predictive” power (see FAQ 1.5) of each individual SNP that we identify is very small. Our results show that genetic associations with educational attainment are comprised of thousands, or even millions, of SNPs, each of which has a tiny effect size. Each SNP is therefore weakly associated with, rather than a strong influence on, educational attainment. “Genes for educational attainment” might misleadingly imply the latter.  

Fourth, environmental factors can increase or decrease the impact of specific SNPs. Put differently, even if a SNP is associated with higher or lower levels of educational attainment on average, it may have a much larger or smaller effect depending on environmental conditions. Indeed, in our most recent previous large-scale GWAS of educational attainment (Lee et al., 2018), we report exploratory analyses that provide evidence of such gene-environment interactions. Educational attainment couldn’t even exist as a meaningful object of measurement if we didn’t have schools, and having schools introduces societal mechanisms that influence who spends the most years attending them. Accordingly, genetic associations with educational attainment necessarily will be mediated by societal systems and therefore genetic variation should often be expected to interact with environmental factors when it influences social phenomena, such as educational attainment. “Genes for educational attainment” suggests a stability in the relationship between these genes and the outcome of educational attainment that does not exist.

Finally, genes do not affect educational attainment directly (see FAQ 1.4), although we don’t know exactly why the SNPs we identify are associated with educational attainment. We found in our most recent previous large-scale GWAS of educational attainment (Lee et al., 2018) that the genes identified as associated with educational attainment tend to be especially active in the brain and involved in neural development and neuron-to-neuron communication. The “predictive” power (see FAQ 1.5) of genes on educational attainment might therefore partly depend on a long process starting with brain development, followed by the emergence of particular psychological outcomes (e.g., cognitive performance and personality). These outcomes might then lead to behavioral tendencies as well as experiences and treatment by parents, peers, and teachers. All of these factors may additionally interact with the environment in which a person lives. Eventually these outcomes, behaviors, and experiences might influence (but not completely determine) educational attainment. Much more research is needed to explore these and other possible explanations for the relationship between SNPs and educational attainment.

3.3       Does this study show that an individual’s level of educational attainment (or any other outcome) is determined, or fixed, at conception? Do genes determine the choices we make and who we become?

No and no.

Genes and genetic variation do not determine our choices or who we become. If they did, identical twins would make all of the same decisions, have the same interests, etc. Years of twin studies have shown that, while identical twins tend to be more similar than fraternal twins—including with respect to the years of formal schooling they complete—they are nevertheless different (see FAQ 1.8). This implies that environmental factors also play a large role in our outcomes. In the case of educational attainment, social and other environmental factors account for most variation among people.

But even if it were true that genetic factors accounted for all of the differences among individuals in educational attainment, it would still not follow that an individual’s number of years of formal schooling is “determined” at conception. There are at least three reasons for this.

First, some genetic effects operate through environmental channels (Jencks, 1980). When this is the case, SNPs that are associated with an outcome in one setting might not be associated with it in another setting. As an illustrative example, suppose—hypothetically—that the SNPs we identified are associated with educational attainment because they help students to memorize and, as a result, to become better at taking tests that rely on memorization (in fact, we do not know why the SNPs we identified are associated with educational attainment; see FAQ 3.2). In this example, changes to the intermediate environmental channels—the type of tests administered in schools—could have drastic effects on individuals’ educational attainment, even though individuals’ DNA would not have changed. A genetic association with educational attainment might not be found at allif schools did not use tests that rely on memorization. More generally, the genetic associations that we found might not apply as strongly if the education system were organized differently than it is at present (see also FAQ 1.4).

Second, even if the genetic associations with educational attainment operated entirely through non-environmental mechanisms that are difficult to modify (such as direct influences on the formation of neurons in the brain and the biochemical interactions among them), there could still exist powerful environmental interventions that could cancel out what would have been the effect of SNPs. Consider a famous example suggested by the economist Arthur Goldberger. Genes influence eyesight at least partly through biological mechanisms that themselves are hard to change. Yet even if all variation in unaided eyesight were due to genes, there would still be enormous benefits from introducing eyeglasses, which can erase the contribution of genes to that outcome (Goldberger, 1979). Conversely, someone genetically predisposed towards being taller than average might end up being shorter than average if they lacked adequate nutrition during childhood. In the context of educational attainment, policies such as a required minimum number of years of education and dedicated resources for individuals with learning disabilities can increase educational attainment in the entire population and/or reduce differences among individuals—all without, of course, changing anyone’s DNA. 

 

Third, even if the genetic effects on educational attainment were not influenced by changes in the environment, those environmental changes themselves could still have a major impact on the educational attainment of the population as a whole. For example, if young children were given more nutritious diets, then everyone’s school performance might improve, and college graduation rates might increase. By analogy, 80%-90% of the variation across individuals in height is due to genetic factors. Yet the current generation of people is much taller than past generations due to changes in the environment such as improved nutrition.

3.4       Can the polygenic index from this paper be used to accurately predict a particular person’s educational attainment?

No. While the “predictive” power (see FAQ 1.5) of our polygenic index is substantial—it predicts 13% of variation in educational attainment across individuals with European genetic ancestries—and useful for some purposes (see FAQ 1.7), it is important to keep in mind that the score fails to predict the vast majority (87%) of variation in years of education across individuals. Many of those with low polygenic indexes go on to achieve high levels of education, and a large proportion of those with high polygenic indexes do not complete college.

Thus, an important message of this paper and our earlier papers is that DNA does not “determine” an individual’s level of education, for multiple reasons: First, it is estimated that, at least in the environments in which we have been measuring it, the additive effects of common SNPs will only ever predict about 20% of the variance in educational attainment across individuals. Second, today’s polygenic index is only able to predict about two-thirds of that 20% (i.e., 13 percentage points). Third, since SNPs matter more or less depending on environmental context (see FAQ 2.7), a polygenic index might be less (or more) predictive for individuals in some environments than for individuals in others. Fourth, polygenic indexes are most predictive when the individuals used to make the index have the same genetic ancestries as the people whose outcomes you would like to predict. For example, because the research in this paper is almost entirely based on individuals with European genetic ancestries, the polygenic index predicts only about 2% of the variance in individuals with African genetic ancestries (see FAQ 3.5). Finally, polygenic predictions only hold for as long as the environment in which they were developed remains substantially the same: if the laws or pedagogy underlying a population’s educational system changes substantially, then so, too, might the optimized polygenic index. Just as eyeglasses allow those genetically predisposed to poor vision to have nearly perfect vision, innovations in education (say, an innovation that makes education irresistibly engaging, thus mitigating the risk to those with SNPs associated with lower propensity to pay attention or avoid distraction) might result in those with lower polygenic indexes to actually achieve just as much education, on average, as those with higher polygenic indexes (see also FAQs 3.2 and 3.3).

As sample sizes for GWASs continue to grow, it will likely be possible to construct a polygenic index for educational attainment whose predictive power comes closer to 20% of the variance in educational attainment across individuals (Rietveld et al., 2013). Even this level of predictive power would pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. 

The results of SSGAC studies have sometimes been used by online platforms, including some companies, to predict individual outcomes. We recognize that returning individual genomic “results” can be a fun way to engage people in research and other projects and to feed or stoke their interest in genomics. But it is important that participants/users understand that these individual results are not meaningful predictions and should be regarded essentially as entertainment. Failure to make this point clear risks sowing confusion and undermining trust in genetics research.

3.5       Can your polygenic index be used for research studies in diverse genetic ancestry populations?

Only in a limited way. As a practical matter, it is possible to calculate a polygenic index for any individual for whom genome-wide data is available, but the polygenic index will be most “predictive” (see FAQ 1.5) in populations of European genetic ancestries.

Our study was conducted only using samples of individuals of European genetic ancestries (see Appendix 1). The set of SNPs that are associated with educational attainment in people of European genetic ancestries is unlikely to overlap perfectly with the set of SNPs associated with educational attainment in people of other genetic ancestries. And even if a given SNP is associated in both genetic ancestry groups, the effect size—in other words, the strength of the association—will likely differ. This is partly because linkage disequilibrium (LD) patterns (i.e., the correlation structure of the genome) vary by genetic ancestry. This means that some variant may be associated with educational attainment because the variant is in LD (i.e., correlated) with a variant elsewhere in the genome that causally affects education (see FAQ 1.4). If the strength of the correlation is greater in one genetic ancestry group than in another, then the size of the association will be larger in that genetic ancestry group. Moreover, even if LD patterns were similar in each genetic ancestry group, the association may differ in different groups because environmental conditions differ (see FAQ 1.4, 3.2 & 3.3). The fact that there are differences across genetic ancestry groups in the set of associated SNPs and their effect sizes has two important implications.

First, it means that polygenic indexes of individuals from different genetic ancestry groups cannot be meaningfully compared. A recent paper (Martin et al., 2017) illustrated this point in the context of polygenic indexes for predicting height; in the sample analyzed in that paper, polygenic indexes for height predict that individuals of European genetic ancestries would be taller than those of South Asian genetic ancestries, who in turn would be taller than those of African genetic ancestries. In actuality, however, populations of African genetic ancestries represented by the sample have similar height to populations of European genetic ancestries, and populations of both African and European genetic ancestries tend to be taller than populations of South Asian genetic ancestries.

Second, while polygenic indexes can be used to predict differences across individuals within a sample of people of non-European genetic ancestries, the amount of predictive power will be much smaller than in a sample of people of European genetic ancestries. Such an attenuation of predictive power has been repeatedly found in prior work (Belsky et al., 2013; Domingue et al., 2015, 2017; Vassos et al., 2017). Unfortunately, this attenuation means that for non-European genetic-ancestry populations, many of the benefits of having a polygenic index available will have to wait until large GWASs are conducted using samples from these populations. (Currently, most large genotyped samples are of European genetic ancestries.)

For a more extensive, excellent discussion of these and related issues, see Graham Coop’s blog post “Polygenic scores and tea drinking”: https://gcbias.org/2018/03/14/polygenic-scores-and-tea-drinking/.

For more on population stratification bias, see FAQs 1.4 & 2.2.

3.6       Should practitioners (e.g., in education or other domains) use the results of this study to make decisions?

No. Doing so would be extremely premature and unsupported by the science. As explained in FAQ 3.4, our polygenic index is only weakly “predictive” (see FAQ 1.5) of educational and health outcomes for individuals. Guessing whether a person has above- or below-average years of education using their polygenic index would only be slightly better than a coin flip: that will be unacceptable in most practice contexts. Nor can our results immediately be used to develop an intervention (say, to improve graduation rates by changing pedagogy) because we don’t know why the SNPs we identified are associated with educational attainment; much more research is needed to investigate that before any such interventions would be evidence-based. 

In this respect, our study is no different from GWASs of complex medical outcomes. There, as here, GWAS associations alone are not actionable for decisions being made by practitioners. They are only an important first step in basic science research that might someday be useful in helping practitioners make decisions. GWAS can help identify SNPs associated with an outcome of interest. Subsequent studies of those SNPs would then be needed to confirm their relationship to the outcome. 

When the outcome in question is socio-behavioral rather than clinical, there are additional questions about whether polygenic indexes might stigmatize, whether there are sufficient legal and other protections to prevent discrimination on the basis of polygenic indexes, and whether the expected benefits of using polygenic indexes in a particular practice setting would justify these risks. Addressing these questions would involve a great deal of multidisciplinary empirical and normative research.  

Although the results of our study are not immediately useful in practice, they are useful to social scientists (e.g., by allowing them to construct polygenic indexes that can be used as control variables in randomized controlled trials or in studies of gene-by-environment interactions, see FAQ 1.7).

3.7       Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants? What has been done to help avert the potential harms of this research?

Unfortunately, like much research, the results of our study and of research that builds on it could be misunderstood, misapplied in ways that are inconsistent with the science, and applied in ways that are unethical.

Genetics research in particular has a long history of being used to harm people, especially on the basis of racist and classist inferences. Indeed, the term “eugenics” was coined in the late 1800s by one of the most prominent early researchers of heredity, Francis Galton. In the first half of the 20th century, many prominent scientists, politicians, clergymen, and other influential individuals across the political spectrum were active proponents of the belief that socioeconomic disparities in society were primarily or exclusively caused by genetic factors and that existing social disparities simply reflected the natural order and were both inevitable and justified. These ideas, and their active development and endorsement by many in the scientific community, laid the groundwork for 20th century forced sterilizations, anti-miscegenation laws, eugenics-based immigration restrictions, and genocide. Today, racist individuals and groups continue to misinterpret and misuse the results of genetics research to give unjustified support for their agenda.

Acknowledging the harm done to certain groups in the name of science reminds us of the importance of careful communication of the implications of scientific research and the need for intense vigilance to ensure that disadvantaged groups are not further harmed by this and related work. Nevertheless, for a variety of reasons, in this instance, we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing the knowledge. Here, we briefly discuss some of the broad potential benefits of this research. We then describe what we take to be our ethical obligation as researchers conducting this work. 

First, one benefit of conducting social-science genetics research in ever larger samples is that doing so allows us to correct the scientific record. An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral outcomes have serious methodological limitations, fail to replicate, and are likely to be false-positive findings (Benjamin et al., 2012; Chabris et al., 2012, 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral outcomes), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt, 2012). One of the most important reasons why earlier work generated unreliable results is that the sample sizes were far too small, given that the true effects of individual genetic variants on behavioral outcomes are tiny. Pre-existing claims of genetic associations with complex social-science outcomes have reported widely varying effect sizes, many of them purporting to “predict” (see FAQ 1.5) ten to one hundred times as much of the variation across individuals as did the genetic variants we found in this study and in our other studies. 

Second, behavioral genetics research also has the potential to correct the social record and thereby to help combatdiscrimination and stigmatization. For instance, at various times and places throughout human history (unfortunately, including the present day), girls and women have been discouraged or even prevented from pursuing as much education as their male counterparts. There are of course many reasons why that argument has been made and sometimes prevailed, but to the extent that it is rooted in a belief in genetically-based differences between males and females, our current study’s (and our previous study’s) analysis of the X chromosome finds no such evidence (see FAQ 2.9). Similarly, overestimating the role of genetics can be damaging, and the present work can help debunk this myth, too. Of the 20% of the variance in educational attainment that is related to the additive effects of common SNPs, we (Lee et al., 2018)  and others have found that the relationship to educational attainment depends importantly on environmental factors. By clarifying the limits of deterministic views of complex outcomes, recent behavioral genetics research—if communicated responsibly—could make appeals to genetic justifications for discrimination and stigmatization less persuasive to the public in the future. 

Third, behavioral genetics research has the potential to yield many other benefits, especially as sample sizes continue to increase—as briefly summarized in FAQ 1.7. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity rather than being foreseeable. For instance, because educational attainment is measured in far larger genotyped samples than brain function, large-scale GWASs of educational attainment have provided better insights into brain function than GWASs to date that directly examine brain function, since the latter have necessarily been conducted in much smaller samples.

In sum, we agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics, 2002, p. 114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner.” In our view, responsible behavioral genetics research includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show and how they should—and should not—be used (hence, this FAQ document). In addition, we are developing a Terms of Use for researchers who would like to use our results in their own research. Researchers will agree to “have read and understand the principles articulated by the American Society of Human Genetics (ASHG) position statement: ‘ASHG Denounces Attempts to Link Genetics and Racial Supremacy.’ (See also International Genetic Epidemiological Society Statement on Racism and Genetic Epidemiology.).” Data-users also will acknowledge “I understand that comparisons of genetically predicted phenotype levels across ancestral groups are usually scientifically confounded due to the effects of linkage disequilibrium, gene-environment correlation, gene-environment interactions, and other methodological problems” (see FAQ 3.5).

Additional reading and references

Amos, C. I. et al. (2008) ‘Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1’, Nature Genetics, 40, pp. 616–622. doi: 10.1038/ng.109.

Anderson, E. L. et al. (2017) ‘The causal effect of educational attainment on Alzheimer’s disease: A two-sample Mendelian randomization study’, bioRxiv [https://doi.org/10.1101/127993].

Bansal, V. et al. (2017) ‘Genetics of educational attainment aid in identifying biological subcategories of schizophrenia’, bioRxiv [https://doi.org/10.1101/114405].

Barban, N. et al. (2016) ‘Genome-wide analysis identifies 12 loci influencing human reproductive behavior’, Nature Genetics, 48(12), pp. 1462–1472. doi: 10.1038/ng.3698.

Barcellos, S. H., Carvalho, L. S. and Turley, P. (2018) ‘Education can Reduce Health Disparities Related to Genetic Risk of Obesity: Evidence from a British Reform’, bioRxiv [https://doi.org/10.1101/260463]. doi: 10.1101/260463.

Belsky, D. W. et al. (2013) ‘Development and evaluation of a genetic risk score for obesity’, Biodemography and Social Biology, 59(1), pp. 85–100. doi: 10.1080/19485565.2013.774628.

Belsky, D. W. et al. (2016) ‘The Genetics of Success’, Psychological Science, 27(7), pp. 957–972. doi: 10.1177/0956797616643070.

Benjamin, D. J. et al. (2012) ‘The Promises and Pitfalls of Genoeconomics’, Annual Review of Economics, 4(1), pp. 627–662. doi: 10.1146/annurev-economics-080511-110939.

Branigan, A. R. et al. (2013) ‘Variation in the Heritability of Educational Attainment: An International Meta-Analysis’, Social Forces, 92(1), pp. 109–140. doi: 10.1093/sf/sot076.

Bulik-Sullivan, B. K. et al. (2015) ‘LD Score regression distinguishes confounding from polygenicity in genome-wide association studies.’, Nature Genetics, 47(3), pp. 291–295. doi: 10.1038/ng.3211.

Cesarini, D. and Visscher, P. M. (2017) ‘Genetics and educational attainment’, npj Science of Learning, 2(1), p. 4. doi: 10.1038/s41539-017-0005-6.

Chabris, C. F. et al. (2012) ‘Most reported genetic associations with general intelligence are probably false positives’, Psychological Science, 23(11), pp. 1314–1323. doi: 10.1177/0956797611435528.

 

Chabris, C. F. et al. (2015) ‘The Fourth Law of Behavior Genetics’, Current Directions in Psychological Science, 24(4), pp. 304–312. doi: 10.1177/0963721415580430.

Cheesman, R. et al. (2020) ‘Comparison of Adopted and Nonadopted Individuals Reveals Gene–Environment Interplay for Education in the UK Biobank’, Psychological Science, 31(5), pp. 582–591. doi: 10.1177/0956797620904450.

Conley, D. et al. (2016) ‘Assortative mating and differential fertility by phenotype and genotype across the 20th century’, Proceedings of the National Academy of Sciences, 113(24), pp. 6647–6652.

Cutler, D. M. and Lleras-Muney, A. (2010) ‘Education and Health: Evaluating Theories and Evidence’, in House, J. et al. (eds) Making Americans Healthier: Social and Economic Policy as Health Policy. New York: New York: Russell Sage Foundation, pp. 29–60.

Davies, N. M. et al. (2018) ‘The causal effects of education on health outcomes in the UK Biobank’, Nature Human Behaviour. doi: 10.1038/s41562-017-0279-y.

 

Domingue, B. W. et al. (2015) ‘Polygenic Influence on Educational Attainment: New evidence from The ational Longitudinal Study of Adolescent to Adult Health’, AERA Open, 1(3), pp. 1–13. doi: 10.1177/2332858415599972.

Domingue, B. W. et al. (2017) ‘Mortality selection in a genetic sample and implications for association studies’, International Journal of Epidemiology, 46(4), pp. 1285–1294. doi: 10.1093/ije/dyx041.

Duncan, L. et al. (2019) ‘Analysis of polygenic risk score usage and performance in diverse human populations’, Nature Communications, 10(1), pp. 1–9. doi: 10.1038/s41467-019-11112-0.

Goldberger, A. S. A. (1979) ‘Heritability’, Economica, 46(184), pp. 327–347. Available at: http://www.jstor.org/stable/2553675.

Heath, A. C. et al. (1985) ‘Education policy and the heritability of educational attainment’, Nature, 314(6013), pp. 734–736. doi: doi:10.1038/314734a0.

Heckman, J. J. et al. (2010) ‘The rate of return to the HighScope Perry Preschool Program’, Journal of Public Economics, 94(1–2), pp. 114–128. doi: 10.1016/j.jpubeco.2009.11.001.

Hewitt, J. K. (2012) ‘Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits.’, Behavior Genetics, 42(1), pp. 1–2. doi: 10.1007/s10519-011-9504-z.

Hill, W. G., Goddard, M. E. and Visscher, P. M. (2008) ‘Data and theory point to mainly additive genetic variance for complex traits.’, PLoS Genetics. Edited by T. F. C. Mackay, 4(2), p. e1000008. doi: 10.1371/journal.pgen.1000008.

Hivert, V. et al. (2021) ‘Estimation of non-additive genetic variance in human complex traits from a large sample of unrelated individuals’, American Journal of Human Genetics, 108(5), pp. 786–798. doi: 10.1016/J.AJHG.2021.02.014.

Houmark, M. A., Ronda, V. and Rosholm, M. (2020) The Nurture of Nature and the Nature of Nurture: How Genes and Investments Interact in the Formation of Skills. Bonn: Institute of Labor Economics (IZA). Available at: http://hdl.handle.net/10419/227307.

Hugh-Jones, D. et al. (2016) ‘Assortative mating on educational attainment leads to genetic spousal resemblance for polygenic scores’, Intelligence, 59, pp. 103–108. doi: 10.1016/j.intell.2016.08.005.

Hung, R. J. et al. (2008) ‘A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25’, Nature. doi: 10.1038/nature06885.

Jencks, C. (1980) ‘Heredity, environment, and public policy reconsidered’, American Sociological Review, 45(5), pp. 723–736. Available at: http://www.jstor.org/stable/2094892.

Jinks, J. and Eaves, L. J. (1974) ‘IQ and Inequality’, Nature, 248(5446), pp. 287–289. Available at: https://doi.org/10.1038/248287a0.

van Kippersluis, H. and Rietveld, C. A. (2018) ‘Pleiotropy-robust Mendelian randomization’, International Journal of Epidemiology, 47(4), pp. 1279–1288. doi: 10.1093/ije/dyx002.

Kong, A. et al. (2018) ‘The nature of nurture: Effects of parental genotypes’, Science, 359(6374), pp. 424–428. doi: 10.1126/science.aan6877.

Lambert, J.-C. et al. (2013) ‘Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease’, Nature Genetics, 45(12), pp. 1452–1458. doi: 10.1038/ng.2802.

Lander, E. S. and Schork, N. J. (1994) ‘Genetic dissection of complex traits’, Science, 265, pp. 2037–48.

Lee, J. J. et al. (2018) ‘Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals’, Nature Genetics, 50(8), pp. 1112–1121. doi: 10.1038/s41588-018-0147-3.

Linnér, R. K. et al. (2017) ‘An epigenome-wide association study meta-analysis of educational attainment’, Nature Publishing Group. doi: 10.1038/mp.2017.210.

Locke, A. E. A. et al. (2015) ‘Genetic studies of body mass index yield new insights for obesity biology’, Nature, 518(7538), pp. 197–206. doi: 10.1038/nature14177.

Mare, R. D. (1991) ‘Five decades of educational assortative mating’, American sociological review, pp. 15–32.

Marioni, Riccardo E. et al. (2016) ‘Genetic variants linked to education predict longevity’, Proceedings of the National Academy of Sciences, 113(47), pp. 13366–13371. doi: 10.1073/pnas.1605334113.

Marioni, Riccardo E et al. (2016) ‘The epigenetic clock and telomere length are independently associated with chronological age and mortality’, International journal of epidemiology, pp. 1–9. doi: 10.1093/ije/dyw041.

Martin, A. R. et al. (2017) ‘Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations’, American Journal of Human Genetics, 100(4), pp. 635–649. doi: 10.1016/j.ajhg.2017.03.004.

Nature Editors (2013) ‘Dangerous work’, Nature, 502(7469), pp. 5–6. doi: 10.1038/502005b.

Nuffield Council on Bioethics (2002) Genetics and human behaviour: the ethical context. London: Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf].

Okbay, A., Baselmans, B. M. L., et al. (2016) ‘Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses’, Nature Genetics, 48(6), pp. 624–633. doi: 10.1038/ng.3552.

 

Okbay, A., Beauchamp, J. P., et al. (2016) ‘Genome-wide association study identifies 74 loci associated with educational attainment’, Nature, 533(7604), pp. 539–542. doi: 10.1038/nature17671.

Parens, E. and Appelbaum, P. S. (2015) ‘An introduction to thinking about trustworthy research into the genetics of intelligence’, Hastings Center Report, 45(S1), pp. S2–S8. doi: 10.1002/hast.491.

Pazokitoroudi, A. et al. (2020) Quantifying the contribution of dominance effects to complex trait variation in biobank-scale data, bioRxiv. doi: 10.1101/2020.11.10.376897.

Pickrell, J. K. et al. (2016) ‘Detection and interpretation of shared genetic influences on 42 human traits’, Nature Genetics, 48(7), pp. 709–717. doi: 10.1038/ng.3570.

Rietveld, C. A. et al. (2013) ‘GWAS of 126,559 individuals identifies genetic variants associated with educational attainment’, Science, 340(6139), pp. 1467–1471. doi: 10.1126/science.1235488.

 

Ripke, S. et al. (2014) ‘Biological insights from 108 schizophrenia-associated genetic loci’, Nature, 511(7510), pp. 421–427. doi: 10.1038/nature13595.

Robinson, M. R. et al. (2017) ‘Genetic evidence of assortative mating in humans’, Nature Human Behaviour. doi: 10.1038/s41562-016-0016.

Ross, C. E. and Wu, C. (1995) ‘The links between education and health’, American Sociological Review, 60(5), pp. 719–745.

Sacerdote, B. (2007) ‘How Large are the Effects from Changes in Family Environment? A Study of Korean American Adoptees’, The Quarterly Journal of Economics, 122(1), pp. 119–157. doi: 10.1162/qjec.122.1.119.

Sacerdote, B. (2011) ‘Nature and Nurture Effects On Children’s Outcomes: What Have We Learned From Studies of Twins And Adoptees?’, in Benhabib, J., Bisin, A., and Jackson, M. O. (eds) Handbook of Social Economics. Elsevier/North-Holland, pp. 1–29.

Schmitz, L. L. and Conley, D. (2017) ‘The effect of Vietnam-era conscription and genetic potential for educational attainment on schooling outcomes’, Economics of Education Review, 61, pp. 85–97. doi: https://doi.org/10.1016/j.econedurev.2017.10.001.

Silventoinen, K. et al. (2004) ‘Heritability of body height and educational attainment in an international context: comparison of adult twins in Minnesota and Finland’, American Journal of Human Biology, 16(5), pp. 544–555.

Thorgeirsson, T. E. et al. (2008) ‘A variant associated with nicotine dependence, lung cancer and peripheral arterial disease’, Nature, 452(7187), pp. 638–642. doi: 10.1038/nature06846.

Tillmann, T. et al. (2017) ‘Education and coronary heart disease: Mendelian randomisation study’, BMJ (Online), 358, p. j3542. doi: 10.1136/bmj.j3542.

Turkheimer, E. (2000) ‘Three laws of behavior genetics and what they mean’, Current Directions in Psychological Science, 9(5), pp. 160–164.

Turley, P. et al. (2018) ‘Multi-trait analysis of genome-wide association summary statistics using MTAG’, Nature Genetics, 50(2), pp. 229–237. doi: 10.1101/118810.

Vassos, E. et al. (2017) ‘An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis’, Biological Psychiatry, 81(6), pp. 470–477. doi: 10.1016/j.biopsych.2016.06.028.

Visscher, P. M. et al. (2017) ‘10 Years of GWAS Discovery: Biology, Function, and Translation’, The American Journal of Human Genetics, 101(1), pp. 5–22. doi: 10.1016/j.ajhg.2017.06.005.

Wang, Y. et al. (2020) ‘Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations’, Nature Communications, 11(1), pp. 1–9. doi: 10.1038/s41467-020-17719-y.

Warrier, V. et al. (2016) ‘Genetic overlap between educational attainment, schizophrenia and autism’, bioRxiv [https://doi.org/10.1101/093575]. doi: 10.1101/093575.

Weikart, D. P. and Perry Preschool Project (1967) Preschool intervention; a preliminary report of the Perry Preschool Project.Ann Arbor: Campus Publishers.

Winkler, T. W. et al. (2014) ‘Quality control and conduct of genome-wide association meta-analyses.’, Nature Protocols, 9(5), pp. 1192–1212.

Wood, A. R. et al. (2014) ‘Defining the role of common variation in the genomic and biological architecture of adult human height’, Nature Genetics, 46(11), pp. 1173–1186. doi: 10.1038/ng.3097.

Yengo, L., Robinson, M. R., et al. (2018) ‘Imprint of Assortative Mating on the Human Genome’, Nature Human Behaviour, 2(12), pp. 2, 948–954. doi: 10.1038/s41562-018-0476-3.

Yengo, L., Sidorenko, J., et al. (2018) ‘Meta-analysis of genome-wide association studies for height and body mass index in ∼700000 individuals of European ancestry.’, Human molecular genetics, 27(20), pp. 3641–3649. doi: 10.1093/hmg/ddy271.

Young, A. I. et al. (2020) ‘Mendelian imputation of parental genotypes for genome-wide estimation of direct and indirect genetic effects’, bioRxiv, p. 2020.07.02.185199. doi: 10.1101/2020.07.02.185199.

 
 
 
 
 
 
 
 
 
 

FAQ's about "Resource Profile and User Guide of the Polygenic Index Repository"

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

This document provides information about the study:

 

Becker et al. (2021). “Resource Profile and User Guide of the Polygenic Index Repository.” Nature Human Behaviour.

 

The document was prepared by Daniel Benjamin, David Laibson, Michelle N. Meyer, and Patrick Turley. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

 

  1. Background

  2. Study design and results

  3. Social and ethical implications of the study

  4. Appendices

 

For clarifications or additional questions, please contact Daniel Benjamin (daniel.benjamin@gmail.com).

Quick Links

1.1.     Who conducted this study? What are the group’s overarching goals?

1.2.     What is a polygenic index (PGI)? Why this terminology?

1.3.     How is a polygenic index constructed?

1.4.     How might polygenic indexes be useful?

1.5.     Does a polygenic index “cause” the outcome of interest?

1.6.     In what sense does a polygenic index “predict” the outcome of interest?

1.7.     What polygenic indexes were available to researchers prior to this project?

1.8.     How do different polygenic indexes for the same outcome differ? How comparable are results across studies that use different polygenic indexes for the same outcome?

1.9.     Why create the Polygenic Index Repository?

2.1.     What outcomes are included in the Polygenic Index Repository? How did you choose the outcomes?

2.2.     How did you create these polygenic indexes?

2.3.     How predictive are the polygenic indexes in the Repository?

2.4.     What is the “measurement-error-corrected estimator”? How will it and the Repository improve comparability of results across future studies? 

2.5.     What is in the User Guide that accompanies the Repository?

2.6.     Who can access the Repository polygenic indexes, and how?

2.7.     How will the Repository be updated?

3.1.     Do GWAS or the polygenic indexes they produce identify the gene—or genes—“for” a particular outcome?

3.2.     Do polygenic indexes show that these outcomes are determined, or fixed, at conception?

3.3.     Can the polygenic indexes from the Repository be used to accurately predict a particular person’s outcomes? 

3.4.     Can the polygenic indexes accurately be used for research studies in non-European-ancestry populations?

3.5.     Would it be appropriate to use the Repository social and behavioral polygenic indexes in policy or practice?

3.6.     Could research on polygenic indexes lead to discrimination against, or stigmatization of, people with higher or lower polygenic indexes for certain outcomes? If so, why facilitate the spread of polygenic indexes?

3.7.     What have you done to mitigate the risks of research using Repository polygenic indexes?

4.0.       References

1. Background

1.1.  Who conducted this study? What are the group’s overarching goals?

The authors of the study are researchers affiliated with the Social Science Genetic Association Consortium (SSGAC) as well as data providers (i.e., individuals who act as stewards for datasets and provide other researchers with access to these data for research purposes). The SSGAC is a multi-institutional, international research group that aims to identify statistically robust associations between variation in DNA and variation in social-science-relevant outcomes. 

We study the most common sources of genetic variation—single-nucleotide polymorphisms (SNPs). SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. Each SNP usually has two different possible base pairs, which are called alleles. Although there are tens of millions of sites where SNPs are located in the human genome, our work (like most genetic research today that aims to link variation in DNA to variation in disease and other outcomes) investigates only SNPs that can be easily measured with a high level of accuracy. These days, we can easily and accurately measure millions of SNPs, which together capture most of the common genetic variation across people.

The social-science-relevant outcomes that we analyze include differences across people in behavior, preferences, and personality that are traditionally studied by social and behavioral scientists (e.g., anthropologists, economists, political scientists, psychologists, and sociologists). These traits are often also of interest to health and other researchers.

The SSGAC was formed in 2011 to address a specific set of scientific challenges. Most outcomes and behaviors are weakly associated with a very large number of SNPs. Although their collective effect can be meaningful (see FAQs 1.2& 2.3), we now know that almost every one of these SNPs has an extremely weak association on its own. To identify specific SNPs with such small effects, scientists must study at least hundreds of thousands of people (to separate weak signals from sampling noise). One promising strategy for doing this is for many investigators to pool their data into one large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of medical conditions (Visscher et al., 2017). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. (In genetics research, “cohort” is a term that means “dataset.”) The SSGAC was founded by three social scientists—Daniel Benjamin (University of California – Los Angeles), David Cesarini (New York University), and Philipp Koellinger (University of Wisconsin and Vrije Universiteit Amsterdam)—who believe that studying SNPs associated with social scientific outcomes can have substantial positive impacts across many research fields. This includes research that aims to better understand the effects of the environment (e.g., research on policy interventions) and interactions between genetic and environmental effects. The potential benefits also span a diverse set of research questions in the biomedical sciences, such as why and how educational attainment is linked to longevity and better overall health outcomes.

To conduct such research, the SSGAC implements genome-wide association studies (GWAS) of social-scientific outcomes. For example, to conduct a GWAS of educational attainment (e.g., Lee et al., 2018)every participating cohort calculates the cross-sectional (i.e., within-cohort) correlation between educational attainment and DNA-base-pair variation at a single location on the genome: a SNP. As first discussed above, a SNP is a base-pair of the genome where there is common variation in the human population. This statistical analysis is repeated for each SNP on the genome. The cohort-level results do not contain individual-level data—just summary statistics about these within-cohort statistical associations. The SSGAC then combines these cohort results to produce the overall GWAS results. By using existing datasets and combining cohort results, we can study the genetics of large numbers of individuals (for example, ~1.1 million people in Lee et al. (2018)) at very low cost. The SSGAC publicly shares overall, aggregated results(subject to some Terms of Service; see FAQ 3.7) so that other scientists can build on this work. These publicly available data have already catalyzed many research projects and analyses across the social and biomedical sciences. Among the most useful products of these GWASs for other research are the polygenic indexes that are based on GWAS associations. Polygenic indexes are variables that aggregate the predictive power of many SNPs for predicting the outcome of the GWAS (see FAQ 1.2), and they are the focus on the current paper.  

The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, Princeton University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Biology and Human Genetics, University of Tartu and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Genetic Epidemiology, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics and Law, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate what was found less tersely and technically than in the paper, as well as what can and cannot be concluded from the research findings more broadly. FAQ documents produced for SSGAC publications are available on the SSGAC website.

To date, SSGAC-affiliated papers have studied educational attainment, cognitive performance, subjective well-being, reproductive behavior, risk tolerance, and dietary intake. The SSGAC website contains a list of our major publications, which have been published in journals such as Science, Nature, Nature Genetics, Proceedings of the National Academy of Sciences, Psychological Science, and Molecular Psychiatry.

1.2. What is a polygenic index (PGI)? Why this terminology?

A polygenic index (we use the acronym PGI throughout the paper) is an index composed of a large number of SNPs from across the genome. Each polygenic index is associated with a particular outcome (for details, see FAQ 1.3). Because a polygenic index aggregates the information from many SNPs, it can “predict” (see FAQ 1.6) far more of the variation among individuals than any single SNP. (Note that even polygenic indexes are not good predictors of outcomes for one person; see FAQ 3.3). Often, the polygenic indexes with the most predictive power are those created using all the (millions of) SNPs measured in a SNP array. A SNP array is the currently standard way of measuring common genetic differences across individuals. A SNP array data does not measure the entire genetic sequence of each individual, but it does measure most of the places on the genome where individuals differ.

Our terminology of polygenic index is currently non-standard, but most of the authors of the paper prefer it to current terms and hope that this paper, and the Polygenic Index Repository introduced in this paper, make polygenic index a standard term. The traditional terms include polygenic risk score and polygenic score. The word risk makes little sense when the polygenic index is for a non-disease outcome (such as height). The word score was intended to echo statistical nomenclature but can instead convey an unintended value judgment or valence (i.e., “a higher score must be better”). The word index is at least as accurate statistically and does not convey a value judgment.

1.3. How is a polygenic index constructed?

A polygenic index is constructed in three steps. First, a genome-wide association study (GWAS) is conducted, looking at SNPs measured across the entire human genome to see which of them are associated with higher or lower levels of some outcome. As explained above, SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. SNPs usually have two different possible base pairs, or alleles. Although there are tens of millions of sites where SNPs are located in the human genome, GWASs typically investigate only SNPs that can be easily measured (or imputed) with a high level of accuracy. These days, we can easily and accurately measure millions of SNPs, which together capture most of the common genetic variation across people. For each of these millions of SNPs, the GWAS generates an “effect size” corresponding to the (typically miniscule) magnitude of the association between that SNP and the outcome. (We use the term “effect size” because it is a common scientific shorthand for “magnitude of association,” but we emphasize that use of the term is not intended to imply that the SNP, or polygenic index, causes the outcome; see FAQ 1.5.)

Second, the effect sizes are used to determine the “weight” each SNP will get in the polygenic index. The simplest scheme is to weight each SNP by its effect size as estimated in the GWAS. This simple weighting scheme has one main problem: because SNPs tend to be correlated with nearby SNPs on the genome (a phenomenon called linkage disequilibrium), if one SNP is associated with the outcome, nearby SNPs will also be associated with the outcome. State-of-the-art approaches to determining the weights for a polygenic index are designed to address this problem. We use a common approach called LDpred (Vilhjálmsson et al., 2015). Using the results of a GWAS, LDpred generates a weight for each SNP. These weights are not equal to the SNPs’ effect sizes as estimated in the GWAS, mostly because the weights take into account each SNP’s correlation with other SNPs. (Even though LDpred addresses the issue of linkage disequilibrium, it does so only for the purpose of generating weights for optimal prediction. LDpred will not necessarily assign more weight to the SNP whose association with the outcome is responsible for nearby SNPs’ associations with the outcome. Thus, LDpred is a tool to address the issue of linkage disequilibrium for the purpose of prediction—which is the purpose of a polygenic index—but not for the purpose of unbiased estimation of SNPs’ effect sizes. See FAQ 1.5.) 

Third, the set of weights for the SNPs are used in a formula for calculating a polygenic index for any particular individual. The formula is a weighted sum of alleles at each SNP (using the weights from the second step). The formula is used to calculate a numerical value of the polygenic index for each individual in some dataset (that was not included in the GWAS).

The sample used for the GWAS in the first step is the training sample for the polygenic index. The larger the GWAS sample size, the greater the predictive power of a polygenic index constructed in the third step. However, this predictive power of a polygenic index has a maximum for each outcome that the polygenic index can approach as the sample size gets bigger, but it can never exceed.  

1.4. How might polygenic indexes be useful?

A polygenic index for an outcome provides one measure of the genetic influence on that outcome that can be used in research in a variety of ways. For example, polygenic indexes have been used to:

  • partially control for genetic influences in order to generate less noisy estimates of how changes in school policy influence health outcomes (Davies et al., 2018);

  • examine how the effect of school policy on health outcomes depends in part on genetic influences (Barcellos, Carvalho and Turley, 2018a);

  • study why SNPs predict educational attainment – for example, it appears that some genetic effects on educational attainment operate through associations with cognitive function and traits such as self-control (Belsky et al., 2016), which in turn affect educational attainment;

  • investigate how genetic influences on educational attainment differ across environmental contexts (Schmitz and Conley, 2017; Barcellos, Carvalho and Turley, 2018b); 

  • investigate how genetic influences on BMI vary over the lifecycle (Khera et al., 2019);

  • infer the degree of assortative mating (Robinson et al., 2017; Yengo et al., 2018);

  • trace recent migration patterns (Domingue et al., 2018; Abdellaoui et al., 2019);

  • examine whether polygenic indexes for disease risk are sufficiently predictive to be incorporated into clinical practice for preventative medicine (Khera et al., 2018); and

  • develop new statistical tools that may advance our understanding of how parenting and other features of a child’s rearing environment influence his or her developmental outcomes (Koellinger and Harden, 2018; Kong et al., 2018).

 

The idea of using GWAS results to create a polygenic index was initially proposed in 2007 (Wray, Goddard and Visscher, 2007), and the first polygenic index was created in 2009 in a GWAS of schizophrenia and bipolar disorder (Purcell et al., 2009). Since then, polygenic indexes have become a significant part of research that builds on genetics in the medical and social sciences. For example, in the current paper we analyze presentations at the annual meeting of the Behavior Genetics Association. We report that the fraction of presentations that used polygenic indexes increased from 0% in 2009 to 20% in 2019. The list above represents a few illustrative examples of research that uses polygenic indexes.

As discussed in FAQ 1.9 below, one goal of this paper, and the Polygenic Index Repository it introduces, is to facilitate further work using polygenic indexes by making a much wider range of more predictive polygenic indexes available to researchers.

1.5. Does a polygenic index “cause” the outcome of interest?

Polygenic indexes available today, including those we construct in this paper, should not be interpreted as a measure of causal mechanisms.

The genome-wide association studies (GWASs) used as the training data for the polygenic indexes (see FAQ 1.3) identify SNPs that are associated with the outcome, but an observed empirical correlation with a specific SNP need not imply that the SNP causes the outcome, for a variety of reasons. First, SNPs are often highly correlated with other, nearby SNPs on the same chromosome. As a result, when one or more SNPs in a region causally influence an outcome (in that particular environment), many non-causal SNPs in that region may also be identified as associated with the outcome (in FAQ 1.3, see the parenthetical “Even though LDpred…” for why LDpred does not solve this problem for the purpose of identifying the causal SNP). In fact, the causal SNP may not have even been measured directly. For example, GWAS that focus on common SNPs would not be able to identify rare or structural types of genetic variation (e.g., deletions or insertions of an entire genetic region) that are causal, but they may identify SNPs that are correlated with these unobserved variants. For these and other reasons, polygenic indexes are likely to be composed of a mix of causal and non-causal SNPs, and the weights used in the formula for constructing the polygenic index (see FAQ 1.3) should not be interpreted as estimates of the causal effects of the SNPs. As a very rough estimate, for social and behavioral outcomes, no more than about one-third of the predictive power of a polygenic index (i.e., the percentage of the variance in the outcome among individuals that the polygenic index explains) is explained by causal genetic effects (Howe et al., 2021). For instance, the most predictive polygenic index for educational attainment currently available explains about 12% of the variance between people, but only one-third of that—about 4%—is causal. (These causal SNPs may be among the SNPs included in the polygenic index or may be physically close to, and therefore correlated with, SNPs that are included.) In contrast, for anthropometric outcomes such as height, it is possible that nearly all of the predictive power of a polygenic index is explained by causal SNPs.

Second, at a particular SNP the frequency of different alleles might vary systematically across environments. If those environmental factors are not accounted for in the association analyses, some of the measured SNP associations with social-science outcomes may be spurious. To use a well-known example often used to explain this idea (Lander and Schork, 1994), any genetic variants common in people of Asian ancestries will be associated statistically with more frequent than average chopstick use, but these variants would not cause greater chopstick use; rather, these genetic variants and the outcome of chopstick use are both distributed unevenly among people with different ancestries. This is called the problem of “population stratification.” The GWAS underlying the polygenic indexes in this paper employ standard strategies to try to minimize this problem, but the issues raised by population stratification cannot be ruled out entirely. As a result, the polygenic indexes likely reflect population stratification to some extent. In the User Guide that accompanies the Polygenic Index Repository (reproduced in the Supplementary Methods of the paper), we discuss this problem in more detail and discuss strategies for addressing the population stratification in the polygenic indexes

Even in GWAS (such as those we rely on or conduct ourselves) that attempt to address and correct for heterogeneity in genetic ancestry, allele frequencies may nonetheless vary systematically with environmental factors even within a group of people of similar genetic ancestry. For example, a SNP that is associated with improved educational outcomes in the parental generation may have downstream effects on parental income and other factors known to influence children’s educational outcomes (such as neighborhood characteristics). This same SNP is likely to be inherited by the children of these parents, creating a correlation between the presence of the SNP in a child’s genome and the extent to which the child was reared in an environment with specific characteristics. A recent study of Icelandic families showed that a parental allele associated with higher educational attainment of the parent that is not passed on to the parent’s offspring is still associated with the child’s educational attainment, suggesting that GWAS results for educational attainment partly represent these intergenerational environmental pathways (Kong et al., 2018).

Third, a SNP’s effects on an outcome may be indirect, so a SNP that may be “causal” in one environment may have a diminished effect or no effect at all in other environments. For example, variation in a particular SNP on chromosome 15 is associated with lung cancer (Amos et al., 2008; Hung et al., 2008; Thorgeirsson et al., 2008). From this observation alone we cannot conclude that variation in this SNP can cause lung cancer through some direct biological mechanism. In fact, it is likely that variation in this SNP, which is part of the nicotinic acetylcholine receptor gene cluster that affects nicotine metabolism, increases lung cancer risk through effects on smoking behavior. In a tobacco-free environment, it is plausible that this association with lung cancer would be substantially weaker and perhaps disappear altogether. Thus, even if we have credible evidence that a specific association is not spurious, it is entirely possible that the SNP in question influences the outcome through channels that we, in common parlance, would label environmental (e.g., smoking). Nearly forty years ago, the sociologist Christopher Jencks criticized the widespread tendency to mistakenly treat environmental and genetic sources of variation as mutually exclusive (see also Turkheimer, 2000). As the example of smoking illustrates, it is often overly simplistic to assume that “genetic explanations of behavior are likely to be exclusively physical explanations while environmental explanations are likely to be social” (Jencks, 1980, p723).

1.6. In what sense does a polygenic index “predict” the outcome of interest?

When we and other scientists say that polygenic indexes (and other variables, such as demographics or other environmental factors) “predict” certain outcomes, our use of the word differs in several important ways from how “predict” is used in standard language (e.g., outside of social science research papers). First, we do not mean that the polygenic index guarantees an outcome with 100% probability, or even with a high degree of likelihood. Rather, we mean that the polygenic index is, on average across people, statistically associated with an outcome. In other words, on average, people with a higher numerical value of the polygenic index have a higher likelihood of the outcome compared to people with a lower numerical value. A polygenic index is said to be statistically “predictive” of an outcome even if the polygenic index has only a weak association with the outcome—as is the case, for instance, with almost all of the polygenic indexes in this paper. In such cases, the polygenic index is only weakly predictive of the outcome.

Second, in standard language, “prediction” usually refers to the future. In contrast, when scientists say that a polygenic index “predicts” an outcome, they mean that they expect to see the association in new data. “New data” means data that haven’t been analyzed yet—regardless of whether those data will be collected in the future or have already been collected. In other words, in social science, it makes perfect sense to ask how well a polygenic index predicts outcomes that have already occurred, like how many years of education were attained by older adults.

Finally, in standard language, a “prediction” is often an unconditional guess about what will happen. Instead of meaning it unconditionally, scientists mean that they expect to see an association in new data under certain conditions, for example, that the environment for the new data is the same as the environment in which the GWAS that underlies the polygenic index (see FAQ 1.3) was conducted. In the example given in FAQ 1.5, in which a SNP is associated with lung cancer due to an effect on smoking, we would not expect the SNP to be as strongly predictive of lung cancer, or predictive at all, in an environment where tobacco-based products are hard to obtain or absent entirely.

1.7. What polygenic indexes were available to researchers prior to this project?

Prior to this project, only a few datasets had constructed polygenic indexes that researchers could download and use. Notable examples of data providers that did make polygenic indexes directly available to researchers —all of which recognized early on the value of doing so—are the Health and Retirement Study, the Wisconsin Longitudinal Study, and the National Longitudinal Adolescent to Adult Health Study. The UK Biobank does not construct polygenic indexes for its users, but it provides a mechanism by which researchers who use the data and construct polygenic indexes can “return” them to the UK Biobank for use by other researchers. Through this mechanism, polygenic indexes constructed from several GWASs have been made available for researchers to download from the UK Biobank.

To study polygenic indexes in other datasets or for other outcomes, prior to this paper, researchers would need to construct the polygenic indexes themselves, following the steps described in FAQ 1.3. For the first step, most researchers would need to rely on publicly available GWAS results, which include less data and are therefore less predictive than some polygenic indexes in published work that rely on non-public GWAS results (see FAQ 2.3). Recently, to make it easier for researchers to construct polygenic indexes themselves, the Polygenic Score Catalog (Lambert et al., 2020) collected together weights for a range of polygenic indexes (also based on publicly available GWAS results).

As we discuss in more detail in FAQ 2.1, for the Polygenic Index Repository, we constructed a large number of polygenic indexes in each of 11 datasets (including the four mentioned above) and have made the polygenic indexes directly available for researchers to download. The polygenic indexes are often based on more data than is publicly available, and the polygenic indexes are constructed according to a uniform methodology across both outcomes and datasets. For examples of Repository polygenic indexes that were previously not available at all or that were less accurate (i.e., predictive), see FAQ 2.3.

1.8. How do different polygenic indexes for the same outcome differ? How comparable are results across studies that use different polygenic indexes for the same outcome?

There are several reasons why polygenic indexes for the same outcome can differ from each other. As described in FAQ 1.3, there are three steps to creating a polygenic index, and differences can arise at each of these steps. For example, in the first step, researchers could base the polygenic index on different GWAS studies of the same outcome. Different GWAS studies may be based on samples who live under different environmental conditions, may have different measures of the outcome, and/or may have measured different SNPs. As another example, in the second step, researchers could use a different method of determining polygenic-index weights from the results of a GWAS. For these and other reasons, it has been common for different studies to use different polygenic indexes, even when the polygenic indexes are for the same outcome and are being studied in the same dataset.

The results are typically difficult to compare across such studies for three main reasons:

  1. If the polygenic indexes are constructed using different methods, then even though they are both measuring genetic influences on the outcome, the precise definition of these “genetic influences” may differ (see FAQs 3.1 and 3.2).

  2. The units for measuring the strength of associations between the polygenic index and other variables generally differ across studies. Researchers usually report results in terms of standard deviations (a statistical unit) of the polygenic index, but if the polygenic index in one study is a more powerful predictor than that in the other study, then one standard deviation of one polygenic index means something different than one standard deviation of the other.

  3. If one of the polygenic indexes is a more powerful predictor than the other, then they differ in their signal-to-noise ratio for capturing genetic influences on the outcome. Whenever an explanatory variable is measured with noise, results based on that variable will be distorted, sometimes in unanticipated ways. Since the signal-to-noise ratio differs across the polygenic indexes, results based on them are distorted differentially, further making the results difficult to compare.

 

1.9. Why create the Polygenic Index Repository?

In brief, the Polygenic Index Repository introduced in this paper has three main goals: (i) to make polygenic indexes for a large number of outcomes more accessible to a wider range of researchers from many fields and disciplines, including early career researchers, researchers without access to the data and/or training required to create the most state-of-the-art polygenic indexes, and researchers who wish to probe the limitations of polygenic indexes; (ii) to increase the use of polygenic indexes that are more accurate (i.e., predictive) than polygenic indexes researchers could construct from publicly available GWAS results; and (iii) to facilitate the comparability of results across studies that use these polygenic indexes.

In more detail, the Polygenic Index Repository addresses several practical obstacles that researchers interested in using polygenic indexes must often confront, including:

  1. Constructing a polygenic index from genotype data requires special expertise. Even for researchers with that expertise, it can be a time-consuming process.

  2. It is generally desirable to generate polygenic-index weights from the GWAS with the largest sample size because the predictive accuracy of a polygenic index is expected to be largest in that case. However, there are administrative hurdles for accessing some GWAS results, such as those from 23andMe. In practice, researchers often end up constructing polygenic indexes using only publicly available GWAS results. Such polygenic indexes tend to have less predictive power.

  3. Publicly available GWAS results are sometimes based on a sample that includes the dataset (or close relatives of dataset members) in which the researcher wants to analyze the polygenic index. Such “sample overlap” spuriously inflates the predictive power of the polygenic index, which can lead to highly misleading results.

  4. Because different researchers construct polygenic indexes in different ways, it is hard to compare and interpret results from different studies (see FAQ 1.8).

As we explain in the paper:

 

We overcome #1 by constructing the [polygenic indexes] ourselves and releasing them to the data providers, who in turn will make them available to researchers. This simultaneously addresses #2 because we use all the data available to us that may not be easily available to other researchers or to the data providers, including genome-wide summary statistics from 23andMe. Using these genome-wide summary statistics from 23andMe is what primarily distinguishes our Repository from existing efforts by data providers to construct PGIs and make them available…It also distinguishes our Repository from efforts to make publicly available [polygenic index] weights directly available for download (although we also do that, for weights constructed without 23andMe data). To deal with #3, for each [outcome] and each dataset, we construct a [polygenic index] from GWAS summary statistics that excludes that dataset. We overcome #4 by using a uniform methodology across the [outcomes].

In addition to providing polygenic indexes constructed using a uniform methodology (which deals with problem #1 listed in FAQ 1.8), we aim to improve comparability of results based on polygenic indexes in another way (which deals with problems #2 and #3 listed in FAQ 1.8): we derive a “measurement-error-corrected estimator” and provide software for calculating it. This estimator deals with the fact that polygenic indexes can differ from each other in their signal-to-noise ratios. It estimates what the results of an analysis would be if the polygenic index had no noise. It thereby avoids the distortions in results that arise from having a noisy measure. Because it puts results about the polygenic index in the units of the “noiseless” polygenic index, the results from polygenic indexes with different signal-to-noise ratios are expressed in the same units. For more details, see FAQ 2.4.

 
 
 
 
 
 
 
 
 
 

2. Study Design and Results

2.1. What outcomes are included in the Polygenic Index Repository? How did you choose the outcomes?

We constructed polygenic indexes for 47 outcomes in 11 datasets, using a consistent methodology. The outcomes (listed in Table 1 in the paper) can be categorized into five groups:

  • anthropometric (height and body mass index);

  • cognition and education (including number of years of formal schooling and performance on cognitive tests);

  • fertility and sexual development (including number of children separately for men and women, and age at first menses);

  • health and health behaviors (the largest category, which includes self-rated overall health, several alcohol and smoking-related behaviors, and depressive symptoms); and

  • personality and well-being (the next largest category, which includes self-rated risk tolerance, subjective well-being, and adventurousness).

 

The set of 47 outcomes we studied was selected from a larger set of 53 outcomes; we did not create polygenic indexes for the 6 outcomes for which statistical calculations indicated that, based on the GWAS results we had available, a polygenic index was predicted to explain less than 1% of the variation across individuals. Although the specific threshold of 1% is somewhat arbitrary (but see further discussion in FAQ 2.3 below), polygenic indexes with low predictive power are less useful and more likely to generate misleading results (such as false positives) if used.

2.2. How did you create these polygenic indexes?

In order to construct the polygenic indexes, we combined GWAS results from three sources. First, for the 34 outcomes where we could find previously published GWAS, we obtained the publicly available results. Second, we collaborated with the personal genomics company 23andMe. 23andMe contributes to academic research by analyzing the data of customers who consent to participate in research. For this paper, 23andMe provided GWAS results for 37 outcomes, 9 of which had not previously been published. Third, for 25 outcomes, we conducted a GWAS ourselves in the UK Biobank, a large-scale biomedical database accessible to researchers. When more than one of these sources of GWAS results was available for an outcome, we combined the GWAS results together using a statistical method called meta-analysis. In some cases, we constructed “multi-trait polygenic indexes” using GWAS results for multiple outcomes (Turley et al., 2018); these polygenic indexes are often more predictive than a standard “single-trait polygenic index” constructed from GWAS results from a single outcome (FAQ 1.3), but the results from analyzing multi-trait polygenic indexes are sometimes more difficult to interpret (FAQ 2.5).

2.3. How predictive are the polygenic indexes in the Repository?

To assess the predictive power of the polygenic indexes, we used data from 5 of the 11 participating datasets (those for which we had access to both the outcome and genotype data we needed to construct the polygenic indexes). In each of these 5 datasets, we calculated the predictive power of every polygenic index for which the dataset contained data on the relevant outcome (see FAQ 2.1).

The predictive power of the polygenic indexes varies substantially across the outcomes and validation datasets. The polygenic index for height has the greatest predictive power. It predicts 26% to 34% of the variation across individuals, depending on the validation dataset. Next is the polygenic index for body mass index (BMI), whose predictive power ranges from 13% to 15% in our validation datasets. Several outcomes—cognitive performance, age at first menses, and educational attainment—have a polygenic index with predictive power in the range of 6% to 12%. Among the least predictive are the polygenic indexes for satisfaction with family and satisfaction with friendships, whose predictive powers in our validation datasets range from 0.3% to 0.7% (they were included because their predictive power was statistically expected to exceed 1%; see FAQ 2.1). The predictive powers for the other polygenic indexes in the Repository lie somewhere between 1% and 6%.

Although the effects explained by these polygenic indexes are small-to-modest, they can nevertheless be useful in research. For instance, the environmental factors studied in economics research typically have predictive power smaller than 5%, often 1% or smaller. Among the strongest predictors of educational attainment is family socioeconomic status, which has predictive power of roughly 15%. In a standard categorization used in psychology (Cohen, 1992; percentages here are squared r values) predictive power less than 9% is “small” while predictive power greater than 25% (rarely attained in psychological research) is “large.” We caution, however, that these comparisons of the effect sizes of polygenic indexes and environmental influences aren’t apples-to-apples because researchers usually study one particular environmental factor or many on an outcome, whereas a polygenic index summarizes the predictive power of SNPs across the genome. As discussed further in FAQ 3.3, for social and behavioral outcomes, the sum of all environmental (i.e., non-genetic) influences substantially outweigh the sum of all genetic influences that a polygenic index aims to capture.

As we discuss in FAQ 3.3, an individual’s polygenic indexes (even for height) do not very accurately predict that individual’s outcomes. However, polygenic indexes are useful for scientific studies (including social science, health research, etc.). Such studies are concerned with aggregate population trends and averages rather than with individual outcomes. For example, for a polygenic index that predicts 1% of the variation across individuals, studies of its association with other variables can be well powered in sample sizes as small as 785 individuals; 10 out of the 11 datasets participating in the Repository have sample sizes larger than that.

A major goal of the Polygenic Index Repository is to enable other research that is valuable to social scientists and health researchers. Such studies are already being conducted with some polygenic indexes (see FAQ 1.9). For some outcomes, the polygenic indexes in the Repository are more predictive than those that were previously possible to construct; examples include having asthma/eczema/rhinitis, number of cigarettes smoked per day, having migraines, nearsightedness, self-reported physical activity, self-rated overall health, extraversion (i.e., being outgoing), and subjective well-being (i.e., self-reported happiness or life satisfaction). For other outcomes, polygenic indexes were not available prior to this paper because there had been no large published GWASs for those outcomes; examples include childhood reading, self-rated math ability, and self-reported narcissism, and several allergies including to pollen.

2.4. What is the “measurement-error-corrected estimator”? How will it and the Repository improve comparability of results across future studies?

To understand this tool, it’s helpful to imagine the theoretically ideal polygenic index that could result from an infinitely large GWAS. In the paper, we call the predictor that would result from this ideal GWAS the “additive SNP factor.” The actual polygenic indexes that exist in the world are “noisy” measures of, and therefore only proxies for, this additive SNP factor. The signal-to-noise ratio of a polygenic index—i.e., the extent to which it reflects the additive SNP factor—is determined by the sample size of the GWAS from which the polygenic index is constructed (a larger GWAS leads to less noise and therefore a higher signal-to-noise ratio). The fact that the polygenic index is noisy distorts the results of most analyses that use the polygenic index (relative to what the results would be with the ideal predictor). These distortions can lead researchers to reach incorrect conclusions. For example, in an analysis of how genes and environments interact in influencing some outcome, the noise in the polygenic index will usually cause a researcher to underestimate how strongly genes and environments interact.

Moreover, as discussed in FAQ 1.8, there are many reasons why two polygenic indexes for the same outcome could differ from each other, including differences in the GWAS that the polygenic index is based on and different methods for constructing the polygenic index. Many of these differences among GWASs produce differences in the signal-to-noise ratios of their resulting polygenic indexes. Two studies using polygenic indexes with different signal-to-noise ratios will, in turn, have results that are distorted to differing degrees, reducing comparability of results across studies that use the polygenic indexes.

The “measurement-error-corrected estimator” we derive in the paper enables researchers to conduct analyses without the distortion that comes from the noise. It works because we (often) have a good estimate of how much noise a given polygenic index has. We can use that information to calculate what the results of an analysis would have been if the polygenic index had no noise. The estimator improves comparability of results across papers because it avoids the distortions in results that arise from having a noisy polygenic index. Rather than being distorted to different degrees, two studies using polygenic indexes with different signal-to-noise ratios that use our estimator will both have undistorted results. We have made available the software for this estimator. We will maintain and provide user support for this software.

Moreover, across all the polygenic indexes and across all the datasets participating in the Repository, we constructed the polygenic indexes in a uniform way. To the extent that future studies use the polygenic indexes from the Repository, their results will therefore be more comparable.

2.5. What is in the User Guide that accompanies the Repository?

Along with the polygenic indexes, we have distributed to the participating datasets a User Guide. Data providers will distribute this User Guide to researchers as part of the Repository. The User Guide contains technical details about the construction of the polygenic indexes, as well as details about data and software availability. It also describes a set of key interpretational considerations that researchers should keep in mind when analyzing polygenic indexes. These include when to use a single-trait versus multi-trait polygenic index (see FAQ 2.1) and reasons why associations between a polygenic index and an outcome generally cannot be interpreted as causal (see FAQ 1.5). Finally, the User Guide contains a discussion of six “interpretational considerations” that we urge researchers who use polygenic indexes to consider as part of the responsible conduct and communication of their research (see FAQ 3.7).

2.6. Who can access the Repository polygenic indexes, and how?

Researchers can access the Repository polygenic indexes through the data access procedures for each of the datasets participating in the Repository. These are summarized in the Supplementary Note of the paper. Typically, data providers require researchers to submit a brief a description of the planned research and to sign a Data Use Agreement. The Data Use Agreement usually requires researchers to agree to protect the confidentiality of individuals in the dataset and, to that end, to analyze the data on computers that satisfy certain security protocols.

We provided the polygenic indexes we created to the 11 datasets participating in the Repository, so that the data providers can distribute them to users of the datasets. We designed the Repository this way for three reasons (corresponding to problems #1, #2, and #3 in FAQ 1.9; problem #4 is addressed by using a consistent methodology for constructing the polygenic indexes). First, because we are making available the polygenic indexes (rather than the GWAS results from which they are constructed), researchers do not need to spend time constructing the polygenic indexes from GWAS results.

Second, for many outcomes, the polygenic indexes we construct are based on more data than are in the largest previously published GWAS. Because the Repository polygenic indexes for those outcomes are based on more data, they are more accurate (i.e., predictive) than polygenic indexes that could be constructed based only on publicly available GWAS results. Third, we tailored the polygenic indexes we constructed to each of the 11 datasets. Specifically, we ensured that for a given dataset, its polygenic indexes were not based on GWAS results that included that dataset (which would have led to “sample overlap” that would make it problematic to use the polygenic index with that dataset).

2.7. How will the Repository be updated?

We plan to update the Repository regularly as new GWAS are published or new data become available in which we can conduct our own GWAS. The updates will increase the predictive power of polygenic indexes already in the Repository, as well as expand the set of outcomes for which polygenic indexes are available. We also expect to include additional datasets whose stewards want to participate in the Repository and make their data broadly available to the research community.

 
 
 
 
 
 

3.   Ethical and social implications of the study

3.1.  Do GWAS or the polygenic indexes they produce identify the gene—or genes—“for” a particular outcome?

No. GWAS of complex outcomes identify many SNPs that are associated with an outcome like height or educational attainment. Although it was once believed that scientists would discover numerous strong one-to-one associations between specific genes and outcomes, we have known for a number of years that the vast majority of human traits and other outcomes are complex and are influenced by thousands of genes, each of which alone tends to have a small influence on the relevant outcome.

Furthermore, many complex outcomes are also influenced by parts of the genome that are not genes at all but instead serve to regulate genes (e.g., influencing when a gene is turned on or off). Genes typically contain many SNPs (often dozens or hundreds, in some cases thousands), and there are even more SNPs outside of genes than inside genes. Complex outcomes are often influenced by millions of SNPs.

Although the GWAS that produced the polygenic indexes included in the Repository did find several SNPs that are associated with particular outcomes, we believe that characterizing these as “genes for X”—or, more accurately—“SNPs for X” (e.g., educational attainment, height) is still likely to mislead, for many reasons, and we urge researchers and reporters to avoid this usage.

As an example, consider the outcome of educational attainment. First, most of the variation in people’s educational attainment is accounted for by social and other environmental factors, not by additive genetic effects (See FAQ 3.3). “Genes for educational attainment” might be read to imply, incorrectly, that genes are the strongest predictor of variation in educational attainment.

Second, the SNPs that are associated with educational attainment are also associated with many other things. These SNPs are no more “for” educational attainment than for the other outcomes with which they are associated.

Third, the “predictive” power (see FAQ 1.6) of each individual SNP that we identify is very small. Our previous work (Lee et al., 2018) has shown that genetic associations with educational attainment are comprised of thousands, or even millions, of SNPs, each of which has a tiny effect size. Each SNP is therefore weakly associated with, rather than a strong influence on, educational attainment. “Genes for educational attainment” might misleadingly imply the latter.

Fourth, environmental factors can increase or decrease the impact of specific SNPs (see FAQ 3.3). Put differently, even if a SNP is associated with higher or lower levels of educational attainment on average, it may have a much larger or smaller effect depending on environmental conditions. Indeed, in our most recent GWAS of educational attainment (Lee et al., 2018) and elsewhere, we report exploratory analyses that provide evidence of such gene-environment interactions. Educational attainment couldn’t even exist as a meaningful object of measurement if we didn’t have schools, and having schools introduces societal mechanisms that influence who goes to them. Accordingly, genetic associations with educational attainment necessarily will be mediated by societal systems and therefore genetic variation should often be expected to interact with environmental factors when it influences social phenomena, such as educational attainment. “Genes for educational attainment” suggests a stability in the relationship between these genes and the outcome of educational attainment that does not exist.

Finally, SNPs do not affect educational attainment directly. As described in our previous work (Lee et al., 2018), the genes identified as associated with educational attainment tend to be especially active in the brain and involved in neural development and neuron-to-neuron communication. The “predictive” power (see FAQ 1.6) of SNPs on educational attainment may therefore be the result of a long process starting with brain development, followed by the emergence of particular psychological traits (e.g., cognitive abilities and personality). These traits may then lead to behavioral tendencies as well as experiences and treatment by parents, peers, and teachers. All of these factors may additionally interact with the environment in which a person lives. Eventually these traits, behaviors, and experiences may influence (but not completely determine) educational attainment.

3.2. Do polygenic indexes show that these outcomes are determined, or fixed, at conception?

Absolutely not. Social and other environmental factors account for most variation in most of the outcomes for which the Repository contains polygenic indexes. But even if it were true that genetic factors accounted for all of the differences among individuals in an outcome, it would still not follow that an individual’s outcome is “determined” at conception. There are at least three reasons for this.

First, some genetic effects may operate through environmental channels (Jencks, 1980). Again, consider educational attainment as an example. Suppose—hypothetically— that some of the SNPs in the index help students to memorize and, as a result, to become better at taking tests that rely on memorization. In this example, changes to the intermediate environmental channels—the type of tests administered in schools—could have large effects on individuals’ educational attainment, even though individuals’ genome would not have changed. Certain SNPs may not be associated with educational attainment at all if schools did not use tests that rely on memorization. More generally, the polygenic index for educational attainment in the Repository might be less predictive if the education system were organized differently than it is at present (see also FAQ 3.3).

Second, even if the genetic associations with educational attainment operated entirely through non- environmental mechanisms that are difficult to modify (such as direct influences on the formation of neurons in the brain and the biochemical interactions among them), there could still exist powerful environmental interventions that could change the genetic relationships. In a famous example suggested by the economist Arthur Goldberger, even if all variation in unaided eyesight were due to genes, there would still be enormous benefits from introducing eyeglasses (Goldberger, 1979). Similarly, policies such as a required minimum number of years of education and dedicated resources for individuals with learning disabilities can increase educational attainment in the entire population and/or reduce differences among individuals.

Third, even if the genetic effects on an outcome were not influenced by changes in the environment, those environmental changes themselves could still have a major impact on the outcome in the population as a whole. For example, if young children were given more nutritious diets, then everyone’s school performance might improve, and college graduation rates might increase. Or consider the outcome of height: 80%-90% of the variation across individuals in height is due to genetic factors. Yet the current generation of people is much taller than past generations due to changes in the environment such as improved nutrition.

3.3. Can the polygenic indexes from the Repository be used to accurately predict a particular person’s outcomes?

No. While the “predictive” power (see FAQ 1.6) of our polygenic indexes makes most of them useful in research for some purposes (see FAQ 2.3), these polygenic indexes fail to predict the majority of variation across individuals. Even for height—the outcome for which our polygenic index has the greatest predictive power—the index fails to predict 70% of the variation.

Indeed, an important message of a number of our earlier papers is that DNA does not “determine” an individual’s behavioral and social outcomes, for at least four reasons: First, in the environments in which the outcomes have been measured, other studies have estimated that the additive effects of SNPs will only ever account (even with arbitrarily large samples used to construct polygenic indexes) for a minority of the variation across individuals in the outcomes we study. For example, we estimate that the theoretical upper bound for additive effects of SNPs would account for 46% of the variation in height, 24% in body mass index, 20% in age at first menses, and less than 10% for most of the social/behavioral outcomes we study. So even a hypothetical polygenic index that perfectly reflects the additive SNP factor (see FAQ 2.4) could only explain a small fraction of the variation across individuals. Second, today’s polygenic indexes are not perfect; they are only able to predict a fraction of that already small fraction of cross-sectional predictive power. Third, since SNPs matter more or less depending on environmental context (see FAQ 3.2), a polygenic index might be less (or more) predictive for individuals in some environments than for individuals in others. Finally, and similarly, polygenic predictions only hold for as long as the environment in which they were developed remains substantially the same.

To illustrate these final two reasons, consider the example of educational attainment (for which we have included a polygenic index in the Repository and on which we have done previous research): if the pedagogy underlying the educational system in which the GWAS that produced the polygenic index was conducted is substantially different than the pedagogy of the different population to which that polygenic index is being applied, the polygenic index may be less (or, conceivably, more) predictive in this second population (for an example, see FAQ 3.2). The same is true if the polygenic index is applied to the same population, but at a later time when the pedagogy has changed substantially. Just as eyeglasses allow those genetically predisposed to poor vision to have nearly perfect vision, innovations in education (say, an innovation that makes education irresistibly engaging, thus mitigating the risk to those with SNPs associated with lower ability to pay attention or maintain self-control) might result in those with lower polygenic indexes now achieving just as much education, on average, as those with higher polygenic indexes.

As sample sizes for GWAS continue to grow, it will likely be possible to construct polygenic indexes for many outcomes whose predictive power comes closer to the total amount of variation that is theoretically predictable from additive effects of common SNPs for those outcomes (the upper bounds given above). Even these levels of predictive power would pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are therefore vastly more accurate forecasters than social science geneticists will ever be.

Note: Polygenic indexes created by GWASs are increasingly used by commercial and research direct-to-consumer platforms to predict individual outcomes. We recognize that returning individual genomic “results” can be a fun way to engage people in research and other projects and has at least the theoretical potential to stoke their interest in, and educate them about, genomics and how genes and environments interact. But it is important that participants/users understand that, at present, most of these individual results, including all social and behavioral outcomes, are not meaningful predictions (in the sense that they generally have very little predictive power at the individual level).  Failure to make this point clear risks sowing confusion and undermining trust in genetics research.

3.4. Can the polygenic indexes accurately be used for research studies in non-European-ancestry populations?

No. We constructed polygenic indexes only for individuals classified as “European ancestry.” (The precise definition of “European ancestry” differs in different datasets, but it usually means that a person’s pattern of genetic variation across the genome is statistically close to the average pattern from a “reference sample” for some European country. The reference samples used by geneticists are based on samples of people who live in the European country today and whose recent ancestors also lived in that country.) Therefore, the Polygenic Index Repository only includes polygenic indexes for these individuals.

The main reason we only constructed polygenic indexes for these individuals is that the polygenic indexes are likely to be much less predictive—and hence much less useful—in a sample of people of non-European ancestries. That is because our original GWAS data was obtained from samples of people with European-ancestry, and GWAS results have been found to have only limited portability across ancestries (Belsky et al., 2013; Domingue et al., 2015, 2017; Martin et al., 2017; Vassos et al., 2017). There are a number of reasons for the limited portability. For one thing, the set of SNPs that are associated with an outcome in people of European ancestries is unlikely to overlap closely with the set of SNPs associated with the outcome in people of non-European ancestries. And even if a given SNP is associated in both ancestry groups, the effect size—in other words, the strength of the association—will almost surely differ. This is primarily because linkage disequilibrium (LD) patterns (i.e., the correlation structure of the genome) vary by ancestry. This means that some SNP may be associated with the outcome because the SNP is in LD (i.e., correlated) with a SNP elsewhere in the genome that causally affects education (see FAQ 1.5). If the strength of the correlation is greater in one ancestry group than in another, then the size of the association will be larger in that ancestry group. Moreover, even if LD patterns were similar in each ancestry group, the association may differ in different groups because environmental conditions differ (see FAQ 1.6). The fact that there are differences across ancestry groups in the set of associated SNPs and their effect sizes means that the weights for constructing polygenic indexes in European-ancestry individuals (FAQ 1.3) would be the “wrong” weights for non-European-ancestry individuals. For a more extensive, excellent discussion of these and related issues, see Graham Coop’s blog post, “Polygenic scores and tea drinking.”

Unfortunately, this attenuation of predictive power means that for non-European-ancestry populations, many of the benefits of having a polygenic index available will have to wait until large GWAS studies are conducted using samples from these populations. (Currently, most large genotyped samples are of European ancestries.) We intend that future versions of the Polygenic Index Repository will include polygenic indexes for non-European-ancestry populations, once it becomes possible to produce polygenic indexes with adequate predictive power. We believe that the relative scarcity of polygenic indexes that can be used for research that focuses on non-European ancestry groups is a disparity that should be rapidly eliminated by prioritizing GWAS studies that focus on non-European populations.

3.5. Would it be appropriate to use the Repository social and behavioral polygenic indexes in policy or practice?

No. We reiterate that polygenic indexes are poor predictors of social and behavioral outcomes (see FAQs 2.3 and 3.3). Their incremental predictive power over and above other, non-genetic predictors that are already used is even smaller than a polygenic index’s predictive power on its own. Moreover, the predictive power of the polygenic indexes for social and behavioral outcomes depends on the environment in which the GWAS participants live (FAQ 3.3). Thus, enshrining polygenic indexes in policy risks basing policy (which can be difficult to change) on weak predictions that could become even weaker or nonexistent as the environment changes. Furthermore, the polygenic indexes can operate through environmental channels (FAQ 3.2). Allocating resources based on polygenic indexes could therefore exacerbate inequalities that were originally due to environmental disparities (a similar risk to that of other biased algorithms that bake in pre-existing discrimination). Using polygenic indexes in order to prioritize giving resources to individuals who are already advantaged would further limit the opportunities of individuals who are disadvantaged, which would be ethically inappropriate. Finally, even if polygenic indexes were used to offer additional resources to disadvantaged individuals, any small potential benefits of using such weak individual predictors would almost certainly be offset by the risk of stigmatization and by the fact that this technology is currently only accessible to people of European ancestries (FAQ 3.4). For all these reasons, we are deeply skeptical that the Repository social and behavioral polygenic indexes have any appropriate role to play in policy now or in the foreseeable future.

3.6. Could research on polygenic indexes lead to discrimination against, or stigmatization of, people with higher or lower polygenic indexes for certain outcomes? If so, why facilitate the spread of polygenic indexes?

Unfortunately, like a great deal of research—including, for instance, research identifying genomic variation associated with increased cancer risk—the results can be misunderstood and misapplied. This includes being used to discriminate against those with higher or lower polygenic indexes for certain outcomes (e.g., in insurance markets). Nevertheless, for a variety of reasons, in this instance, we do not think that the best response to the possibility that useful knowledge could be misused is to refrain from producing the knowledge. Moreover, many researchers already have access to and use polygenic indexes; against this background, the Repository helps ensure that a much wider array of researchers have the same opportunity to access and probe these research tools, and also that the polygenic indexes themselves will be more accurate. Here, we briefly discuss some of the broad potential benefits of this research. We then describe what we see as our ethical duty as researchers conducting this work.

First, one benefit of conducting social-science genetics research in ever larger samples is that doing so allows us to correct the scientific record. An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral outcomes have serious methodological limitations, fail to replicate, and are likely to be false-positive findings (Benjamin et al., 2012; Chabris et al., 2012, 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral outcomes), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt, 2012). One of the most important reasons why earlier work has generated unreliable results is that the sample sizes were far too small, given that the true effects of individual SNPs on behavioral outcomes are tiny. Pre-existing claims of genetic associations with complex social-science outcomes have reported widely varying effect sizes, many of them purporting to “predict” as much of the variation across individuals as do the polygenic indexes we construct in this paper that aggregate the effects of millions of SNPs.

Second, behavioral genetics research also has the potential to correct the social record and thereby to help combat discrimination and stigmatization. For instance, overestimating the role of genetics can be damaging, and the present work can help debunk the myth of genetic determinism. By quantifying how various outcomes are predicted by genetic data, we show that for all of the outcomes we study, the genetic data can explain a very small fraction of the variation across individuals (see FAQ 2.3). By clarifying the limits of deterministic views of complex outcomes, recent behavioral genetics research—if communicated responsibly—could make appeals to genetic justifications for discrimination and stigmatization less persuasive to the public in the future.

Third, behavioral genetics research has the potential to yield many other benefits, especially as sample sizes continue to increase—as briefly summarized in FAQ 1.9. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity. Indeed, very few of the uses of polygenic indexes were anticipated when they were first proposed (Wray, Goddard and Visscher, 2007).

In sum, we agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics, 2002, p114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner” (see FAQ 3.7).

3.7. What have you done to mitigate the risks of research using Repository polygenic indexes?

In our view, the responsible behavioral genetics research called for by the Nuffield Council on Bioethics (see FAQ 3.6) includes sound methodology and analysis of data (e.g., only conducting analyses that are adequately powered and, when feasible, preregistering power calculations and planned analyses); a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public. A critical aspect of the latter is particular vigilance regarding what research results do—and do not—show, and how polygenic indexes can—and cannot—be appropriately used. In an effort to reduce the risk that its results might be misinterpreted by readers, misreported by the media, or misused, the SSGAC has developed and publicly posted FAQs like this document with every major paper it has published since its first paper in 2013.

 

In addition, the SSGAC will require researchers who download the SNP weights for constructing polygenic indexes to agree to Terms of Service. Among the many terms that we require researchers to agree to, we highlight two here:

I understand that comparisons of genetically predicted phenotype levels across ancestral groups are usually scientifically confounded due to the effects of linkage disequilibrium, gene-environment correlation, gene-environment interactions, and other methodological problems. I have read and understand the principles articulated by the American Society of Human Genetics (ASHG) position statement: “ASHG Denounces Attempts to Link Genetics and Racial Supremacy.” (See also International Genetic Epidemiological Society Statement on Racism and Genetic Epidemiology.) 

 

I have read and understand the principles articulated by the ASHG with respect to “Advancing Diverse Participation in Research with Special Consideration for Vulnerable Populations” (https://www.cell.com/ajhg/fulltext/S0002-9297(20)30279-2). In particular, I understand the principles articulated in the final two sections of this statement, “In the Conduct of Research with Vulnerable Populations, Researchers Must Address Concerns that Participation May Lead to Group Harm” and “The Benefits of Research Participation Are Profound, Yet the Potential Danger that Unethical Application of Genetics Might Stigmatize, Discriminate against, or Persecute Vulnerable Populations Persists.” 

These Terms of Service stem from the observation that SNP associations are not necessarily causal (see FAQ 1.5) and depend on the environment of the individuals included in the GWAS (see FAQ 1.6). Different ancestry groups arise in the population because they became partially separated from each other many generations ago, for example, due to geographic factors or social forces. When two groups are geographically or socially separated, they also face different environments, which not only may have direct effects on certain outcomes (such as disease risk) but may also change the strength of the association between the outcomes and certain SNPs. Therefore, when individuals from two ancestry groups have different average outcomes, it is extremely difficult to identify whether the difference is due to average genetic differences between the groups or to the different environments faced by the groups. For this reason, it is usually scientifically invalid to make general statements about ancestry group differences based on SNP associations identified in a GWAS. (Also see FAQ 3.2.) The Terms of Service also require users to securely store the data and to immediately report any breach of the Terms.

 

Finally, we have developed and provided to participating data providers a User Guide to be distributed to researchers who use Repository polygenic indexes (see FAQ 2.5). We will also provide the User Guide to researchers who download the SNP weights. One section of the User Guide discusses six “interpretational considerations” that are likely to arise when conducting research with polygenic indexes and which we urge researchers to seriously consider as a critical part of responsibly conducting and communicating their research. One recurring ethical concern about genetic research is the tendency for its predictive power to become exaggerated in the media and in the public’s minds, at the expense of a more nuanced understanding of how genes and environment interact, the importance of environmental influences, and the ability of interventions to improve outcomes. Many of the interpretational considerations we discuss in the User Guide involve how to anticipate and address potential confounds and how to navigate complex questions about causality and ensure responsible communication of causality.

For instance, the User Guide cautions researchers to appreciate and communicate that associations between a polygenic index and an outcome may operate through environmental (rather than biological) mechanisms (see FAQs 3.2 and 3.3).

4. References

Abdellaoui, A. et al. (2019). Genetic correlates of social stratification in Great Britain. Nature Human Behaviour, 3 (12), 1332–1342. Available from https://doi.org/10.1038/s41562-019-0757-5.

Amos, C.I. et al. (2008). Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nature Genetics, 40, 616–622. Available from https://doi.org/10.1038/ng.109.

 

Barcellos, S.H., Carvalho, L.S. and Turley, P. (2018a). Education can reduce health differences related to genetic risk of             obesity. Proceedings of the National Academy of Sciences, 115 (42), E9765. Available from                           https://doi.org/10.1073/pnas.1802909115.

 

Barcellos, S.H., Carvalho, L.S. and Turley, P. (2018b). Education can Reduce Health Disparities Related to Genetic Risk of Obesity: Evidence from a British Reform. bioRxiv [https://doi.org/10.1101/260463]. Available from https://doi.org/10.1101/260463.

Belsky, D.W. et al. (2013). Development and evaluation of a genetic risk score for obesity. Biodemography and Social Biology, 59 (1), 85–100. Available from https://doi.org/10.1080/19485565.2013.774628.

Belsky, D.W. et al. (2016). The Genetics of Success. Psychological Science, 27 (7), 957–972. Available from https://doi.org/10.1177/0956797616643070.

Benjamin, D.J. et al. (2012). The Promises and Pitfalls of Genoeconomics. Annual Review of Economics, 4 (1), 627–662. Available from https://doi.org/10.1146/annurev-economics-080511-110939.

Chabris, C.F. et al. (2012). Most reported genetic associations with general intelligence are probably false positives. Psychological Science, 23 (11), 1314–1323. Available from https://doi.org/10.1177/0956797611435528.

Chabris, C.F. et al. (2015). The Fourth Law of Behavior Genetics. Current Directions in Psychological Science, 24 (4), 304–312. Available from https://doi.org/10.1177/0963721415580430.

Cohen, J. (1992). Statistical Power Analysis. Current Directions in Psychological Science, 1 (3), 98–101. Available from https://doi.org/10.1111/1467-8721.ep10768783.

Davies, N.M. et al. (2018). The causal effects of education on health outcomes in the UK Biobank. Nature Human Behaviour. Available from https://doi.org/10.1038/s41562-017-0279-y.

Domingue, B.W. et al. (2015). Polygenic Influence on Educational Attainment: New evidence from The National Longitudinal Study of Adolescent to Adult Health. AERA Open, 1 (3), 1–13. Available from https://doi.org/10.1177/2332858415599972.

Domingue, B.W. et al. (2017). Mortality selection in a genetic sample and implications for association studies. International Journal of Epidemiology, 46 (4), 1285–1294. Available from https://doi.org/10.1093/ije/dyx041.

Domingue, B.W. et al. (2018). Geographic Clustering of Polygenic Scores at Different Stages of the Life Course. RSF: The Russell Sage Foundation Journal of the Social Sciences, 4 (4), 137 LP – 149. Available from https://doi.org/10.7758/RSF.2018.4.4.08.

Goldberger, A.S.A. (1979). Heritability.Economica, 46 (184), 327–347.

Hewitt, J.K. (2012). Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behavior Genetics, 42 (1), 1–2. Available from https://doi.org/10.1007/s10519-011-9504-z.

Howe, L.J. et al. (2021). Within-sibship GWAS improve estimates of direct genetic effects. bioRxiv, 2021.03.05.433935. Available from https://doi.org/10.1101/2021.03.05.433935.

Hung, R.J. et al. (2008). A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature. Available from https://doi.org/10.1038/nature06885.

Jencks, C. (1980). Heredity, environment, and public policy reconsidered. American Sociological Review, 45 (5), 723–736.

Khera, A. V. et al. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50 (9), 1219–1224. Available from https://doi.org/10.1038/s41588-018-0183-z.

Khera, A. V et al. (2019). Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Cell, 177 (3), 587-596.e9. Available from https://doi.org/10.1016/j.cell.2019.03.028.

Koellinger, P.D. and Harden, K.P. (2018). Using nature to understand nurture: Genetic associations show how parenting matters for children’s education. Science, 359 (6374), 386–387. Available from https://doi.org/10.1126/science.aar6429.

Kong, A. et al. (2018). The nature of nurture: Effects of parental genotypes. Science, 359 (6374), 424–428. Available from https://doi.org/10.1126/science.aan6877.

Lambert, S.A. et al. (2020). The Polygenic Score Catalog: an open database for reproducibility and systematic evaluation. medRxiv, 2020.05.20.20108217. Available from https://doi.org/10.1101/2020.05.20.20108217.

Lander, E.S. and Schork, N.J. (1994). Genetic dissection of complex traits. Science, 265, 2037–48.

Lee, J.J. et al. (2018). Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics, 50 (8), 1112–1121. Available from https://doi.org/10.1038/s41588-018-0147-3.

Martin, A.R. et al. (2017). Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations.American Journal of Human Genetics, 100 (4), 635–649. Available from https://doi.org/10.1016/j.ajhg.2017.03.004.

Nuffield Council on Bioethics. (2002). Genetics and human behaviour: the ethical context. London: Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf].

Purcell, S.M. et al. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460 (7256), 748–752. Available from https://doi.org/10.1038/nature08185.

Robinson, M.R. et al. (2017). Genetic evidence of assortative mating in humans. Nature Human Behaviour. Available from https://doi.org/10.1038/s41562-016-0016.

Schmitz, L.L. and Conley, D. (2017). The effect of Vietnam-era conscription and genetic potential for educational attainment on schooling outcomes. Economics of Education Review, 61, 85–97. Available from https://doi.org/https://doi.org/10.1016/j.econedurev.2017.10.001.

Thorgeirsson, T.E. et al. (2008). A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature, 452 (7187), 638–642. Available from https://doi.org/10.1038/nature06846.

Turkheimer, E. (2000). Three laws of behavior genetics and what they mean. Current Directions in Psychological Science, 9 (5), 160–164.

Turley, P. et al. (2018). Multi-trait analysis of genome-wide association summary statistics using MTAG. Nature Genetics, 50 (2), 229–237. Available from https://doi.org/10.1101/118810.

Vassos, E. et al. (2017). An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis. Biological Psychiatry, 81 (6), 470–477. Available from https://doi.org/10.1016/j.biopsych.2016.06.028.

Vilhjálmsson, B.J. et al. (2015). Modeling linkage disequilibrium increases accuracy of polygenicrisk scores. The American Journal of Human Genetics, 97 (4), 576–592.

Visscher, P.M. et al. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics, 101 (1), 5–22. Available from https://doi.org/10.1016/j.ajhg.2017.06.005.

Wray, N.R., Goddard, M.E. and Visscher, P.M. (2007). Prediction of individual genetic risk to disease from genome-wide association studies. Genome research, 17 (10), 1520–1528. Available from https://doi.org/10.1101/gr.6665407.

Yengo, L. et al. (2018). Imprint of Assortative Mating on the Human Genome. Nature Human Behaviour, 2 (12), 2, 948–954. Available from https://doi.org/10.1038/s41562-018-0476-3.

 
 
 
 
 
 
 
 

FAQs about “Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences”

 
 
 
 

Back to Top

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document provides information about the study:

 

Karlsson Linnér et al. 2019. “Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences.” Nature Genetics.

The document was prepared by Jonathan P. Beauchamp, Daniel J. Benjamin, Richard Karlsson Linnér, Philipp D. Koellinger, and Michelle N. Meyer. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

          1. Background

          2. Study design and results

          3. Social and ethical implications of the study

          4. Appendices

For clarifications or additional questions, please contact Jonathan P. Beauchamp (jonathan.pierre.beauchamp@gmail.com).

Quick Links

1.1.  Who conducted this study? What is the group’s overarching goals?

1.2.   The current study focuses on a variable called "general risk tolerance." What is general risk tolerance?

1.3.  What was already known about the genetics of risk tolerance prior to this study?

2.1.  What did you do in this paper? How was the study designed?

2.2.  What did you find in the GWAS?

2.3.  Are the SNPs associated with higher risk tolerance in your study also associated with other phenotypes?

2.4.  How much of a particular person's risk tolerance can be predicted from the results of this paper?

2.5.  What do your results tell us about human biology and brain development

2.6.  How do your results relate to previous research on the genetics of risk tolerance?

3.1.  Did you find “the gene for” (or "the genes for") risk tolerance?

3.2.  Does this study show that an individual's level of risk tolerance is determined and fixed at conception?

3.3.  Can you use the results in this paper to meaningfully predict a particular person's risk tolerance?

3.4.  Can environmental factors modify the effects of the specific SNPs you identified?

3.5.  What policy lessons or practical advice do you draw from this study?

3.6.  Could this kind of research lead to discrimination against, or stigmatization of, people with specific genetic variants? If so, why conduct this research?

Appendix 1:  Quality control measures

Appendix 2:  Additional reading and references

1. Background

1.1.  Who conducted this study? What was the group's overarching goal?

The authors are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, multi-disciplinary, international research group that aims to identify statistically robust links between genetic variants (for instance, base-pairs of DNA that vary across people) and phenotypes of interest to social scientists. A “phenotype” refers to anything that may be influenced by DNA, such as disease risk or physical characteristics. The phenotypes of interest to social scientists include behaviors, preferences, personality traits, and socioeconomic outcomes.

 

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. As is now well understood (Chabris et al. 2015), most phenotypes—including virtually all social-science phenotypes—are influenced by hundreds or thousands of genetic variants. Although in combination their collective effects can be sizeable, almost every one of these genetic variants has an extremely small effect on its own. To reliably identify these individual variants, therefore, scientists must study large samples; typically, hundreds of thousands of individuals are required. One approach to obtaining a large enough sample is for many research groups to pool analyses of their data into a single, large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of diseases and conditions (Visscher et al. 2017a). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

 

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believed that genetic data could have a substantial positive impact on research in the social sciences, and that social-science genetics could make important contributions to medical research. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Genetics, Broad Institute and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Statistical Genetics, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

 

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). These, together with an analysis plan, are often preregistered on the Open Science Framework (OSF) [The analysis plan for this study can be downloaded here: https://osf.io/cjx9m/]. Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate to journalists and the public what was found and what can and cannot be concluded from the research findings.

The SSGAC’s first major project was a genome-wide association study (GWAS) of educational attainment published in Science (Rietveld et al. 2013b). The study is summarized in a FAQ posted on the SSGAC website (https://www.thessgac.org/faqs). The study was followed by two related studies, using successively much larger samples, published in Nature (Okbay et al. 2016b) and Nature Genetics (Lee et al. 2018). Subsequent SSGAC papers have studied subjective well-being, depressive symptoms, the personality trait neuroticism, cognitive performance, and reproductive behavior. These papers have been published in Nature Genetics (Barban et al. 2016, Okbay et al. 2016a), Proceedings of the National Academy of Sciences (Rietveld et al. 2013a, 2014b), and Psychological Science (Chabris et al. 2012, Rietveld et al. 2014a), among other journals. The present study is the SSGAC’s first study that focuses on the genetics of general risk tolerance.

1.2.  The current study focuses on a variable called “general risk tolerance.” What is general risk tolerance?

Risk pervades many aspects of human life and is a central concept in the study of decision-making and behavior. Somewhat surprisingly, then, there is no universally agreed-upon definition of “risk.” For our purposes, we define “risk” as the degree of variability in possible outcomes, and “risk tolerance” as a person’s willingness to choose options that entail more risk, typically to have the chance of obtaining a more rewarding outcome. For example, an engineer with a high degree of risk tolerance would be more willing to quit her job at a stable, large corporation and join a risky start-up. An individual with a high degree of risk tolerance may also be more likely to drive faster than the speed limit on a highway, thus incurring a higher risk of having an accident or a traffic ticket in order to save time.

 

An individual’s risk tolerance typically varies across domains of behavior. For instance, an individual may be willing to take relatively more risks in the career and financial domains, but not in the health and leisure domains. Nonetheless, individuals with greater risk tolerance in one domain are statistically more likely to exhibit greater risk tolerance in other domains as well. For this reason, survey-based measures of general risk tolerance—defined as a person’s general willingness to take risks—have been used as all-around predictors of risky behaviors such as portfolio allocation, occupational choice, smoking, drinking alcohol, and starting one’s own business (Beauchamp et al. 2017, Dohmen et al. 2011, Falk et al. 2015). In our study, we analyze a measure of general risk tolerance based on responses to questions such as: “Would you describe yourself as someone who takes risks? Yes / No.” The exact phrasing and number of response categories varied across the study cohorts, but all questions asked subjects about their overall or general attitudes toward risk.

1.3.  What was already known about the genetics of risk tolerance prior to this study?

Researchers have found that identical twins (who share all of their genes) tend to be more similar to one another in terms of their risk tolerance than fraternal twins (who share, on average, only half of their genes), which suggests that genetic factors influence risk tolerance. With some assumptions, it is possible to translate the greater similarity of identical twins into an estimate of the “heritability” of risk tolerance. The heritability of risk tolerance is the percentage of the variation in risk tolerance among individuals that can be accounted for statistically by genetic differences, given current environmental conditions. Estimates from twin studies suggest that risk tolerance is moderately heritable (~30%) (Beauchamp et al. 2017, Cesarini et al. 2009, Harden et al. 2017). We note, however, that such estimates are based on several assumptions and vary across studies, in part because different studies use different measures of risk tolerance as well as different assumptions and methods.

As we further discuss in FAQ 2.2, the current study also estimated the “SNP heritability” of risk tolerance, which is the percentage of the variation in risk tolerance among individuals that can be accounted for statistically by “common SNPs” (a type of genetic variants; see FAQ 2.1 for details), given current environmental conditions. Our estimate suggests that common SNPs account for only ~5% to 9% of the variation in risk tolerance across individuals. Importantly, while these heritability estimates all suggest that genetic factors influence risk tolerance, we emphasize that this does not imply that risk tolerance is pre-determined at birth or that genetic factors act independently of the environment, as we discuss below in FAQs 3.2 and 3.4.

Risk tolerance has been one of the most studied phenotypes in social science genetics. To date, however, nearly all published studies attempting to discover the genetic variants associated with risk tolerance have been “candidate-gene studies” conducted in relatively small samples, ranging from a few hundred to a few thousand individuals. A candidate-gene study tests the associations between a phenotype of interest and a few selected genetic variants that are hypothesized to be associated with the phenotype. Though there is nothing wrong in principle with such studies, we now know that the sample sizes of the candidate-gene studies for risk tolerance and other behavioral traits were probably too small to robustly identify genetic variants [As mentioned above, it is now well established that the bulk of the genetic variation in the vast majority of behavioral phenotypes is attributable to a large number of genetic variants, each having a very small effect (Chabris et al. 2015). For that reason, large samples are needed to detect individual genetic variants.] (Chabris et al. 2012, Hewitt 2012). Indeed, as we explain in FAQ 2.6, we used our own results to assess the evidence in favor of the main biological pathways and genetic variants which previous candidate-gene studies had hypothesized or reported to relate to risk tolerance. Although our sample was several orders of magnitude larger than the samples used in the candidate-gene studies, we found no evidence that these biological pathways and genetic variants are associated with risk tolerance.

To the best of our knowledge, prior to our study there had only been two studies with samples that were large enough to provide sufficient statistical power to robustly detect genetic variants with small effect sizes (Day et al. 2016, Strawbridge et al. 2018). From these studies, only two genetic variants associated with risk tolerance had been identified.

In summary, when our study was initiated, despite much interest, little was known about which genetic variants are related to risk tolerance.

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

2. Study design and results

2.1.  What did you do in this paper? How was the study designed?

We performed the largest-to-date genome-wide association study (GWAS) of risk tolerance. In a GWAS, scientists look across the human genome for genetic variants that are associated with a phenotype of interest. If a genetic variant is associated, then individuals who have a certain “allele” (i.e., a certain version of that variant) are more likely than those with a different allele to exhibit a phenotype (in this case, higher general risk tolerance).

We chose a GWAS study design because it has been a successful research strategy for identifying genetic variants associated with many traits and diseases, including body height (Wood et al. 2014), BMI (Locke et al. 2015), Alzheimer’s disease (Lambert et al. 2013), and schizophrenia (Ripke et al. 2014). GWAS have also recently been used to identify genetic variants associated with a variety of health-relevant social science outcomes, such as the number of children a person has (Barban et al. 2016), happiness (Okbay et al. 2016a, Turley et al. 2018), and educational attainment (Okbay et al. 2016b, Rietveld et al. 2013b). Furthermore, scientists who have attempted to replicate reported GWAS associations in independent samples of sufficiently large size have typically been successful (Visscher et al. 2017b), thereby indicating that GWAS associations are robust findings. 


In our GWAS of general risk tolerance, we tested ~9.3M single nucleotide polymorphisms (SNPs) from across the human genome for association with general risk tolerance. SNPs are the most common type of genetic variant in the genome and are the genetic variants that are captured by the genetic data used in our study and most other modern genome-wide association studies. (There are other types of genetic variants, which we did not analyze.) Some SNPs have alleles that are relatively common in the population and are called “common SNPs,” while other SNPs have one allele that is rare in the population; our GWAS analyzed both common SNPs and some rare SNPs.


As mentioned above, genetic variants associated with social-science phenotypes tend to have very small individual effects on the phenotypes. Therefore, in order to have sufficient statistical power to discover SNPs associated with risk tolerance, we pooled the results from analyses of two very large datasets, the UK Biobank (n = 431,126 individuals) and a dataset of research participants from 23andMe (n = 508,782 individuals), thereby yielding a “discovery” sample of 939,908 individuals. We replicated the findings from this discovery sample in a “replication” sample comprised of ten smaller datasets and totaling 35,445 individuals. In all of these samples, to avoid the statistical confounding that arises from studying ethnically diverse populations, we restricted our GWAS to individuals of European ancestries. (For a somewhat technical explanation, see Appendix 1.)


We used the results of our GWAS of general risk tolerance for a wide range of additional analyses. For example, to examine the extent to which SNPs that are associated with risk tolerance also tend to be associated with other phenotypes, we estimated “genetic correlations” between risk tolerance and a wide range of phenotypes (see FAQ 2.3). In addition, in several samples of genotyped individuals, we used individuals’ SNP data and the results of our GWAS to construct “polygenic scores” that partially predict individuals’ risk tolerance based on their SNP data (see FAQ 2.4). We also performed a suite of bioinformatics analyses to get insight into the biology of risk tolerance (see FAQs 2.5 and 2.6).
 

In addition to our GWAS of general risk tolerance, we conducted six supplementary GWAS, of six phenotypes related to risk tolerance and risk-taking behaviors. We conducted a GWAS of “adventurousness,” defined as the self-reported tendency to be adventurous vs. cautious. We also conducted GWAS of four risky behaviors that each plausibly capture risk taking in a different domain of behavior: “automobile speeding propensity” (the tendency to drive faster than the speed limit), “drinks per week” (the average number of alcoholic drinks consumed per week), “ever smoker” (whether one has smoked more than once or twice), and “number of sexual partners” (the lifetime number of sexual partners). Finally, we conducted a GWAS of the first principal component of the four risky behaviors. (The first principal component is a variable that captures the common variation across the four risky behaviors and can be interpreted as capturing the general tendency to take risks across domains.) Section 1.2 of our article’s Supplementary Information provides more detail on the definitions of these phenotypes. The analyses of the six supplementary phenotypes were performed in samples ranging from ~315,000 to ~557,000 individuals. These samples were smaller because of more limited data availability for these phenotypes.
 

2.2.  What did you find in the GWAS?

Our main GWAS identified 124 SNPs associated with general risk tolerance in our discovery sample. The 124 SNPs are located in 99 “loci” (a locus is a small region of the genome). As expected, the estimated individual effects of the 124 SNPs are all very small: none of the SNPs explain more than 0.02% of the variation in general risk tolerance across individuals. 
 

We verified that the 124 SNPs identified in our discovery sample also tend to be associated with general risk tolerance in our replication sample. Because the replication sample was not large enough to provide adequate statistical power to replicate the associations of each of the 124 SNPs individually, we performed a “holistic” replication analysis. This analysis compares the overall agreement in estimates for the 124 SNPs across the discovery and the replication GWAS. This holistic replication was successful, indicating that it is highly unlikely that the results from our discovery sample were driven by chance alone.

 
We also estimated the “SNP heritability” of risk tolerance. The SNP heritability of a phenotype is the share of the variation in the phenotype that is statistically accounted for by common SNPs, given current environmental conditions (see FAQ 1.3). We used several methods to obtain our estimates. With all methods, we used a set of common SNPs—that is, SNPs that have alleles that are relatively common in the population—to estimate the heritability. Because the different methods make different assumptions and because we applied the different methods to slightly different data, the methods yielded different heritability estimates. Our estimates suggest that common SNPs account for ~5% to 9% of the variation in risk tolerance across individuals. (The true heritability of risk tolerance is likely to be somewhat higher, since other genetic variants, such as rare SNPs and structural genetic variants, are likely to also contribute to variation in risk tolerance.) 

 

Our six supplementary GWAS (of the phenotypes related to risk tolerance and risk-taking behaviors) identified a total of 741 associations between a specific SNP and one of the phenotypes. Because of the lack of suitable replication samples, we did not perform replication analyses for the GWAS of these six phenotypes.

2.3.  Are the SNPs associated with higher risk tolerance in your study also associated with other phenotypes?

Yes. Of the 124 SNPs we identified as associated with general risk tolerance, we found that 72 are also associated with one or more of the six supplementary phenotypes related to risk tolerance and risk-taking behaviors [Equivalently, as we write in the abstract of the paper, of the 99 loci referred to above and that contain the 124 SNPs associated with general risk tolerance, 46 also contain one or more SNPs associated with at least one of the six supplementary phenotypes.]. We also identified several regions of the genome that stood out as being associated with general risk tolerance and with all or most of the six supplementary phenotypes. We verified that the effects of the SNPs in these regions are concordant, such that SNPs associated with higher general risk tolerance are also associated with more risky behavior. This suggests that these regions represent shared genetic influences on risk tolerance and risky behaviors (rather than just being genomic hot spots containing SNPs associated with many different phenotypes).


In addition, we estimated the “genetic correlation” between general risk tolerance and various other phenotypes. The genetic correlation between two phenotypes is a measure of the extent to which the SNPs that affect one phenotype also tend to affect the other phenotype. We found that general risk tolerance is moderately to highly genetically correlated with a range of risky behaviors. General risk tolerance is genetically correlated with the six supplementary phenotypes (which capture various types of risky behavior), with estimates of the genetic correlations ranging from 0.25 to 0.83. General risk tolerance is also moderately to highly genetically correlated with a number of additional risky behaviors, including cannabis use and self-employment. Importantly, the direction of the genetic correlations is in the expected direction, with higher risk tolerance being associated with riskier behavior. Moreover, our estimates of the genetic correlations between general risk tolerance and the supplementary risky behaviors are substantially higher than the corresponding phenotypic correlations [Although measurement error partly accounts for the lower phenotypic correlations, the genetic correlations remain considerably higher even after adjustment of the phenotypic correlations for measurement error.], implying that general risk tolerance is more strongly associated with these risky behaviors at the genetic level than at the non-genetic (environmental) level. The relatively high genetic correlations between general risk tolerance and risky behaviors suggests the existence of a genetically-influenced “general factor of risk tolerance” that captures a general tendency to take risk across domains of behavior. 


We also found that risk tolerance is moderately genetically correlated with several personality and neuropsychiatric phenotypes. Of note, the estimated genetic correlations with the personality traits extraversion (    = 0.51)["    " denotes a genetic correlation estimate.], neuroticism (    = –0.42), and openness to experience (    = 0.33) are highly statistically significant and are substantially larger in magnitude than previously reported phenotypic correlations, pointing to shared genetic influences among general risk tolerance and these personality traits. We also found statistically significant and positive genetic correlations between general risk tolerance and the neuropsychiatric phenotypes ADHD, bipolar disorder, and schizophrenia.

2.4.  How much of a particular person’s risk tolerance can be predicted from the results of this paper?

Although each individual SNP has a very small effect, the GWAS estimates of the SNPs’ (very small) effects can be combined to create a “polygenic score,” an index that takes into account the effects of many SNPs from across the genome. Because a polygenic score aggregates the information from many SNPs, it can predict far more of the variation in risk tolerance among individuals than any single SNP. We found that polygenic scores constructed using the results of our GWAS of general risk tolerance explain up to ~1.6% of the variation across individuals in general risk tolerance. While 1.6% is far larger than the amount of variation explained by individual SNPs (less than 0.02%, as noted above), it is small in absolute terms. As we explain in FAQ 3.3, such polygenic scores cannot be used to meaningfully predict a particular person’s risk tolerance.


The predictive power of the polygenic scores is so small partly because our estimates of the SNPs’ effect sizes are relatively imprecise. As the available sample sizes for GWAS get larger, estimates of the SNPs’ effect sizes will become more precise, and the scores’ explanatory power will rise; in theory, if environmental conditions remain the same, it should be possible one day to construct a polygenic score whose explanatory power is close to the heritability of risk tolerance. For example, a score constructed using the set of common SNPs we used to estimate the ~5% to 9% SNP heritability of risk tolerance (see FAQ 2.2), may ultimately explain ~5% to 9% of the variation in risk tolerance across individuals.
Although the polygenic scores we constructed have too little explanatory power to usefully predict any individual’s risk tolerance, they have sufficient explanatory power to be useful in social science studies, which focus on average or aggregated behavior in the population (not individual outcomes). Indeed, with 80% statistical power (the conventional threshold for adequate power), the effect of our polygenic scores can be detected in a study with 500 individuals. Therefore, the polygenic scores provided by our study can be useful in social science studies that have at least 500 participants and in which the participants’ genomes have been measured. (Several datasets commonly used in social science research meet these criteria.)

2.5.  What do your results tell us about human biology and brain development?

To gain insights into the biological mechanisms through which genetic variation influences general risk tolerance, we conducted a suite of bioinformatics analyses. Our bioinformatics analyses point to the involvement of the neurotransmitters glutamate and GABA, which were heretofore not generally believed to play a role in risk tolerance. Glutamate is the most abundant neurotransmitter in the body and plays an excitatory role (i.e., when one neuron secretes it onto another, the second neuron is more likely in turn to transmit its own signal). GABA, by contrast, is the main inhibitory transmitter. To our knowledge, with the exception of a recent study (Lee et al. 2018) prioritizing a much larger number of pathways, no published large-scale GWAS of cognition, personality, or neuropsychiatric phenotypes has pointed to clear roles both for glutamate and GABA. Our results suggest that the balance between excitatory and inhibitory neurotransmission may contribute to variation in general risk tolerance across individuals.


Perhaps unsurprisingly, our bioinformatics analyses point to a role for the brain and the central nervous system in modulating risk tolerance. Specifically, our analyses point to the involvement of some brain regions that have previously been identified in neuroscientific studies on decision-making, including the prefrontal cortex, basal ganglia, and midbrain.

2.6.  How do your results relate to previous research on the genetics of risk tolerance?

As mentioned above in FAQ 1.3, risk tolerance has been one of the most studied phenotypes in social science genetics. However, almost all previous studies have been “candidate-gene studies” conducted in relatively small samples, whose limitations are now appreciated. 


We used the results of our GWAS to revisit this previous research. We reviewed the literature that aimed to link risk tolerance to biological pathways, and identified five main biological pathways that have been previously hypothesized to relate to risk tolerance: the steroid hormone cortisol, the monoamine neurotransmitters dopamine and serotonin, and the steroid sex hormones estrogen and testosterone. We then tested whether these five biological pathways relate to risk tolerance.


To understand how we tested these five biological pathways, it is helpful to first define what a gene is. A “gene” is a sequence of DNA in the genome that codes for a molecule that has a biological function. The human genome has roughly 20,000 to 25,000 genes; although genes comprise only about 1% to 2% of human genome, they have important biological functions. Genes, like other parts of the genome, can contain SNPs. 


To test the five biological pathways for association with risk tolerance, thus, we first used external databases created by other researchers to identify the genes that are involved, or are likely to be involved, in each of these five pathways. Then, we conducted various bioinformatics analyses that used the results of our GWAS and tested the hypothesis that SNPs located in the genes involved in each of the five pathways tend to be more strongly associated with general risk tolerance than other SNPs. We found no evidence in support of that hypothesis, suggesting that the five pathways are not particularly important contributors to individual variation in risk tolerance. 


We also used our GWAS results to examine whether SNPs located within (or highly correlated with) 15 specific genes, which previous candidate-gene studies had tested for association with risk tolerance, are indeed associated with risk tolerance. Our sample was several orders of magnitude larger than the samples used in the previous candidate-gene studies (as mentioned above in FAQ 1.3, these studies were conducted in relatively small samples). Despite this, we found no evidence that these 15 genes are associated with risk tolerance, and failed to replicate the main associations the previous candidate-gene studies had reported. Our results are consistent with other studies that have found that small-sample candidate-gene studies have a poor replication record (Chabris et al. 2012, Hewitt 2012). 


We also note that our discovery GWAS replicated the associations between general risk tolerance and the two SNPs that had previously been found to be associated with general risk tolerance in the two previous studies with large samples (Day et al. 2016, Strawbridge et al. 2018; see FAQ 1.3). This is not surprising, however, since those two studies analyzed data from the UK Biobank, and the UK Biobank is one of the two large datasets we included in our discovery GWAS.


In summary, instead of pointing to the main genetic variants and biological pathways that had previously been hypothesized to relate to risk tolerance, our analyses identified 124 SNPs associated with risk tolerance (see FAQ 2.2), and point to the involvement of the neurotransmitters glutamate and GABA and of several brain regions (see FAQ 2.5).

3. Social and ethical implications of the study

3.1.  Did you find “the gene for” (or “the genes for”) risk tolerance?

No. We did find several genes [As mentioned in FAQ 2.6, a gene is a sequence of DNA in the genome that codes for a molecule that has a biological function; genes, like other parts of the genome, can contain SNPs.] containing SNPs associated with general risk tolerance, but that does not mean that these genes determine general risk tolerance. The genetic factors we identified are involved in a long chain of biological processes that exert an influence on human behavior, and those processes are intricately entwined with the environment. 


In summary, our findings conform with the expectation that variation in risk tolerance across individuals is influenced by at least thousands, if not millions, of genetic variants (Chabris et al. 2015).

3.2.  Does this study show that an individual’s level of risk tolerance is determined and fixed at conception?

No. A large share of the variation in risk tolerance among individuals is determined by environmental factors, and environmental factors may also interact with genetic factors. As mentioned in FAQ 1.3, twin studies have found that part of the variation in risk tolerance across individuals is statistically accounted for by genetic factors. But even if all of the variation in risk tolerance at a certain point in time were accounted for by genetic factors (which is definitely not the case), this would not rule out the possibility of past or future environmental influences on risk tolerance. For instance, even if poor eyesight were perfectly heritable and hence completely determined by genetic factors (it is not), the invention of eye glasses, contact lenses, and laser surgery would all drastically improve a person’s poor genetic outlook for clear vision. On the flip side, environmental trauma (e.g., a poke to the eye) could drastically worsen another individual’s genetic outlook for clear vision. The lesson of eyesight as a phenotype is that heritability of a phenotype—even 100% heritability—does not imply biological determinism: environmental factors can still in principle influence the phenotype. And again, risk tolerance is far from being perfectly heritable.

3.3.  Can you use the results in this paper to meaningfully predict a particular person’s risk tolerance?

No, the results cannot be used to meaningfully predict either a particular person’s general risk tolerance, nor their likelihood of taking any particular risk and engaging in any particular sort of risky behavior. As mentioned in FAQ 2.4, we used the results of our GWAS of general risk tolerance to construct polygenic scores that can explain up to ~1.6% of the variation across individuals in general risk tolerance. That means that ~98.4% of the variation in general risk tolerance is explained by factors other than the polygenic scores. 


As we also mentioned in FAQ 2.4, we expect that future, larger GWAS will allow the construction of polygenic scores with higher predictive power. However, the predictive power of such scores would still pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are therefore vastly more accurate forecasters than social science geneticists will ever be.


We also note that, while the polygenic scores we constructed can’t usefully predict any individual’s risk tolerance, they can be useful in social science studies, which focus on aggregated behavior in the population.

 

3.4.  Can environmental factors modify the effects of the specific SNPs you identified?

It is a plausible hypothesis that environmental factors are both moderators and mediators of genetic influences on risk tolerance. For example, it is conceivable that some SNPs have alleles [As mentioned above, an allele is a certain version of a genetic variant.] that tend to make individuals relatively less risk tolerant, but only when the individuals are exposed to certain environments (e.g., when they experience a traumatic episode). (Such environments factors would be said to “moderate” the influence of those SNPs.) It is also conceivable that some SNPs affect risk tolerance indirectly, by influencing individuals’ preferences for certain environments (e.g., by influencing their preferences for socializing with quiet, cautious friends), which may in turn affect their risk tolerance. (Such environments would be said to “mediate” the influence of those SNPs.)  


We did not perform any statistical tests of “gene-environment interactions” in our study. (Gene-environment interactions refer to the moderation of genetic influences by environmental factors.) One promising approach for future studies that seek to identify gene-environment interactions will be to use our GWAS results to construct polygenic scores of general risk tolerance, and then test whether environmental or demographic variables moderate the association between the polygenic scores and an outcome of interest. 


To facilitate such research, we have made the summary results of our GWAS publicly available on the SSGAC’s website (www.thessgac.org); interested researchers who have access to datasets with genotypic data can download these results and use them to construct polygenic scores.

 

3.5.  What policy lessons or practical advice do you draw from this study?

None whatsoever. Any practical response—individual or policy-level—to this or similar research would be extremely premature. In this respect, our study is no different from genome-wide association studies (GWAS) of complex medical outcomes. In medical GWAS research, it is well understood that identifying genetic variants that affect disease risk is merely a first step toward understanding the underlying biology of that disease. It is not sufficient to assess risk for any specific individual. It is not appropriate to base policies and practices on such assessments.

3.6.  Could this kind of research lead to discrimination against, or stigmatization of, people with specific genetic variants? If so, why conduct this research?

Unfortunately, like a great deal of research—including, for instance, research identifying genetic variants associated with increased cancer risk—the results can be misunderstood and could be misapplied, including by being used to discriminate against individuals with specific genetic variants (e.g., in insurance markets). Nevertheless, for a variety of reasons, we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing the knowledge.


First, even if we believed that some knowledge (and specifically knowledge about genetic influences on risk-taking behavior) should be forbidden, that goal is unattainable. Behavioral genetics research, including studies of the relationships between genes and a variety of social-science phenotypes, including risk tolerance, is already being conducted by many scientists and other individuals around the world and will continue to be conducted. Not all of this work involves the use of appropriate scientific methods or the transparent communication of results. In this context, researchers who are committed to developing, implementing, and spreading best practices for conducting and communicating potentially controversial research, including behavioral genetics research, arguably have an ethical responsibility to participate in the development and dissemination of this body of knowledge—rather than abstain from it because of its sensitive nature. 


An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral traits have serious methodological limitations, fail to replicate, and are likely to have false-positive findings (Beauchamp et al. 2011, Benjamin et al. 2012, Chabris et al. 2012, 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral traits), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt 2012). Consistent with this, the current study was unable to replicate the results of previous candidate-gene studies of risk tolerance (see FAQ 2.6). One of the most important reasons why earlier work has generated unreliable results is that the sample sizes were far too small, given that the true effects of individual genetic variants on behavioral traits are tiny.

 
Second, one should not assume that behavioral genetics research carries only the potential to increase stigmatization. For instance, behavioral phenotypes such as general risk tolerance are often assumed to be fully and equally within the control of every individual. That view of these behaviors likely contributes to a lack of sympathy for those who exhibit a self-destructive level of risk-taking and, perhaps, suboptimal support for programs that attempt to reduce such behavior. Our purpose here is not to advocate for or against any particular policy for addressing risk behaviors; rather, we mean only to point out that a finding that genes do have some influence can reduce, rather than increase, stigma of those who exhibit risk-tolerant or even risk-seeking behavior. 


Third, behavioral genetics research has the potential to yield other benefits, especially as sample sizes continue to increase. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity rather than being foreseeable. For instance, identifying variants associated with risk tolerance may lead to insights regarding the underlying biological pathways. To take an example from medicine, genetic variants in the LMTK2 (lemur tyrosine kinase 2) gene have small effects on an individual’s predisposition to prostate cancer. Nonetheless, knowing that this gene is involved can point scientists toward studying what the gene does, which may end up teaching us something critical about the pathology of prostate cancer. The effect from modifying a biological pathway, e.g., with a pharmaceutical, is potentially much larger than the effect of the gene itself. Moreover, although we are not quite there yet, when many genetic variants taken together capture ~10% of the variation across individuals in risk tolerance, this amount of predictive power (while still too low to be relevant for individual predictions) will be useful for controlling for genetic factors when studying the effect of a policy or program on an outcome that is also affected by risk tolerance. For example, when studying a policy intervention that aims to reduce the use of illicit substances that present health risks, controlling for as many factors as possible, including genetic factors associated with risk taking, can help generate more precise estimates of the effectiveness of the policy.


In sum, the potential benefits of this research, when conducted responsibly, seem reasonable in relation to the risks, especially considering that this research is already being conducted, sometimes with lesser attention to both scientific rigor and thoughtful science communication. We thus agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics 2002, p. 114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner.” In our view, responsible behavioral genetics research includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show (hence, this FAQ document).

4. Appendices

Appendix 1:  Quality control measures

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWAS). We took many precautions to guard against these pitfalls.


One potential source of spurious results is incomplete “quality control (QC)” of the genetic data. To avoid this problem, we used state-of-the-art QC protocols from medical genetics research (Winkler et al. 2014). We supplemented these protocols by a more recent protocol from Okbay et al. (2016a), as well as by developing and applying additional, more stringent QC filters.


Another potential source of spurious results is a confound known as “population stratification” (e.g., Hamer & Sirota 2000). To illustrate, suppose we were conducting a GWAS of height. People from Northern Europe are on average taller than people from Southern Europe, and there are also small differences in how often certain genetic variants occur in Northern and Southern Europe. If we combine samples of Northern and Southern Europeans and perform a GWAS that ignores the regions the individuals come from, then we would find genetic associations for these variants. However, those associations would simply reflect the fact that the variants are correlated with a population (Northern or Southern Europe) and may actually have nothing to do with height.


In our study we were extremely careful to avoid population stratification as much as possible. At the outset, we restricted the study to individuals of European ancestries, since population stratification problems are more severe when including individuals of different ancestries in the same sample. As is standard in GWAS of medical outcomes, we controlled for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across populations, so controlling for them largely removes the spurious associations arising solely from these small differences. 


After taking these steps to minimize population stratification, we conducted several analyses to assess how much population stratification still remained in our data. First, we analyzed data on 17,684 sibling pairs from the Swedish Twin Registry and the UK Biobank. The key idea underlying our test was to examine if differences in genetic variants across siblings are associated with differences in the siblings’ risk tolerance. If so, then these associations cannot be the result of population stratification. The reason is that full siblings (from the same two biological parents) share their ancestry entirely, and therefore differences in their genetic variants cannot be due to being from different population groups. Unfortunately, because our sample of siblings is much smaller than our discovery GWAS sample (939,908 individuals), our estimates of the effects of the genetic variants within the sibling pairs are much less precise than those in the GWAS. However, we can test whether the GWAS results are entirely due to population stratification, because if they were, then the sibling estimates would not line up with the GWAS estimates at all. In fact, we found that the within-family estimates are more similar to the GWAS estimates in both sign and magnitude than would be expected by chance. These results imply that our GWAS results are not entirely due to population stratification. A second analysis, known as a “LD score regression intercept” analysis (Bulik-Sullivan et al. 2015), indicated that there is some, but not much, population stratification in our GWAS results.

Appendix 2:  Additional reading and references

  1. Barban N, Jansen R, de Vlaming R, Vaez A, Mandemakers JJ, et al. 2016. Genome-wide analysis identifies 12 loci influencing human reproductive behavior. Nat. Genet. 48(12):1462–72

  2. Beauchamp JP, Cesarini D, Johannesson M. 2017. The psychometric and empirical properties of measures of risk preferences. J. Risk Uncertain. 54(3):203–37

  3. Beauchamp JP, Cesarini D, Johannesson M, van der Loos MJHM, Koellinger PD, et al. 2011. Molecular genetics and economics. J. Econ. Perspect. 25(4):57–82

  4. Benjamin DJ, Cesarini D, Chabris CF, Glaeser EL, Laibson DI, et al. 2012. The promises and pitfalls of genoeconomics. Annu. Rev. Econom. 4(1):627–62

  5. Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, et al. 2015. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47(3):291–95

  6. Cesarini D, Dawes CT, Johannesson M, Lichtenstein P, Wallace B. 2009. Genetic variation in preferences for giving and risk taking. Q. J. Econ. 124(2):809–42

  7. Chabris CF, Hebert BM, Benjamin DJ, Beauchamp JP, Cesarini D, et al. 2012. Most reported genetic associations with general intelligence are probably false positives. Psychol. Sci. 23(11):1314–23

  8. Chabris CF, Lee JJ, Cesarini D, Benjamin DJ, Laibson DI. 2015. The fourth law of behavior genetics. Curr. Dir. Psychol. Sci. 24(4):304–12

  9. Day FR, Helgason H, Chasman DI, Rose LM, Loh P-R, et al. 2016. Physical and neurobehavioral determinants of reproductive onset and success. Nat. Genet. 48(6):617–23

  10. Dohmen T, Falk A, Huffman D, Sunde U, Schupp J, Wagner GG. 2011. Individual risk attitudes: Measurement, determinants, and behavioral consequences. J. Eur. Econ. Assoc. 9(3):522–50

  11. Falk A, Dohmen T, Falk A, Huffman D. 2015. The nature and predictive power of preferences: Global evidence. IZA Discussion Papers.

  12. Hamer DH, Sirota L. 2000. Beware the chopsticks gene. Mol. Psychiatry. 5(1):11–13

  13. Harden KP, Kretsch N, Mann FD, Herzhoff K, Tackett JL, et al. 2017. Beyond dual systems: A genetically-informed, latent factor model of behavioral and self-report measures related to adolescent risk-taking. Dev. Cogn. Neurosci. 25:221–34

  14. Hewitt JK. 2012. Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behav. Genet. 42(1):1–2

  15. Lambert J-C, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, et al. 2013. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat. Genet. 45(12):1452–58

  16. Lee J, Wedow R, Okbay A, Kong E, Maghzian O, et al. 2018. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50:1112–21

  17. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, et al. 2015. Genetic studies of body mass index yield new insights for obesity biology. Nature. 518(7538):197–206

  18. Nuffield Council on Bioethics. 2002. Genetics and human behaviour: the ethical context. Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf], London

  19. Okbay A, Baselmans BML, Neve J-E De, Turley P, Nivard MG, et al. 2016a. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat. Genet. 48(6):624–33

  20. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, et al. 2016b. Genome-wide association study identifies 74 loci associated with educational attainment. Nature. 533:539–42

  21. Rietveld CA, Cesarini D, Benjamin DJ, Koellinger PD, De Neve J-E, et al. 2013a. Molecular genetics and subjective well-being. Proc. Natl. Acad. Sci. 110(24):9692–97

  22. Rietveld CA, Conley DC, Eriksson N, Esko T, Medland SE, et al. 2014a. Replicability and robustness of GWAS for behavioral traits. Psychol. Sci. 25(11):1975–86

  23. Rietveld CA, Esko TT, Davies G, Pers TH, Turley PA, et al. 2014b. Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Natl. Acad. Sci. U. S. A. 111(38):13790–94

  24. Rietveld CACA, Medland SESE, Derringer J, Yang J, Esko T, et al. 2013b. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science. 340(6139):1467–71

  25. Ripke S, Neale BM, Corvin A, Walters JTR, Farh K-H, et al. 2014. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 511(7510):421–27

  26. Strawbridge RJ, Ward J, Cullen B, Tunbridge EM, Hartz S, et al. 2018. Genome-wide analysis of self-reported risk-taking behaviour and cross-disorder genetic correlations in the UK Biobank cohort. Transl. Psychiatry. 8(1):1–11

  27. Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, et al. 2018. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50(2):229–37

  28. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, et al. 2017a. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101(1):5–22

  29. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, et al. 2017b. 10 years of GWAS discovery: Biology, function, and translation. Am. J. Hum. Genet. 101(1):5–22

  30. Winkler TW, Day FR, Croteau-Chonka DC, Wood AR, Locke AE, et al. 2014. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 9(5):1192–1212

  31. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, et al. 2014. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46(11):1173–8

̂

r

g

̂

r

g

̂

r

g

̂

r

g

 
 
 

̂

r

g

 
 
 
 
 
 
 
 
 
 
 
 

FAQs about “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document provides information about the study:

 

Lee et al. (2018). “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment.” Nature Genetics.

The document was prepared by Daniel J. Benjamin, David Cesarini, Christopher F. Chabris, Philipp D. Koellinger, David Laibson, Michelle N. Meyer, Aysu Okbay, Patrick Turley, Peter M. Visscher, and Meghan Zacher. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

          1. Background

          2. Study design and results

          3. Social and ethical implications of the study

          4. Appendices

 

For clarifications or additional questions, please contact Daniel Benjamin (daniel.benjamin@gmail.com).

 

Quick Links

1.1.  Who conducted this study? What are the group’s overarching goals?

1.2.   The current study focuses on an outcome called “educational attainment.” What is educational attainment?

1.3.  What is a GWAS? Are the genetic variants identified in a GWAS “causal”?

1.4.  In what sense do the genetic variants identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

1.5.  What is a polygenic score?

1.6.  Why conduct a GWAS of educational attainment?

1.7.  What was already known about genetic associations with educational attainment prior to this study?

2.1.  What did you do in this paper? How was the study designed? Why was the study designed in this way?

2.2.  What did you find in the GWAS of educational attainment?

2.3.  How predictive is the polygenic score developed in this study?

2.4.  What did you find in the analysis of siblings?

2.5.  What did you find in the analysis of environmental heterogeneity?

2.6.  What did you find in the analysis of the X chromosome?

2.7.  What did you find in the analysis of cognitive performance and math abilities?

2.8.  Are the genetic variants associated with higher educational attainment in your study also associated with other outcomes?

2.9.  What do your results tell us about human biology and brain development?

3.1.  Did you find “the gene for” educational attainment?

3.2.  Well, then, did you find “the genes for” educational attainment?

3.3.  Does this study show that an individual’s level of educational attainment is determined, or fixed, at conception?

3.4.  Can the polygenic score from this paper be used to accurately predict a particular person’s educational attainment?

3.5.  Can your polygenic score be used for research studies in non-European-ancestry populations?

3.6.  What policy lessons do you draw from this study?

3.7.  Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants? If so, why conduct this research?

Appendix 1:  Quality control measures

Appendix 2:  Additional reading and references

1. Background

1.1.  Who conducted this study? What was the group's overarching goal?

 

The authors of the study are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, international research group that aims to identify statistically robust links between genetic variants and social-science-relevant traits. These include traits such as behavior, preferences, and personality that are traditionally studied by social and behavioral scientists (e.g., economists, psychologists, sociologists) but are often also of interest to health and other researchers.

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. Most traits and behaviors are associated with thousands of genetic variants. Although their collective effect can be substantial (see FAQs 1.5 & 2.3), we now know that almost every one of these genetic variants has an extremely weak effect on its own. To identify specific variants with such small effects, scientists must study at least hundreds of thousands of people (to separate weak signals from noise). One promising strategy for doing this is for many investigators to pool their data into one large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of diseases and conditions (Visscher et al. 2017). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believe that studying genetic variants associated with social scientific outcomes can have substantial positive impacts across many research fields. This includes research that aims to better understand the effects of the environment (e.g., research on policy interventions, including the effects of different school environments) and interactions between genetic and environmental effects. The potential benefits also span a diverse set of research questions in the biomedical sciences, such as why and how educational attainment is linked to longevity and better overall health outcomes.

To conduct such research, the SSGAC implements genome-wide association studies (GWAS, see FAQ 1.3) of social-scientific outcomes. For example, to conduct a GWAS of educational attainment, every participating cohort uploads the (within-cohort) statistical association between educational attainment and a single-nucleotide polymorphism (SNP) in the genomes of the individuals in the cohort.  A SNP is a base-pair of the genome where there is common variation in the human population (see FAQ 1.3).  This statistical analysis is repeated for each SNP on the genome. The cohort-level results do not contain individual-level data – just summary statistics about these within-cohort statistical associations. The SSGAC then combines these cohort results to produce the overall GWAS results. By using existing datasets and combining cohort-level results, we can study the genetics of ~1.1 million people at very low cost. The SSGAC publicly shares the overall, aggregated results at www.thessgac.org/data so that other scientists can build on this work. These publicly available data have already catalyzed many research projects and analyses across the social and biomedical sciences (see FAQ 1.6. for examples).

The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, Princeton University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Biology and Human Genetics, University of Tartu and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Genetic Epidemiology, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics and Law, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). Whenever possible, we pre-register our analyses at OSF (formerly Open Science Framework). Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate what was found less tersely and technically than in the paper, as well as what can and cannot be concluded from the research findings more broadly. FAQ documents produced for SSGAC publications are available at https://www.thessgac.org/faqs.

In addition to educational attainment, SSGAC-affiliated papers have studied subjective well-being, reproductive behavior, and risk tolerance. The SSGAC website contains an up-to-date list of our major publications, which have been published in journals such as Science, Nature, Nature Genetics, Proceedings of the National Academy of Sciences, Psychological Science, and Molecular Psychiatry.

1.2.  The current study focuses on an outcome called “educational attainment.” What is educational attainment?

Educational attainment is the amount of formal education a person completes (measured as the number of years of education completed for people in our sample, all of whom are at least age 30 or older). Although educational attainment is most strongly influenced by social and other environmental factors (see FAQ 1.7), it is also influenced by thousands of genes. People vary considerably in how much education they complete. Education is recognized throughout the social and biomedical sciences as an important “predictor” (see FAQ 1.4) of many other life outcomes, such as income, occupation, health, and longevity (Ross & Wu 1995; Cutler & Lleras-Muney 2008). Educational attainment is also among the relatively few social-scientific traits for which it is feasible to conduct a large-scale genome-wide study, because educational attainment is frequently measured by a variety of cohorts, including medical cohorts, due to its robust association with health. A large-scale study is necessary (but not sufficient) to generate scientific findings that are reproducible.

1.3.  What is a GWAS? Are the genetic variants identified in a GWAS “causal”?

In a genome-wide association study (GWAS), scientists look at genetic variants measured across the entire human genome to see whether any of them are, on average, associated with higher or lower levels of some outcome. Commonly, and in our studies, such analyses focus on the most common genetic variants—so called single-nucleotide polymorphisms (SNPs). SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. SNPs usually have two different possible base pairs, or alleles. Although there are tens of millions of sites where SNPs are located in the human genome, GWASs typically investigate only SNPs that can be measured (or imputed) with a high level of accuracy. These days, such procedures usually yield millions of SNPs that together capture most common genetic variation across people.

GWAS has been a successful research strategy for identifying genetic variants associated with many traits and diseases, including body height (Wood et al. 2014), BMI (Locke et al. 2015), Alzheimer’s disease (Lambert et al. 2013), and schizophrenia (Ripke et al. 2014). It has also recently been used to identify genetic variants associated with a variety of health-relevant social science outcomes, such as the number of children a person has (Barban et al. 2016), happiness (Okbay, Baselmans, et al. 2016; Turley et al. 2018), and educational attainment (Rietveld et al. 2013; Okbay, Beauchamp, et al. 2016).

GWAS identifies genetic variants that are associated with the outcome, but an observed association with a specific variant need not imply that the variant causes the outcome, for a variety of reasons. First, genetic variants are often highly correlated with other, nearby variants on the same chromosome. As a result, when one or more variants in a region causally influence an outcome (in that particular environment), many non-causal variants in that region may also be identified as associated with the outcome. When GWAS results are analyzed, researchers will often tend to emphasize results for the genetic variant in a region that showed the strongest evidence of association. This variant need not be the causal variant. In fact, the causal genetic variant may not have even been measured directly. For example, GWAS that focus on common SNPs would not be able to identify rare or structural genetic variants (e.g., deletions or insertions of an entire genetic region) that are causal, but they may identify SNPs that are correlated with these unobserved variants.

Second, the frequencies of many genetic variants vary systematically across environments. If those environmental factors are not accounted for in the association analyses, some of the associations found may be spurious. To use a well-known example (Lander & Schork 1994), any genetic variants common in people of Asian ancestries will be associated statistically with chopstick use, but these variants would not cause chopstick use; rather, these genetic variants and the outcome of chopstick use are both distributed unevenly among people with different ancestries. This is the problem of “population stratification” discussed in Appendix 1. GWAS researchers have a number of strategies for addressing the challenges posed by population stratification (see FAQs 2.4 & 3.5 and Appendix 1).

Even in studies such as ours that attempt to address and correct for heterogeneity in genetic ancestry, allele frequencies may nonetheless vary systematically with environmental factors. For example, a genetic variant that is associated with improved educational outcomes in the parental generation may have downstream effects on parental income and other factors known to influence children’s educational outcomes (such as neighborhood characteristics). This same genetic variant is likely to be inherited by the children of these parents, creating a correlation between the presence of the genetic variant in a child’s genome and the extent to which the child was reared in an environment with specific characteristics. A recent study of Icelandic families showed that the parental allele that is not passed on to the parent’s offspring is still associated with the child’s educational attainment, suggesting that GWAS results for educational attainment partly represent these intergenerational pathways (Kong et al. 2018). Our sibling analyses yield results that are consistent with this conclusion (see FAQ 2.4).

Third, variants’ effects on an outcome may be indirect, so a variant that may be “causal” in one environment may have a diminished effect or no effect at all in other environments. For example, the nicotinic acetylcholine receptor gene cluster on chromosome 15 is associated with lung cancer (Thorgeirsson et al. 2008; Amos et al. 2008; Hung et al. 2008). From this observation alone we cannot conclude that these genetic variants cause lung cancer through some direct biological mechanism. In fact, it is likely that these genetic variants increase lung cancer risk through their effects on smoking behavior. In a tobacco-free environment, it is plausible that many of the associations would be substantially weaker and perhaps disappear altogether. Thus, even if we have credible evidence that a specific association is not spurious, it is entirely possible that the genetic variant in question influences the outcome through channels that we, in common parlance, would label environmental (e.g., smoking). Nearly forty years ago, the sociologist Christopher Jencks criticized the widespread tendency to mistakenly treat environmental and genetic sources of variation as mutually exclusive (see also Turkheimer 2000). As the example of smoking illustrates, it is often overly simplistic to assume that “genetic explanations of behavior are likely to be exclusively physical explanations while environmental explanations are likely to be social” (Jencks 1980, p.723).

In general, GWAS is just one step in a longer, often complex process of identifying causal pathways, but the results of a large-scale GWAS are a useful tool for that purpose and often lead to novel and important insights (Visscher et al. 2017). In other words, GWAS results provide important signals as to where scientists should invest future in-depth research to understand why the association exists.

1.4.  In what sense do the genetic variants identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

When we and other scientists say that genetic variants (and other variables, such as demographics) “predict” certain outcomes, our use of the word differs in several important ways from how “predict” is used in standard language (e.g., outside of social science research papers). First, we do not mean that the presence of a genetic variant guarantees an outcome with 100% probability, or even with a high degree of likelihood. Rather, we mean that the variant is, on average across people, statistically associated with an outcome. In other words, on average, people with the genetic variant have a higher likelihood of the outcome compared to people without the genetic variant. A genetic variant is said to be statistically “predictive” of an outcome even if the presence of the genetic variant only very weakly increases the likelihood of that outcome—as is the case, for instance, with every SNP that we identify that is associated with educational attainment.

Second, in standard language, “prediction” usually refers to the future. In contrast, when scientists say that genetic variants “predict” an outcome, they mean that they expect to see the association in new data. “New data” means data that haven’t been analyzed yet—regardless of whether that data will be collected in the future or has already been collected.

Finally, in standard language, a “prediction” is often an unconditional guess about what will happen. Instead of meaning it unconditionally, scientists mean that they expect to see an association in new data under certain conditions, for example, that the environment for the new data is the same as the environment in which the variants were found in the previously studied data to be associated with the outcome. In the example given in FAQ 1.3, in which a genetic variant is associated with lung cancer due to its effect on smoking, we would not expect the genetic variant to be as strongly predictive of lung cancer in an environment where cigarettes are absent.

We use the term “effect size” as a concise way to refer to the magnitude of the predicted difference in the outcome resulting from having one allele of a genetic variant as opposed to the other possible allele (for example, see FAQ 2.2). The use of the word “effect” is not intended to imply that we believe it is generally appropriate to use the strength of the association between a variant and educational attainment as a measure of the variant’s causal effect on educational attainment (see FAQ 1.3).

1.5.  What is a polygenic score?

The results of a GWAS can be used to create a “polygenic score,” an index composed of many genetic variants from across the genome. Because a polygenic score aggregates the information from many genetic variants, it can “predict” (see FAQ 1.4) far more of the variation among individuals for the GWAS outcome than any single genetic variant. Often, the polygenic scores with the most predictive power are those created using all the (millions of) genetic variants studied in a GWAS. The larger the GWAS sample size, the greater the predictive power (in other, independent samples) of a polygenic score constructed from the GWAS results. More precisely, the GWAS results are used to create a formula for how to construct a polygenic score. Using this formula, a polygenic score can then be constructed for any individual with genome-wide data. Indeed, some of the value of a GWAS is that the polygenic score it produces can be used in subsequent studies conducted in other samples.

1.6.  Why conduct a GWAS of educational attainment?

We are motivated to conduct this research because we believe it can be fruitful for the social sciences and health research. In addition to the specific findings of our paper, which are discussed in Section 2 of these FAQs, the results of a GWAS of educational attainment also provide inputs for other research. For example, results from our earlier GWAS of educational attainment (Rietveld et al. 2013; Okbay, Beauchamp, et al. 2016) conducted in much smaller sample sizes (see also FAQ 1.7) have been used to:

  • examine the genetic overlap between educational attainment and ADHD, schizophrenia, Alzheimer’s disease, intellectual disability, cognitive decline in the elderly, brain morphology, and longevity (Pickrell et al. 2016; Warrier et al. 2016; Anderson et al. 2017; Marioni et al. 2016);

  • help us better identify possible genetic subtypes of schizophrenia (Bansal et al. 2017);

  • explore why educational attainment appears to be protective against coronary artery disease (Tillmann et al. 2017) and obesity (van Kippersluis & Rietveld 2017);

  • control for genetic influences in order to generate more credible estimates of how changes in school policy influence health outcomes (Davies et al. 2018);

  • study why specific genetic variants predict educational attainment. For example, it appears that some genetic effects on educational attainment operate through associations with cognitive performance and traits such as self-control (Belsky et al. 2016), which in turn affect educational attainment;

  • study how the effects of genes on education differ across environmental contexts (Schmitz & Conley 2017; Barcellos et al. 2018); and

  • develop new statistical tools that may advance our understanding of how parenting and other features of a child’s rearing environment influence his or her developmental outcomes (Kong et al. 2018; Koellinger & Harden 2018).

These are just some examples of follow-up studies that previous GWASs of educational attainment have already enabled. By making the results of our analyses publicly available at https://www.thessgac.org/data, we hope to facilitate further valuable work by other researchers.

1.7.  What was already known about genetic associations with educational attainment prior to this study?

Educational attainment is strongly influenced by social and other environmental factors. For example, holding all other influences equal, those who live in communities where education (at least beyond a certain level) is relatively expensive are less likely to obtain a high level of educational attainment. Even when education is free or heavily subsidized, full-time education constitutes an opportunity cost that not everyone is equally able to bear: some individuals, due to a variety of family or economic circumstances, will face more pressure than others to leave school and enter the labor force. More generally, educational outcomes are strongly influenced by environmental factors such as social norms, early-life educational experiences, and economic opportunity.

A variety of findings—from twin, family, and GWAS studies—suggest that in affluent countries, genetic factors account for some of the differences across people in their educational attainment (Branigan et al. 2013; Heath et al. 1985; Silventoinen et al. 2004). Studies have found repeatedly that identical twins raised in the same home are substantially more similar to each other in their educational attainment than fraternal twins (or other full siblings) reared together. Full siblings reared together are, in turn, more similar than half siblings reared together who, in turn, are more similar than genetically unrelated siblings (e.g., siblings who are conventionally unrelated, typically because at least one of them is adopted) reared together (Cesarini & Visscher 2017; Sacerdote 2011; Sacerdote 2007). The studies have also provided strong evidence that so-called common environment (the environmental factors shared by siblings raised in the same household) can have long-lasting effects on educational outcomes. In Sweden, the educational outcomes of adopted (i.e., genetically unrelated) brothers reared in the same households are about as similar as the educational outcomes of full siblings reared in separate homes (Cesarini & Visscher 2017). A study of Korean-American adoptees finds that adoptees assigned to households where both parents had college degrees were 16 percentage points more likely to attend college than children assigned to families in which neither parent completed college (Sacerdote 2007).

Research (like the current study) using molecular genetic data—data that measures each person’s DNA and can be used to identify differences between people at the molecular level—has similarly found that common SNPs jointly predict up to 20% of variation across individuals (Rietveld et al. 2013). This predictive power may derive from many different types of mechanisms. For example, genetic variation may affect neural functions such as memory. Genetic variation may improve sleep quality (making it easier to subsequently stay awake in boring lectures). Genetic variation can affect personality traits, such as the willingness to listen politely to and follow the instructions of teachers (who aren’t always right but nevertheless dictate grades and other outcomes). There may also be even more convoluted pathways. For example, genetic variation can affect one’s sociability, which might draw someone into or drive someone out of the particular social environments that exist in higher education.

In prior GWAS studies, researchers have observed that some genetic variants are associated with educational attainment. In the SSGAC’s first major publication (Rietveld et al. 2013), we conducted a GWAS in a sample of roughly 100,000 people and identified three genetic variants that were statistically associated with educational attainment. In 2016, the SSGAC conducted another GWAS of educational attainment, this time in a sample of around 300,000 people (Okbay, Beauchamp, et al. 2016). We found that 74 genetic variants were associated with educational attainment. These included the three genetic variants identified in our earlier study (Rietveld et al. 2013). Both of these studies involved, at the time they were conducted, the largest sample sizes ever studied for genetic associations with a social science outcome.

There were three key takeaways from the SSGAC’s prior work:

     1. A GWAS approach can identify specific genetic variants statistically associated with behavioral variables if the study             is conducted in large enough samples (at least 100,000 people).

     2. Genetic variants that are associated with a behavioral variable such as educational attainment are each likely to have           less predictive power (i.e., a smaller effect size) than are genetic variants that are associated with a biomedical or                 other physical outcome (Chabris et al. 2015). For example, of the hundreds of genetics variants found to be associated           with height to date (Wood et al., 2014), the genetic variant with the strongest association predicts 0.4% of the variation           across individuals in height, whereas the genetic variant with the strongest association with educational attainment               identified to date predicts less than one tenth (<0.04%) as much of the variation in educational attainment (Okbay,                   Beauchamp, et al. 2016). (The genetic variants that have not yet been identified will very likely explain less variance               than those that are currently known, since statistical power is greatest for those that explain the most variance.)

      3. In the samples studied, at least 20% of the variation in educational attainment is predicted by genetic variation                        (Rietveld et al. 2013), implying that the genetic associations with educational attainment result from the cumulative                effects of at least thousands (probably millions) of different genetic variants, not just a few.

These findings from twin, family, and GWAS studies imply that individuals who carry an allele associated with greater educational attainment will on average complete slightly more formal education than other (similarly environmentally situated) individuals who carry a different allele of the same genetic variant. Put in population terms, these findings imply that people with particular alleles will tend on average to complete more formal education, while people who carry other alleles will tend on average to complete less formal education. It is important to emphasize that these associations represent average tendencies in a population. Many individuals with high polygenic scores for educational attainment will not get a college degree, and vice-versa. This makes polygenic scores for educational attainment poor predictors of individual outcomes (see FAQ 3.4), but increasingly useful tools in social science research (see FAQ 2.3).

2. Study Design and Results

2.1.  What did you do in this paper? How was the study designed? Why was the study designed in this way?

We conducted a GWAS (see FAQ 1.3) of educational attainment (see FAQ 1.2) in a sample of over 1.1 million people. The sample size we used in the current study is much larger than that used in previous GWAS of educational attainment (see FAQ 1.7). By constructing a current sample of over 1.1 million, we expected to estimate genetic effects with much greater accuracy than previous studies (with smaller samples) and, thus, to learn much more about the specific genetic variants that are associated with educational attainment.

To construct such a large sample, we combined information from our previous GWAS of roughly 300,000 research participants from 64 datasets (which we refer to as “cohorts”) (Okbay, Beauchamp, et al. 2016) with data that have recently become available from seven additional cohorts. These seven new cohorts include the UK Biobank and the personal genomics company 23andMe, both of which have surveyed and genotyped hundreds of thousands of research participants.

Our study was limited to only the most common type of genetic variant: single-nucleotide polymorphisms (SNPs, see FAQ 1.3). Unlike most other studies, which have analyzed only the autosomes (the non-sex chromosomes), our study also included SNPs on the X chromosome (see FAQ 2.6). In total, our analyses included approximately 10 million SNPs. And, as in other GWASs, our analyses included only individuals of primarily European genetic ancestry. This restriction is needed in order to reduce statistical confounds that otherwise arise from studying populations with diverse genetic ancestries (see the discussion of population stratification in Appendix 1; see also FAQs 1.3, 2.4 & 3.5).

In the remainder of the paper, we used the findings from the GWAS for a range of additional analyses that explored (among other things):

  • the extent to which siblings with different alleles end up with different amounts of formal schooling (see FAQ 2.4);

  • which environmental conditions affect the size of the association between genetic variants and educational attainment (see FAQ 2.5);

  • the genetic overlap between educational attainment and other outcomes, such as cognitive performance (constituting the largest GWAS of cognitive performance to date) and self-reported math ability (see FAQ 2.7);

  • which other outcomes are also correlated with genetic variants that are associated with educational attainment (see FAQ 2.8); and

  • the biological functions of the genetic variants identified (see FAQ 2.9).

2.2.  What did you find in the GWAS of educational attainment?

In our sample of roughly 1.1 million people, we found 1,271 genetic variants that were associated with educational attainment (using the standard statistical threshold in GWAS, which adjusts for multiple hypothesis testing). This is a substantial increase from the 74 variants identified in our last GWAS of around 300,000 individuals (Okbay, Beauchamp, et al. 2016), confirming the importance of large sample size for identifying specific genetic variants associated with behavioral traits.

The current study further confirmed the finding from our earlier work that the effects of individual genetic variants on educational attainment are extremely small. The average effect size across the 1,271 genetic variants was just 1.8 weeks of schooling per allele; even the SNPs with the strongest associations only predicted around 3 weeks of additional schooling per allele. Taken together, these 1,271 SNPs accounted for just 3.9% of the variation across individuals in years of education completed.

Here is another way to think about this result. Imagine that we used the results for these 1,271 genetic variants (not the ~1 million SNPs across entire genome we discuss in FAQ 2.3) to predict the educational attainment for a new group of people (separate from our discovery sample). We could then compare each individual’s predicted educational attainment to their actual educational attainment. If we did so, our results suggest that we would find that the predictions and actual outcomes correlate only very modestly (at about r = 0.20). That, in turn, means that if someone were predicted to complete an above average number of years of schooling (i.e., to be in the top half of educational attainment), that person would have about a 58% chance of actually being in the top half of educational attainment. Fifty-eight percent is better than chance (i.e., 50%), suggesting that a prediction based on these 1,271 SNPs has more power to predict educational attainment than a coin flip—but only a bit more power. By contrast, a prediction based on a polygenic score that combines ~1 million SNPs that we studied (see FAQs 1.5 & 2.3) has more predictive power: r = 0.33, corresponding to 11% of the variation across individuals.

The contrast between the 3.9% of the variation predicted by the 1,271 SNPs and the 20% known to be explained by common SNPs (see FAQ 1.7) implies that there are many other SNPs that have not yet been identified. Even larger sample sizes will be needed to identify them.

It is also important to keep in mind that educational attainment is a complex phenomenon, and our study focuses on only a tiny piece of the bigger picture. In this paper, we only examine one type of genetic variant (SNPs). Further, we conduct only preliminary analyses of how the effects of genetic variants on educational attainment differ depending on environmental conditions (see FAQ 2.5). These other genetic effects, environmental effects, and their interactions are important topics of active research and of future work by the SSGAC. Such work includes further studies of associations between educational attainment and epigenetic marks (Linnér et al. 2017).

2.3.  How predictive is the polygenic score developed in this study?

As discussed in FAQ 1.5, we can create an index using the GWAS results from around ~1 million genetic variants. Such an index is called a “polygenic score.”

The polygenic score we constructed “predicts” (see FAQ 1.4) around 11% of the variation in education across individuals (when tested in independent data that was not included in the GWAS). This ~1 million SNP polygenic score predicts much more of the variation than does the genetic predictor described in FAQ 2.2, which was based on only 1,271 SNPs. Including all ~1 million SNPs tends to add predictive power because the threshold for significance/inclusion that is used to identify the 1,271 SNPs is very conservative (i.e., many of the other ~1 million SNPs are also associated with educational attainment but are not identified by our study, and on net, it turns out empirically that more signal than noise is added by including them). This study’s polygenic score has much more predictive power than polygenic scores constructed from our earlier two GWAS of educational attainment, because both of those studies had much smaller sample sizes (~100,000 and ~300,000 individuals, respectively, compared with ~1.1 million individuals of the current study).

Individuals with high polygenic scores have, on average, high