FAQs

"Resource Profile and User Guide of the Polygenic Index Repository" 

“Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences”

 

“Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”

"Genome-wide association study identifies 74 loci associated with educational attainment"

"Genetic variants associated with subjective well-being, depressive symptoms and neuroticism identified through genome-wide analyses"

"GWAS of 126,559 individuals identifies genetic variants associated with educational attainment"

"Common Genetic Variants Associated with Cognitive Performance Identified Using Proxy-Phenotype Method"

FAQ's about "Resource Profile and User Guide of the Polygenic Index Repository"

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

This document provides information about the study:

 

Becker et al. (2021). “Resource Profile and User Guide of the Polygenic Index Repository.” Nature Human Behaviour.

 

The document was prepared by Daniel Benjamin, David Laibson, Michelle N. Meyer, and Patrick Turley. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

 

  1. Background

  2. Study design and results

  3. Social and ethical implications of the study

  4. Appendices

 

For clarifications or additional questions, please contact Daniel Benjamin (daniel.benjamin@gmail.com).

Quick Links

1.1.     Who conducted this study? What are the group’s overarching goals?

1.2.     What is a polygenic index (PGI)? Why this terminology?

1.3.     How is a polygenic index constructed?

1.4.     How might polygenic indexes be useful?

1.5.     Does a polygenic index “cause” the outcome of interest?

1.6.     In what sense does a polygenic index “predict” the outcome of interest?

1.7.     What polygenic indexes were available to researchers prior to this project?

1.8.     How do different polygenic indexes for the same outcome differ? How comparable are results across studies that use different polygenic indexes for the same outcome?

1.9.     Why create the Polygenic Index Repository?

2.1.     What outcomes are included in the Polygenic Index Repository? How did you choose the outcomes?

2.2.     How did you create these polygenic indexes?

2.3.     How predictive are the polygenic indexes in the Repository?

2.4.     What is the “measurement-error-corrected estimator”? How will it and the Repository improve comparability of results across future studies? 

2.5.     What is in the User Guide that accompanies the Repository?

2.6.     Who can access the Repository polygenic indexes, and how?

2.7.     How will the Repository be updated?

3.1.     Do GWAS or the polygenic indexes they produce identify the gene—or genes—“for” a particular outcome?

3.2.     Do polygenic indexes show that these outcomes are determined, or fixed, at conception?

3.3.     Can the polygenic indexes from the Repository be used to accurately predict a particular person’s outcomes? 

3.4.     Can the polygenic indexes accurately be used for research studies in non-European-ancestry populations?

3.5.     Would it be appropriate to use the Repository social and behavioral polygenic indexes in policy or practice?

3.6.     Could research on polygenic indexes lead to discrimination against, or stigmatization of, people with higher or lower polygenic indexes for certain outcomes? If so, why facilitate the spread of polygenic indexes?

3.7.     What have you done to mitigate the risks of research using Repository polygenic indexes?

4.0.       References

1. Background

1.1.  Who conducted this study? What are the group’s overarching goals?

The authors of the study are researchers affiliated with the Social Science Genetic Association Consortium (SSGAC) as well as data providers (i.e., individuals who act as stewards for datasets and provide other researchers with access to these data for research purposes). The SSGAC is a multi-institutional, international research group that aims to identify statistically robust associations between variation in DNA and variation in social-science-relevant outcomes. 

We study the most common sources of genetic variation—single-nucleotide polymorphisms (SNPs). SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. Each SNP usually has two different possible base pairs, which are called alleles. Although there are tens of millions of sites where SNPs are located in the human genome, our work (like most genetic research today that aims to link variation in DNA to variation in disease and other outcomes) investigates only SNPs that can be easily measured with a high level of accuracy. These days, we can easily and accurately measure millions of SNPs, which together capture most of the common genetic variation across people.

The social-science-relevant outcomes that we analyze include differences across people in behavior, preferences, and personality that are traditionally studied by social and behavioral scientists (e.g., anthropologists, economists, political scientists, psychologists, and sociologists). These traits are often also of interest to health and other researchers.

The SSGAC was formed in 2011 to address a specific set of scientific challenges. Most outcomes and behaviors are weakly associated with a very large number of SNPs. Although their collective effect can be meaningful (see FAQs 1.2& 2.3), we now know that almost every one of these SNPs has an extremely weak association on its own. To identify specific SNPs with such small effects, scientists must study at least hundreds of thousands of people (to separate weak signals from sampling noise). One promising strategy for doing this is for many investigators to pool their data into one large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of medical conditions (Visscher et al., 2017). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. (In genetics research, “cohort” is a term that means “dataset.”) The SSGAC was founded by three social scientists—Daniel Benjamin (University of California – Los Angeles), David Cesarini (New York University), and Philipp Koellinger (University of Wisconsin and Vrije Universiteit Amsterdam)—who believe that studying SNPs associated with social scientific outcomes can have substantial positive impacts across many research fields. This includes research that aims to better understand the effects of the environment (e.g., research on policy interventions) and interactions between genetic and environmental effects. The potential benefits also span a diverse set of research questions in the biomedical sciences, such as why and how educational attainment is linked to longevity and better overall health outcomes.

To conduct such research, the SSGAC implements genome-wide association studies (GWAS) of social-scientific outcomes. For example, to conduct a GWAS of educational attainment (e.g., Lee et al., 2018)every participating cohort calculates the cross-sectional (i.e., within-cohort) correlation between educational attainment and DNA-base-pair variation at a single location on the genome: a SNP. As first discussed above, a SNP is a base-pair of the genome where there is common variation in the human population. This statistical analysis is repeated for each SNP on the genome. The cohort-level results do not contain individual-level data—just summary statistics about these within-cohort statistical associations. The SSGAC then combines these cohort results to produce the overall GWAS results. By using existing datasets and combining cohort results, we can study the genetics of large numbers of individuals (for example, ~1.1 million people in Lee et al. (2018)) at very low cost. The SSGAC publicly shares overall, aggregated results(subject to some Terms of Service; see FAQ 3.7) so that other scientists can build on this work. These publicly available data have already catalyzed many research projects and analyses across the social and biomedical sciences. Among the most useful products of these GWASs for other research are the polygenic indexes that are based on GWAS associations. Polygenic indexes are variables that aggregate the predictive power of many SNPs for predicting the outcome of the GWAS (see FAQ 1.2), and they are the focus on the current paper.  

The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, Princeton University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Biology and Human Genetics, University of Tartu and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Genetic Epidemiology, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics and Law, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate what was found less tersely and technically than in the paper, as well as what can and cannot be concluded from the research findings more broadly. FAQ documents produced for SSGAC publications are available on the SSGAC website.

To date, SSGAC-affiliated papers have studied educational attainment, cognitive performance, subjective well-being, reproductive behavior, risk tolerance, and dietary intake. The SSGAC website contains a list of our major publications, which have been published in journals such as Science, Nature, Nature Genetics, Proceedings of the National Academy of Sciences, Psychological Science, and Molecular Psychiatry.

1.2. What is a polygenic index (PGI)? Why this terminology?

A polygenic index (we use the acronym PGI throughout the paper) is an index composed of a large number of SNPs from across the genome. Each polygenic index is associated with a particular outcome (for details, see FAQ 1.3). Because a polygenic index aggregates the information from many SNPs, it can “predict” (see FAQ 1.6) far more of the variation among individuals than any single SNP. (Note that even polygenic indexes are not good predictors of outcomes for one person; see FAQ 3.3). Often, the polygenic indexes with the most predictive power are those created using all the (millions of) SNPs measured in a SNP array. A SNP array is the currently standard way of measuring common genetic differences across individuals. A SNP array data does not measure the entire genetic sequence of each individual, but it does measure most of the places on the genome where individuals differ.

Our terminology of polygenic index is currently non-standard, but most of the authors of the paper prefer it to current terms and hope that this paper, and the Polygenic Index Repository introduced in this paper, make polygenic index a standard term. The traditional terms include polygenic risk score and polygenic score. The word risk makes little sense when the polygenic index is for a non-disease outcome (such as height). The word score was intended to echo statistical nomenclature but can instead convey an unintended value judgment or valence (i.e., “a higher score must be better”). The word index is at least as accurate statistically and does not convey a value judgment.

1.3. How is a polygenic index constructed?

A polygenic index is constructed in three steps. First, a genome-wide association study (GWAS) is conducted, looking at SNPs measured across the entire human genome to see which of them are associated with higher or lower levels of some outcome. As explained above, SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. SNPs usually have two different possible base pairs, or alleles. Although there are tens of millions of sites where SNPs are located in the human genome, GWASs typically investigate only SNPs that can be easily measured (or imputed) with a high level of accuracy. These days, we can easily and accurately measure millions of SNPs, which together capture most of the common genetic variation across people. For each of these millions of SNPs, the GWAS generates an “effect size” corresponding to the (typically miniscule) magnitude of the association between that SNP and the outcome. (We use the term “effect size” because it is a common scientific shorthand for “magnitude of association,” but we emphasize that use of the term is not intended to imply that the SNP, or polygenic index, causes the outcome; see FAQ 1.5.)

Second, the effect sizes are used to determine the “weight” each SNP will get in the polygenic index. The simplest scheme is to weight each SNP by its effect size as estimated in the GWAS. This simple weighting scheme has one main problem: because SNPs tend to be correlated with nearby SNPs on the genome (a phenomenon called linkage disequilibrium), if one SNP is associated with the outcome, nearby SNPs will also be associated with the outcome. State-of-the-art approaches to determining the weights for a polygenic index are designed to address this problem. We use a common approach called LDpred (Vilhjálmsson et al., 2015). Using the results of a GWAS, LDpred generates a weight for each SNP. These weights are not equal to the SNPs’ effect sizes as estimated in the GWAS, mostly because the weights take into account each SNP’s correlation with other SNPs. (Even though LDpred addresses the issue of linkage disequilibrium, it does so only for the purpose of generating weights for optimal prediction. LDpred will not necessarily assign more weight to the SNP whose association with the outcome is responsible for nearby SNPs’ associations with the outcome. Thus, LDpred is a tool to address the issue of linkage disequilibrium for the purpose of prediction—which is the purpose of a polygenic index—but not for the purpose of unbiased estimation of SNPs’ effect sizes. See FAQ 1.5.) 

Third, the set of weights for the SNPs are used in a formula for calculating a polygenic index for any particular individual. The formula is a weighted sum of alleles at each SNP (using the weights from the second step). The formula is used to calculate a numerical value of the polygenic index for each individual in some dataset (that was not included in the GWAS).

The sample used for the GWAS in the first step is the training sample for the polygenic index. The larger the GWAS sample size, the greater the predictive power of a polygenic index constructed in the third step. However, this predictive power of a polygenic index has a maximum for each outcome that the polygenic index can approach as the sample size gets bigger, but it can never exceed.  

1.4. How might polygenic indexes be useful?

A polygenic index for an outcome provides one measure of the genetic influence on that outcome that can be used in research in a variety of ways. For example, polygenic indexes have been used to:

  • partially control for genetic influences in order to generate less noisy estimates of how changes in school policy influence health outcomes (Davies et al., 2018);

  • examine how the effect of school policy on health outcomes depends in part on genetic influences (Barcellos, Carvalho and Turley, 2018a);

  • study why SNPs predict educational attainment – for example, it appears that some genetic effects on educational attainment operate through associations with cognitive function and traits such as self-control (Belsky et al., 2016), which in turn affect educational attainment;

  • investigate how genetic influences on educational attainment differ across environmental contexts (Schmitz and Conley, 2017; Barcellos, Carvalho and Turley, 2018b); 

  • investigate how genetic influences on BMI vary over the lifecycle (Khera et al., 2019);

  • infer the degree of assortative mating (Robinson et al., 2017; Yengo et al., 2018);

  • trace recent migration patterns (Domingue et al., 2018; Abdellaoui et al., 2019);

  • examine whether polygenic indexes for disease risk are sufficiently predictive to be incorporated into clinical practice for preventative medicine (Khera et al., 2018); and

  • develop new statistical tools that may advance our understanding of how parenting and other features of a child’s rearing environment influence his or her developmental outcomes (Koellinger and Harden, 2018; Kong et al., 2018).

 

The idea of using GWAS results to create a polygenic index was initially proposed in 2007 (Wray, Goddard and Visscher, 2007), and the first polygenic index was created in 2009 in a GWAS of schizophrenia and bipolar disorder (Purcell et al., 2009). Since then, polygenic indexes have become a significant part of research that builds on genetics in the medical and social sciences. For example, in the current paper we analyze presentations at the annual meeting of the Behavior Genetics Association. We report that the fraction of presentations that used polygenic indexes increased from 0% in 2009 to 20% in 2019. The list above represents a few illustrative examples of research that uses polygenic indexes.

As discussed in FAQ 1.9 below, one goal of this paper, and the Polygenic Index Repository it introduces, is to facilitate further work using polygenic indexes by making a much wider range of more predictive polygenic indexes available to researchers.

1.5. Does a polygenic index “cause” the outcome of interest?

Polygenic indexes available today, including those we construct in this paper, should not be interpreted as a measure of causal mechanisms.

The genome-wide association studies (GWASs) used as the training data for the polygenic indexes (see FAQ 1.3) identify SNPs that are associated with the outcome, but an observed empirical correlation with a specific SNP need not imply that the SNP causes the outcome, for a variety of reasons. First, SNPs are often highly correlated with other, nearby SNPs on the same chromosome. As a result, when one or more SNPs in a region causally influence an outcome (in that particular environment), many non-causal SNPs in that region may also be identified as associated with the outcome (in FAQ 1.3, see the parenthetical “Even though LDpred…” for why LDpred does not solve this problem for the purpose of identifying the causal SNP). In fact, the causal SNP may not have even been measured directly. For example, GWAS that focus on common SNPs would not be able to identify rare or structural types of genetic variation (e.g., deletions or insertions of an entire genetic region) that are causal, but they may identify SNPs that are correlated with these unobserved variants. For these and other reasons, polygenic indexes are likely to be composed of a mix of causal and non-causal SNPs, and the weights used in the formula for constructing the polygenic index (see FAQ 1.3) should not be interpreted as estimates of the causal effects of the SNPs. As a very rough estimate, for social and behavioral outcomes, no more than about one-third of the predictive power of a polygenic index (i.e., the percentage of the variance in the outcome among individuals that the polygenic index explains) is explained by causal genetic effects (Howe et al., 2021). For instance, the most predictive polygenic index for educational attainment currently available explains about 12% of the variance between people, but only one-third of that—about 4%—is causal. (These causal SNPs may be among the SNPs included in the polygenic index or may be physically close to, and therefore correlated with, SNPs that are included.) In contrast, for anthropometric outcomes such as height, it is possible that nearly all of the predictive power of a polygenic index is explained by causal SNPs.

Second, at a particular SNP the frequency of different alleles might vary systematically across environments. If those environmental factors are not accounted for in the association analyses, some of the measured SNP associations with social-science outcomes may be spurious. To use a well-known example often used to explain this idea (Lander and Schork, 1994), any genetic variants common in people of Asian ancestries will be associated statistically with more frequent than average chopstick use, but these variants would not cause greater chopstick use; rather, these genetic variants and the outcome of chopstick use are both distributed unevenly among people with different ancestries. This is called the problem of “population stratification.” The GWAS underlying the polygenic indexes in this paper employ standard strategies to try to minimize this problem, but the issues raised by population stratification cannot be ruled out entirely. As a result, the polygenic indexes likely reflect population stratification to some extent. In the User Guide that accompanies the Polygenic Index Repository (reproduced in the Supplementary Methods of the paper), we discuss this problem in more detail and discuss strategies for addressing the population stratification in the polygenic indexes

Even in GWAS (such as those we rely on or conduct ourselves) that attempt to address and correct for heterogeneity in genetic ancestry, allele frequencies may nonetheless vary systematically with environmental factors even within a group of people of similar genetic ancestry. For example, a SNP that is associated with improved educational outcomes in the parental generation may have downstream effects on parental income and other factors known to influence children’s educational outcomes (such as neighborhood characteristics). This same SNP is likely to be inherited by the children of these parents, creating a correlation between the presence of the SNP in a child’s genome and the extent to which the child was reared in an environment with specific characteristics. A recent study of Icelandic families showed that a parental allele associated with higher educational attainment of the parent that is not passed on to the parent’s offspring is still associated with the child’s educational attainment, suggesting that GWAS results for educational attainment partly represent these intergenerational environmental pathways (Kong et al., 2018).

Third, a SNP’s effects on an outcome may be indirect, so a SNP that may be “causal” in one environment may have a diminished effect or no effect at all in other environments. For example, variation in a particular SNP on chromosome 15 is associated with lung cancer (Amos et al., 2008; Hung et al., 2008; Thorgeirsson et al., 2008). From this observation alone we cannot conclude that variation in this SNP can cause lung cancer through some direct biological mechanism. In fact, it is likely that variation in this SNP, which is part of the nicotinic acetylcholine receptor gene cluster that affects nicotine metabolism, increases lung cancer risk through effects on smoking behavior. In a tobacco-free environment, it is plausible that this association with lung cancer would be substantially weaker and perhaps disappear altogether. Thus, even if we have credible evidence that a specific association is not spurious, it is entirely possible that the SNP in question influences the outcome through channels that we, in common parlance, would label environmental (e.g., smoking). Nearly forty years ago, the sociologist Christopher Jencks criticized the widespread tendency to mistakenly treat environmental and genetic sources of variation as mutually exclusive (see also Turkheimer, 2000). As the example of smoking illustrates, it is often overly simplistic to assume that “genetic explanations of behavior are likely to be exclusively physical explanations while environmental explanations are likely to be social” (Jencks, 1980, p723).

1.6. In what sense does a polygenic index “predict” the outcome of interest?

When we and other scientists say that polygenic indexes (and other variables, such as demographics or other environmental factors) “predict” certain outcomes, our use of the word differs in several important ways from how “predict” is used in standard language (e.g., outside of social science research papers). First, we do not mean that the polygenic index guarantees an outcome with 100% probability, or even with a high degree of likelihood. Rather, we mean that the polygenic index is, on average across people, statistically associated with an outcome. In other words, on average, people with a higher numerical value of the polygenic index have a higher likelihood of the outcome compared to people with a lower numerical value. A polygenic index is said to be statistically “predictive” of an outcome even if the polygenic index has only a weak association with the outcome—as is the case, for instance, with almost all of the polygenic indexes in this paper. In such cases, the polygenic index is only weakly predictive of the outcome.

Second, in standard language, “prediction” usually refers to the future. In contrast, when scientists say that a polygenic index “predicts” an outcome, they mean that they expect to see the association in new data. “New data” means data that haven’t been analyzed yet—regardless of whether those data will be collected in the future or have already been collected. In other words, in social science, it makes perfect sense to ask how well a polygenic index predicts outcomes that have already occurred, like how many years of education were attained by older adults.

Finally, in standard language, a “prediction” is often an unconditional guess about what will happen. Instead of meaning it unconditionally, scientists mean that they expect to see an association in new data under certain conditions, for example, that the environment for the new data is the same as the environment in which the GWAS that underlies the polygenic index (see FAQ 1.3) was conducted. In the example given in FAQ 1.5, in which a SNP is associated with lung cancer due to an effect on smoking, we would not expect the SNP to be as strongly predictive of lung cancer, or predictive at all, in an environment where tobacco-based products are hard to obtain or absent entirely.

1.7. What polygenic indexes were available to researchers prior to this project?

Prior to this project, only a few datasets had constructed polygenic indexes that researchers could download and use. Notable examples of data providers that did make polygenic indexes directly available to researchers —all of which recognized early on the value of doing so—are the Health and Retirement Study, the Wisconsin Longitudinal Study, and the National Longitudinal Adolescent to Adult Health Study. The UK Biobank does not construct polygenic indexes for its users, but it provides a mechanism by which researchers who use the data and construct polygenic indexes can “return” them to the UK Biobank for use by other researchers. Through this mechanism, polygenic indexes constructed from several GWASs have been made available for researchers to download from the UK Biobank.

To study polygenic indexes in other datasets or for other outcomes, prior to this paper, researchers would need to construct the polygenic indexes themselves, following the steps described in FAQ 1.3. For the first step, most researchers would need to rely on publicly available GWAS results, which include less data and are therefore less predictive than some polygenic indexes in published work that rely on non-public GWAS results (see FAQ 2.3). Recently, to make it easier for researchers to construct polygenic indexes themselves, the Polygenic Score Catalog (Lambert et al., 2020) collected together weights for a range of polygenic indexes (also based on publicly available GWAS results).

As we discuss in more detail in FAQ 2.1, for the Polygenic Index Repository, we constructed a large number of polygenic indexes in each of 11 datasets (including the four mentioned above) and have made the polygenic indexes directly available for researchers to download. The polygenic indexes are often based on more data than is publicly available, and the polygenic indexes are constructed according to a uniform methodology across both outcomes and datasets. For examples of Repository polygenic indexes that were previously not available at all or that were less accurate (i.e., predictive), see FAQ 2.3.

1.8. How do different polygenic indexes for the same outcome differ? How comparable are results across studies that use different polygenic indexes for the same outcome?

There are several reasons why polygenic indexes for the same outcome can differ from each other. As described in FAQ 1.3, there are three steps to creating a polygenic index, and differences can arise at each of these steps. For example, in the first step, researchers could base the polygenic index on different GWAS studies of the same outcome. Different GWAS studies may be based on samples who live under different environmental conditions, may have different measures of the outcome, and/or may have measured different SNPs. As another example, in the second step, researchers could use a different method of determining polygenic-index weights from the results of a GWAS. For these and other reasons, it has been common for different studies to use different polygenic indexes, even when the polygenic indexes are for the same outcome and are being studied in the same dataset.

The results are typically difficult to compare across such studies for three main reasons:

  1. If the polygenic indexes are constructed using different methods, then even though they are both measuring genetic influences on the outcome, the precise definition of these “genetic influences” may differ (see FAQs 3.1 and 3.2).

  2. The units for measuring the strength of associations between the polygenic index and other variables generally differ across studies. Researchers usually report results in terms of standard deviations (a statistical unit) of the polygenic index, but if the polygenic index in one study is a more powerful predictor than that in the other study, then one standard deviation of one polygenic index means something different than one standard deviation of the other.

  3. If one of the polygenic indexes is a more powerful predictor than the other, then they differ in their signal-to-noise ratio for capturing genetic influences on the outcome. Whenever an explanatory variable is measured with noise, results based on that variable will be distorted, sometimes in unanticipated ways. Since the signal-to-noise ratio differs across the polygenic indexes, results based on them are distorted differentially, further making the results difficult to compare.

 

1.9. Why create the Polygenic Index Repository?

In brief, the Polygenic Index Repository introduced in this paper has three main goals: (i) to make polygenic indexes for a large number of outcomes more accessible to a wider range of researchers from many fields and disciplines, including early career researchers, researchers without access to the data and/or training required to create the most state-of-the-art polygenic indexes, and researchers who wish to probe the limitations of polygenic indexes; (ii) to increase the use of polygenic indexes that are more accurate (i.e., predictive) than polygenic indexes researchers could construct from publicly available GWAS results; and (iii) to facilitate the comparability of results across studies that use these polygenic indexes.

In more detail, the Polygenic Index Repository addresses several practical obstacles that researchers interested in using polygenic indexes must often confront, including:

  1. Constructing a polygenic index from genotype data requires special expertise. Even for researchers with that expertise, it can be a time-consuming process.

  2. It is generally desirable to generate polygenic-index weights from the GWAS with the largest sample size because the predictive accuracy of a polygenic index is expected to be largest in that case. However, there are administrative hurdles for accessing some GWAS results, such as those from 23andMe. In practice, researchers often end up constructing polygenic indexes using only publicly available GWAS results. Such polygenic indexes tend to have less predictive power.

  3. Publicly available GWAS results are sometimes based on a sample that includes the dataset (or close relatives of dataset members) in which the researcher wants to analyze the polygenic index. Such “sample overlap” spuriously inflates the predictive power of the polygenic index, which can lead to highly misleading results.

  4. Because different researchers construct polygenic indexes in different ways, it is hard to compare and interpret results from different studies (see FAQ 1.8).

As we explain in the paper:

 

We overcome #1 by constructing the [polygenic indexes] ourselves and releasing them to the data providers, who in turn will make them available to researchers. This simultaneously addresses #2 because we use all the data available to us that may not be easily available to other researchers or to the data providers, including genome-wide summary statistics from 23andMe. Using these genome-wide summary statistics from 23andMe is what primarily distinguishes our Repository from existing efforts by data providers to construct PGIs and make them available…It also distinguishes our Repository from efforts to make publicly available [polygenic index] weights directly available for download (although we also do that, for weights constructed without 23andMe data). To deal with #3, for each [outcome] and each dataset, we construct a [polygenic index] from GWAS summary statistics that excludes that dataset. We overcome #4 by using a uniform methodology across the [outcomes].

In addition to providing polygenic indexes constructed using a uniform methodology (which deals with problem #1 listed in FAQ 1.8), we aim to improve comparability of results based on polygenic indexes in another way (which deals with problems #2 and #3 listed in FAQ 1.8): we derive a “measurement-error-corrected estimator” and provide software for calculating it. This estimator deals with the fact that polygenic indexes can differ from each other in their signal-to-noise ratios. It estimates what the results of an analysis would be if the polygenic index had no noise. It thereby avoids the distortions in results that arise from having a noisy measure. Because it puts results about the polygenic index in the units of the “noiseless” polygenic index, the results from polygenic indexes with different signal-to-noise ratios are expressed in the same units. For more details, see FAQ 2.4.

 
 
 
 
 
 
 
 
 
 

FAQs

2. Study Design and Results

2.1. What outcomes are included in the Polygenic Index Repository? How did you choose the outcomes?

We constructed polygenic indexes for 47 outcomes in 11 datasets, using a consistent methodology. The outcomes (listed in Table 1 in the paper) can be categorized into five groups:

  • anthropometric (height and body mass index);

  • cognition and education (including number of years of formal schooling and performance on cognitive tests);

  • fertility and sexual development (including number of children separately for men and women, and age at first menses);

  • health and health behaviors (the largest category, which includes self-rated overall health, several alcohol and smoking-related behaviors, and depressive symptoms); and

  • personality and well-being (the next largest category, which includes self-rated risk tolerance, subjective well-being, and adventurousness).

 

The set of 47 outcomes we studied was selected from a larger set of 53 outcomes; we did not create polygenic indexes for the 6 outcomes for which statistical calculations indicated that, based on the GWAS results we had available, a polygenic index was predicted to explain less than 1% of the variation across individuals. Although the specific threshold of 1% is somewhat arbitrary (but see further discussion in FAQ 2.3 below), polygenic indexes with low predictive power are less useful and more likely to generate misleading results (such as false positives) if used.

2.2. How did you create these polygenic indexes?

In order to construct the polygenic indexes, we combined GWAS results from three sources. First, for the 34 outcomes where we could find previously published GWAS, we obtained the publicly available results. Second, we collaborated with the personal genomics company 23andMe. 23andMe contributes to academic research by analyzing the data of customers who consent to participate in research. For this paper, 23andMe provided GWAS results for 37 outcomes, 9 of which had not previously been published. Third, for 25 outcomes, we conducted a GWAS ourselves in the UK Biobank, a large-scale biomedical database accessible to researchers. When more than one of these sources of GWAS results was available for an outcome, we combined the GWAS results together using a statistical method called meta-analysis. In some cases, we constructed “multi-trait polygenic indexes” using GWAS results for multiple outcomes (Turley et al., 2018); these polygenic indexes are often more predictive than a standard “single-trait polygenic index” constructed from GWAS results from a single outcome (FAQ 1.3), but the results from analyzing multi-trait polygenic indexes are sometimes more difficult to interpret (FAQ 2.5).

2.3. How predictive are the polygenic indexes in the Repository?

To assess the predictive power of the polygenic indexes, we used data from 5 of the 11 participating datasets (those for which we had access to both the outcome and genotype data we needed to construct the polygenic indexes). In each of these 5 datasets, we calculated the predictive power of every polygenic index for which the dataset contained data on the relevant outcome (see FAQ 2.1).

The predictive power of the polygenic indexes varies substantially across the outcomes and validation datasets. The polygenic index for height has the greatest predictive power. It predicts 26% to 34% of the variation across individuals, depending on the validation dataset. Next is the polygenic index for body mass index (BMI), whose predictive power ranges from 13% to 15% in our validation datasets. Several outcomes—cognitive performance, age at first menses, and educational attainment—have a polygenic index with predictive power in the range of 6% to 12%. Among the least predictive are the polygenic indexes for satisfaction with family and satisfaction with friendships, whose predictive powers in our validation datasets range from 0.3% to 0.7% (they were included because their predictive power was statistically expected to exceed 1%; see FAQ 2.1). The predictive powers for the other polygenic indexes in the Repository lie somewhere between 1% and 6%.

Although the effects explained by these polygenic indexes are small-to-modest, they can nevertheless be useful in research. For instance, the environmental factors studied in economics research typically have predictive power smaller than 5%, often 1% or smaller. Among the strongest predictors of educational attainment is family socioeconomic status, which has predictive power of roughly 15%. In a standard categorization used in psychology (Cohen, 1992; percentages here are squared r values) predictive power less than 9% is “small” while predictive power greater than 25% (rarely attained in psychological research) is “large.” We caution, however, that these comparisons of the effect sizes of polygenic indexes and environmental influences aren’t apples-to-apples because researchers usually study one particular environmental factor or many on an outcome, whereas a polygenic index summarizes the predictive power of SNPs across the genome. As discussed further in FAQ 3.3, for social and behavioral outcomes, the sum of all environmental (i.e., non-genetic) influences substantially outweigh the sum of all genetic influences that a polygenic index aims to capture.

As we discuss in FAQ 3.3, an individual’s polygenic indexes (even for height) do not very accurately predict that individual’s outcomes. However, polygenic indexes are useful for scientific studies (including social science, health research, etc.). Such studies are concerned with aggregate population trends and averages rather than with individual outcomes. For example, for a polygenic index that predicts 1% of the variation across individuals, studies of its association with other variables can be well powered in sample sizes as small as 785 individuals; 10 out of the 11 datasets participating in the Repository have sample sizes larger than that.

A major goal of the Polygenic Index Repository is to enable other research that is valuable to social scientists and health researchers. Such studies are already being conducted with some polygenic indexes (see FAQ 1.9). For some outcomes, the polygenic indexes in the Repository are more predictive than those that were previously possible to construct; examples include having asthma/eczema/rhinitis, number of cigarettes smoked per day, having migraines, nearsightedness, self-reported physical activity, self-rated overall health, extraversion (i.e., being outgoing), and subjective well-being (i.e., self-reported happiness or life satisfaction). For other outcomes, polygenic indexes were not available prior to this paper because there had been no large published GWASs for those outcomes; examples include childhood reading, self-rated math ability, and self-reported narcissism, and several allergies including to pollen.

2.4. What is the “measurement-error-corrected estimator”? How will it and the Repository improve comparability of results across future studies?

To understand this tool, it’s helpful to imagine the theoretically ideal polygenic index that could result from an infinitely large GWAS. In the paper, we call the predictor that would result from this ideal GWAS the “additive SNP factor.” The actual polygenic indexes that exist in the world are “noisy” measures of, and therefore only proxies for, this additive SNP factor. The signal-to-noise ratio of a polygenic index—i.e., the extent to which it reflects the additive SNP factor—is determined by the sample size of the GWAS from which the polygenic index is constructed (a larger GWAS leads to less noise and therefore a higher signal-to-noise ratio). The fact that the polygenic index is noisy distorts the results of most analyses that use the polygenic index (relative to what the results would be with the ideal predictor). These distortions can lead researchers to reach incorrect conclusions. For example, in an analysis of how genes and environments interact in influencing some outcome, the noise in the polygenic index will usually cause a researcher to underestimate how strongly genes and environments interact.

Moreover, as discussed in FAQ 1.8, there are many reasons why two polygenic indexes for the same outcome could differ from each other, including differences in the GWAS that the polygenic index is based on and different methods for constructing the polygenic index. Many of these differences among GWASs produce differences in the signal-to-noise ratios of their resulting polygenic indexes. Two studies using polygenic indexes with different signal-to-noise ratios will, in turn, have results that are distorted to differing degrees, reducing comparability of results across studies that use the polygenic indexes.

The “measurement-error-corrected estimator” we derive in the paper enables researchers to conduct analyses without the distortion that comes from the noise. It works because we (often) have a good estimate of how much noise a given polygenic index has. We can use that information to calculate what the results of an analysis would have been if the polygenic index had no noise. The estimator improves comparability of results across papers because it avoids the distortions in results that arise from having a noisy polygenic index. Rather than being distorted to different degrees, two studies using polygenic indexes with different signal-to-noise ratios that use our estimator will both have undistorted results. We have made available the software for this estimator. We will maintain and provide user support for this software.

Moreover, across all the polygenic indexes and across all the datasets participating in the Repository, we constructed the polygenic indexes in a uniform way. To the extent that future studies use the polygenic indexes from the Repository, their results will therefore be more comparable.

2.5. What is in the User Guide that accompanies the Repository?

Along with the polygenic indexes, we have distributed to the participating datasets a User Guide. Data providers will distribute this User Guide to researchers as part of the Repository. The User Guide contains technical details about the construction of the polygenic indexes, as well as details about data and software availability. It also describes a set of key interpretational considerations that researchers should keep in mind when analyzing polygenic indexes. These include when to use a single-trait versus multi-trait polygenic index (see FAQ 2.1) and reasons why associations between a polygenic index and an outcome generally cannot be interpreted as causal (see FAQ 1.5). Finally, the User Guide contains a discussion of six “interpretational considerations” that we urge researchers who use polygenic indexes to consider as part of the responsible conduct and communication of their research (see FAQ 3.7).

2.6. Who can access the Repository polygenic indexes, and how?

Researchers can access the Repository polygenic indexes through the data access procedures for each of the datasets participating in the Repository. These are summarized in the Supplementary Note of the paper. Typically, data providers require researchers to submit a brief a description of the planned research and to sign a Data Use Agreement. The Data Use Agreement usually requires researchers to agree to protect the confidentiality of individuals in the dataset and, to that end, to analyze the data on computers that satisfy certain security protocols.

We provided the polygenic indexes we created to the 11 datasets participating in the Repository, so that the data providers can distribute them to users of the datasets. We designed the Repository this way for three reasons (corresponding to problems #1, #2, and #3 in FAQ 1.9; problem #4 is addressed by using a consistent methodology for constructing the polygenic indexes). First, because we are making available the polygenic indexes (rather than the GWAS results from which they are constructed), researchers do not need to spend time constructing the polygenic indexes from GWAS results.

Second, for many outcomes, the polygenic indexes we construct are based on more data than are in the largest previously published GWAS. Because the Repository polygenic indexes for those outcomes are based on more data, they are more accurate (i.e., predictive) than polygenic indexes that could be constructed based only on publicly available GWAS results. Third, we tailored the polygenic indexes we constructed to each of the 11 datasets. Specifically, we ensured that for a given dataset, its polygenic indexes were not based on GWAS results that included that dataset (which would have led to “sample overlap” that would make it problematic to use the polygenic index with that dataset).

2.7. How will the Repository be updated?

We plan to update the Repository regularly as new GWAS are published or new data become available in which we can conduct our own GWAS. The updates will increase the predictive power of polygenic indexes already in the Repository, as well as expand the set of outcomes for which polygenic indexes are available. We also expect to include additional datasets whose stewards want to participate in the Repository and make their data broadly available to the research community.

 
 
 
 
 
 
 

3.   Ethical and social implications of the study

3.1.  Do GWAS or the polygenic indexes they produce identify the gene—or genes—“for” a particular outcome?

No. GWAS of complex outcomes identify many SNPs that are associated with an outcome like height or educational attainment. Although it was once believed that scientists would discover numerous strong one-to-one associations between specific genes and outcomes, we have known for a number of years that the vast majority of human traits and other outcomes are complex and are influenced by thousands of genes, each of which alone tends to have a small influence on the relevant outcome.

Furthermore, many complex outcomes are also influenced by parts of the genome that are not genes at all but instead serve to regulate genes (e.g., influencing when a gene is turned on or off). Genes typically contain many SNPs (often dozens or hundreds, in some cases thousands), and there are even more SNPs outside of genes than inside genes. Complex outcomes are often influenced by millions of SNPs.

Although the GWAS that produced the polygenic indexes included in the Repository did find several SNPs that are associated with particular outcomes, we believe that characterizing these as “genes for X”—or, more accurately—“SNPs for X” (e.g., educational attainment, height) is still likely to mislead, for many reasons, and we urge researchers and reporters to avoid this usage.

As an example, consider the outcome of educational attainment. First, most of the variation in people’s educational attainment is accounted for by social and other environmental factors, not by additive genetic effects (See FAQ 3.3). “Genes for educational attainment” might be read to imply, incorrectly, that genes are the strongest predictor of variation in educational attainment.

Second, the SNPs that are associated with educational attainment are also associated with many other things. These SNPs are no more “for” educational attainment than for the other outcomes with which they are associated.

Third, the “predictive” power (see FAQ 1.6) of each individual SNP that we identify is very small. Our previous work (Lee et al., 2018) has shown that genetic associations with educational attainment are comprised of thousands, or even millions, of SNPs, each of which has a tiny effect size. Each SNP is therefore weakly associated with, rather than a strong influence on, educational attainment. “Genes for educational attainment” might misleadingly imply the latter.

Fourth, environmental factors can increase or decrease the impact of specific SNPs (see FAQ 3.3). Put differently, even if a SNP is associated with higher or lower levels of educational attainment on average, it may have a much larger or smaller effect depending on environmental conditions. Indeed, in our most recent GWAS of educational attainment (Lee et al., 2018) and elsewhere, we report exploratory analyses that provide evidence of such gene-environment interactions. Educational attainment couldn’t even exist as a meaningful object of measurement if we didn’t have schools, and having schools introduces societal mechanisms that influence who goes to them. Accordingly, genetic associations with educational attainment necessarily will be mediated by societal systems and therefore genetic variation should often be expected to interact with environmental factors when it influences social phenomena, such as educational attainment. “Genes for educational attainment” suggests a stability in the relationship between these genes and the outcome of educational attainment that does not exist.

Finally, SNPs do not affect educational attainment directly. As described in our previous work (Lee et al., 2018), the genes identified as associated with educational attainment tend to be especially active in the brain and involved in neural development and neuron-to-neuron communication. The “predictive” power (see FAQ 1.6) of SNPs on educational attainment may therefore be the result of a long process starting with brain development, followed by the emergence of particular psychological traits (e.g., cognitive abilities and personality). These traits may then lead to behavioral tendencies as well as experiences and treatment by parents, peers, and teachers. All of these factors may additionally interact with the environment in which a person lives. Eventually these traits, behaviors, and experiences may influence (but not completely determine) educational attainment.

3.2. Do polygenic indexes show that these outcomes are determined, or fixed, at conception?

Absolutely not. Social and other environmental factors account for most variation in most of the outcomes for which the Repository contains polygenic indexes. But even if it were true that genetic factors accounted for all of the differences among individuals in an outcome, it would still not follow that an individual’s outcome is “determined” at conception. There are at least three reasons for this.

First, some genetic effects may operate through environmental channels (Jencks, 1980). Again, consider educational attainment as an example. Suppose—hypothetically— that some of the SNPs in the index help students to memorize and, as a result, to become better at taking tests that rely on memorization. In this example, changes to the intermediate environmental channels—the type of tests administered in schools—could have large effects on individuals’ educational attainment, even though individuals’ genome would not have changed. Certain SNPs may not be associated with educational attainment at all if schools did not use tests that rely on memorization. More generally, the polygenic index for educational attainment in the Repository might be less predictive if the education system were organized differently than it is at present (see also FAQ 3.3).

Second, even if the genetic associations with educational attainment operated entirely through non- environmental mechanisms that are difficult to modify (such as direct influences on the formation of neurons in the brain and the biochemical interactions among them), there could still exist powerful environmental interventions that could change the genetic relationships. In a famous example suggested by the economist Arthur Goldberger, even if all variation in unaided eyesight were due to genes, there would still be enormous benefits from introducing eyeglasses (Goldberger, 1979). Similarly, policies such as a required minimum number of years of education and dedicated resources for individuals with learning disabilities can increase educational attainment in the entire population and/or reduce differences among individuals.

Third, even if the genetic effects on an outcome were not influenced by changes in the environment, those environmental changes themselves could still have a major impact on the outcome in the population as a whole. For example, if young children were given more nutritious diets, then everyone’s school performance might improve, and college graduation rates might increase. Or consider the outcome of height: 80%-90% of the variation across individuals in height is due to genetic factors. Yet the current generation of people is much taller than past generations due to changes in the environment such as improved nutrition.

3.3. Can the polygenic indexes from the Repository be used to accurately predict a particular person’s outcomes?

No. While the “predictive” power (see FAQ 1.6) of our polygenic indexes makes most of them useful in research for some purposes (see FAQ 2.3), these polygenic indexes fail to predict the majority of variation across individuals. Even for height—the outcome for which our polygenic index has the greatest predictive power—the index fails to predict 70% of the variation.

Indeed, an important message of a number of our earlier papers is that DNA does not “determine” an individual’s behavioral and social outcomes, for at least four reasons: First, in the environments in which the outcomes have been measured, other studies have estimated that the additive effects of SNPs will only ever account (even with arbitrarily large samples used to construct polygenic indexes) for a minority of the variation across individuals in the outcomes we study. For example, we estimate that the theoretical upper bound for additive effects of SNPs would account for 46% of the variation in height, 24% in body mass index, 20% in age at first menses, and less than 10% for most of the social/behavioral outcomes we study. So even a hypothetical polygenic index that perfectly reflects the additive SNP factor (see FAQ 2.4) could only explain a small fraction of the variation across individuals. Second, today’s polygenic indexes are not perfect; they are only able to predict a fraction of that already small fraction of cross-sectional predictive power. Third, since SNPs matter more or less depending on environmental context (see FAQ 3.2), a polygenic index might be less (or more) predictive for individuals in some environments than for individuals in others. Finally, and similarly, polygenic predictions only hold for as long as the environment in which they were developed remains substantially the same.

To illustrate these final two reasons, consider the example of educational attainment (for which we have included a polygenic index in the Repository and on which we have done previous research): if the pedagogy underlying the educational system in which the GWAS that produced the polygenic index was conducted is substantially different than the pedagogy of the different population to which that polygenic index is being applied, the polygenic index may be less (or, conceivably, more) predictive in this second population (for an example, see FAQ 3.2). The same is true if the polygenic index is applied to the same population, but at a later time when the pedagogy has changed substantially. Just as eyeglasses allow those genetically predisposed to poor vision to have nearly perfect vision, innovations in education (say, an innovation that makes education irresistibly engaging, thus mitigating the risk to those with SNPs associated with lower ability to pay attention or maintain self-control) might result in those with lower polygenic indexes now achieving just as much education, on average, as those with higher polygenic indexes.

As sample sizes for GWAS continue to grow, it will likely be possible to construct polygenic indexes for many outcomes whose predictive power comes closer to the total amount of variation that is theoretically predictable from additive effects of common SNPs for those outcomes (the upper bounds given above). Even these levels of predictive power would pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are therefore vastly more accurate forecasters than social science geneticists will ever be.

Note: Polygenic indexes created by GWASs are increasingly used by commercial and research direct-to-consumer platforms to predict individual outcomes. We recognize that returning individual genomic “results” can be a fun way to engage people in research and other projects and has at least the theoretical potential to stoke their interest in, and educate them about, genomics and how genes and environments interact. But it is important that participants/users understand that, at present, most of these individual results, including all social and behavioral outcomes, are not meaningful predictions (in the sense that they generally have very little predictive power at the individual level).  Failure to make this point clear risks sowing confusion and undermining trust in genetics research.

3.4. Can the polygenic indexes accurately be used for research studies in non-European-ancestry populations?

No. We constructed polygenic indexes only for individuals classified as “European ancestry.” (The precise definition of “European ancestry” differs in different datasets, but it usually means that a person’s pattern of genetic variation across the genome is statistically close to the average pattern from a “reference sample” for some European country. The reference samples used by geneticists are based on samples of people who live in the European country today and whose recent ancestors also lived in that country.) Therefore, the Polygenic Index Repository only includes polygenic indexes for these individuals.

The main reason we only constructed polygenic indexes for these individuals is that the polygenic indexes are likely to be much less predictive—and hence much less useful—in a sample of people of non-European ancestries. That is because our original GWAS data was obtained from samples of people with European-ancestry, and GWAS results have been found to have only limited portability across ancestries (Belsky et al., 2013; Domingue et al., 2015, 2017; Martin et al., 2017; Vassos et al., 2017). There are a number of reasons for the limited portability. For one thing, the set of SNPs that are associated with an outcome in people of European ancestries is unlikely to overlap closely with the set of SNPs associated with the outcome in people of non-European ancestries. And even if a given SNP is associated in both ancestry groups, the effect size—in other words, the strength of the association—will almost surely differ. This is primarily because linkage disequilibrium (LD) patterns (i.e., the correlation structure of the genome) vary by ancestry. This means that some SNP may be associated with the outcome because the SNP is in LD (i.e., correlated) with a SNP elsewhere in the genome that causally affects education (see FAQ 1.5). If the strength of the correlation is greater in one ancestry group than in another, then the size of the association will be larger in that ancestry group. Moreover, even if LD patterns were similar in each ancestry group, the association may differ in different groups because environmental conditions differ (see FAQ 1.6). The fact that there are differences across ancestry groups in the set of associated SNPs and their effect sizes means that the weights for constructing polygenic indexes in European-ancestry individuals (FAQ 1.3) would be the “wrong” weights for non-European-ancestry individuals. For a more extensive, excellent discussion of these and related issues, see Graham Coop’s blog post, “Polygenic scores and tea drinking.”

Unfortunately, this attenuation of predictive power means that for non-European-ancestry populations, many of the benefits of having a polygenic index available will have to wait until large GWAS studies are conducted using samples from these populations. (Currently, most large genotyped samples are of European ancestries.) We intend that future versions of the Polygenic Index Repository will include polygenic indexes for non-European-ancestry populations, once it becomes possible to produce polygenic indexes with adequate predictive power. We believe that the relative scarcity of polygenic indexes that can be used for research that focuses on non-European ancestry groups is a disparity that should be rapidly eliminated by prioritizing GWAS studies that focus on non-European populations.

3.5. Would it be appropriate to use the Repository social and behavioral polygenic indexes in policy or practice?

No. We reiterate that polygenic indexes are poor predictors of social and behavioral outcomes (see FAQs 2.3 and 3.3). Their incremental predictive power over and above other, non-genetic predictors that are already used is even smaller than a polygenic index’s predictive power on its own. Moreover, the predictive power of the polygenic indexes for social and behavioral outcomes depends on the environment in which the GWAS participants live (FAQ 3.3). Thus, enshrining polygenic indexes in policy risks basing policy (which can be difficult to change) on weak predictions that could become even weaker or nonexistent as the environment changes. Furthermore, the polygenic indexes can operate through environmental channels (FAQ 3.2). Allocating resources based on polygenic indexes could therefore exacerbate inequalities that were originally due to environmental disparities (a similar risk to that of other biased algorithms that bake in pre-existing discrimination). Using polygenic indexes in order to prioritize giving resources to individuals who are already advantaged would further limit the opportunities of individuals who are disadvantaged, which would be ethically inappropriate. Finally, even if polygenic indexes were used to offer additional resources to disadvantaged individuals, any small potential benefits of using such weak individual predictors would almost certainly be offset by the risk of stigmatization and by the fact that this technology is currently only accessible to people of European ancestries (FAQ 3.4). For all these reasons, we are deeply skeptical that the Repository social and behavioral polygenic indexes have any appropriate role to play in policy now or in the foreseeable future.

3.6. Could research on polygenic indexes lead to discrimination against, or stigmatization of, people with higher or lower polygenic indexes for certain outcomes? If so, why facilitate the spread of polygenic indexes?

Unfortunately, like a great deal of research—including, for instance, research identifying genomic variation associated with increased cancer risk—the results can be misunderstood and misapplied. This includes being used to discriminate against those with higher or lower polygenic indexes for certain outcomes (e.g., in insurance markets). Nevertheless, for a variety of reasons, in this instance, we do not think that the best response to the possibility that useful knowledge could be misused is to refrain from producing the knowledge. Moreover, many researchers already have access to and use polygenic indexes; against this background, the Repository helps ensure that a much wider array of researchers have the same opportunity to access and probe these research tools, and also that the polygenic indexes themselves will be more accurate. Here, we briefly discuss some of the broad potential benefits of this research. We then describe what we see as our ethical duty as researchers conducting this work.

First, one benefit of conducting social-science genetics research in ever larger samples is that doing so allows us to correct the scientific record. An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral outcomes have serious methodological limitations, fail to replicate, and are likely to be false-positive findings (Benjamin et al., 2012; Chabris et al., 2012, 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral outcomes), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt, 2012). One of the most important reasons why earlier work has generated unreliable results is that the sample sizes were far too small, given that the true effects of individual SNPs on behavioral outcomes are tiny. Pre-existing claims of genetic associations with complex social-science outcomes have reported widely varying effect sizes, many of them purporting to “predict” as much of the variation across individuals as do the polygenic indexes we construct in this paper that aggregate the effects of millions of SNPs.

Second, behavioral genetics research also has the potential to correct the social record and thereby to help combat discrimination and stigmatization. For instance, overestimating the role of genetics can be damaging, and the present work can help debunk the myth of genetic determinism. By quantifying how various outcomes are predicted by genetic data, we show that for all of the outcomes we study, the genetic data can explain a very small fraction of the variation across individuals (see FAQ 2.3). By clarifying the limits of deterministic views of complex outcomes, recent behavioral genetics research—if communicated responsibly—could make appeals to genetic justifications for discrimination and stigmatization less persuasive to the public in the future.

Third, behavioral genetics research has the potential to yield many other benefits, especially as sample sizes continue to increase—as briefly summarized in FAQ 1.9. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity. Indeed, very few of the uses of polygenic indexes were anticipated when they were first proposed (Wray, Goddard and Visscher, 2007).

In sum, we agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics, 2002, p114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner” (see FAQ 3.7).

3.7. What have you done to mitigate the risks of research using Repository polygenic indexes?

In our view, the responsible behavioral genetics research called for by the Nuffield Council on Bioethics (see FAQ 3.6) includes sound methodology and analysis of data (e.g., only conducting analyses that are adequately powered and, when feasible, preregistering power calculations and planned analyses); a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public. A critical aspect of the latter is particular vigilance regarding what research results do—and do not—show, and how polygenic indexes can—and cannot—be appropriately used. In an effort to reduce the risk that its results might be misinterpreted by readers, misreported by the media, or misused, the SSGAC has developed and publicly posted FAQs like this document with every major paper it has published since its first paper in 2013.

 

In addition, the SSGAC will require researchers who download the SNP weights for constructing polygenic indexes to agree to Terms of Service. Among the many terms that we require researchers to agree to, we highlight two here:

I agree to conduct research that strictly adheres to the principles articulated by the American Society of Human Genetics (ASHG) position statement: “ASHG Denounces Attempts to Link Genetics and Racial Supremacy.” (See also International Genetic Epidemiological Society Statement on Racism and Genetic Epidemiology.) In particular, I will not use these data to make comparisons across ancestral groups. Such comparisons could animate biological conceptualizations of racial superiority. In addition, such comparisons are usually scientifically confounded due to the effects of linkage disequilibrium, gene-environment correlation, gene-environment interactions, and other methodological problems.

I have read the principles articulated by the ASHG with respect to “Advancing Diverse Participation in Research with Special Consideration for Vulnerable Populations”. I agree to adhere to the principles articulated in the final two sections of this statement, “In the Conduct of Research with Vulnerable Populations, Researchers Must Address Concerns that Participation May Lead to Group Harm” and “The Benefits of Research Participation Are Profound, Yet the Potential Danger that Unethical Application of Genetics Might Stigmatize, Discriminate against, or Persecute Vulnerable Populations Persists.” 

These Terms of Service stem from the observation that SNP associations are not necessarily causal (see FAQ 1.5) and depend on the environment of the individuals included in the GWAS (see FAQ 1.6). Different ancestry groups arise in the population because they became partially separated from each other many generations ago, for example, due to geographic factors or social forces. When two groups are geographically or socially separated, they also face different environments, which not only may have direct effects on certain outcomes (such as disease risk) but may also change the strength of the association between the outcomes and certain SNPs. Therefore, when individuals from two ancestry groups have different average outcomes, it is extremely difficult to identify whether the difference is due to average genetic differences between the groups or to the different environments faced by the groups. For this reason, it is scientifically invalid to make general statements about ancestry group differences based on SNP associations identified in a GWAS. (Also see FAQ 3.2.) The Terms of Service also require users to securely store the data and to immediately report any breach of the Terms.

 

Finally, we have developed and provided to participating data providers a User Guide to be distributed to researchers who use Repository polygenic indexes (see FAQ 2.5). We will also provide the User Guide to researchers who download the SNP weights. One section of the User Guide discusses six “interpretational considerations” that are likely to arise when conducting research with polygenic indexes and which we urge researchers to seriously consider as a critical part of responsibly conducting and communicating their research. One recurring ethical concern about genetic research is the tendency for its predictive power to become exaggerated in the media and in the public’s minds, at the expense of a more nuanced understanding of how genes and environment interact, the importance of environmental influences, and the ability of interventions to improve outcomes. Many of the interpretational considerations we discuss in the User Guide involve how to anticipate and address potential confounds and how to navigate complex questions about causality and ensure responsible communication of causality.

For instance, the User Guide cautions researchers to appreciate and communicate that associations between a polygenic index and an outcome may operate through environmental (rather than biological) mechanisms (see FAQs 3.2 and 3.3).

4. References

Abdellaoui, A. et al. (2019). Genetic correlates of social stratification in Great Britain. Nature Human Behaviour, 3 (12), 1332–1342. Available from https://doi.org/10.1038/s41562-019-0757-5.

Amos, C.I. et al. (2008). Genome-wide association scan of tag SNPs identifies a susceptibility locus for lung cancer at 15q25.1. Nature Genetics, 40, 616–622. Available from https://doi.org/10.1038/ng.109.

 

Barcellos, S.H., Carvalho, L.S. and Turley, P. (2018a). Education can reduce health differences related to genetic risk of             obesity. Proceedings of the National Academy of Sciences, 115 (42), E9765. Available from                           https://doi.org/10.1073/pnas.1802909115.

 

Barcellos, S.H., Carvalho, L.S. and Turley, P. (2018b). Education can Reduce Health Disparities Related to Genetic Risk of Obesity: Evidence from a British Reform. bioRxiv [https://doi.org/10.1101/260463]. Available from https://doi.org/10.1101/260463.

Belsky, D.W. et al. (2013). Development and evaluation of a genetic risk score for obesity. Biodemography and Social Biology, 59 (1), 85–100. Available from https://doi.org/10.1080/19485565.2013.774628.

Belsky, D.W. et al. (2016). The Genetics of Success. Psychological Science, 27 (7), 957–972. Available from https://doi.org/10.1177/0956797616643070.

Benjamin, D.J. et al. (2012). The Promises and Pitfalls of Genoeconomics. Annual Review of Economics, 4 (1), 627–662. Available from https://doi.org/10.1146/annurev-economics-080511-110939.

Chabris, C.F. et al. (2012). Most reported genetic associations with general intelligence are probably false positives. Psychological Science, 23 (11), 1314–1323. Available from https://doi.org/10.1177/0956797611435528.

Chabris, C.F. et al. (2015). The Fourth Law of Behavior Genetics. Current Directions in Psychological Science, 24 (4), 304–312. Available from https://doi.org/10.1177/0963721415580430.

Cohen, J. (1992). Statistical Power Analysis. Current Directions in Psychological Science, 1 (3), 98–101. Available from https://doi.org/10.1111/1467-8721.ep10768783.

Davies, N.M. et al. (2018). The causal effects of education on health outcomes in the UK Biobank. Nature Human Behaviour. Available from https://doi.org/10.1038/s41562-017-0279-y.

Domingue, B.W. et al. (2015). Polygenic Influence on Educational Attainment: New evidence from The National Longitudinal Study of Adolescent to Adult Health. AERA Open, 1 (3), 1–13. Available from https://doi.org/10.1177/2332858415599972.

Domingue, B.W. et al. (2017). Mortality selection in a genetic sample and implications for association studies. International Journal of Epidemiology, 46 (4), 1285–1294. Available from https://doi.org/10.1093/ije/dyx041.

Domingue, B.W. et al. (2018). Geographic Clustering of Polygenic Scores at Different Stages of the Life Course. RSF: The Russell Sage Foundation Journal of the Social Sciences, 4 (4), 137 LP – 149. Available from https://doi.org/10.7758/RSF.2018.4.4.08.

Goldberger, A.S.A. (1979). Heritability.Economica, 46 (184), 327–347.

Hewitt, J.K. (2012). Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behavior Genetics, 42 (1), 1–2. Available from https://doi.org/10.1007/s10519-011-9504-z.

Howe, L.J. et al. (2021). Within-sibship GWAS improve estimates of direct genetic effects. bioRxiv, 2021.03.05.433935. Available from https://doi.org/10.1101/2021.03.05.433935.

Hung, R.J. et al. (2008). A susceptibility locus for lung cancer maps to nicotinic acetylcholine receptor subunit genes on 15q25. Nature. Available from https://doi.org/10.1038/nature06885.

Jencks, C. (1980). Heredity, environment, and public policy reconsidered. American Sociological Review, 45 (5), 723–736.

Khera, A. V. et al. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50 (9), 1219–1224. Available from https://doi.org/10.1038/s41588-018-0183-z.

Khera, A. V et al. (2019). Polygenic Prediction of Weight and Obesity Trajectories from Birth to Adulthood. Cell, 177 (3), 587-596.e9. Available from https://doi.org/10.1016/j.cell.2019.03.028.

Koellinger, P.D. and Harden, K.P. (2018). Using nature to understand nurture: Genetic associations show how parenting matters for children’s education. Science, 359 (6374), 386–387. Available from https://doi.org/10.1126/science.aar6429.

Kong, A. et al. (2018). The nature of nurture: Effects of parental genotypes. Science, 359 (6374), 424–428. Available from https://doi.org/10.1126/science.aan6877.

Lambert, S.A. et al. (2020). The Polygenic Score Catalog: an open database for reproducibility and systematic evaluation. medRxiv, 2020.05.20.20108217. Available from https://doi.org/10.1101/2020.05.20.20108217.

Lander, E.S. and Schork, N.J. (1994). Genetic dissection of complex traits. Science, 265, 2037–48.

Lee, J.J. et al. (2018). Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nature Genetics, 50 (8), 1112–1121. Available from https://doi.org/10.1038/s41588-018-0147-3.

Martin, A.R. et al. (2017). Human Demographic History Impacts Genetic Risk Prediction across Diverse Populations.American Journal of Human Genetics, 100 (4), 635–649. Available from https://doi.org/10.1016/j.ajhg.2017.03.004.

Nuffield Council on Bioethics. (2002). Genetics and human behaviour: the ethical context. London: Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf].

Purcell, S.M. et al. (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460 (7256), 748–752. Available from https://doi.org/10.1038/nature08185.

Robinson, M.R. et al. (2017). Genetic evidence of assortative mating in humans. Nature Human Behaviour. Available from https://doi.org/10.1038/s41562-016-0016.

Schmitz, L.L. and Conley, D. (2017). The effect of Vietnam-era conscription and genetic potential for educational attainment on schooling outcomes. Economics of Education Review, 61, 85–97. Available from https://doi.org/https://doi.org/10.1016/j.econedurev.2017.10.001.

Thorgeirsson, T.E. et al. (2008). A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature, 452 (7187), 638–642. Available from https://doi.org/10.1038/nature06846.

Turkheimer, E. (2000). Three laws of behavior genetics and what they mean. Current Directions in Psychological Science, 9 (5), 160–164.

Turley, P. et al. (2018). Multi-trait analysis of genome-wide association summary statistics using MTAG. Nature Genetics, 50 (2), 229–237. Available from https://doi.org/10.1101/118810.

Vassos, E. et al. (2017). An Examination of Polygenic Score Risk Prediction in Individuals With First-Episode Psychosis. Biological Psychiatry, 81 (6), 470–477. Available from https://doi.org/10.1016/j.biopsych.2016.06.028.

Vilhjálmsson, B.J. et al. (2015). Modeling linkage disequilibrium increases accuracy of polygenicrisk scores. The American Journal of Human Genetics, 97 (4), 576–592.

Visscher, P.M. et al. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics, 101 (1), 5–22. Available from https://doi.org/10.1016/j.ajhg.2017.06.005.

Wray, N.R., Goddard, M.E. and Visscher, P.M. (2007). Prediction of individual genetic risk to disease from genome-wide association studies. Genome research, 17 (10), 1520–1528. Available from https://doi.org/10.1101/gr.6665407.

Yengo, L. et al. (2018). Imprint of Assortative Mating on the Human Genome. Nature Human Behaviour, 2 (12), 2, 948–954. Available from https://doi.org/10.1038/s41562-018-0476-3.

 
 
 
 
 
 
 
 

FAQs about “Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences”

 
 
 
 

Back to Top

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document provides information about the study:

 

Karlsson Linnér et al. 2019. “Genome-wide association analyses of risk tolerance and risky behaviors in over 1 million individuals identify hundreds of loci and shared genetic influences.” Nature Genetics.

The document was prepared by Jonathan P. Beauchamp, Daniel J. Benjamin, Richard Karlsson Linnér, Philipp D. Koellinger, and Michelle N. Meyer. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

          1. Background

          2. Study design and results

          3. Social and ethical implications of the study

          4. Appendices

For clarifications or additional questions, please contact Jonathan P. Beauchamp (jonathan.pierre.beauchamp@gmail.com).

Quick Links

1.1.  Who conducted this study? What is the group’s overarching goals?

1.2.   The current study focuses on a variable called "general risk tolerance." What is general risk tolerance?

1.3.  What was already known about the genetics of risk tolerance prior to this study?

2.1.  What did you do in this paper? How was the study designed?

2.2.  What did you find in the GWAS?

2.3.  Are the SNPs associated with higher risk tolerance in your study also associated with other phenotypes?

2.4.  How much of a particular person's risk tolerance can be predicted from the results of this paper?

2.5.  What do your results tell us about human biology and brain development

2.6.  How do your results relate to previous research on the genetics of risk tolerance?

3.1.  Did you find “the gene for” (or "the genes for") risk tolerance?

3.2.  Does this study show that an individual's level of risk tolerance is determined and fixed at conception?

3.3.  Can you use the results in this paper to meaningfully predict a particular person's risk tolerance?

3.4.  Can environmental factors modify the effects of the specific SNPs you identified?

3.5.  What policy lessons or practical advice do you draw from this study?

3.6.  Could this kind of research lead to discrimination against, or stigmatization of, people with specific genetic variants? If so, why conduct this research?

Appendix 1:  Quality control measures

Appendix 2:  Additional reading and references

1. Background

1.1.  Who conducted this study? What was the group's overarching goal?

The authors are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, multi-disciplinary, international research group that aims to identify statistically robust links between genetic variants (for instance, base-pairs of DNA that vary across people) and phenotypes of interest to social scientists. A “phenotype” refers to anything that may be influenced by DNA, such as disease risk or physical characteristics. The phenotypes of interest to social scientists include behaviors, preferences, personality traits, and socioeconomic outcomes.

 

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. As is now well understood (Chabris et al. 2015), most phenotypes—including virtually all social-science phenotypes—are influenced by hundreds or thousands of genetic variants. Although in combination their collective effects can be sizeable, almost every one of these genetic variants has an extremely small effect on its own. To reliably identify these individual variants, therefore, scientists must study large samples; typically, hundreds of thousands of individuals are required. One approach to obtaining a large enough sample is for many research groups to pool analyses of their data into a single, large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of diseases and conditions (Visscher et al. 2017a). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

 

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believed that genetic data could have a substantial positive impact on research in the social sciences, and that social-science genetics could make important contributions to medical research. The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, New York University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Genetics, Broad Institute and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Statistical Genetics, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

 

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). These, together with an analysis plan, are often preregistered on the Open Science Framework (OSF) [The analysis plan for this study can be downloaded here: https://osf.io/cjx9m/]. Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate to journalists and the public what was found and what can and cannot be concluded from the research findings.

The SSGAC’s first major project was a genome-wide association study (GWAS) of educational attainment published in Science (Rietveld et al. 2013b). The study is summarized in a FAQ posted on the SSGAC website (https://www.thessgac.org/faqs). The study was followed by two related studies, using successively much larger samples, published in Nature (Okbay et al. 2016b) and Nature Genetics (Lee et al. 2018). Subsequent SSGAC papers have studied subjective well-being, depressive symptoms, the personality trait neuroticism, cognitive performance, and reproductive behavior. These papers have been published in Nature Genetics (Barban et al. 2016, Okbay et al. 2016a), Proceedings of the National Academy of Sciences (Rietveld et al. 2013a, 2014b), and Psychological Science (Chabris et al. 2012, Rietveld et al. 2014a), among other journals. The present study is the SSGAC’s first study that focuses on the genetics of general risk tolerance.

1.2.  The current study focuses on a variable called “general risk tolerance.” What is general risk tolerance?

Risk pervades many aspects of human life and is a central concept in the study of decision-making and behavior. Somewhat surprisingly, then, there is no universally agreed-upon definition of “risk.” For our purposes, we define “risk” as the degree of variability in possible outcomes, and “risk tolerance” as a person’s willingness to choose options that entail more risk, typically to have the chance of obtaining a more rewarding outcome. For example, an engineer with a high degree of risk tolerance would be more willing to quit her job at a stable, large corporation and join a risky start-up. An individual with a high degree of risk tolerance may also be more likely to drive faster than the speed limit on a highway, thus incurring a higher risk of having an accident or a traffic ticket in order to save time.

 

An individual’s risk tolerance typically varies across domains of behavior. For instance, an individual may be willing to take relatively more risks in the career and financial domains, but not in the health and leisure domains. Nonetheless, individuals with greater risk tolerance in one domain are statistically more likely to exhibit greater risk tolerance in other domains as well. For this reason, survey-based measures of general risk tolerance—defined as a person’s general willingness to take risks—have been used as all-around predictors of risky behaviors such as portfolio allocation, occupational choice, smoking, drinking alcohol, and starting one’s own business (Beauchamp et al. 2017, Dohmen et al. 2011, Falk et al. 2015). In our study, we analyze a measure of general risk tolerance based on responses to questions such as: “Would you describe yourself as someone who takes risks? Yes / No.” The exact phrasing and number of response categories varied across the study cohorts, but all questions asked subjects about their overall or general attitudes toward risk.

1.3.  What was already known about the genetics of risk tolerance prior to this study?

Researchers have found that identical twins (who share all of their genes) tend to be more similar to one another in terms of their risk tolerance than fraternal twins (who share, on average, only half of their genes), which suggests that genetic factors influence risk tolerance. With some assumptions, it is possible to translate the greater similarity of identical twins into an estimate of the “heritability” of risk tolerance. The heritability of risk tolerance is the percentage of the variation in risk tolerance among individuals that can be accounted for statistically by genetic differences, given current environmental conditions. Estimates from twin studies suggest that risk tolerance is moderately heritable (~30%) (Beauchamp et al. 2017, Cesarini et al. 2009, Harden et al. 2017). We note, however, that such estimates are based on several assumptions and vary across studies, in part because different studies use different measures of risk tolerance as well as different assumptions and methods.

As we further discuss in FAQ 2.2, the current study also estimated the “SNP heritability” of risk tolerance, which is the percentage of the variation in risk tolerance among individuals that can be accounted for statistically by “common SNPs” (a type of genetic variants; see FAQ 2.1 for details), given current environmental conditions. Our estimate suggests that common SNPs account for only ~5% to 9% of the variation in risk tolerance across individuals. Importantly, while these heritability estimates all suggest that genetic factors influence risk tolerance, we emphasize that this does not imply that risk tolerance is pre-determined at birth or that genetic factors act independently of the environment, as we discuss below in FAQs 3.2 and 3.4.

Risk tolerance has been one of the most studied phenotypes in social science genetics. To date, however, nearly all published studies attempting to discover the genetic variants associated with risk tolerance have been “candidate-gene studies” conducted in relatively small samples, ranging from a few hundred to a few thousand individuals. A candidate-gene study tests the associations between a phenotype of interest and a few selected genetic variants that are hypothesized to be associated with the phenotype. Though there is nothing wrong in principle with such studies, we now know that the sample sizes of the candidate-gene studies for risk tolerance and other behavioral traits were probably too small to robustly identify genetic variants [As mentioned above, it is now well established that the bulk of the genetic variation in the vast majority of behavioral phenotypes is attributable to a large number of genetic variants, each having a very small effect (Chabris et al. 2015). For that reason, large samples are needed to detect individual genetic variants.] (Chabris et al. 2012, Hewitt 2012). Indeed, as we explain in FAQ 2.6, we used our own results to assess the evidence in favor of the main biological pathways and genetic variants which previous candidate-gene studies had hypothesized or reported to relate to risk tolerance. Although our sample was several orders of magnitude larger than the samples used in the candidate-gene studies, we found no evidence that these biological pathways and genetic variants are associated with risk tolerance.

To the best of our knowledge, prior to our study there had only been two studies with samples that were large enough to provide sufficient statistical power to robustly detect genetic variants with small effect sizes (Day et al. 2016, Strawbridge et al. 2018). From these studies, only two genetic variants associated with risk tolerance had been identified.

In summary, when our study was initiated, despite much interest, little was known about which genetic variants are related to risk tolerance.

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

2. Study design and results

2.1.  What did you do in this paper? How was the study designed?

We performed the largest-to-date genome-wide association study (GWAS) of risk tolerance. In a GWAS, scientists look across the human genome for genetic variants that are associated with a phenotype of interest. If a genetic variant is associated, then individuals who have a certain “allele” (i.e., a certain version of that variant) are more likely than those with a different allele to exhibit a phenotype (in this case, higher general risk tolerance).

We chose a GWAS study design because it has been a successful research strategy for identifying genetic variants associated with many traits and diseases, including body height (Wood et al. 2014), BMI (Locke et al. 2015), Alzheimer’s disease (Lambert et al. 2013), and schizophrenia (Ripke et al. 2014). GWAS have also recently been used to identify genetic variants associated with a variety of health-relevant social science outcomes, such as the number of children a person has (Barban et al. 2016), happiness (Okbay et al. 2016a, Turley et al. 2018), and educational attainment (Okbay et al. 2016b, Rietveld et al. 2013b). Furthermore, scientists who have attempted to replicate reported GWAS associations in independent samples of sufficiently large size have typically been successful (Visscher et al. 2017b), thereby indicating that GWAS associations are robust findings. 


In our GWAS of general risk tolerance, we tested ~9.3M single nucleotide polymorphisms (SNPs) from across the human genome for association with general risk tolerance. SNPs are the most common type of genetic variant in the genome and are the genetic variants that are captured by the genetic data used in our study and most other modern genome-wide association studies. (There are other types of genetic variants, which we did not analyze.) Some SNPs have alleles that are relatively common in the population and are called “common SNPs,” while other SNPs have one allele that is rare in the population; our GWAS analyzed both common SNPs and some rare SNPs.


As mentioned above, genetic variants associated with social-science phenotypes tend to have very small individual effects on the phenotypes. Therefore, in order to have sufficient statistical power to discover SNPs associated with risk tolerance, we pooled the results from analyses of two very large datasets, the UK Biobank (n = 431,126 individuals) and a dataset of research participants from 23andMe (n = 508,782 individuals), thereby yielding a “discovery” sample of 939,908 individuals. We replicated the findings from this discovery sample in a “replication” sample comprised of ten smaller datasets and totaling 35,445 individuals. In all of these samples, to avoid the statistical confounding that arises from studying ethnically diverse populations, we restricted our GWAS to individuals of European ancestries. (For a somewhat technical explanation, see Appendix 1.)


We used the results of our GWAS of general risk tolerance for a wide range of additional analyses. For example, to examine the extent to which SNPs that are associated with risk tolerance also tend to be associated with other phenotypes, we estimated “genetic correlations” between risk tolerance and a wide range of phenotypes (see FAQ 2.3). In addition, in several samples of genotyped individuals, we used individuals’ SNP data and the results of our GWAS to construct “polygenic scores” that partially predict individuals’ risk tolerance based on their SNP data (see FAQ 2.4). We also performed a suite of bioinformatics analyses to get insight into the biology of risk tolerance (see FAQs 2.5 and 2.6).
 

In addition to our GWAS of general risk tolerance, we conducted six supplementary GWAS, of six phenotypes related to risk tolerance and risk-taking behaviors. We conducted a GWAS of “adventurousness,” defined as the self-reported tendency to be adventurous vs. cautious. We also conducted GWAS of four risky behaviors that each plausibly capture risk taking in a different domain of behavior: “automobile speeding propensity” (the tendency to drive faster than the speed limit), “drinks per week” (the average number of alcoholic drinks consumed per week), “ever smoker” (whether one has smoked more than once or twice), and “number of sexual partners” (the lifetime number of sexual partners). Finally, we conducted a GWAS of the first principal component of the four risky behaviors. (The first principal component is a variable that captures the common variation across the four risky behaviors and can be interpreted as capturing the general tendency to take risks across domains.) Section 1.2 of our article’s Supplementary Information provides more detail on the definitions of these phenotypes. The analyses of the six supplementary phenotypes were performed in samples ranging from ~315,000 to ~557,000 individuals. These samples were smaller because of more limited data availability for these phenotypes.
 

2.2.  What did you find in the GWAS?

Our main GWAS identified 124 SNPs associated with general risk tolerance in our discovery sample. The 124 SNPs are located in 99 “loci” (a locus is a small region of the genome). As expected, the estimated individual effects of the 124 SNPs are all very small: none of the SNPs explain more than 0.02% of the variation in general risk tolerance across individuals. 
 

We verified that the 124 SNPs identified in our discovery sample also tend to be associated with general risk tolerance in our replication sample. Because the replication sample was not large enough to provide adequate statistical power to replicate the associations of each of the 124 SNPs individually, we performed a “holistic” replication analysis. This analysis compares the overall agreement in estimates for the 124 SNPs across the discovery and the replication GWAS. This holistic replication was successful, indicating that it is highly unlikely that the results from our discovery sample were driven by chance alone.

 
We also estimated the “SNP heritability” of risk tolerance. The SNP heritability of a phenotype is the share of the variation in the phenotype that is statistically accounted for by common SNPs, given current environmental conditions (see FAQ 1.3). We used several methods to obtain our estimates. With all methods, we used a set of common SNPs—that is, SNPs that have alleles that are relatively common in the population—to estimate the heritability. Because the different methods make different assumptions and because we applied the different methods to slightly different data, the methods yielded different heritability estimates. Our estimates suggest that common SNPs account for ~5% to 9% of the variation in risk tolerance across individuals. (The true heritability of risk tolerance is likely to be somewhat higher, since other genetic variants, such as rare SNPs and structural genetic variants, are likely to also contribute to variation in risk tolerance.) 

 

Our six supplementary GWAS (of the phenotypes related to risk tolerance and risk-taking behaviors) identified a total of 741 associations between a specific SNP and one of the phenotypes. Because of the lack of suitable replication samples, we did not perform replication analyses for the GWAS of these six phenotypes.

2.3.  Are the SNPs associated with higher risk tolerance in your study also associated with other phenotypes?

Yes. Of the 124 SNPs we identified as associated with general risk tolerance, we found that 72 are also associated with one or more of the six supplementary phenotypes related to risk tolerance and risk-taking behaviors [Equivalently, as we write in the abstract of the paper, of the 99 loci referred to above and that contain the 124 SNPs associated with general risk tolerance, 46 also contain one or more SNPs associated with at least one of the six supplementary phenotypes.]. We also identified several regions of the genome that stood out as being associated with general risk tolerance and with all or most of the six supplementary phenotypes. We verified that the effects of the SNPs in these regions are concordant, such that SNPs associated with higher general risk tolerance are also associated with more risky behavior. This suggests that these regions represent shared genetic influences on risk tolerance and risky behaviors (rather than just being genomic hot spots containing SNPs associated with many different phenotypes).


In addition, we estimated the “genetic correlation” between general risk tolerance and various other phenotypes. The genetic correlation between two phenotypes is a measure of the extent to which the SNPs that affect one phenotype also tend to affect the other phenotype. We found that general risk tolerance is moderately to highly genetically correlated with a range of risky behaviors. General risk tolerance is genetically correlated with the six supplementary phenotypes (which capture various types of risky behavior), with estimates of the genetic correlations ranging from 0.25 to 0.83. General risk tolerance is also moderately to highly genetically correlated with a number of additional risky behaviors, including cannabis use and self-employment. Importantly, the direction of the genetic correlations is in the expected direction, with higher risk tolerance being associated with riskier behavior. Moreover, our estimates of the genetic correlations between general risk tolerance and the supplementary risky behaviors are substantially higher than the corresponding phenotypic correlations [Although measurement error partly accounts for the lower phenotypic correlations, the genetic correlations remain considerably higher even after adjustment of the phenotypic correlations for measurement error.], implying that general risk tolerance is more strongly associated with these risky behaviors at the genetic level than at the non-genetic (environmental) level. The relatively high genetic correlations between general risk tolerance and risky behaviors suggests the existence of a genetically-influenced “general factor of risk tolerance” that captures a general tendency to take risk across domains of behavior. 


We also found that risk tolerance is moderately genetically correlated with several personality and neuropsychiatric phenotypes. Of note, the estimated genetic correlations with the personality traits extraversion (    = 0.51)["    " denotes a genetic correlation estimate.], neuroticism (    = –0.42), and openness to experience (    = 0.33) are highly statistically significant and are substantially larger in magnitude than previously reported phenotypic correlations, pointing to shared genetic influences among general risk tolerance and these personality traits. We also found statistically significant and positive genetic correlations between general risk tolerance and the neuropsychiatric phenotypes ADHD, bipolar disorder, and schizophrenia.

2.4.  How much of a particular person’s risk tolerance can be predicted from the results of this paper?

Although each individual SNP has a very small effect, the GWAS estimates of the SNPs’ (very small) effects can be combined to create a “polygenic score,” an index that takes into account the effects of many SNPs from across the genome. Because a polygenic score aggregates the information from many SNPs, it can predict far more of the variation in risk tolerance among individuals than any single SNP. We found that polygenic scores constructed using the results of our GWAS of general risk tolerance explain up to ~1.6% of the variation across individuals in general risk tolerance. While 1.6% is far larger than the amount of variation explained by individual SNPs (less than 0.02%, as noted above), it is small in absolute terms. As we explain in FAQ 3.3, such polygenic scores cannot be used to meaningfully predict a particular person’s risk tolerance.


The predictive power of the polygenic scores is so small partly because our estimates of the SNPs’ effect sizes are relatively imprecise. As the available sample sizes for GWAS get larger, estimates of the SNPs’ effect sizes will become more precise, and the scores’ explanatory power will rise; in theory, if environmental conditions remain the same, it should be possible one day to construct a polygenic score whose explanatory power is close to the heritability of risk tolerance. For example, a score constructed using the set of common SNPs we used to estimate the ~5% to 9% SNP heritability of risk tolerance (see FAQ 2.2), may ultimately explain ~5% to 9% of the variation in risk tolerance across individuals.
Although the polygenic scores we constructed have too little explanatory power to usefully predict any individual’s risk tolerance, they have sufficient explanatory power to be useful in social science studies, which focus on average or aggregated behavior in the population (not individual outcomes). Indeed, with 80% statistical power (the conventional threshold for adequate power), the effect of our polygenic scores can be detected in a study with 500 individuals. Therefore, the polygenic scores provided by our study can be useful in social science studies that have at least 500 participants and in which the participants’ genomes have been measured. (Several datasets commonly used in social science research meet these criteria.)

2.5.  What do your results tell us about human biology and brain development?

To gain insights into the biological mechanisms through which genetic variation influences general risk tolerance, we conducted a suite of bioinformatics analyses. Our bioinformatics analyses point to the involvement of the neurotransmitters glutamate and GABA, which were heretofore not generally believed to play a role in risk tolerance. Glutamate is the most abundant neurotransmitter in the body and plays an excitatory role (i.e., when one neuron secretes it onto another, the second neuron is more likely in turn to transmit its own signal). GABA, by contrast, is the main inhibitory transmitter. To our knowledge, with the exception of a recent study (Lee et al. 2018) prioritizing a much larger number of pathways, no published large-scale GWAS of cognition, personality, or neuropsychiatric phenotypes has pointed to clear roles both for glutamate and GABA. Our results suggest that the balance between excitatory and inhibitory neurotransmission may contribute to variation in general risk tolerance across individuals.


Perhaps unsurprisingly, our bioinformatics analyses point to a role for the brain and the central nervous system in modulating risk tolerance. Specifically, our analyses point to the involvement of some brain regions that have previously been identified in neuroscientific studies on decision-making, including the prefrontal cortex, basal ganglia, and midbrain.

2.6.  How do your results relate to previous research on the genetics of risk tolerance?

As mentioned above in FAQ 1.3, risk tolerance has been one of the most studied phenotypes in social science genetics. However, almost all previous studies have been “candidate-gene studies” conducted in relatively small samples, whose limitations are now appreciated. 


We used the results of our GWAS to revisit this previous research. We reviewed the literature that aimed to link risk tolerance to biological pathways, and identified five main biological pathways that have been previously hypothesized to relate to risk tolerance: the steroid hormone cortisol, the monoamine neurotransmitters dopamine and serotonin, and the steroid sex hormones estrogen and testosterone. We then tested whether these five biological pathways relate to risk tolerance.


To understand how we tested these five biological pathways, it is helpful to first define what a gene is. A “gene” is a sequence of DNA in the genome that codes for a molecule that has a biological function. The human genome has roughly 20,000 to 25,000 genes; although genes comprise only about 1% to 2% of human genome, they have important biological functions. Genes, like other parts of the genome, can contain SNPs. 


To test the five biological pathways for association with risk tolerance, thus, we first used external databases created by other researchers to identify the genes that are involved, or are likely to be involved, in each of these five pathways. Then, we conducted various bioinformatics analyses that used the results of our GWAS and tested the hypothesis that SNPs located in the genes involved in each of the five pathways tend to be more strongly associated with general risk tolerance than other SNPs. We found no evidence in support of that hypothesis, suggesting that the five pathways are not particularly important contributors to individual variation in risk tolerance. 


We also used our GWAS results to examine whether SNPs located within (or highly correlated with) 15 specific genes, which previous candidate-gene studies had tested for association with risk tolerance, are indeed associated with risk tolerance. Our sample was several orders of magnitude larger than the samples used in the previous candidate-gene studies (as mentioned above in FAQ 1.3, these studies were conducted in relatively small samples). Despite this, we found no evidence that these 15 genes are associated with risk tolerance, and failed to replicate the main associations the previous candidate-gene studies had reported. Our results are consistent with other studies that have found that small-sample candidate-gene studies have a poor replication record (Chabris et al. 2012, Hewitt 2012). 


We also note that our discovery GWAS replicated the associations between general risk tolerance and the two SNPs that had previously been found to be associated with general risk tolerance in the two previous studies with large samples (Day et al. 2016, Strawbridge et al. 2018; see FAQ 1.3). This is not surprising, however, since those two studies analyzed data from the UK Biobank, and the UK Biobank is one of the two large datasets we included in our discovery GWAS.


In summary, instead of pointing to the main genetic variants and biological pathways that had previously been hypothesized to relate to risk tolerance, our analyses identified 124 SNPs associated with risk tolerance (see FAQ 2.2), and point to the involvement of the neurotransmitters glutamate and GABA and of several brain regions (see FAQ 2.5).

3. Social and ethical implications of the study

3.1.  Did you find “the gene for” (or “the genes for”) risk tolerance?

No. We did find several genes [As mentioned in FAQ 2.6, a gene is a sequence of DNA in the genome that codes for a molecule that has a biological function; genes, like other parts of the genome, can contain SNPs.] containing SNPs associated with general risk tolerance, but that does not mean that these genes determine general risk tolerance. The genetic factors we identified are involved in a long chain of biological processes that exert an influence on human behavior, and those processes are intricately entwined with the environment. 


In summary, our findings conform with the expectation that variation in risk tolerance across individuals is influenced by at least thousands, if not millions, of genetic variants (Chabris et al. 2015).

3.2.  Does this study show that an individual’s level of risk tolerance is determined and fixed at conception?

No. A large share of the variation in risk tolerance among individuals is determined by environmental factors, and environmental factors may also interact with genetic factors. As mentioned in FAQ 1.3, twin studies have found that part of the variation in risk tolerance across individuals is statistically accounted for by genetic factors. But even if all of the variation in risk tolerance at a certain point in time were accounted for by genetic factors (which is definitely not the case), this would not rule out the possibility of past or future environmental influences on risk tolerance. For instance, even if poor eyesight were perfectly heritable and hence completely determined by genetic factors (it is not), the invention of eye glasses, contact lenses, and laser surgery would all drastically improve a person’s poor genetic outlook for clear vision. On the flip side, environmental trauma (e.g., a poke to the eye) could drastically worsen another individual’s genetic outlook for clear vision. The lesson of eyesight as a phenotype is that heritability of a phenotype—even 100% heritability—does not imply biological determinism: environmental factors can still in principle influence the phenotype. And again, risk tolerance is far from being perfectly heritable.

3.3.  Can you use the results in this paper to meaningfully predict a particular person’s risk tolerance?

No, the results cannot be used to meaningfully predict either a particular person’s general risk tolerance, nor their likelihood of taking any particular risk and engaging in any particular sort of risky behavior. As mentioned in FAQ 2.4, we used the results of our GWAS of general risk tolerance to construct polygenic scores that can explain up to ~1.6% of the variation across individuals in general risk tolerance. That means that ~98.4% of the variation in general risk tolerance is explained by factors other than the polygenic scores. 


As we also mentioned in FAQ 2.4, we expect that future, larger GWAS will allow the construction of polygenic scores with higher predictive power. However, the predictive power of such scores would still pale in comparison to some other scientific predictors. For example, professional weather forecasts correctly predict about 95% of the variation in day-to-day temperatures. Weather forecasters are therefore vastly more accurate forecasters than social science geneticists will ever be.


We also note that, while the polygenic scores we constructed can’t usefully predict any individual’s risk tolerance, they can be useful in social science studies, which focus on aggregated behavior in the population.

 

3.4.  Can environmental factors modify the effects of the specific SNPs you identified?

It is a plausible hypothesis that environmental factors are both moderators and mediators of genetic influences on risk tolerance. For example, it is conceivable that some SNPs have alleles [As mentioned above, an allele is a certain version of a genetic variant.] that tend to make individuals relatively less risk tolerant, but only when the individuals are exposed to certain environments (e.g., when they experience a traumatic episode). (Such environments factors would be said to “moderate” the influence of those SNPs.) It is also conceivable that some SNPs affect risk tolerance indirectly, by influencing individuals’ preferences for certain environments (e.g., by influencing their preferences for socializing with quiet, cautious friends), which may in turn affect their risk tolerance. (Such environments would be said to “mediate” the influence of those SNPs.)  


We did not perform any statistical tests of “gene-environment interactions” in our study. (Gene-environment interactions refer to the moderation of genetic influences by environmental factors.) One promising approach for future studies that seek to identify gene-environment interactions will be to use our GWAS results to construct polygenic scores of general risk tolerance, and then test whether environmental or demographic variables moderate the association between the polygenic scores and an outcome of interest. 


To facilitate such research, we have made the summary results of our GWAS publicly available on the SSGAC’s website (www.thessgac.org); interested researchers who have access to datasets with genotypic data can download these results and use them to construct polygenic scores.

 

3.5.  What policy lessons or practical advice do you draw from this study?

None whatsoever. Any practical response—individual or policy-level—to this or similar research would be extremely premature. In this respect, our study is no different from genome-wide association studies (GWAS) of complex medical outcomes. In medical GWAS research, it is well understood that identifying genetic variants that affect disease risk is merely a first step toward understanding the underlying biology of that disease. It is not sufficient to assess risk for any specific individual. It is not appropriate to base policies and practices on such assessments.

3.6.  Could this kind of research lead to discrimination against, or stigmatization of, people with specific genetic variants? If so, why conduct this research?

Unfortunately, like a great deal of research—including, for instance, research identifying genetic variants associated with increased cancer risk—the results can be misunderstood and could be misapplied, including by being used to discriminate against individuals with specific genetic variants (e.g., in insurance markets). Nevertheless, for a variety of reasons, we do not think that the best response to the possibility that useful knowledge might be misused is to refrain from producing the knowledge.


First, even if we believed that some knowledge (and specifically knowledge about genetic influences on risk-taking behavior) should be forbidden, that goal is unattainable. Behavioral genetics research, including studies of the relationships between genes and a variety of social-science phenotypes, including risk tolerance, is already being conducted by many scientists and other individuals around the world and will continue to be conducted. Not all of this work involves the use of appropriate scientific methods or the transparent communication of results. In this context, researchers who are committed to developing, implementing, and spreading best practices for conducting and communicating potentially controversial research, including behavioral genetics research, arguably have an ethical responsibility to participate in the development and dissemination of this body of knowledge—rather than abstain from it because of its sensitive nature. 


An important theme in our earlier work has been to point out that most existing studies in social-science genetics that report genetic associations with behavioral traits have serious methodological limitations, fail to replicate, and are likely to have false-positive findings (Beauchamp et al. 2011, Benjamin et al. 2012, Chabris et al. 2012, 2015). This same point was made in an editorial in Behavior Genetics (the leading journal for the genetics of behavioral traits), which stated that “it now seems likely that many of the published [behavior genetics] findings of the last decade are wrong or misleading and have not contributed to real advances in knowledge” (Hewitt 2012). Consistent with this, the current study was unable to replicate the results of previous candidate-gene studies of risk tolerance (see FAQ 2.6). One of the most important reasons why earlier work has generated unreliable results is that the sample sizes were far too small, given that the true effects of individual genetic variants on behavioral traits are tiny.

 
Second, one should not assume that behavioral genetics research carries only the potential to increase stigmatization. For instance, behavioral phenotypes such as general risk tolerance are often assumed to be fully and equally within the control of every individual. That view of these behaviors likely contributes to a lack of sympathy for those who exhibit a self-destructive level of risk-taking and, perhaps, suboptimal support for programs that attempt to reduce such behavior. Our purpose here is not to advocate for or against any particular policy for addressing risk behaviors; rather, we mean only to point out that a finding that genes do have some influence can reduce, rather than increase, stigma of those who exhibit risk-tolerant or even risk-seeking behavior. 


Third, behavioral genetics research has the potential to yield other benefits, especially as sample sizes continue to increase. Foregoing this research necessarily entails foregoing these and any other possible benefits, some of which will likely be the result of serendipity rather than being foreseeable. For instance, identifying variants associated with risk tolerance may lead to insights regarding the underlying biological pathways. To take an example from medicine, genetic variants in the LMTK2 (lemur tyrosine kinase 2) gene have small effects on an individual’s predisposition to prostate cancer. Nonetheless, knowing that this gene is involved can point scientists toward studying what the gene does, which may end up teaching us something critical about the pathology of prostate cancer. The effect from modifying a biological pathway, e.g., with a pharmaceutical, is potentially much larger than the effect of the gene itself. Moreover, although we are not quite there yet, when many genetic variants taken together capture ~10% of the variation across individuals in risk tolerance, this amount of predictive power (while still too low to be relevant for individual predictions) will be useful for controlling for genetic factors when studying the effect of a policy or program on an outcome that is also affected by risk tolerance. For example, when studying a policy intervention that aims to reduce the use of illicit substances that present health risks, controlling for as many factors as possible, including genetic factors associated with risk taking, can help generate more precise estimates of the effectiveness of the policy.


In sum, the potential benefits of this research, when conducted responsibly, seem reasonable in relation to the risks, especially considering that this research is already being conducted, sometimes with lesser attention to both scientific rigor and thoughtful science communication. We thus agree with the U.K. Nuffield Council on Bioethics, which concluded in a report (Nuffield Council on Bioethics 2002, p. 114) that “research in behavioural genetics has the potential to advance our understanding of human behaviour and that the research can therefore be justified,” but that “researchers and those who report research have a duty to communicate findings in a responsible manner.” In our view, responsible behavioral genetics research includes sound methodology and analysis of data; a commitment to publish all results, including any negative results; and transparent, complete reporting of methodology and findings in publications, presentations, and communications with the media and the public, including particular vigilance regarding what the results do—and do not—show (hence, this FAQ document).

4. Appendices

Appendix 1:  Quality control measures

There are many potential pitfalls that can lead to spurious results in genome-wide association studies (GWAS). We took many precautions to guard against these pitfalls.


One potential source of spurious results is incomplete “quality control (QC)” of the genetic data. To avoid this problem, we used state-of-the-art QC protocols from medical genetics research (Winkler et al. 2014). We supplemented these protocols by a more recent protocol from Okbay et al. (2016a), as well as by developing and applying additional, more stringent QC filters.


Another potential source of spurious results is a confound known as “population stratification” (e.g., Hamer & Sirota 2000). To illustrate, suppose we were conducting a GWAS of height. People from Northern Europe are on average taller than people from Southern Europe, and there are also small differences in how often certain genetic variants occur in Northern and Southern Europe. If we combine samples of Northern and Southern Europeans and perform a GWAS that ignores the regions the individuals come from, then we would find genetic associations for these variants. However, those associations would simply reflect the fact that the variants are correlated with a population (Northern or Southern Europe) and may actually have nothing to do with height.


In our study we were extremely careful to avoid population stratification as much as possible. At the outset, we restricted the study to individuals of European ancestries, since population stratification problems are more severe when including individuals of different ancestries in the same sample. As is standard in GWAS of medical outcomes, we controlled for “principal components” of the genetic data in the analysis; these principal components capture the small genetic differences across populations, so controlling for them largely removes the spurious associations arising solely from these small differences. 


After taking these steps to minimize population stratification, we conducted several analyses to assess how much population stratification still remained in our data. First, we analyzed data on 17,684 sibling pairs from the Swedish Twin Registry and the UK Biobank. The key idea underlying our test was to examine if differences in genetic variants across siblings are associated with differences in the siblings’ risk tolerance. If so, then these associations cannot be the result of population stratification. The reason is that full siblings (from the same two biological parents) share their ancestry entirely, and therefore differences in their genetic variants cannot be due to being from different population groups. Unfortunately, because our sample of siblings is much smaller than our discovery GWAS sample (939,908 individuals), our estimates of the effects of the genetic variants within the sibling pairs are much less precise than those in the GWAS. However, we can test whether the GWAS results are entirely due to population stratification, because if they were, then the sibling estimates would not line up with the GWAS estimates at all. In fact, we found that the within-family estimates are more similar to the GWAS estimates in both sign and magnitude than would be expected by chance. These results imply that our GWAS results are not entirely due to population stratification. A second analysis, known as a “LD score regression intercept” analysis (Bulik-Sullivan et al. 2015), indicated that there is some, but not much, population stratification in our GWAS results.

Appendix 2:  Additional reading and references

  1. Barban N, Jansen R, de Vlaming R, Vaez A, Mandemakers JJ, et al. 2016. Genome-wide analysis identifies 12 loci influencing human reproductive behavior. Nat. Genet. 48(12):1462–72

  2. Beauchamp JP, Cesarini D, Johannesson M. 2017. The psychometric and empirical properties of measures of risk preferences. J. Risk Uncertain. 54(3):203–37

  3. Beauchamp JP, Cesarini D, Johannesson M, van der Loos MJHM, Koellinger PD, et al. 2011. Molecular genetics and economics. J. Econ. Perspect. 25(4):57–82

  4. Benjamin DJ, Cesarini D, Chabris CF, Glaeser EL, Laibson DI, et al. 2012. The promises and pitfalls of genoeconomics. Annu. Rev. Econom. 4(1):627–62

  5. Bulik-Sullivan BK, Loh P-R, Finucane HK, Ripke S, Yang J, et al. 2015. LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47(3):291–95

  6. Cesarini D, Dawes CT, Johannesson M, Lichtenstein P, Wallace B. 2009. Genetic variation in preferences for giving and risk taking. Q. J. Econ. 124(2):809–42

  7. Chabris CF, Hebert BM, Benjamin DJ, Beauchamp JP, Cesarini D, et al. 2012. Most reported genetic associations with general intelligence are probably false positives. Psychol. Sci. 23(11):1314–23

  8. Chabris CF, Lee JJ, Cesarini D, Benjamin DJ, Laibson DI. 2015. The fourth law of behavior genetics. Curr. Dir. Psychol. Sci. 24(4):304–12

  9. Day FR, Helgason H, Chasman DI, Rose LM, Loh P-R, et al. 2016. Physical and neurobehavioral determinants of reproductive onset and success. Nat. Genet. 48(6):617–23

  10. Dohmen T, Falk A, Huffman D, Sunde U, Schupp J, Wagner GG. 2011. Individual risk attitudes: Measurement, determinants, and behavioral consequences. J. Eur. Econ. Assoc. 9(3):522–50

  11. Falk A, Dohmen T, Falk A, Huffman D. 2015. The nature and predictive power of preferences: Global evidence. IZA Discussion Papers.

  12. Hamer DH, Sirota L. 2000. Beware the chopsticks gene. Mol. Psychiatry. 5(1):11–13

  13. Harden KP, Kretsch N, Mann FD, Herzhoff K, Tackett JL, et al. 2017. Beyond dual systems: A genetically-informed, latent factor model of behavioral and self-report measures related to adolescent risk-taking. Dev. Cogn. Neurosci. 25:221–34

  14. Hewitt JK. 2012. Editorial policy on candidate gene association and candidate gene-by-environment interaction studies of complex traits. Behav. Genet. 42(1):1–2

  15. Lambert J-C, Ibrahim-Verbaas CA, Harold D, Naj AC, Sims R, et al. 2013. Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease. Nat. Genet. 45(12):1452–58

  16. Lee J, Wedow R, Okbay A, Kong E, Maghzian O, et al. 2018. Gene discovery and polygenic prediction from a genome-wide association study of educational attainment in 1.1 million individuals. Nat. Genet. 50:1112–21

  17. Locke AE, Kahali B, Berndt SI, Justice AE, Pers TH, et al. 2015. Genetic studies of body mass index yield new insights for obesity biology. Nature. 518(7538):197–206

  18. Nuffield Council on Bioethics. 2002. Genetics and human behaviour: the ethical context. Nuffield Council on Bioethics [http://nuffieldbioethics.org/wp-content/uploads/2014/07/Genetics-and-human-behaviour.pdf], London

  19. Okbay A, Baselmans BML, Neve J-E De, Turley P, Nivard MG, et al. 2016a. Genetic variants associated with subjective well-being, depressive symptoms, and neuroticism identified through genome-wide analyses. Nat. Genet. 48(6):624–33

  20. Okbay A, Beauchamp JP, Fontana MA, Lee JJ, Pers TH, et al. 2016b. Genome-wide association study identifies 74 loci associated with educational attainment. Nature. 533:539–42

  21. Rietveld CA, Cesarini D, Benjamin DJ, Koellinger PD, De Neve J-E, et al. 2013a. Molecular genetics and subjective well-being. Proc. Natl. Acad. Sci. 110(24):9692–97

  22. Rietveld CA, Conley DC, Eriksson N, Esko T, Medland SE, et al. 2014a. Replicability and robustness of GWAS for behavioral traits. Psychol. Sci. 25(11):1975–86

  23. Rietveld CA, Esko TT, Davies G, Pers TH, Turley PA, et al. 2014b. Common genetic variants associated with cognitive performance identified using the proxy-phenotype method. Proc. Natl. Acad. Sci. U. S. A. 111(38):13790–94

  24. Rietveld CACA, Medland SESE, Derringer J, Yang J, Esko T, et al. 2013b. GWAS of 126,559 individuals identifies genetic variants associated with educational attainment. Science. 340(6139):1467–71

  25. Ripke S, Neale BM, Corvin A, Walters JTR, Farh K-H, et al. 2014. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 511(7510):421–27

  26. Strawbridge RJ, Ward J, Cullen B, Tunbridge EM, Hartz S, et al. 2018. Genome-wide analysis of self-reported risk-taking behaviour and cross-disorder genetic correlations in the UK Biobank cohort. Transl. Psychiatry. 8(1):1–11

  27. Turley P, Walters RK, Maghzian O, Okbay A, Lee JJ, et al. 2018. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50(2):229–37

  28. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, et al. 2017a. 10 Years of GWAS Discovery: Biology, Function, and Translation. Am. J. Hum. Genet. 101(1):5–22

  29. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, et al. 2017b. 10 years of GWAS discovery: Biology, function, and translation. Am. J. Hum. Genet. 101(1):5–22

  30. Winkler TW, Day FR, Croteau-Chonka DC, Wood AR, Locke AE, et al. 2014. Quality control and conduct of genome-wide association meta-analyses. Nat. Protoc. 9(5):1192–1212

  31. Wood AR, Esko T, Yang J, Vedantam S, Pers TH, et al. 2014. Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 46(11):1173–8

̂

r

g

̂

r

g

̂

r

g

̂

r

g

 
 
 

̂

r

g

 
 
 
 
 
 
 
 
 
 
 
 

FAQs about “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment”

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Back to Top

Use the quick link menu to jump to a specific question, or scroll down to read all FAQs for this publication. 

 

This document provides information about the study:

 

Lee et al. (2018). “Gene discovery and polygenic prediction from a 1.1-million-person GWAS of educational attainment.” Nature Genetics.

The document was prepared by Daniel J. Benjamin, David Cesarini, Christopher F. Chabris, Philipp D. Koellinger, David Laibson, Michelle N. Meyer, Aysu Okbay, Patrick Turley, Peter M. Visscher, and Meghan Zacher. It draws from and builds on the FAQs for earlier SSGAC papers. It has the following sections:

          1. Background

          2. Study design and results

          3. Social and ethical implications of the study

          4. Appendices

 

For clarifications or additional questions, please contact Daniel Benjamin (daniel.benjamin@gmail.com).

 

Quick Links

1.1.  Who conducted this study? What are the group’s overarching goals?

1.2.   The current study focuses on an outcome called “educational attainment.” What is educational attainment?

1.3.  What is a GWAS? Are the genetic variants identified in a GWAS “causal”?

1.4.  In what sense do the genetic variants identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

1.5.  What is a polygenic score?

1.6.  Why conduct a GWAS of educational attainment?

1.7.  What was already known about genetic associations with educational attainment prior to this study?

2.1.  What did you do in this paper? How was the study designed? Why was the study designed in this way?

2.2.  What did you find in the GWAS of educational attainment?

2.3.  How predictive is the polygenic score developed in this study?

2.4.  What did you find in the analysis of siblings?

2.5.  What did you find in the analysis of environmental heterogeneity?

2.6.  What did you find in the analysis of the X chromosome?

2.7.  What did you find in the analysis of cognitive performance and math abilities?

2.8.  Are the genetic variants associated with higher educational attainment in your study also associated with other outcomes?

2.9.  What do your results tell us about human biology and brain development?

3.1.  Did you find “the gene for” educational attainment?

3.2.  Well, then, did you find “the genes for” educational attainment?

3.3.  Does this study show that an individual’s level of educational attainment is determined, or fixed, at conception?

3.4.  Can the polygenic score from this paper be used to accurately predict a particular person’s educational attainment?

3.5.  Can your polygenic score be used for research studies in non-European-ancestry populations?

3.6.  What policy lessons do you draw from this study?

3.7.  Could this kind of research lead to discrimination against, or stigmatization of, people with the relevant genetic variants? If so, why conduct this research?

Appendix 1:  Quality control measures

Appendix 2:  Additional reading and references

1. Background

1.1.  Who conducted this study? What was the group's overarching goal?

 

The authors of the study are members of the Social Science Genetic Association Consortium (SSGAC). The SSGAC is a multi-institutional, international research group that aims to identify statistically robust links between genetic variants and social-science-relevant traits. These include traits such as behavior, preferences, and personality that are traditionally studied by social and behavioral scientists (e.g., economists, psychologists, sociologists) but are often also of interest to health and other researchers.

The SSGAC was formed in 2011 to overcome a specific set of scientific challenges. Most traits and behaviors are associated with thousands of genetic variants. Although their collective effect can be substantial (see FAQs 1.5 & 2.3), we now know that almost every one of these genetic variants has an extremely weak effect on its own. To identify specific variants with such small effects, scientists must study at least hundreds of thousands of people (to separate weak signals from noise). One promising strategy for doing this is for many investigators to pool their data into one large study. This approach has borne considerable fruit when used by medical geneticists interested in a range of diseases and conditions (Visscher et al. 2017). Most of these advances would not have been possible without large research collaborations between multiple research groups interested in similar questions. The SSGAC was formed in an attempt by social scientists to adopt this research model.

The SSGAC is organized as a working group of the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE), a successful medical consortium. It was founded by three social scientists—Daniel Benjamin (University of Southern California), David Cesarini (New York University), and Philipp Koellinger (Vrije Universiteit Amsterdam)—who believe that studying genetic variants associated with social scientific outcomes can have substantial positive impacts across many research fields. This includes research that aims to better understand the effects of the environment (e.g., research on policy interventions, including the effects of different school environments) and interactions between genetic and environmental effects. The potential benefits also span a diverse set of research questions in the biomedical sciences, such as why and how educational attainment is linked to longevity and better overall health outcomes.

To conduct such research, the SSGAC implements genome-wide association studies (GWAS, see FAQ 1.3) of social-scientific outcomes. For example, to conduct a GWAS of educational attainment, every participating cohort uploads the (within-cohort) statistical association between educational attainment and a single-nucleotide polymorphism (SNP) in the genomes of the individuals in the cohort.  A SNP is a base-pair of the genome where there is common variation in the human population (see FAQ 1.3).  This statistical analysis is repeated for each SNP on the genome. The cohort-level results do not contain individual-level data – just summary statistics about these within-cohort statistical associations. The SSGAC then combines these cohort results to produce the overall GWAS results. By using existing datasets and combining cohort-level results, we can study the genetics of ~1.1 million people at very low cost. The SSGAC publicly shares the overall, aggregated results at www.thessgac.org/data so that other scientists can build on this work. These publicly available data have already catalyzed many research projects and analyses across the social and biomedical sciences (see FAQ 1.6. for examples).

The Advisory Board for the SSGAC is composed of prominent researchers representing various disciplines: Dalton Conley (Sociology, Princeton University), George Davey Smith (Epidemiology, University of Bristol), Tõnu Esko (Molecular Biology and Human Genetics, University of Tartu and Estonian Genome Center), Albert Hofman (Epidemiology, Harvard University), Robert Krueger (Psychology, University of Minnesota), David Laibson (Economics, Harvard University), James Lee (Psychology, University of Minnesota), Sarah Medland (Genetic Epidemiology, QIMR Berghofer Medical Research Institute), Michelle Meyer (Bioethics and Law, Geisinger Health System), and Peter Visscher (Statistical Genetics, University of Queensland).

The SSGAC is committed to the principles of reproducibility and transparency. Prior to conducting genetic association studies, power calculations are carried out to determine the necessary sample size for the analysis (assuming realistically small effect sizes associated with individual genetic variants). Whenever possible, we pre-register our analyses at OSF (formerly Open Science Framework). Major SSGAC publications are usually accompanied by a FAQ document (such as this one). The FAQ document is written to communicate what was found less tersely and technically than in the paper, as well as what can and cannot be concluded from the research findings more broadly. FAQ documents produced for SSGAC publications are available at https://www.thessgac.org/faqs.

In addition to educational attainment, SSGAC-affiliated papers have studied subjective well-being, reproductive behavior, and risk tolerance. The SSGAC website contains an up-to-date list of our major publications, which have been published in journals such as Science, Nature, Nature Genetics, Proceedings of the National Academy of Sciences, Psychological Science, and Molecular Psychiatry.

1.2.  The current study focuses on an outcome called “educational attainment.” What is educational attainment?

Educational attainment is the amount of formal education a person completes (measured as the number of years of education completed for people in our sample, all of whom are at least age 30 or older). Although educational attainment is most strongly influenced by social and other environmental factors (see FAQ 1.7), it is also influenced by thousands of genes. People vary considerably in how much education they complete. Education is recognized throughout the social and biomedical sciences as an important “predictor” (see FAQ 1.4) of many other life outcomes, such as income, occupation, health, and longevity (Ross & Wu 1995; Cutler & Lleras-Muney 2008). Educational attainment is also among the relatively few social-scientific traits for which it is feasible to conduct a large-scale genome-wide study, because educational attainment is frequently measured by a variety of cohorts, including medical cohorts, due to its robust association with health. A large-scale study is necessary (but not sufficient) to generate scientific findings that are reproducible.

1.3.  What is a GWAS? Are the genetic variants identified in a GWAS “causal”?

In a genome-wide association study (GWAS), scientists look at genetic variants measured across the entire human genome to see whether any of them are, on average, associated with higher or lower levels of some outcome. Commonly, and in our studies, such analyses focus on the most common genetic variants—so called single-nucleotide polymorphisms (SNPs). SNPs are sites in the genome where single DNA base pairs commonly differ across individuals. SNPs usually have two different possible base pairs, or alleles. Although there are tens of millions of sites where SNPs are located in the human genome, GWASs typically investigate only SNPs that can be measured (or imputed) with a high level of accuracy. These days, such procedures usually yield millions of SNPs that together capture most common genetic variation across people.

GWAS has been a successful research strategy for identifying genetic variants associated with many traits and diseases, including body height (Wood et al. 2014), BMI (Locke et al. 2015), Alzheimer’s disease (Lambert et al. 2013), and schizophrenia (Ripke et al. 2014). It has also recently been used to identify genetic variants associated with a variety of health-relevant social science outcomes, such as the number of children a person has (Barban et al. 2016), happiness (Okbay, Baselmans, et al. 2016; Turley et al. 2018), and educational attainment (Rietveld et al. 2013; Okbay, Beauchamp, et al. 2016).

GWAS identifies genetic variants that are associated with the outcome, but an observed association with a specific variant need not imply that the variant causes the outcome, for a variety of reasons. First, genetic variants are often highly correlated with other, nearby variants on the same chromosome. As a result, when one or more variants in a region causally influence an outcome (in that particular environment), many non-causal variants in that region may also be identified as associated with the outcome. When GWAS results are analyzed, researchers will often tend to emphasize results for the genetic variant in a region that showed the strongest evidence of association. This variant need not be the causal variant. In fact, the causal genetic variant may not have even been measured directly. For example, GWAS that focus on common SNPs would not be able to identify rare or structural genetic variants (e.g., deletions or insertions of an entire genetic region) that are causal, but they may identify SNPs that are correlated with these unobserved variants.

Second, the frequencies of many genetic variants vary systematically across environments. If those environmental factors are not accounted for in the association analyses, some of the associations found may be spurious. To use a well-known example (Lander & Schork 1994), any genetic variants common in people of Asian ancestries will be associated statistically with chopstick use, but these variants would not cause chopstick use; rather, these genetic variants and the outcome of chopstick use are both distributed unevenly among people with different ancestries. This is the problem of “population stratification” discussed in Appendix 1. GWAS researchers have a number of strategies for addressing the challenges posed by population stratification (see FAQs 2.4 & 3.5 and Appendix 1).

Even in studies such as ours that attempt to address and correct for heterogeneity in genetic ancestry, allele frequencies may nonetheless vary systematically with environmental factors. For example, a genetic variant that is associated with improved educational outcomes in the parental generation may have downstream effects on parental income and other factors known to influence children’s educational outcomes (such as neighborhood characteristics). This same genetic variant is likely to be inherited by the children of these parents, creating a correlation between the presence of the genetic variant in a child’s genome and the extent to which the child was reared in an environment with specific characteristics. A recent study of Icelandic families showed that the parental allele that is not passed on to the parent’s offspring is still associated with the child’s educational attainment, suggesting that GWAS results for educational attainment partly represent these intergenerational pathways (Kong et al. 2018). Our sibling analyses yield results that are consistent with this conclusion (see FAQ 2.4).

Third, variants’ effects on an outcome may be indirect, so a variant that may be “causal” in one environment may have a diminished effect or no effect at all in other environments. For example, the nicotinic acetylcholine receptor gene cluster on chromosome 15 is associated with lung cancer (Thorgeirsson et al. 2008; Amos et al. 2008; Hung et al. 2008). From this observation alone we cannot conclude that these genetic variants cause lung cancer through some direct biological mechanism. In fact, it is likely that these genetic variants increase lung cancer risk through their effects on smoking behavior. In a tobacco-free environment, it is plausible that many of the associations would be substantially weaker and perhaps disappear altogether. Thus, even if we have credible evidence that a specific association is not spurious, it is entirely possible that the genetic variant in question influences the outcome through channels that we, in common parlance, would label environmental (e.g., smoking). Nearly forty years ago, the sociologist Christopher Jencks criticized the widespread tendency to mistakenly treat environmental and genetic sources of variation as mutually exclusive (see also Turkheimer 2000). As the example of smoking illustrates, it is often overly simplistic to assume that “genetic explanations of behavior are likely to be exclusively physical explanations while environmental explanations are likely to be social” (Jencks 1980, p.723).

In general, GWAS is just one step in a longer, often complex process of identifying causal pathways, but the results of a large-scale GWAS are a useful tool for that purpose and often lead to novel and important insights (Visscher et al. 2017). In other words, GWAS results provide important signals as to where scientists should invest future in-depth research to understand why the association exists.

1.4.  In what sense do the genetic variants identified in a GWAS “predict” the outcome of interest? What do you mean by “effect size”?

When we and other scientists say that genetic variants (and other variables, such as demographics) “predict” certain outcomes, our use of the word differs in several important ways from how “predict” is used in standard language (e.g., outside of social science research papers). First, we do not mean that the presence of a genetic variant guarantees an outcome with 100% probability, or even with a high degree of likelihood. Rather, we mean that the variant is, on average across people, statistically associated with an outcome. In other words, on average, people with the genetic variant have a higher likelihood of the outcome compared to people without the genetic variant. A genetic variant is said to be statistically “predictive” of an outcome even if the presence of the genetic variant only very weakly increases the likelihood of that outcome—as is the case, for instance, with every SNP that we identify that is associated with educational attainment.

Second, in standard language, “prediction” usually refers to the future. In contrast, when scientists say that genetic variants “predict” an outcome, they mean that they expect to see the association in new data. “New data” means data that haven’t been analyzed yet—regardless of whether that data will be collected in the future or has already been collected.

Finally, in standard language, a “prediction” is often an unconditional guess about what will happen. Instead of meaning it unconditionally, scientists mean that they expect to see an association in new data under certain conditions, for example, that the environment for the new data is the same as the environment in which the variants were found in the previously studied data to be associated with the outcome. In the example given in FAQ 1.3, in which a genetic variant is associated with lung cancer due to its effect on smoking, we would not expect the genetic variant to be as strongly predictive of lung cancer in an environment where cigarettes are absent.

We use the term “effect size” as a concise way to refer to the magnitude of the predicted difference in the outcome resulting from having one allele of a genetic variant as opposed to the other possible allele (for example, see FAQ 2.2). The use of the word “effect” is not intended to imply that we believe it is generally appropriate to use the strength of the association between a variant and educational attainment as a measure of the variant’s causal effect on educational attainment (see FAQ 1.3).

1.5.  What is a polygenic score?

The results of a GWAS can be used to create a “polygenic score,” an index composed of many genetic variants from across the genome. Because a polygenic score aggregates the information from many genetic variants, it can “predict” (see FAQ 1.4) far more of the variation among individuals for the GWAS outcome than any single genetic variant. Often, the polygenic scores with the most predictive power are those created using all the (millions of) genetic variants studied in a GWAS. The larger the GWAS sample size, the greater the predictive power (in other, independent samples) of a polygenic score constructed from the GWAS results. More precisely, the GWAS results are used to create a formula for how to construct a polygenic score. Using this formula, a polygenic score can then be constructed for any individual with genome-wide data. Indeed, some of the value of a GWAS is that the polygenic score it produces can be used in subsequent studies conducted in other samples.

1.6.  Why conduct a GWAS of educational attainment?

We are motivated to conduct this research because we believe it can be fruitful for the social sciences and health research. In addition to the specific findings of our paper, which are discussed in Section 2 of these FAQs, the results of a GWAS of educational attainment also provide inputs for other research. For example, results from our earlier GWAS of educational attainment (Rietveld et al. 2013; Okbay, Beauchamp, et al. 2016) conducted in much smaller sample sizes (see also FAQ 1.7) have been used to:

  • examine the genetic overlap between educational attainment and ADHD, schizophrenia, Alzheimer’s disease, intellectual disability, cognitive decline in the elderly, brain morphology, and longevity (Pickrell et al. 2016; Warrier et al. 2016; Anderson et al. 2017; Marioni et al. 2016);

  • help us better identify possible genetic subtypes of schizophrenia (Bansal et al. 2017);

  • explore why educational attainment appears to be protective against coronary artery disease (Tillmann et al. 2017) and obesity (van Kippersluis & Rietveld 2017);

  • control for genetic influences in order to generate more credible estimates of how changes in school policy influence health outcomes (Davies et al. 2018);

  • study why specific genetic variants predict educational attainment. For example, it appears that some genetic effects on educational attainment operate through associations with cognitive performance and traits such as self-control (Belsky et al. 2016), which in turn affect educational attainment;

  • study how the effects of genes on education differ across environmental contexts (Schmitz & Conley 2017; Barcellos et al. 2018); and

  • develop new statistical tools that may advance our understanding of how parenting and other features of a child’s rearing environment influence his or her developmental outcomes (Kong et al. 2018; Koellinger & Harden 2018).

These are just some examples of follow-up studies that previous GWASs of educational attainment have already enabled. By making the results of our analyses publicly available at https://www.thessgac.org/data, we hope to facilitate further valuable work by other researchers.

1.7.  What was already known about genetic associations with educational attainment prior to this study?

Educational attainment is strongly influenced by social and other environmental factors. For example, holding all other influences equal, those who live in communities where education (at least beyond a certain level) is relatively expensive are less likely to obtain a high level of educational attainment. Even when education is free or heavily subsidized, full-time education constitutes an opportunity cost that not everyone is equally able to bear: some individuals, due to a variety of family or economic circumstances, will face more pressure than others to leave school and enter the labor force. More generally, educational outcomes are strongly influenced by environmental factors such as social norms, early-life educational experiences, and economic opportunity.

A variety of findings—from twin, family, and GWAS studies—suggest that in affluent countries, genetic factors account for some of the differences across people in their educational attainment (Branigan et al. 2013; Heath et al. 1985; Silventoinen et al. 2004). Studies have found repeatedly that identical twins raised in the same home are substantially more similar to each other in their educational attainment than fraternal twins (or other full siblings) reared together. Full siblings reared together are, in turn, more similar than half siblings reared together who, in turn, are more similar than genetically unrelated siblings (e.g., siblings who are conventionally unrelated, typically because at least one of them is adopted) reared together (Cesarini & Visscher 2017; Sacerdote 2011; Sacerdote 2007). The studies have also provided strong evidence that so-called common environment (the environmental factors shared by siblings raised in the same household) can have long-lasting effects on educational outcomes. In Sweden, the educational outcomes of adopted (i.e., genetically unrelated) brothers reared in the same households are about as similar as the educational outcomes of full siblings reared in separate homes (Cesarini & Visscher 2017). A study of Korean-American adoptees finds that adoptees assigned to households where both parents had college degrees were 16 percentage points more likely to attend college than children assigned to families in which neither parent completed college (Sacerdote 2007).

Research (like the current study) using molecular genetic data—data that measures each person’s DNA and can be used to identify differences between people at the molecular level—has similarly found that common SNPs jointly predict up to 20% of variation across individuals (Rietveld et al. 2013). This predictive power may derive from many different types of mechanisms. For example, genetic variation may affect neural functions such as memory. Genetic variation may improve sleep quality (making it easier to subsequently stay awake in boring lectures). Genetic variation can affect personality traits, such as the willingness to listen politely to and follow the instructions of teachers (who aren’t always right but nevertheless dictate grades and other outcomes). There may also be even more convoluted pathways. For example, genetic variation can affect one’s sociability, which might draw someone into or drive someone out of the particular social environments that exist in higher education.

In prior GWAS studies, researchers have observed that some genetic variants are associated with educational attainment. In the SSGAC’s first major publication (Rietveld et al. 2013), we conducted a GWAS in a sample of roughly 100,000 people and identified three genetic variants that were statistically associated with educational attainment. In 2016, the SSGAC conducted another GWAS of educational attainment, this time in a sample of around 300,000 people (Okbay, Beauchamp, et al. 2016). We found that 74 genetic variants were associated with educational attainment. These included the three genetic variants identified in our earlier study