The Autism Simplex Collection: an international, expertly phenotyped autism sample for genetic and phenotypic analyses

Background There is an urgent need for expanding and enhancing autism spectrum disorder (ASD) samples, in order to better understand causes of ASD. Methods In a unique public-private partnership, 13 sites with extensive experience in both the assessment and diagnosis of ASD embarked on an ambitious, 2-year program to collect samples for genetic and phenotypic research and begin analyses on these samples. The program was called The Autism Simplex Collection (TASC). TASC sample collection began in 2008 and was completed in 2010, and included nine sites from North America and four sites from Western Europe, as well as a centralized Data Coordinating Center. Results Over 1,700 trios are part of this collection, with DNA from transformed cells now available through the National Institute of Mental Health (NIMH). Autism Diagnostic Interview-Revised (ADI-R) and Autism Diagnostic Observation Schedule-Generic (ADOS-G) measures are available for all probands, as are standardized IQ measures, Vineland Adaptive Behavioral Scales (VABS), the Social Responsiveness Scale (SRS), Peabody Picture Vocabulary Test (PPVT), and physical measures (height, weight, and head circumference). At almost every site, additional phenotypic measures were collected, including the Broad Autism Phenotype Questionnaire (BAPQ) and Repetitive Behavior Scale-Revised (RBS-R), as well as the non-word repetition scale, Communication Checklist (Children’s or Adult), and Aberrant Behavior Checklist (ABC). Moreover, for nearly 1,000 trios, the Autism Genome Project Consortium (AGP) has carried out Illumina 1 M SNP genotyping and called copy number variation (CNV) in the samples, with data being made available through the National Institutes of Health (NIH). Whole exome sequencing (WES) has been carried out in over 500 probands, together with ancestry matched controls, and this data is also available through the NIH. Additional WES is being carried out by the Autism Sequencing Consortium (ASC), where the focus is on sequencing complete trios. ASC sequencing for the first 1,000 samples (all from whole-blood DNA) is complete and data will be released in 2014. Data is being made available through NIH databases (database of Genotypes and Phenotypes (dbGaP) and National Database for Autism Research (NDAR)) with DNA released in Dist 11.0. Primary funding for the collection, genotyping, sequencing and distribution of TASC samples was provided by Autism Speaks and the NIH, including the National Institute of Mental Health (NIMH) and the National Human Genetics Research Institute (NHGRI). Conclusions TASC represents an important sample set that leverages expert sites. Similar approaches, leveraging expert sites and ongoing studies, represent an important path towards further enhancing available ASD samples.


Background
Autism spectrum disorder (ASD) is a highly heritable neurodevelopmental disorder characterized by fundamental deficits in social reciprocity and the presence of restricted interests and/or repetitive behaviors [1]. ASD is among the most prevalent of developmental disorders with an estimated prevalence of about 1 in 100, with males affected at a rate four times that of females [2]. A substantial portion of the risk for ASD traces to genetic variation, either inherited [1,3,4] or arising de novo [5][6][7][8][9][10][11]. Through the strong support of families, advocacy groups and international research initiatives, large collaborative networks have been formed to advance knowledge of genes associated with ASD, and in recent years, these efforts have led to dramatic progress in the understanding of the genetics of ASD. Recent findings identify examples of specific etiological mechanisms, while underscoring the highly complex nature of the genetic architecture in ASD. To date, a specific genetic etiology can be identified in up to 25% of individuals with ASD and include single gene disorders (for example, Fragile X), known genetic syndromes (for example, 22q11 deletion syndrome), chromosomal anomalies, de novo and inherited copy number variation (CNV), indels, and single nucleotide variation (SNV) [12]. Intensive behavioral interventions can have profoundly positive effects on the prognosis of autism (for example, see [13]), however morbidity and the cost of ASD to society, families and individuals remains high [14,15], reinforcing the need for earlier diagnosis and ultimately, targeted treatment.
Further advances in the genetics of ASD will require analysis of genetic variation at all levels, necessitating the need to take advantage of new technology, including that of massively parallel (high-throughput) sequencing, and will require many thousands of samples to have adequate power to identify rare variants associated with high risk and more common variants associated with lower risk. The Interagency Autism Coordinating Committee (IACC) of the NIH, in its 2010 strategic plan, has called for genetic analysis of some 20,000 unrelated families, a number that is consistent with studies in other psychiatric disorders such as schizophrenia (which also has a similar prevalence and heritability to autism, see [16]). The IACC recognized that in order to discover additional causes of ASD to inform 'prognosis and treatments and lead to the prevention/preemption of the challenges and disabilities of ASD' , the community would require such sample sizes.
In spite of the prevalence and burden of ASD, there have not yet been sufficient large-scale initiatives to recruit subjects to reach 20,000 unrelated families, and the numbers of unrelated families available to researchers through the National Institute of Mental Health (NIMH) Center for Collaborative Genetic Studies on Mental Disorders (CCGSMD) was still moderate in 2008. To begin to redress this, members of the Autism Genome Project Consortium (AGP) made a proposal to Autism Speaks to collect a large number of well-phenotyped trios and to make them publically available through the existing open process of the NIMH CCGSMD. The proposal attempted to be modest in terms of costs because all participating sites were already fully qualified on the relevant instruments, had families in clinical and research studies to turn to for participation, and all sites agreed to leverage ongoing studies funded by parallel sources to carry out the proposed sample collection at a fraction of the real costs. After positive review at Autism Speaks, and a critical parallel commitment by the NIMH to collect, transform, and distribute biomaterials and data through CCGSMD, The Autism Simplex Collection (TASC) was awarded funds to ascertain 1,700 trios in this initiative. As of today, the TASC samples are now available for qualified researchers through the NIMH CCGSMD and additional genotyping and whole exome sequencing (WES) data is being made available for these samples on an ongoing basis.

The Autism Genome Project Consortium
In 2004, through the support of the National Alliance for Autism Research (NAAR), now Autism Speaks, the AGP was launched as an international consortium of genetic researchers from North America and Europe dedicated to identifying genes that underlie susceptibility to ASD. The goal of the AGP was to identify causes of ASD using modern genetic approaches and to then translate these findings into better clinical practice. The members of the AGP have joined together to reach these goals, recognizing that an open collaborative structure would allow the field to proceed with maximal efficiency. While a detailed description of the AGP is beyond the scope of the current manuscript, it is important to highlight some recent successes as they inter-relate with the goals of the TASC. The AGP has consistently made use of state-of-the-art methods, describing for the first time a genome-wide assessment of CNV in ASD that included a first report of CNV at the NRXN1 gene in ASD [7]. The AGP reported a large genome-wide association study (GWAS) in ASD [17] and has identified many CNVs that contribute to ASD risk [6]. This latter study showed that CNV in genes previously associated with intellectual disability (ID) are likely contributing to ASD risk and identified pathways that appear to be implicated in ASD risk. The AGP collaborates with other initiatives, carrying out genotyping where requested (for example, see [18]) and providing all genotype data to the Psychiatric Genetics Consortium (PGC; [19]) for mega-analysis. Recently the AGP has completed genotyping another large sample (including much of TASC; Pinto et al. in press) and has contributed over 1,000 TASC samples for WES. WES data from cases was generated and analyzed, in a large effort funded by the NIMH and the National Human Genetics Research Institute (NHGRI), through an American Recovery and Reinvestment Act grant [3,8,20,21]. WES data from trios is now being generated by the ASC [22], again with support from the NIMH and NHGRI.
The AGP thus represents a unique union of organizations and research sites (over 50) that is comprised of clinicians and researchers with recognized expertise in ASD that work closely with local ASD groups and families. Since 2003, AGP members have published several hundred peer-reviewed papers on autism genetics research. The pooling of the extensive scientific and clinical expertise from across the world can serve to provide reliably characterized samples of phenotype and genotype data to the scientific community. This is critical as large, reliable samples are crucial for dissection of the genetic and genomic architecture of ASD.

The Autism Simplex Collection (TASC)
In 2008, the AGP sought to provide an independent ASD sample for gene discovery and phenotypic analyses, and for replicating and extending findings from prior studies carried out by the AGP and by additional research groups. To this end, 13 sites within the AGP with extensive experience in the diagnosis and assessment of ASD made a proposal that included recruitment of additional families to the NIMH CCGSMD in a two-year period. Recruitment would include in-depth expert phenotyping of all families and biomaterials from all families.

Methods
Autism Speaks agreed to fund the collection of 1,700 samples, while the NIH agreed to assume the costs for collecting, transforming and distributing the samples, as well as warehousing all clinical information. Dr. Louise Gallagher coordinated the recruitment with principle investigators at each site taking local responsibility (see Table 1). Inclusion criteria included individuals aged between 3 and 21 years with a research diagnosis of autism/ ASD based on the Autism Diagnostic Interview-Revised (ADI-R) and the Autism Diagnostic Observation Schedule (ADOS) (administered by research reliable raters). Individual IQs were preferably greater than or equal to 35 based on standardized testing and a full-scale or nonverbal mental age of 18 months or greater. Only full parent-child trios were included and the availability of an unaffected sibling was not a requirement for inclusion. Individuals with known medical or genetic causes of autism or a history consistent with Childhood Disintegrative Disorder were excluded. Other exclusion criteria included: extreme prematurity (< 1,000 grams or less than 32 weeks); prematurity with associated neurological complications; birth trauma with associated early neo-natal complications; significant brain injury; in utero exposure to medication known to be associated with autism, for example. retinoic acid, sodium valproate. The Phenotype Protocol was developed and agreed by the members of the AGP Phenotype Analytic Committee (Chair: Dr. Peter Szatmari). Additionally, consultation with the Simon Simplex Collection, which was also in a phase of phenotype protocol development, allowed for cross-comparability of some instruments. Core assessments carried out at all sites included ADI-R [23] and ADOS-G [24] for the probands, as well as standardized IQ measures, Vineland Adaptive Behavioral Scales (VABS) [25], Social Responsiveness Scale (SRS) [26], Peabody Picture Vocabulary Test (PPVT) [27] and physical measures (height, weight, and head circumference) ( Table 2). In addition, at almost every site, additional phenotype measures were assessed, including the Broad Autism Phenotype Questionnaire (BAPQ) [28] and Repetitive Behavior Scale-Revised (RBS-R) [29], as well as the non-word repetition scale (from the Comprehensive Test of Phonological Processing/ CTOPP-Domain V) [30], The Children's Communication Checklist-2 [31], Communication Checklist-Adult Version [32], and Aberrant Behavior Checklist (ABC) [33] (Table 2). Phenotypes were uploaded to a centralized database in the Data Coordinating Center (DCC), which performed point-of-entry data validation, facilitated cross-site harmonization of clinical information, and provided combined output data for analysis to ensure a common starting point for analyses performed at different participating sites.

Results
The consortium recruited over 90% of its target sample in approximately two and a half years. The ambitious recruitment was made feasible by capitalizing on sites with active clinical and research programs, already utilizing standardized, research-quality evaluation methods, and by leveraging site-specific resources. The TASC collection includes a majority of European ancestry, but also includes samples of more diverse ancestry. Summary statistics on ancestry, as well as ADI-R and ADOS-G diagnoses, and available genetic data are shown Table 3.

Discussion
Genetic analyses of TASC samples AGP members continue their work on ASD genetics and have been making use of TASC samples to this end. Over 400 TASC families are included in the two recent AGP papers reporting on CNV and common SNPs in ASD [6,17]. Additional TASC families are in the current analysis in the follow-up AGP cohort (Pinto et al., in press), such that to date, both genome-wide SNP data and CNV calls have been generated for some 900 TASC families. This data is being analyzed by the AGP and shared with other sites, including the PGC [19] as well as the database of Genotypes and Phenotypes (dbGaP) and the National Database for Autism Research (NDAR). Genotypes were managed and stored centrally in the  DCC along with phenotypes and were transferred from there to the national databases. Called CNV are also available for TASC samples. The TASC samples were also an important part of studies sequencing over 1,000 ASD exomes using next-generation sequencing. Five sites collaborated on this American Recovery and Reinvestment Act (ARRA)-funded initiative, where a premium was placed on making use of publically available samples that could be matched with control samples using genome-wide SNP data. Using genomewide SNP data generated by the AGP from the approximately 950 TASC families, over 500 unrelated TASC ASD subjects were selected for WES [3,8,20,21]. Sequencing data from these families were released early 2012 via dbGaP. The same ARRA grant provided resources to facilitate the release of TASC samples to additional researchers through the NIMH CCGSMD. Most recently, the ASC [22] has begun to sequence whole blood-derived DNA from TASC samples, with support from the NIMH and NHGRI. WES data from 1,000 samples (primarily in the form of trios) will be released through dbGaP in 2014, and data from additional TASC samples will be released as the data become available.

TASC in the broader context
The goal identified by the IACC to examine 20,000 ASD probands for genetic factors is consistent with studies in other complex psychiatric disorders. As noted above, both prevalence and heritability of schizophrenia are in the same range as is seen for ASD [16]. In addition, like ASD, schizophrenia risk includes CNV as a source of the risk architecture [35,36]. The PGC has published a GWAS with approximately 18,000 schizophrenia samples and identified over 10 SNPs with very modest effect sizes [37], with a larger study (with twice the number of affecteds) now being completed. The PGC study in ASD, which includes all available SNP data in ASD (and virtually all AGP and TASC data) accounts for over 5,000 samples. Clearly it is not yet possible to identify multiple common SNPs with very low effect sizes are part of the ASD risk, as it is clear that common SNPs with odds ratios greater than 1.3 can be excluded [38]. Even considering rare variation with major effect, large sample sizes are needed, precisely because the variants are rare. In short, there is every reason to think that the target of 20,000 proposed by the IACC is required and that this sample size might begin to achieve the associated goal crucial of the IACC, which is to capture a significant proportion of the genetic risk for ASD.
TASC shares similarities with other large-scale sample collection. These include the Autism Genetic Resource Exchange (AGRE) [39], a network originally funded by Cure Autism Now (CAN) and now integrated into Autism Speaks, and the Simons Simplex Collection [40], funded by the Simons Foundation (see Table 4). In addition, the NIMH has samples from prior collections, and efforts are being made to collect additional samples. Collectively, these important efforts can provide samples that approach those in schizophrenia over the next several years and provide greater power for broader analyses that will be required to fully comprehend the relationship between genotype and phenotype.

Conclusions
There has been remarkable progress in ASD genetics in the past several years, driven to no small degree by the availability of large well-characterized samples. Relevant genetic findings can now be made in over 25% of ASD cases, using Fragile X testing, chromosome microarray (CMA) and WES. Fragile X testing and CMA can identify genetic findings in more than 10% of cases. With WES, studies in the past two years have shown that deleterious de novo, recessive and X-linked mutations are observed in some 12% of subjects, and WES identifies small contributory CNV in some 7% of subjects [3,4,[8][9][10][11]41,42]. Moreover, recently identified mutations and CNVs that are associated with high risk for ASD are leading to a better understanding of ASD pathogenesis and to model systems for ASD. Identification of highrisk genetic variants can provide an etiological diagnosis and opportunities for genetic counseling, an emerging important aspect in ASD care [43]. In some examples of monogenic forms of ASD, including Fragile X syndrome and tuberous sclerosis, model systems have led clinical Multiple sources and designs https://www.nimhgenetics.org/ trials with novel therapeutics, highlighting the translational nature of gene discovery in ASD research. Explaining a large proportion of the genetic risk for ASD over the next five years is an achievable goal, but will require larger collections of well-characterized samples. The TASC research group and the TASC samples represent a successful collaboration and public-private partnership to enhance the numbers of samples available through the NIMH in less than three years. The phenotypic and genetic data generated by the AGP and others provide unique high value to this collection. Furthermore, the use of these samples in many ongoing studies will continue to add value to this collection. Similar approaches, leveraging expert sites and ongoing studies, represent an important path towards further enhancing the available ASD sample set. LG: conception and design; data collection; analysis; manuscript writing; critical revision; overall study coordination; financial support. DG: data collection. AK: data collection; final approval of manuscript. LS: data collection; final approval of manuscript. AT: data collection. SB: data collection. GH: data collection. ML-S: data collection. FL: data collection. JMG: data collection. PC: data collection. SJG: data collection; final approval of manuscript. WMcM: overall study coordination. JM: data collection; analysis. JG: conception and design. MP-V: conception and design. ED: data collection. SS: data collection. CJMcD: data collection. DJP: data collection. JA: conception and design; data collection. GSC: data collection. JLH: conception and design; data collection. MP: data collection. ALN: data collection. CC: data collection. PF: conception and design. All authors read and approved the final manuscript.