PhD, Yale University, Molecular, Cellular and Developmental Biology (2007)
ERROR! No headcode.htm file found.
I have over a decade?s worth of experience in developing and applying high-throughput and high-resolution genomics analysis tools and procedures, in particular in the context of studying genomic sequence variation in brain development and function.
I have been involved on numerous occasions in using a large-scale and high-throughput setup for genomics analyses as well as carrying out analyses over several levels of genomics and epigenomics information. This includes participation in the ENCODE and 1000 Genomes projects, for the latter as a member of both the analytical and structural variation groups.
I have experience with developing and applying state-of-the-art and emerging genomics and epigenomics technologies (array and next-generation-sequencing based) for the analysis of gene expression, genomic DNA sequence and structure, DNA methylation and chromatin modification, in human cells and human cell culture systems, including stem cell culture models. For example I was co-first author of the paper in Science (Korbel, Urban, Affourtit et al., 2007, PMID 17901297) on developing next-generation-sequencing based paired-end mapping of CNVs and SVs, an approach that is now a standard part of whole-human-genome sequencing projects such as the 1000 Genomes Project. Paired-end mapping is also a critical component of advanced RNA-Seq approaches, mapping of transposable elements and the study of long-range chromatin interactions using the HiC method.
Two main, and connected, directions of research in my laboratory are the investigation of the molecular effects of large genome variants during neuronal development using iPSC model systems and the study of the nature and effects of somatic genome variation in the brain using tissue culture models and primary tissue samples.
Heterozygous NRXN1 deletions constitute the most prevalent currently known single-gene mutation associated with schizophrenia, and additionally predispose to multiple other neurodevelopmental disorders. Engineered heterozygous NRXN1 deletions impaired neurotransmitter release in human neurons, suggesting a synaptic pathophysiological mechanism. Utilizing this observation for drug discovery, however, requires confidence in its robustness and validity. Here, we describe a multicenter effort to test the generality of this pivotal observation, using independent analyses at two laboratories of patient-derived and newly engineered human neurons with heterozygous NRXN1 deletions. Using neurons transdifferentiated from induced pluripotent stem cells that were derived from schizophrenia patients carrying heterozygous NRXN1 deletions, we observed the same synaptic impairment as in engineered NRXN1-deficient neurons. This impairment manifested as a large decrease in spontaneous synaptic events, in evoked synaptic responses, and in synaptic paired-pulse depression. Nrxn1-deficient mouse neurons generated from embryonic stem cells by the same method as human neurons did not exhibit impaired neurotransmitter release, suggesting a human-specific phenotype. Human NRXN1 deletions produced a reproducible increase in the levels of CASK, an intracellular NRXN1-binding protein, and were associated with characteristic gene-expression changes. Thus, heterozygous NRXN1 deletions robustly impair synaptic function in human neurons regardless of genetic background, enabling future drug discovery efforts.
View details for DOI 10.1073/pnas.2025598118
View details for PubMedID 34035170
The obstetrical conditions placenta accreta spectrum (PAS) and placenta previa are a significant source of pregnancy-associated morbidity and mortality, yet the specific molecular and cellular underpinnings of these conditions are not known. In this study, we identified misregulated gene expression patterns in tissues from placenta previa and percreta (the most extreme form of PAS) compared with control cases. By comparing this gene set with existing placental single-cell and bulk RNA-Seq datasets, we show that the upregulated genes predominantly mark extravillous trophoblasts. We performed immunofluorescence on several candidate molecules and found that PRG2 and AQPEP protein levels are upregulated in both the fetal membranes and the placental disk in both conditions. While this increased AQPEP expression remains restricted to trophoblasts, PRG2 is mislocalized and is found throughout the fetal membranes. Using a larger patient cohort with a diverse set of gestationally aged-matched controls, we validated PRG2 as a marker for both previa and PAS and AQPEP as a marker for only previa in the fetal membranes. Our findings suggest that the extraembryonic tissues surrounding the conceptus, including both the fetal membranes and the placental disk, harbor a signature of previa and PAS that is characteristic of EVTs and that may reflect increased trophoblast invasiveness.
View details for DOI 10.1093/biolre/ioab068
View details for PubMedID 33982062
BACKGROUND: Post-zygotic mutations incurred during DNA replication, DNA repair, and other cellular processes lead to somatic mosaicism. Somatic mosaicism is an established cause of various diseases, including cancers. However, detecting mosaic variants in DNA from non-cancerous somatic tissues poses significant challenges, particularly if the variants only are present in a small fraction of cells.RESULTS: Here, the Brain Somatic Mosaicism Network conducts a coordinated, multi-institutional study to examine the ability of existing methods to detect simulated somatic single-nucleotide variants (SNVs) in DNA mixing experiments, generate multiple replicates of whole-genome sequencing data from the dorsolateral prefrontal cortex, other brain regions, dura mater, and dural fibroblasts of a single neurotypical individual, devise strategies to discover somatic SNVs, and apply various approaches to validate somatic SNVs. These efforts lead to the identification of 43 bona fide somatic SNVs that range in variant allele fractions from ~0.005 to ~0.28. Guided by these results, we devise best practices for calling mosaic SNVs from 250* whole-genome sequencing data in the accessible portion of the human genome that achieve 90% specificity and sensitivity. Finally, we demonstrate that analysis of multiple bulk DNA samples from a single individual allows the reconstruction of early developmental cell lineage trees.CONCLUSIONS: This study provides a unified set of best practices to detect somatic SNVs in non-cancerous tissues. The data and methods are freely available to the scientific community and should serve as a guide to assess the contributions of somatic SNVs to neuropsychiatric diseases.
View details for DOI 10.1186/s13059-021-02285-3
View details for PubMedID 33781308
Retrotransposons can cause somatic genome variation in the human nervous system, which is hypothesized to have relevance to brain development and neuropsychiatric disease. However, the detection of individual somatic mobile element insertions presents a difficult signal-to-noise problem. Using a machine-learning method (RetroSom) and deep whole-genome sequencing, we analyzed L1 and Alu retrotransposition in sorted neurons and glia from human brains. We characterized two brain-specific L1 insertions in neurons and glia from a donor with schizophrenia. There was anatomical distribution of the L1 insertions in neurons and glia across both hemispheres, indicating retrotransposition occurred during early embryogenesis. Both insertions were within the introns of genes (CNNM2 and FRMD4A) inside genomic loci associated with neuropsychiatric disorders. Proof-of-principle experiments revealed these L1 insertions significantly reduced gene expression. These results demonstrate that RetroSom has broad applications for studies of brain development and may provide insight into the possible pathological effects of somatic retrotransposition.
View details for DOI 10.1038/s41593-020-00767-4
View details for PubMedID 33432196
View details for Web of Science ID 000572825800109
ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.
View details for DOI 10.1038/s41598-020-64655-4
View details for PubMedID 32404971
View details for Web of Science ID 000542687800145
In both Turner syndrome (TS) and Klinefelter syndrome (KS) copy number aberrations of the X chromosome lead to various developmental symptoms. We report a comparative analysis of TS vs. KS regarding differences at the genomic network level measured in primary samples by analyzing gene expression, DNA methylation, and chromatin conformation. X-chromosome inactivation (XCI) silences transcription from one X chromosome in female mammals, on which most genes are inactive, and some genes escape from XCI. In TS, almost all differentially expressed escape genes are down-regulated but most differentially expressed inactive genes are up-regulated. In KS, differentially expressed escape genes are up-regulated while the majority of inactive genes appear unchanged. Interestingly, 94 differentially expressed genes (DEGs) overlapped between TS and female and KS and male comparisons; and these almost uniformly display expression changes into opposite directions. DEGs on the X chromosome and the autosomes are coexpressed in both syndromes, indicating that there are molecular ripple effects of the changes in X chromosome dosage. Six potential candidate genes (RPS4X, SEPT6, NKRF, CX0rf57, NAA10, and FLNA) for KS are identified on Xq, as well as candidate central genes on Xp for TS. Only promoters of inactive genes are differentially methylated in both syndromes while escape gene promoters remain unchanged. The intrachromosomal contact map of the X chromosome in TS exhibits the structure of an active X chromosome. The discovery of shared DEGs indicates the existence of common molecular mechanisms for gene regulation in TS and KS that transmit the gene dosage changes to the transcriptome.
View details for DOI 10.1073/pnas.1910003117
View details for PubMedID 32071206
Early life adversity and insecure attachment style are known risk factors for perinatal depression. The biological pathways linking these experiences, however, have not yet been elucidated. We hypothesized that overlap in patterns of DNA methylation in association with each of these phenomena could identify genes and pathways of importance. Specifically, we wished to distinguish between allostatic-load and role-transition hypotheses of perinatal depression. We conducted a large-scale analysis of methylation patterns across 5*106 individual CG dinucleotides in 54 women participating in a longitudinal prospective study of perinatal depression, using clustering-based criteria for significance to control for multiple comparisons. We identified 1580 regions in which methylation density was associated with childhood adversity, 3 in which methylation density was associated with insecure attachment style, and 6 in which methylation density was associated with perinatal depression. Shorter telomeres were observed in association with childhood trauma but not with perinatal depression or attachment insecurity. A detailed analysis of methylation density in the oxytocin receptor gene revealed similar patterns of DNA methylation in association with perinatal depression and with insecure attachment style, while childhood trauma was associated with a distinct methylation pattern in this gene. Clinically, attachment style was strongly associated with depression only in pregnancy and the early postpartum, whereas the association of childhood adversity with depression was time-invariant. We concluded that the broad DNA methylation signature and reduced telomere length associated with childhood adversity could indicate increased allostatic load across multiple body systems, whereas perinatal depression and attachment insecurity may be narrower phenotypes with more limited DNA methylation signatures outside the CNS, and no apparent association with telomere length or, by extension, allostatic load. In contrast, the finding of matching DNA methylation patterns within the oxytocin receptor gene for perinatal depression and attachment insecurity is consistent with the theory that the perinatal period is a time of activation of existing attachment schemas for the purpose of structuring the mother-child relationship, and that such activation may occur in part through specific patterns of methylation of the oxytocin receptor gene.
View details for DOI 10.1038/s41398-020-0703-3
View details for PubMedID 32066670
Somatic mosaicism, manifesting as single nucleotide variants (SNVs), mobile element insertions and structural changes in the DNA, is a common phenomenon in human brain cells, with potential functional consequences. Using a clonal approach, we previously detected 200-400 mosaic SNVs per cell in three human fetal brains (15 to 21 weeks post-conception). However, structural variation in the human fetal brain has not yet been investigated. Here, we discover and validate four mosaic structural variants (SVs) in the same brains and resolve their precise breakpoints. The SVs were of kilobase scale and complex, consisting of deletion(s) and rearranged genomic fragments, which sometimes originated from different chromosomes. Sequences at the breakpoints of these rearrangements had microhomologies, suggesting their origin from replication errors. One SV was found in two clones and we timed its origin to ~14 weeks post-conception. No large scale mosaic copy number variants (CNVs) were detectable in normal fetal human brains, suggesting that previously reported megabase-scale CNVs in neurons arise at later stages of development. By reanalysis of public single nuclei data from adult brain neurons, we detected an extra-chromosomal circular DNA event. Our study reveals the existence of mosaic SVs in the developing human brain, likely arising from cell proliferation during mid-neurogenesis. Although relatively rare compared to SNVs, and present in ~10% neurons, SVs in developing human brain affect a comparable number of bases in the genome (~6,200 vs ~4,000 bps), implying that they may have similar functional consequences.
View details for DOI 10.1101/gr.262667.120
View details for PubMedID 33122304
The 15q13.3 microdeletion is associated with several neuropsychiatric disorders, including autism and schizophrenia. Previous association and functional studies have investigated the potential role of several genes within the deletion in neuronal dysfunction, but the molecular effects of the deletion as a whole remain largely unknown.Induced pluripotent stem cells, from 3 patients with the 15q13.3 microdeletion and 3 control subjects, were generated and converted into induced neurons. We analyzed the effects of the 15q13.3 microdeletion on genome-wide gene expression, DNA methylation, chromatin accessibility, and sensitivity to cisplatin-induced DNA damage. Furthermore, we measured gene expression changes in induced neurons with CRISPR (clustered regularly interspaced short palindromic repeats) knockouts of individual 15q13.3 microdeletion genes.In both induced pluripotent stem cells and induced neurons, gene copy number change within the 15q13.3 microdeletion was accompanied by significantly decreased gene expression and no compensatory changes in DNA methylation or chromatin accessibility, supporting the model that haploinsufficiency of genes within the deleted region drives the disorder. Furthermore, we observed global effects of the microdeletion on the transcriptome and epigenome, with disruptions in several neuropsychiatric disorder-associated pathways and gene families, including Wnt signaling, ribosome function, DNA binding, and clustered protocadherins. Individual gene knockouts mirrored many of the observed changes in an overlapping fashion between knockouts.Our multiomics analysis of the 15q13.3 microdeletion revealed downstream effects in pathways previously associated with neuropsychiatric disorders and indications of interactions between genes within the deletion. This molecular systems analysis can be applied to other chromosomal aberrations to further our etiological understanding of neuropsychiatric disorders.
View details for DOI 10.1016/j.biopsych.2020.06.021
View details for PubMedID 32919612
During pregnancy, extravillous trophoblasts (EVTs) invade the maternal decidua and remodel the local vasculature to establish blood supply for the growing fetus. Compromised EVT-function has been linked to aberrant pregnancy associated with maternal and fetal morbidity and mortality. However, metabolic features of this invasive trophoblast subtype are largely unknown. Using primary human trophoblasts, isolated from first trimester placenta tissues, we show that cellular cholesterol homeostasis is differentially regulated in EVTs compared to villous cytotrophoblasts. Utilizing RNA-seq, gene set-enrichment analysis and functional validation we provide evidence that EVTs display increased levels of free and esterified cholesterol. In line, EVTs are characterized by increased expression of the HDL-receptor SR-BI and reduced expression of the liver X-receptor (LXR) and its target genes. We further reveal that EVTs express elevated levels of HSD3B1 (a rate-limiting enzyme in progesterone synthesis) and are capable of secreting progesterone. Increasing cholesterol export by LXR-activation reduced progesterone secretion in an ABCA1-dependent manner. Importantly, HSD3B1 expression was decreased in EVTs of idiopathic recurrent spontaneous abortions (RSA), pointing towards compromised progesterone metabolism in EVTs of early miscarriages. Here, we provide insights into the regulation of cholesterol and progesterone metabolism in trophoblastic subtypes and its putative relevance in human miscarriage.
View details for DOI 10.1194/jlr.P093427
View details for PubMedID 31530576
Allele-specific protein-RNA binding is an essential aspect that may reveal functional genetic variants (GVs) mediating post-transcriptional regulation. Recently, genome-wide detection of in vivo binding of RNA-binding proteins is greatly facilitated by the enhanced crosslinking and immunoprecipitation (eCLIP) method. We developed a new computational approach, called BEAPR, to identify allele-specific binding (ASB) events in eCLIP-Seq data. BEAPR takes into account crosslinking-induced sequence propensity and variations between replicated experiments. Using simulated and actual data, we show that BEAPR largely outperforms often-used count analysis methods. Importantly, BEAPR overcomes the inherent overdispersion problem of these methods. Complemented by experimental validations, we demonstrate that the application of BEAPR to ENCODE eCLIP-Seq data of 154 proteins helps to predict functional GVs that alter splicing or mRNA abundance. Moreover, many GVs with ASB patterns have known disease relevance. Overall, BEAPR is an effective method that helps to address the outstanding challenge of functional interpretation of GVs.
View details for PubMedID 30902979
HepG2 is one of the most widely used human cancer cell lines in biomedical research and one of the main cell lines of ENCODE. Although the functional genomic and epigenomic characteristics of HepG2 are extensively studied, its genome sequence has never been comprehensively analyzed and higher order genomic structural features are largely unknown. The high degree of aneuploidy in HepG2 renders traditional genome variant analysis methods challenging and partially ineffective. Correct and complete interpretation of the extensive functional genomics data from HepG2 requires an understanding of the cell line's genome sequence and genome structure. Using a variety of sequencing and analysis methods, we identified a wide spectrum of genome characteristics in HepG2: copy numbers of chromosomal segments at high resolution, SNVs and Indels (corrected for aneuploidy), regions with loss of heterozygosity, phased haplotypes extending to entire chromosome arms, retrotransposon insertions and structural variants (SVs) including complex and somatic genomic rearrangements. A large number of SVs were phased, sequence assembled and experimentally validated. We re-analyzed published HepG2 datasets for allele-specific expression and DNA methylation and assembled an allele-specific CRISPR/Cas9 targeting map. We demonstrate how deeper insights into genomic regulatory complexity are gained by adopting a genome-integrated framework.
View details for PubMedID 30864654
K562 is widely used in biomedical research. It is one of three tier-one cell lines of ENCODE and also most commonly used for large-scale CRISPR/Cas9 screens. Although its functional genomic and epigenomic characteristics have been extensively studied, its genome sequence and genomic structural features have never been comprehensively analyzed. Such information is essential for the correct interpretation and understanding of the vast troves of existing functional genomics and epigenomics data for K562. We performed and integrated deep-coverage whole-genome (short-insert), mate-pair, and linked-read sequencing as well as karyotyping and array CGH analysis to identify a wide spectrum of genome characteristics in K562: copy numbers (CN) of aneuploid chromosome segments at high-resolution, SNVs and indels (both corrected for CN in aneuploid regions), loss of heterozygosity, megabase-scale phased haplotypes often spanning entire chromosome arms, structural variants (SVs), including small and large-scale complex SVs and nonreference retrotransposon insertions. Many SVs were phased, assembled, and experimentally validated. We identified multiple allele-specific deletions and duplications within the tumor suppressor gene FHIT Taking aneuploidy into account, we reanalyzed K562 RNA-seq and whole-genome bisulfite sequencing data for allele-specific expression and allele-specific DNA methylation. We also show examples of how deeper insights into regulatory complexity are gained by integrating genomic variant information and structural context with functional genomics and epigenomics data. Furthermore, using K562 haplotype information, we produced an allele-specific CRISPR targeting map. This comprehensive whole-genome analysis serves as a resource for future studies that utilize K562 as well as a framework for the analysis of other cancer genomes.
View details for PubMedID 30737237
To understand the health impact of long-duration spaceflight, one identical twin astronaut was monitored before, during, and after a 1-year mission onboard the International Space Station; his twin served as a genetically matched ground control. Longitudinal assessments identified spaceflight-specific changes, including decreased body mass, telomere elongation, genome instability, carotid artery distension and increased intima-media thickness, altered ocular structure, transcriptional and metabolic changes, DNA methylation changes in immune and oxidative stress-related pathways, gastrointestinal microbiota alterations, and some cognitive decline postflight. Although average telomere length, global gene expression, and microbiome changes returned to near preflight levels within 6 months after return to Earth, increased numbers of short telomeres were observed and expression of some genes was still disrupted. These multiomic, molecular, physiological, and behavioral datasets provide a valuable roadmap of the putative health risks for future human spaceflight.
View details for PubMedID 30975860
Identifying structural variation (SV) is essential for genome interpretation but has been historically difficult due to limitations inherent to available genome technologies. Detection methods that use ensemble algorithms and emerging sequencing technologies have enabled the discovery of thousands of SVs, uncovering information about their ubiquity, relationship to disease and possible effects on biological mechanisms. Given the variability in SV type and size, along with unique detection biases of emerging genomic platforms, multiplatform discovery is necessary to resolve the full spectrum of variation. Here, we review modern approaches for investigating SVs and proffer that, moving forwards, studies integrating biological information with detection will be necessary to comprehensively understand the impact of SV in the human genome.
View details for DOI 10.1038/s41576-019-0180-9
View details for PubMedID 31729472
Slow-wave sleep and rapid eye movement (or paradoxical) sleep have been found in mammals, birds and lizards, but it is unclear whether these neuronal signatures are found in non-amniotic vertebrates. Here we develop non-invasive fluorescence-based polysomnography for zebrafish, and show-using unbiased, brain-wide activity recording coupled with assessment of eye movement, muscle dynamics and heart rate-that there are at least two major sleep signatures in zebrafish. These signatures, which we term slow bursting sleep and propagating wave sleep, share commonalities with those of slow-wave sleep and paradoxical or rapid eye movement sleep, respectively. Further, we find that melanin-concentrating hormone signalling (which is involved in mammalian sleep) also regulates propagating wave sleep signatures and the overall amount of sleep in zebrafish, probably via activation of ependymal cells. These observations suggest that common neural signatures of sleep may have emerged in the vertebrate brain over 450 million years ago.
View details for DOI 10.1038/s41586-019-1336-7
View details for PubMedID 31292557
We produced an extensive collection of deep re-sequencing datasets for the Venter/HuRef genome using the Illumina massively-parallel DNA sequencing platform. The original Venter genome sequence is a very-high quality phased assembly based on Sanger sequencing. Therefore, researchers developing novel computational tools for the analysis of human genome sequence variation for the dominant Illumina sequencing technology can test and hone their algorithms by making variant calls from these Venter/HuRef datasets and then immediately confirm the detected variants in the Sanger assembly, freeing them of the need for further experimental validation. This process also applies to implementing and benchmarking existing genome analysis pipelines. We prepared and sequenced 200bp and 350bp short-insert whole-genome sequencing libraries (sequenced to 100x and 40x genomic coverages respectively) as well as 2kb, 5kb, and 12kb mate-pair libraries (49x, 122x, and 145x physical coverages respectively). Lastly, we produced a linked-read library (128x physical coverage) from which we also performed haplotype phasing.
View details for PubMedID 30561434
Genome amplification and cellular senescence are commonly associated with pathological processes. While physiological roles for polyploidization and senescence have been described in mouse development, controversy exists over their significance in humans. Here, we describe tetraploidization and senescence as phenomena of normal human placenta development. During pregnancy, placental extravillous trophoblasts (EVTs) invade the pregnant endometrium, termed decidua, to establish an adapted microenvironment required for the developing embryo. This process is critically dependent on continuous cell proliferation and differentiation, which is thought to follow the classical model of cell cycle arrest prior to terminal differentiation. Strikingly, flow cytometry and DNAseq revealed that EVT formation is accompanied with a genome-wide polyploidization, independent of mitotic cycles. DNA replication in these cells was analysed by a fluorescent cell-cycle indicator reporter system, cell cycle marker expression and EdU incorporation. Upon invasion into the decidua, EVTs widely lose their replicative potential and enter a senescent state characterized by high senescence-associated (SA) beta-galactosidase activity, induction of a SA secretory phenotype as well as typical metabolic alterations. Furthermore, we show that the shift from endocycle-dependent genome amplification to growth arrest is disturbed in androgenic complete hydatidiform moles (CHM), a hyperplastic pregnancy disorder associated with increased risk of developing choriocarinoma. Senescence is decreased in CHM-EVTs, accompanied by exacerbated endoreduplication and hyperploidy. We propose induction of cellular senescence as a ploidy limiting mechanism during normal human placentation and unravel a link between excessive polyploidization and reduced senescence in CHM.
View details for PubMedID 30312291
Microduplication of chromosome 1q21.1 is observed in ~0.03% of adults. It has a highly variable, incompletely penetrant phenotype that can include intellectual disability, global developmental delay, specific learning disabilities, autism, schizophrenia, heart anomalies and dysmorphic features. We evaluated a 10-year-old-male with a 1q21.1 duplication by CGH microarray. He presented with major attention deficits, phonological dysphasia, poor fine motor skills, dysmorphia and mild autistic features, but not the typical macrocephaly. Neuropsychiatric evaluation demonstrated a novel phenotype: an unusually large discrepancy between non-verbal capacities (borderline-impaired WISC-IV index scores of 70 for Working Memory and 68 for Processing Speed) vs. strong verbal skills - scores of 126 for Verbal Comprehension (superior) and 111 for Perceptual Reasoning (normal). HYDIN2 has been hypothesized to underlie macrocephaly and perhaps cognitive deficits in this syndrome, but assessment of HYDIN2 copy number by microarray is difficult because of extensive segmental duplications. We performed whole-genome sequencing which supported HYDIN2 duplication (chr1:146,370,001-148,590,000, 2.22?Mb, hg38). To evaluate copy number more rigorously we developed droplet digital PCR assays of HYDIN2 (targeting unique 1?kb and 6?kb insertions) and its paralog HYDIN (targeting a unique 154?bp segment outside the HYDIN2 overlap). In an independent cohort, ddPCR was concordant with previous microarray data. Duplication of HYDIN2 was confirmed in the patient by ddPCR. This case demonstrates that a large discrepancy of verbal and non-verbal abilities can occur in 1q21.1 duplication syndrome, but it remains unclear whether this has a specific genomic basis. These ddPCR assays may be useful for future research on HYDIN2 copy number.
View details for PubMedID 30155272
BACKGROUND: Copy number variation (CNV) analysis is an integral component of the study of human genomes in both research and clinical settings. Array-based CNV analysis is the current first-tier approach in clinical cytogenetics. Decreasing costs in high-throughput sequencing and cloud computing have opened doors for the development of sequencing-based CNV analysis pipelines with fast turnaround times. We carry out a systematic and quantitative comparative analysis for several low-coverage whole-genome sequencing (WGS) strategies to detect CNV in the human genome.METHODS: We compared the CNV detection capabilities of WGS strategies (short insert, 3 kb insert mate pair and 5 kb insert mate pair) each at 1*, 3* and 5* coverages relative to each other and to 17 currently used high-density oligonucleotide arrays. For benchmarking, we used a set of gold standard (GS) CNVs generated for the 1000 Genomes Project CEU subject NA12878.RESULTS: Overall, low-coverage WGS strategies detect drastically more GS CNVs compared with arrays and are accompanied with smaller percentages of CNV calls without validation. Furthermore, we show that WGS (at ?1* coverage) is able to detect all seven GS deletion CNVs >100kb in NA12878, whereas only one is detected by most arrays. Lastly, we show that the much larger 15 Mbp Cri du chat deletion can be readily detected with short-insert paired-end WGS at even just 1* coverage.CONCLUSIONS: CNV analysis using low-coverage WGS is efficient and outperforms the array-based analysis that is currently used for clinical cytogenetics.
View details for PubMedID 30061371
View details for Web of Science ID 000429541800083
Somatic mosaicism in the human brain may alter function of individual neurons. We analyzed genomes of single cells from the forebrains of three human fetuses (15 to 21 weeks postconception) using clonal cell populations. We detected 200 to 400 single-nucleotide variations (SNVs) per cell. SNV patterns resembled those found in cancer cell genomes, indicating a role of background mutagenesis in cancer. SNVs with a frequency of >2% in brain were also present in the spleen, revealing a pregastrulation origin. We reconstructed cell lineages for the first five postzygotic cleavages and calculated a mutation rate of ~1.3 mutations per division per cell. Later in development, during neurogenesis, the mutation spectrum shifted toward oxidative damage, and the mutation rate increased. Both neurogenesis and early embryogenesis exhibit substantially more mutagenesis than adulthood.
View details for PubMedID 29217587
Here, we describe approaches using droplet digital polymerase chain reaction (ddPCR) to validate and quantify somatic mosaic events contributed by transposable-element insertions, copy-number variants, and single-nucleotide variants. In the ddPCR assay, sample or template DNA is partitioned into tens of thousands of individual droplets such that when DNA input is low, the vast majority of droplets contains no more than one copy of template DNA. PCR takes place in each individual droplet and produces a fluorescent readout to indicate the presence or absence of the target of interest allowing for the accurate "counting" of the number of copies present in the sample. The number of partitions is large enough to assay somatic mosaic events with frequencies down to less than 1%.
View details for PubMedID 29717444
Large copy number variants (CNVs) in the human genome are strongly associated with common neurodevelopmental, neuropsychiatric disorders such as schizophrenia and autism. Here we report on the epigenomic effects of the prominent large deletion CNVs on chromosome 22q11.2 and on chromosome 1q21.1. We use Hi-C analysis of long-range chromosome interactions, including haplotype-specific Hi-C analysis, ChIP-Seq analysis of regulatory histone marks, and RNA-Seq analysis of gene expression patterns. We observe changes on all the levels of analysis, within the deletion boundaries, in the deletion flanking regions, along chromosome 22q, and genome wide. We detect gene expression changes as well as pronounced and multilayered effects on chromatin states, chromosome folding and on the topological domains of the chromatin, that emanate from the large CNV locus. These findings suggest basic principles of how such large genomic deletions can alter nuclear organization and affect genomic molecular activity.
View details for PubMedID 30559385
View details for PubMedID 29220033
View details for Web of Science ID 000413843800357
View details for Web of Science ID 000413843800380
Neuropsychiatric disorders have a complex genetic architecture. Human genetic population-based studies have identified numerous heritable sequence and structural genomic variants associated with susceptibility to neuropsychiatric disease. However, these germline variants do not fully account for disease risk. During brain development, progenitor cells undergo billions of cell divisions to generate the ~80 billion neurons in the brain. The failure to accurately repair DNA damage arising during replication, transcription, and cellular metabolism amid this dramatic cellular expansion can lead to somatic mutations. Somatic mutations that alter subsets of neuronal transcriptomes and proteomes can, in turn, affect cell proliferation and survival and lead to neurodevelopmental disorders. The long life span of individual neurons and the direct relationship between neural circuits and behavior suggest that somatic mutations in small populations of neurons can significantly affect individual neurodevelopment. The Brain Somatic Mosaicism Network has been founded to study somatic mosaicism both in neurotypical human brains and in the context of complex neuropsychiatric disorders.
View details for DOI 10.1126/science.aal1641
View details for Web of Science ID 000400143000042
View details for PubMedID 28450582
To describe the frequency and characteristics of developmental regression in a sample of 50 patients with Phelan McDermid Syndrome (PMS) and investigate the possibility of association between regression, epilepsy, and electroencephalogram (EEG) abnormalities and deletion size.The Autism Diagnostic Interview-Revised (ADI-R) was used to evaluate regression in patients with a confirmed diagnosis of PMS. Information on seizure history and EEGs was obtained from medical record review. Deletion size was determined by DNA microarray.A history of regression at any age was present in 43% of all patients. Among those exhibiting regression, 67% had onset after the age of 30 months, affecting primarily motor and self-help skills. In 63% of all patients there was a history of seizures and a history of abnormal EEG was also present in 71%. No significant associations were found between regression and seizures or EEG abnormalities. Deletion size was significantly associated with EEG abnormalities, but not with regression or seizures.This study found a high rate of regression in PMS. In contrast to regression in autism, that often occurs earlier in development and affects language and social skills, we found regression in PMS most frequently has an onset in mid-childhood, affecting motor and self-help skills. We also found high rates of seizures and abnormal EEGs in patients with PMS. However, a history of abnormal EEG and seizures was not associated with an increased risk of regression. Larger deletion sizes were found to be significantly associated with a history of abnormal EEG.
View details for DOI 10.1016/j.jpsychires.2017.03.010
View details for PubMedID 28346892
Few studies have been conducted to understand post-zygotic accumulation of mutations in cells of the healthy human body. We reprogrammed 32 skin fibroblast cells from families of donors into human induced pluripotent stem cell (hiPSC) lines. The clonal nature of hiPSC lines allows a high-resolution analysis of the genomes of the founder fibroblast cells without being confounded by the artifacts of single-cell whole-genome amplification. We estimate that on average a fibroblast cell in children has 1035 mostly benign mosaic SNVs. On average, 235 SNVs could be directly confirmed in the original fibroblast population by ultradeep sequencing, down to an allele frequency (AF) of 0.1%. More sensitive droplet digital PCR experiments confirmed more SNVs as mosaic with AF as low as 0.01%, suggesting that 1035 mosaic SNVs per fibroblast cell is the true average. Similar analyses in adults revealed no significant increase in the number of SNVs per cell, suggesting that a major fraction of mosaic SNVs in fibroblasts arises during development. Mosaic SNVs were distributed uniformly across the genome and were enriched in a mutational signature previously observed in cancers and in de novo variants and which, we hypothesize, is a hallmark of normal cell proliferation. Finally, AF distribution of mosaic SNVs had distinct narrow peaks, which could be a characteristic of clonal cell selection, clonal expansion, or both. These findings reveal a large degree of somatic mosaicism in healthy human tissues, link de novo and cancer mutations to somatic mosaicism, and couple somatic mosaicism with cell proliferation.
View details for DOI 10.1101/gr.215517.116
View details for PubMedID 28235832
View details for PubMedCentralID PMC5378170
High-resolution microarray technology is routinely used in basic research and clinical practice to efficiently detect copy number variants (CNVs) across the entire human genome. A new generation of arrays combining high probe densities with optimized designs will comprise essential tools for genome analysis in the coming years. We systematically compared the genome-wide CNV detection power of all 17 available array designs from the Affymetrix, Agilent, and Illumina platforms by hybridizing the well-characterized genome of 1000 Genomes Project subject NA12878 to all arrays, and performing data analysis using both manufacturer-recommended and platform-independent software. We benchmarked the resulting CNV call sets from each array using a gold standard set of CNVs for this genome derived from 1000 Genomes Project whole genome sequencing data.The arrays tested comprise both SNP and aCGH platforms with varying designs and contain between ~0.5 to ~4.6 million probes. Across the arrays CNV detection varied widely in number of CNV calls (4-489), CNV size range (~40 bp to ~8 Mbp), and percentage of non-validated CNVs (0-86%). We discovered strikingly strong effects of specific array design principles on performance. For example, some SNP array designs with the largest numbers of probes and extensive exonic coverage produced a considerable number of CNV calls that could not be validated, compared to designs with probe numbers that are sometimes an order of magnitude smaller. This effect was only partially ameliorated using different analysis software and optimizing data analysis parameters.High-resolution microarrays will continue to be used as reliable, cost- and time-efficient tools for CNV analysis. However, different applications tolerate different limitations in CNV detection. Our study quantified how these arrays differ in total number and size range of detected CNVs as well as sensitivity, and determined how each array balances these attributes. This analysis will inform appropriate array selection for future CNV studies, and allow better assessment of the CNV-analytical power of both published and ongoing array-based genomics studies. Furthermore, our findings emphasize the importance of concurrent use of multiple analysis algorithms and independent experimental validation in array-based CNV detection studies.
View details for PubMedID 28438122
View details for PubMedCentralID PMC5402652
The prevalence of autism spectrum disorders (ASDs) is rapidly growing, yet its molecular basis is poorly understood. We used a systems approach in which ASD candidate genes were mapped onto the ubiquitous human protein complexes and the resulting complexes were characterized. The studies revealed the role of histone deacetylases (HDAC1/2) in regulating the expression of ASD orthologs in the embryonic mouse brain. Proteome-wide screens for the co-complexed subunits with HDAC1 and six other key ASD proteins in neuronal cells revealed a protein interaction network, which displayed preferential expression in fetal brain development, exhibited increased deleterious mutations in ASD cases, and were strongly regulated by FMRP and MECP2 causal for Fragile X and Rett syndromes, respectively. Overall, our study reveals molecular components in ASD, suggests a shared mechanism between the syndromic and idiopathic forms of ASDs, and provides a systems framework for analyzing complex human diseases.
View details for PubMedID 26949739
The prevalence of autism spectrum disorders (ASDs) is rapidly growing, yet its molecular basis is poorly understood. We used a systems approach in which ASD candidate genes were mapped onto the ubiquitous human protein complexes and the resulting complexes were characterized. The studies revealed the role of histone deacetylases (HDAC1/2) in regulating the expression of ASD orthologs in the embryonic mouse brain. Proteome-wide screens for the co-complexed subunits with HDAC1 and six other key ASD proteins in neuronal cells revealed a protein interaction network, which displayed preferential expression in fetal brain development, exhibited increased deleterious mutations in ASD cases, and were strongly regulated by FMRP and MECP2 causal for Fragile X and Rett syndromes, respectively. Overall, our study reveals molecular components in ASD, suggests a shared mechanism between the syndromic and idiopathic forms of ASDs, and provides a systems framework for analyzing complex human diseases.
View details for DOI 10.1016/j.cels.2015.11.002
View details for Web of Science ID 000209926300009
View details for PubMedCentralID PMC4776331
The association of 46,XY disorder of sex development (DSD) with congenital diaphragmatic hernia (CDH) is rare, but has been previously described with and without other congenital anomalies. Literature review identified five cases of 46,XY DSD associated with CDH and other congenital anomalies. These five cases share characteristics including CDH, 46,XY karyotype with external female appearing or ambiguous genitalia, cardiac anomalies, and decreased life span. The present case had novel features including truncus arteriosus, bifid thymus, gut malrotation, and limb anomalies consisting of rhizomelia and adactyly. With this case report, we present a review of the literature of cases of 46,XY DSD and CDH in association with multiple congenital abnormalities. This case may represent a unique syndrome of 46,XY DSD and diaphragmatic hernia or a more severe presentation of a syndrome represented in the previously reported cases. © 2015 Wiley Periodicals, Inc.
View details for DOI 10.1002/ajmg.a.37037
View details for Web of Science ID 000355276700028
Large copy number variants (CNVs) are strongly associated with morphogenetic processes and common neurodevelopmental disorders. A new study uses the example of Williams-Beuren syndrome (WBS) and Williams-Beuren region duplication syndrome to illustrate how induced pluripotent stem cells (iPSCs) and next-generation genomics can lead to a better understanding of complex genetics.
View details for DOI 10.1038/ng.3204
View details for PubMedID 25627897
A study of genome-wide gene expression in major depressive disorder (MDD) was undertaken in a large population-based sample to determine whether altered expression levels of genes and pathways could provide insights into biological mechanisms that are relevant to this disorder. Gene expression studies have the potential to detect changes that may be because of differences in common or rare genomic sequence variation, environmental factors or their interaction. We recruited a European ancestry sample of 463 individuals with recurrent MDD and 459 controls, obtained self-report and semi-structured interview data about psychiatric and medical history and other environmental variables, sequenced RNA from whole blood and genotyped a genome-wide panel of common single-nucleotide polymorphisms. We used analytical methods to identify MDD-related genes and pathways using all of these sources of information. In analyses of association between MDD and expression levels of 13?857 single autosomal genes, accounting for multiple technical, physiological and environmental covariates, a significant excess of low P-values was observed, but there was no significant single-gene association after genome-wide correction. Pathway-based analyses of expression data detected significant association of MDD with increased expression of genes in the interferon ?/? signaling pathway. This finding could not be explained by potentially confounding diseases and medications (including antidepressants) or by computationally estimated proportions of white blood cell types. Although cause-effect relationships cannot be determined from these data, the results support the hypothesis that altered immune signaling has a role in the pathogenesis, manifestation, and/or the persistence and progression of MDD.Molecular Psychiatry advance online publication, 3 December 2013; doi:10.1038/mp.2013.161.
View details for DOI 10.1038/mp.2013.161
View details for PubMedID 24296977
Understanding the consequences of regulatory variation in the human genome remains a major challenge, with important implications for understanding gene regulation and interpreting the many disease-risk variants that fall outside of protein-coding regions. Here, we provide a direct window into the regulatory consequences of genetic variation by sequencing RNA from 922 genotyped individuals. We present a comprehensive description of the distribution of regulatory variation-by the specific expression phenotypes altered, the properties of affected genes, and the genomic characteristics of regulatory variants. We detect variants influencing expression of over ten thousand genes, and through the enhanced resolution offered by RNA-sequencing, for the first time we identify thousands of variants associated with specific phenotypes including splicing and allelic expression. Evaluating the effects of both long-range intra-chromosomal and trans (cross-chromosomal) regulation, we observe modularity in the regulatory network, with three-dimensional chromosomal configuration playing a particular role in regulatory modules within each chromosome. We also observe a significant depletion of regulatory variants affecting central and critical genes, along with a trend of reduced effect sizes as variant frequency increases, providing evidence that purifying selection and buffering have limited the deleterious impact of regulatory variation on the cell. Further, generalizing beyond observed variants, we have analyzed the genomic properties of variants associated with expression and splicing and developed a Bayesian model to predict regulatory consequences of genetic variants, applicable to the interpretation of individual genomes and disease studies. Together, these results represent a critical step toward characterizing the complete landscape of human regulatory variation.
View details for DOI 10.1101/gr.155192.113
View details for PubMedID 24092820
Autism is a complex disease whose etiology remains elusive. We integrated previously and newly generated data and developed a systems framework involving the interactome, gene expression and genome sequencing to identify a protein interaction module with members strongly enriched for autism candidate genes. Sequencing of 25 patients confirmed the involvement of this module in autism, which was subsequently validated using an independent cohort of over 500 patients. Expression of this module was dichotomized with a ubiquitously expressed subcomponent and another subcomponent preferentially expressed in the corpus callosum, which was significantly affected by our identified mutations in the network center. RNA-sequencing of the corpus callosum from patients with autism exhibited extensive gene mis-expression in this module, and our immunochemical analysis showed that the human corpus callosum is predominantly populated by oligodendrocyte cells. Analysis of functional genomic data further revealed a significant involvement of this module in the development of oligodendrocyte cells in mouse brain. Our analysis delineates a natural network involved in autism, helps uncover novel candidate genes for this disease and improves our understanding of its molecular pathology.
View details for DOI 10.15252/msb.20145487
View details for PubMedID 25549968
View details for PubMedCentralID PMC4300495
Structural variation of the human genome sequence is the insertion, deletion, or rearrangement of stretches of DNA sequence sized from around 1,000 to millions of base pairs. Over the past few years, structural variation has been shown to be far more common in human genomes than previously thought. Very little is currently known about the effects of structural variation on normal child development, but such effects could be of considerable significance. This review provides an overview of the phenomenon of structural variation in the human genome sequence, describing the novel genomics technologies that are revolutionizing the way structural variation is studied and giving examples of genomic structural variations that affect child development.
View details for DOI 10.1111/cdev.12051
View details for Web of Science ID 000314112000003
View details for PubMedID 23311762
View details for Web of Science ID 000361763800021
Transcriptomic assays that measure expression levels are widely used to study the manifestation of environmental or genetic variations in cellular processes. RNA-sequencing in particular has the potential to considerably improve such understanding because of its capacity to assay the entire transcriptome, including novel transcriptional events. However, as with earlier expression assays, analysis of RNA-sequencing data requires carefully accounting for factors that may introduce systematic, confounding variability in the expression measurements, resulting in spurious correlations. Here, we consider the problem of modeling and removing the effects of known and hidden confounding factors from RNA-sequencing data. We describe a unified residual framework that encapsulates existing approaches, and using this framework, present a novel method, HCP (Hidden Covariates with Prior). HCP uses a more informed assumption about the confounding factors, and performs as well or better than existing approaches while having a much lower computational cost. Our experiments demonstrate that accounting for known and hidden factors with appropriate models improves the quality of RNA-sequencing data in two very different tasks: detecting genetic variations that are associated with nearby expression variations (cis-eQTLs), and constructing accurate co-expression networks.
View details for DOI 10.1371/journal.pone.0068141
View details for PubMedID 23874524
Reprogramming somatic cells into induced pluripotent stem cells (iPSCs) has been suspected of causing de novo copy number variation. To explore this issue, here we perform a whole-genome and transcriptome analysis of 20 human iPSC lines derived from the primary skin fibroblasts of seven individuals using next-generation sequencing. We find that, on average, an iPSC line manifests two copy number variants (CNVs) not apparent in the fibroblasts from which the iPSC was derived. Using PCR and digital droplet PCR, we show that at least 50% of those CNVs are present as low-frequency somatic genomic variants in parental fibroblasts (that is, the fibroblasts from which each corresponding human iPSC line is derived), and are manifested in iPSC lines owing to their clonal origin. Hence, reprogramming does not necessarily lead to de novo CNVs in iPSCs, because most of the line-manifested CNVs reflect somatic mosaicism in the human skin. Moreover, our findings demonstrate that clonal expansion, and iPSC lines in particular, can be used as a discovery tool to reliably detect low-frequency CNVs in the tissue of origin. Overall, we estimate that approximately 30% of the fibroblast cells have somatic CNVs in their genomes, suggesting widespread somatic mosaicism in the human body. Our study paves the way to understanding the fundamental question of the extent to which cells of the human body normally acquire structural alterations in their DNA post-zygotically.
View details for DOI 10.1038/nature11629
View details for Web of Science ID 000312488200058
View details for PubMedID 23160490
View details for PubMedCentralID PMC3532053
DNA capture technologies combined with high-throughput sequencing now enable cost-effective, deep-coverage, targeted sequencing of complete exomes. This is well suited for SNP discovery and genotyping. However there has been little attention devoted to Copy Number Variation (CNV) detection from exome capture datasets despite the potentially high impact of CNVs in exonic regions on protein function.As members of the 1000 Genomes Project analysis effort, we investigated 697 samples in which 931 genes were targeted and sampled with 454 or Illumina paired-end sequencing. We developed a rigorous Bayesian method to detect CNVs in the genes, based on read depth within target regions. Despite substantial variability in read coverage across samples and targeted exons, we were able to identify 107 heterozygous deletions in the dataset. The experimentally determined false discovery rate (FDR) of the cleanest dataset from the Wellcome Trust Sanger Institute is 12.5%. We were able to substantially improve the FDR in a subset of gene deletion candidates that were adjacent to another gene deletion call (17 calls). The estimated sensitivity of our call-set was 45%.This study demonstrates that exonic sequencing datasets, collected both in population based and medical sequencing projects, will be a useful substrate for detecting genic CNV events, particularly deletions. Based on the number of events we found and the sensitivity of the methods in the present dataset, we estimate on average 16 genic heterozygous deletions per individual genome. Our power analysis informs ongoing and future projects about sequencing depth and uniformity of read coverage required for efficient detection.
View details for DOI 10.1186/1471-2105-13-305
View details for Web of Science ID 000314688600001
View details for PubMedID 23157288
View details for PubMedCentralID PMC3563612
Genetic variation between individuals has been extensively investigated, but differences between tissues within individuals are far less understood. It is commonly assumed that all healthy cells that arise from the same zygote possess the same genomic content, with a few known exceptions in the immune system and germ line. However, a growing body of evidence shows that genomic variation exists between differentiated tissues. We investigated the scope of somatic genomic variation between tissues within humans. Analysis of copy number variation by high-resolution array-comparative genomic hybridization in diverse tissues from six unrelated subjects reveals a significant number of intraindividual genomic changes between tissues. Many (79%) of these events affect genes. Our results have important consequences for understanding normal genetic and phenotypic variation within individuals, and they have significant implications for both the etiology of genetic diseases such as cancer and for immortalized cell lines that might be used in research and therapeutics.
View details for DOI 10.1073/pnas.1213736109
View details for Web of Science ID 000311149900070
View details for PubMedID 23043118
View details for PubMedCentralID PMC3497787
In their paper "Copy number variations in 6q14.1 and 5q13.2 are associated with alcohol dependence" Lin and colleagues report on the association between alcohol dependence and 2 duplication CNVs in the genome sequence, one containing 8 genes within its boundaries and another that contains no genes. In this commentary, I point out some of the opportunities and challenges that arise from such a finding.
View details for DOI 10.1111/j.1530-0277.2012.01915.x
View details for Web of Science ID 000308435200003
View details for PubMedID 22909245
Autosomal dominant cerebellar ataxia, deafness and narcolepsy (ADCA-DN) is characterized by late onset (30-40 years old) cerebellar ataxia, sensory neuronal deafness, narcolepsy-cataplexy and dementia. We performed exome sequencing in five individuals from three ADCA-DN kindreds and identified DNMT1 as the only gene with mutations found in all five affected individuals. Sanger sequencing confirmed the de novo mutation p.Ala570Val in one family, and showed co-segregation of p.Val606Phe and p.Ala570Val, with the ADCA-DN phenotype, in two other kindreds. An additional ADCA-DN kindred with a p.GLY605Ala mutation was subsequently identified. Narcolepsy and deafness were the first symptoms to appear in all pedigrees, followed by ataxia. DNMT1 is a widely expressed DNA methyltransferase maintaining methylation patterns in development, and mediating transcriptional repression by direct binding to HDAC2. It is also highly expressed in immune cells and required for the differentiation of CD4+ into T regulatory cells. Mutations in exon 20 of this gene were recently reported to cause hereditary sensory neuropathy with dementia and hearing loss (HSAN1). Our mutations are all located in exon 21 and in very close spatial proximity, suggesting distinct phenotypes depending on mutation location within this gene.
View details for DOI 10.1093/hmg/dds035
View details for Web of Science ID 000303333700006
View details for PubMedID 22328086
View details for PubMedCentralID PMC3465691
Accurate and efficient genome-wide detection of copy number variants (CNVs) is essential for understanding human genomic variation, genome-wide CNV association type studies, cytogenetics research and diagnostics, and independent validation of CNVs identified from sequencing based technologies. Numerous, array-based platforms for CNV detection exist utilizing array Comparative Genome Hybridization (aCGH), Single Nucleotide Polymorphism (SNP) genotyping or both. We have quantitatively assessed the abilities of twelve leading genome-wide CNV detection platforms to accurately detect Gold Standard sets of CNVs in the genome of HapMap CEU sample NA12878, and found significant differences in performance. The technologies analyzed were the NimbleGen 4.2 M, 2.1 M and 3×720 K Whole Genome and CNV focused arrays, the Agilent 1×1 M CGH and High Resolution and 2×400 K CNV and SNP+CGH arrays, the Illumina Human Omni1Quad array and the Affymetrix SNP 6.0 array. The Gold Standards used were a 1000 Genomes Project sequencing-based set of 3997 validated CNVs and an ultra high-resolution aCGH-based set of 756 validated CNVs. We found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays. Our results are important for cost effective CNV detection and validation for both basic and clinical applications.
View details for DOI 10.1371/journal.pone.0027859
View details for Web of Science ID 000298168100021
View details for PubMedID 22140474
View details for PubMedCentralID PMC3227574
As a consequence of the accumulation of insertion events over evolutionary time, mobile elements now comprise nearly half of the human genome. The Alu, L1, and SVA mobile element families are still duplicating, generating variation between individual genomes. Mobile element insertions (MEI) have been identified as causes for genetic diseases, including hemophilia, neurofibromatosis, and various cancers. Here we present a comprehensive map of 7,380 MEI polymorphisms from the 1000 Genomes Project whole-genome sequencing data of 185 samples in three major populations detected with two detection methods. This catalog enables us to systematically study mutation rates, population segregation, genomic distribution, and functional properties of MEI polymorphisms and to compare MEI to SNP variation from the same individuals. Population allele frequencies of MEI and SNPs are described, broadly, by the same neutral ancestral processes despite vastly different mutation mechanisms and rates, except in coding regions where MEI are virtually absent, presumably due to strong negative selection. A direct comparison of MEI and SNP diversity levels suggests a differential mobile element insertion rate among populations.
View details for DOI 10.1371/journal.pgen.1002236
View details for Web of Science ID 000294297000031
View details for PubMedID 21876680
View details for PubMedCentralID PMC3158055
Recent studies have demonstrated the genetic significance of insertions, deletions, and other more complex structural variants (SVs) in the human population. With the development of the next-generation sequencing technologies, high-throughput surveys of SVs on the whole-genome level have become possible. Here we present split-read identification, calibrated (SRiC), a sequence-based method for SV detection.We start by mapping each read to the reference genome in standard fashion using gapped alignment. Then to identify SVs, we score each of the many initial mappings with an assessment strategy designed to take into account both sequencing and alignment errors (e.g. scoring more highly events gapped in the center of a read). All current SV calling methods have multilevel biases in their identifications due to both experimental and computational limitations (e.g. calling more deletions than insertions). A key aspect of our approach is that we calibrate all our calls against synthetic data sets generated from simulations of high-throughput sequencing (with realistic error models). This allows us to calculate sensitivity and the positive predictive value under different parameter-value scenarios and for different classes of events (e.g. long deletions vs. short insertions). We run our calculations on representative data from the 1000 Genomes Project. Coupling the observed numbers of events on chromosome 1 with the calibrations gleaned from the simulations (for different length events) allows us to construct a relatively unbiased estimate for the total number of SVs in the human genome across a wide range of length scales. We estimate in particular that an individual genome contains ~670,000 indels/SVs.Compared with the existing read-depth and read-pair approaches for SV identification, our method can pinpoint the exact breakpoints of SV events, reveal the actual sequence content of insertions, and cover the whole size spectrum for deletions. Moreover, with the advent of the third-generation sequencing technologies that produce longer reads, we expect our method to be even more useful.
View details for DOI 10.1186/1471-2164-12-375
View details for Web of Science ID 000294205500001
View details for PubMedID 21787423
View details for PubMedCentralID PMC3161018
High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2-4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ?1,000 sequenced chromosomes per population, whereas ?2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.
View details for DOI 10.1073/pnas.1019276108
View details for PubMedID 21730125
Copy number variation (CNV) in the genome is a complex phenomenon, and not completely understood. We have developed a method, CNVnator, for CNV discovery and genotyping from read-depth (RD) analysis of personal genome sequencing. Our method is based on combining the established mean-shift approach with additional refinements (multiple-bandwidth partitioning and GC correction) to broaden the range of discovered CNVs. We calibrated CNVnator using the extensive validation performed by the 1000 Genomes Project. Because of this, we could use CNVnator for CNV discovery and genotyping in a population and characterization of atypical CNVs, such as de novo and multi-allelic events. Overall, for CNVs accessible by RD, CNVnator has high sensitivity (86%-96%), low false-discovery rate (3%-20%), high genotyping accuracy (93%-95%), and high resolution in breakpoint discovery (<200 bp in 90% of cases with high sequencing coverage). Furthermore, CNVnator is complementary in a straightforward way to split-read and read-pair approaches: It misses CNVs created by retrotransposable elements, but more than half of the validated CNVs that it identifies are not detected by split-read or read-pair. By genotyping CNVs in the CEPH, Yoruba, and Chinese-Japanese populations, we estimated that at least 11% of all CNV loci involve complex, multi-allelic events, a considerably higher estimate than reported earlier. Moreover, among these events, we observed cases with allele distribution strongly deviating from Hardy-Weinberg equilibrium, possibly implying selection on certain complex loci. Finally, by combining discovery and genotyping, we identified six potential de novo CNVs in two family trios.
View details for DOI 10.1101/gr.114876.110
View details for Web of Science ID 000291153400017
View details for PubMedID 21324876
View details for PubMedCentralID PMC3106330
The study of the developing brain has begun to shed light on the underpinnings of both early and adult onset neuropsychiatric disorders. Neuroimaging of the human brain across developmental time points and the use of model animal systems have combined to reveal brain systems and gene products that may play a role in autism spectrum disorders, attention deficit hyperactivity disorder, obsessive compulsive disorder and many other neurodevelopmental conditions. However, precisely how genes may function in human brain development and how they interact with each other leading to psychiatric disorders is unknown. Because of an increasing understanding of neural stem cells and how the nervous system subsequently develops from these cells, we have now the ability to study disorders of the nervous system in a new way - by rewinding and reviewing the development of human neural cells. Induced pluripotent stem cells (iPSCs), developed from mature somatic cells, have allowed the development of specific cells in patients to be observed in real time. Moreover, they have allowed some neuronal-specific abnormalities to be corrected with pharmacological intervention in tissue culture. These exciting advances based on the use of iPSCs hold great promise for understanding, diagnosing and, possibly, treating psychiatric disorders. Specifically, examination of iPSCs from typically developing individuals will reveal how basic cellular processes and genetic differences contribute to individually unique nervous systems. Moreover, by comparing iPSCs from typically developing individuals and patients, differences at stem cell stages, through neural differentiation, and into the development of functional neurons may be identified that will reveal opportunities for intervention. The application of such techniques to early onset neuropsychiatric disorders is still on the horizon but has become a reality of current research efforts as a consequence of the revelations of many years of basic developmental neurobiological science.
View details for DOI 10.1111/j.1469-7610.2010.02348.x
View details for Web of Science ID 000288461400010
View details for PubMedCentralID PMC3124336
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.
View details for DOI 10.1038/nature09708
View details for Web of Science ID 000286886400033
View details for PubMedID 21293372
View details for PubMedCentralID PMC3077050
Rare coding variants constitute an important class of human genetic variation, but are underrepresented in current databases that are based on small population samples. Recent studies show that variants altering amino acid sequence and protein function are enriched at low variant allele frequency, 2 to 5%, but because of insufficient sample size it is not clear if the same trend holds for rare variants below 1% allele frequency.The 1000 Genomes Exon Pilot Project has collected deep-coverage exon-capture data in roughly 1,000 human genes, for nearly 700 samples. Although medical whole-exome projects are currently afoot, this is still the deepest reported sampling of a large number of human genes with next-generation technologies. According to the goals of the 1000 Genomes Project, we created effective informatics pipelines to process and analyze the data, and discovered 12,758 exonic SNPs, 70% of them novel, and 74% below 1% allele frequency in the seven population samples we examined. Our analysis confirms that coding variants below 1% allele frequency show increased population-specificity and are enriched for functional variants.This study represents a large step toward detecting and interpreting low frequency coding variation, clearly lays out technical steps for effective analysis of DNA capture data, and articulates functional and population properties of this important class of genetic variation.
View details for DOI 10.1186/gb-2011-12-9-r84
View details for Web of Science ID 000298926900001
View details for PubMedID 21917140
The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.
View details for DOI 10.1038/nature09534
View details for Web of Science ID 000283548600039
View details for PubMedCentralID PMC3042601
Within the last decade or so, there has been an acceleration of research attempting to connect specific genetic lesions to the patterns of brain structure and activation. This article comments on observations that have been made based on these recent data and discusses their importance for the field of investigations into developmental disorders.In making these observations, the authors focus on one specific genomic lesion, the well-studied, yet still incompletely understood, 22q11.2 deletion syndrome.The authors demonstrate the degree of variability in the phenotype that occurs at both the brain and behavioral levels of genomic disorders and describe how this variability is, on close inspection, represented at the genomic level.The authors emphasize the importance of combining genetic/genomic analyses and neuroimaging for research and for future clinical diagnostic purposes and for the purposes of developing individualized, patient-tailored treatment and remediation approaches.
View details for DOI 10.1097/DBP.0b013e3181f5a0a1
View details for Web of Science ID 000281561700011
View details for PubMedID 20814258
Differences in gene expression may play a major role in speciation and phenotypic diversity. We examined genome-wide differences in transcription factor (TF) binding in several humans and a single chimpanzee by using chromatin immunoprecipitation followed by sequencing. The binding sites of RNA polymerase II (PolII) and a key regulator of immune responses, nuclear factor kappaB (p65), were mapped in 10 lymphoblastoid cell lines, and 25 and 7.5% of the respective binding regions were found to differ between individuals. Binding differences were frequently associated with single-nucleotide polymorphisms and genomic structural variants, and these differences were often correlated with differences in gene expression, suggesting functional consequences of binding variation. Furthermore, comparing PolII binding between humans and chimpanzee suggests extensive divergence in TF binding. Our results indicate that many differences in individuals and species occur at the level of TF binding, and they provide insight into the genetic events responsible for these differences.
View details for DOI 10.1126/science.1183621
View details for Web of Science ID 000276459600043
View details for PubMedID 20299548
View details for PubMedCentralID PMC2938768
Epstein-Barr virus (EBV) is associated with several types of lymphomas and epithelial tumors including Burkitt's lymphoma (BL), HIV-associated lymphoma, posttransplant lymphoproliferative disorder, and nasopharyngeal carcinoma. EBV nuclear antigen 1 (EBNA1) is expressed in all EBV associated tumors and is required for latency and transformation. EBNA1 initiates latent viral replication in B cells, maintains the viral genome copy number, and regulates transcription of other EBV-encoded latent genes. These activities are mediated through the ability of EBNA1 to bind viral-DNA. To further elucidate the role of EBNA1 in the host cell, we have examined the effect of EBNA1 on cellular gene expression by microarray analysis using the B cell BJAB and the epithelial 293 cell lines transfected with EBNA1. Analysis of the data revealed distinct profiles of cellular gene changes in BJAB and 293 cell lines. Subsequently, chromatin immune-precipitation revealed a direct binding of EBNA1 to cellular promoters. We have correlated EBNA1 bound promoters with changes in gene expression. Sequence analysis of the 100 promoters most enriched revealed a DNA motif that differs from the EBNA1 binding site in the EBV genome.
View details for DOI 10.1073/pnas.0911676106
View details for Web of Science ID 000273178700069
View details for PubMedID 20080792
Down syndrome (DS), or trisomy 21, is a common disorder associated with several complex clinical phenotypes. Although several hypotheses have been put forward, it is unclear as to whether particular gene loci on chromosome 21 (HSA21) are sufficient to cause DS and its associated features. Here we present a high-resolution genetic map of DS phenotypes based on an analysis of 30 subjects carrying rare segmental trisomies of various regions of HSA21. By using state-of-the-art genomics technologies we mapped segmental trisomies at exon-level resolution and identified discrete regions of 1.8-16.3 Mb likely to be involved in the development of 8 DS phenotypes, 4 of which are congenital malformations, including acute megakaryocytic leukemia, transient myeloproliferative disorder, Hirschsprung disease, duodenal stenosis, imperforate anus, severe mental retardation, DS-Alzheimer Disease, and DS-specific congenital heart disease (DSCHD). Our DS-phenotypic maps located DSCHD to a <2-Mb interval. Furthermore, the map enabled us to present evidence against the necessary involvement of other loci as well as specific hypotheses that have been put forward in relation to the etiology of DS-i.e., the presence of a single DS consensus region and the sufficiency of DSCR1 and DYRK1A, or APP, in causing several severe DS phenotypes. Our study demonstrates the value of combining advanced genomics with cohorts of rare patients for studying DS, a prototype for the role of copy-number variation in complex disease.
View details for DOI 10.1073/pnas.0813248106
View details for Web of Science ID 000268178400040
View details for PubMedID 19597142
Emerging molecular and clinical data suggest that ETS fusion prostate cancer represents a distinct molecular subclass, driven most commonly by a hormonally regulated promoter and characterized by an aggressive natural history. The study of the genomic landscape of prostate cancer in the light of ETS fusion events is required to understand the foundation of this molecularly and clinically distinct subtype. We performed genome-wide profiling of 49 primary prostate cancers and identified 20 recurrent chromosomal copy number aberrations, mainly occurring as genomic losses. Co-occurring events included losses at 19q13.32 and 1p22.1. We discovered three genomic events associated with ERG rearranged prostate cancer, affecting 6q, 7q, and 16q. 6q loss in nonrearranged prostate cancer is accompanied by gene expression deregulation in an independent dataset and by protein deregulation of MYO6. To analyze copy number alterations within the ETS genes, we performed a comprehensive analysis of all 27 ETS genes and of the 3 Mbp genomic area between ERG and TMPRSS2 (21q) with an unprecedented resolution (30 bp). We demonstrate that high-resolution tiling arrays can be used to pin-point breakpoints leading to fusion events. This study provides further support to define a distinct molecular subtype of prostate cancer based on the presence of ETS gene rearrangements.
View details for DOI 10.1002/gcc.20647
View details for Web of Science ID 000263572700007
View details for PubMedID 19156837
Segmental duplications (SDs) are operationally defined as >1 kb stretches of duplicated DNA with high sequence identity. They arise from copy number variants (CNVs) fixed in the population. To investigate the formation of SDs and CNVs, we examine their large-scale patterns of co-occurrence with different repeats. Alu elements, a major class of genomic repeats, had previously been identified as prime drivers of SD formation. We also observe this association; however, we find that it sharply decreases for younger SDs. Continuing this trend, we find only weak associations of CNVs with Alus. Similarly, we find an association of SDs with processed pseudogenes, which is decreasing for younger SDs and absent entirely for CNVs. Next, we find that SDs are significantly co-localized with each other, resulting in a highly skewed "power-law" distribution and chromosomal hotspots. We also observe a significant association of CNVs with SDs, but find that an SD-mediated mechanism only accounts for some CNVs (<28%). Overall, our results imply that a shift in predominant formation mechanism occurred in recent history: approximately 40 million years ago, during the "Alu burst" in retrotransposition activity, non-allelic homologous recombination, first mediated by Alus and then the by newly formed CNVs themselves, was the main driver of genome rearrangements; however, its relative importance has decreased markedly since then, with proportionally more events now stemming from other repeats and from non-homologous end-joining. In addition to a coarse-grained analysis, we performed targeted sequencing of 67 CNVs and then analyzed a combined set of 270 CNVs (540 breakpoints) to verify our conclusions.
View details for DOI 10.1101/gr.081422.108
View details for Web of Science ID 000261398900002
View details for PubMedID 18842824
Olfactory receptors (ORs), which are involved in odorant recognition, form the largest mammalian protein superfamily. The genomic content of OR genes is considerably reduced in humans, as reflected by the relatively small repertoire size and the high fraction ( approximately 55%) of human pseudogenes. Since several recent low-resolution surveys suggested that OR genomic loci are frequently affected by copy-number variants (CNVs), we hypothesized that CNVs may play an important role in the evolution of the human olfactory repertoire. We used high-resolution oligonucleotide tiling microarrays to detect CNVs across 851 OR gene and pseudogene loci. Examining genomic DNA from 25 individuals with ancestry from three populations, we identified 93 OR gene loci and 151 pseudogene loci affected by CNVs, generating a mosaic of OR dosages across persons. Our data suggest that approximately 50% of the CNVs involve more than one OR, with the largest CNV spanning 11 loci. In contrast to earlier reports, we observe that CNVs are more frequent among OR pseudogenes than among intact genes, presumably due to both selective constraints and CNV formation biases. Furthermore, our results show an enrichment of CNVs among ORs with a close human paralog or lacking a one-to-one ortholog in chimpanzee. Interestingly, among the latter we observed an enrichment in CNV losses over gains, a finding potentially related to the known diminution of the human OR repertoire. Quantitative PCR experiments performed for 122 sampled ORs agreed well with the microarray results and uncovered 23 additional CNVs. Importantly, these experiments allowed us to uncover nine common deletion alleles that affect 15 OR genes and five pseudogenes. Comparison to the chimpanzee reference genome revealed that all of the deletion alleles are human derived, therefore indicating a profound effect of human-specific deletions on the individual OR gene content. Furthermore, these deletion alleles may be used in future genetic association studies of olfactory inter-individual differences.
View details for DOI 10.1371/journal.pgen.1000249
View details for Web of Science ID 000261481000004
View details for PubMedID 18989455
Highly specific amplification of complex DNA pools without bias or template-independent products (TIPs) remains a challenge. We have developed a method using phi29 DNA polymerase and trehalose and optimized control of amplification to create micrograms of specific amplicons without TIPs from down to subfemtograms of DNA. With an input of as little as 0.5-2.5 ng of human gDNA or a few cells, the product could be close to native DNA in locus representation. The amplicons from 5 and 0.5 ng of DNA faithfully demonstrated all previously known heterozygous segmental duplications and deletions (3 Mb to 18 kb) located on chromosome 22 and even a homozygous deletion smaller than 1 kb with high-resolution chromosome-wide comparative genomic hybridization. With 550k Infinium BeadChip SNP typing, the >99.7% accuracy was compared favorably with results on unamplified DNA. Importantly, underrepresentation of chromosome termini that occurred with GenomiPhi v2 was greatly rescued with the present procedure, and the call rate and accuracy of SNP typing were also improved for the amplicons with a 0.5-ng, partially degraded DNA input. In addition, the amplification proceeded logarithmically in terms of total yield before saturation; the intact cells was amplified >50 times more efficiently than an equivalent amount of extracted DNA; and the locus imbalance for amplicons with 0.1 ng or lower input of DNA was variable, whereas for higher input it was largely reproducible. This procedure facilitates genomic analysis with single cells or other traces of DNA, and generates products suitable for analysis by massively parallel sequencing as well as microarray hybridization.
View details for DOI 10.1073/pnas.0808028105
View details for Web of Science ID 000260360500052
View details for PubMedID 18832167
DNA methylation is an important component of epigenetic modifications that influences the transcriptional machinery and is aberrant in many human diseases. Several methods have been developed to map DNA methylation for either limited regions or genome-wide. In particular, antibodies specific for methylated CpG have been successfully applied in genome-wide studies. However, despite the relevance of the obtained results, the interpretation of antibody enrichment is not trivial. Of greatest importance, the coupling of antibody-enriched methylated fragments with microarrays generates DNA methylation estimates that are not linearly related to the true methylation level. Here, we present an experimental and analytical methodology, MEDME (modeling experimental data with MeDIP enrichment), to obtain enhanced estimates that better describe the true values of DNA methylation level throughout the genome. We propose an experimental scenario for evaluating the true relationship in a high-throughput setting and a model-based analysis to predict the absolute and relative DNA methylation levels. We successfully applied this model to evaluate DNA methylation status of normal human melanocytes compared to a melanoma cell strain. Despite the low resolution typical of methods based on immunoprecipitation, we show that model-derived estimates of DNA methylation provide relatively high correlation with measured absolute and relative levels, as validated by bisulfite genomic DNA sequencing. Importantly, the model-derived DNA methylation estimates simplify the interpretation of the results both at single-loci and at chromosome-wide levels.
View details for DOI 10.1101/gr.080721.108
View details for Web of Science ID 000259700800012
View details for PubMedID 18765822
Following recent technological advances there has been an increasing interest in genome structural variants (SVs), in particular copy-number variants (CNVs)--large-scale duplications and deletions. Although not immediately evident, CNV surveys make a conceptual connection between the fields of population genetics and protein families, in particular with regard to the stability and expandability of families. The mechanisms giving rise to CNVs can be considered as fundamental processes underlying gene duplication and loss; duplicated genes being the results of 'successful' copies, fixed and maintained in the population. Conversely, many 'unsuccessful' duplicates remain in the genome as pseudogenes. Here, we survey studies on CNVs, highlighting issues related to protein families. In particular, CNVs tend to affect specific gene functional categories, such as those associated with environmental response, and are depleted in genes related to basic cellular processes. Furthermore, CNVs occur more often at the periphery of the protein interaction network. In comparison, protein families associated with successful and unsuccessful duplicates are associated with similar functional categories but are differentially placed in the interaction network. These trends are likely reflective of CNV formation biases and natural selection, both of which differentially influence distinct protein families.
View details for DOI 10.1016/j.sbi.2008.02.005
View details for Web of Science ID 000257539100013
View details for PubMedID 18511261
Recent studies of the mammalian transcriptome have revealed a large number of additional transcribed regions and extraordinary complexity in transcript diversity. However, there is still much uncertainty regarding precisely what portion of the genome is transcribed, the exact structures of these novel transcripts, and the levels of the transcripts produced.We have interrogated the transcribed loci in 420 selected ENCyclopedia Of DNA Elements (ENCODE) regions using rapid amplification of cDNA ends (RACE) sequencing. We analyzed annotated known gene regions, but primarily we focused on novel transcriptionally active regions (TARs), which were previously identified by high-density oligonucleotide tiling arrays and on random regions that were not believed to be transcribed. We found RACE sequencing to be very sensitive and were able to detect low levels of transcripts in specific cell types that were not detectable by microarrays. We also observed many instances of sense-antisense transcripts; further analysis suggests that many of the antisense transcripts (but not all) may be artifacts generated from the reverse transcription reaction. Our results show that the majority of the novel TARs analyzed (60%) are connected to other novel TARs or known exons. Of previously unannotated random regions, 17% were shown to produce overlapping transcripts. Furthermore, it is estimated that 9% of the novel transcripts encode proteins.We conclude that RACE sequencing is an efficient, sensitive, and highly accurate method for characterization of the transcriptome of specific cell/tissue types. Using this method, it appears that much of the genome is represented in polyA+ RNA. Moreover, a fraction of the novel RNAs can encode protein and are likely to be functional.
View details for DOI 10.1186/gb-2008-9-1-r3
View details for Web of Science ID 000253779800011
View details for PubMedID 18173853
Structural variation of the genome involves kilobase- to megabase-sized deletions, duplications, insertions, inversions, and complex combinations of rearrangements. We introduce high-throughput and massive paired-end mapping (PEM), a large-scale genome-sequencing method to identify structural variants (SVs) approximately 3 kilobases (kb) or larger that combines the rescue and capture of paired ends of 3-kb fragments, massive 454 sequencing, and a computational approach to map DNA reads onto a reference genome. PEM was used to map SVs in an African and in a putatively European individual and identified shared and divergent SVs relative to the reference genome. Overall, we fine-mapped more than 1000 SVs and documented that the number of SVs among humans is much larger than initially hypothesized; many of the SVs potentially affect gene function. The breakpoint junction sequences of more than 200 SVs were determined with a novel pooling strategy and computational analysis. Our analysis provided insights into the mechanisms of SV formation in humans.
View details for DOI 10.1126/science.1149504
View details for Web of Science ID 000250230400038
View details for PubMedID 17901297
We report the generation and analysis of functional data from multiple, diverse experiments performed on a targeted 1% of the human genome as part of the pilot phase of the ENCODE Project. These data have been further integrated and augmented by a number of evolutionary and computational analyses. Together, our results advance the collective knowledge about human genome function in several major areas. First, our studies provide convincing evidence that the genome is pervasively transcribed, such that the majority of its bases can be found in primary transcripts, including non-protein-coding transcripts, and those that extensively overlap one another. Second, systematic examination of transcriptional regulation has yielded new understanding about transcription start sites, including their relationship to specific regulatory sequences and features of chromatin accessibility and histone modification. Third, a more sophisticated view of chromatin structure has emerged, including its inter-relationship with DNA replication and transcriptional regulation. Finally, integration of these new sources of information, in particular with respect to mammalian evolution based on inter- and intra-species sequence comparisons, has yielded new mechanistic and evolutionary insights concerning the functional landscape of the human genome. Together, these studies are defining a path for pursuit of a more comprehensive characterization of human genome function.
View details for DOI 10.1038/nature05874
View details for Web of Science ID 000247207500034
View details for PubMedID 17571346
View details for PubMedCentralID PMC2212820
Copy-number variants (CNVs) are an abundant form of genetic variation in humans. However, approaches for determining exact CNV breakpoint sequences (physical deletion or duplication boundaries) across individuals, crucial for associating genotype to phenotype, have been lacking so far, and the vast majority of CNVs have been reported with approximate genomic coordinates only. Here, we report an approach, called BreakPtr, for fine-mapping CNVs (available from http://breakptr.gersteinlab.org). We statistically integrate both sequence characteristics and data from high-resolution comparative genome hybridization experiments in a discrete-valued, bivariate hidden Markov model. Incorporation of nucleotide-sequence information allows us to take into account the fact that recently duplicated sequences (e.g., segmental duplications) often coincide with breakpoints. In anticipation of an upcoming increase in CNV data, we developed an iterative, "active" approach to initially scoring with a preliminary model, performing targeted validations, retraining the model, and then rescoring, and a flexible parameterization system that intuitively collapses from a full model of 2,503 parameters to a core one of only 10. Using our approach, we accurately mapped >400 breakpoints on chromosome 22 and a region of chromosome 11, refining the boundaries of many previously approximately mapped CNVs. Four predicted breakpoints flanked known disease-associated deletions. We validated an additional four predicted CNV breakpoints by sequencing. Overall, our results suggest a predictive resolution of approximately 300 bp. This level of resolution enables more precise correlations between CNVs and across individuals than previously possible, allowing the study of CNV population frequencies. Further, it enabled us to demonstrate a clear Mendelian pattern of inheritance for one of the CNVs.
View details for DOI 10.1073/pnas.0703834104
View details for Web of Science ID 000247363000036
View details for PubMedID 17551006
Genomic tiling microarrays have become a popular tool for interrogating the transcriptional activity of large regions of the genome in an unbiased fashion. There are several key parameters associated with each tiling experiment (e.g., experimental protocols and genomic tiling density). Here, we assess the role of these parameters as they are manifest in different tiling-array platforms used for transcription mapping. First, we analyze how a number of published tiling-array experiments agree with established gene annotation on human chromosome 22. We observe that the transcription detected from high-density arrays correlates substantially better with annotation than that from other array types. Next, we analyze the transcription-mapping performance of the two main high-density oligonucleotide array platforms in the ENCODE regions of the human genome. We hybridize identical biological samples and develop several ways of scoring the arrays and segmenting the genome into transcribed and nontranscribed regions, with the aim of making the platforms most comparable to each other. Finally, we develop a platform comparison approach based on agreement with known annotation. Overall, we find that the performance improves with more data points per locus, coupled with statistical scoring approaches that properly take advantage of this, where this larger number of data points arises from higher genomic tiling density and the use of replicate arrays and mismatches. While we do find significant differences in the performance of the two high-density platforms, we also find that they complement each other to some extent. Finally, our experiments reveal a significant amount of novel transcription outside of known genes, and an appreciable sample of this was validated by independent experiments.
View details for DOI 10.1101/gr.5014606
View details for Web of Science ID 000247226900020
View details for PubMedID 17119069
Deletions and amplifications of the human genomic sequence (copy number polymorphisms) are the cause of numerous diseases and a potential cause of phenotypic variation in the normal population. Comparative genomic hybridization (CGH) has been developed as a useful tool for detecting alterations in DNA copy number that involve blocks of DNA several kilobases or larger in size. We have developed high-resolution CGH (HR-CGH) to detect accurately and with relatively little bias the presence and extent of chromosomal aberrations in human DNA. Maskless array synthesis was used to construct arrays containing 385,000 oligonucleotides with isothermal probes of 45-85 bp in length; arrays tiling the beta-globin locus and chromosome 22q were prepared. Arrays with a 9-bp tiling path were used to map a 622-bp heterozygous deletion in the beta-globin locus. Arrays with an 85-bp tiling path were used to analyze DNA from patients with copy number changes in the pericentromeric region of chromosome 22q. Heterozygous deletions and duplications as well as partial triploidies and partial tetraploidies of portions of chromosome 22q were mapped with high resolution (typically up to 200 bp) in each patient, and the precise breakpoints of two deletions were confirmed by DNA sequencing. Additional peaks potentially corresponding to known and novel additional CNPs were also observed. Our results demonstrate that HR-CGH allows the detection of copy number changes in the human genome at an unprecedented level of resolution.
View details for DOI 10.1073/pnas.0511340103
View details for Web of Science ID 000236362600039
View details for PubMedID 16537408
Elucidating the transcribed regions of the genome constitutes a fundamental aspect of human biology, yet this remains an outstanding problem. To comprehensively identify coding sequences, we constructed a series of high-density oligonucleotide tiling arrays representing sense and antisense strands of the entire nonrepetitive sequence of the human genome. Transcribed sequences were located across the genome via hybridization to complementary DNA samples, reverse-transcribed from polyadenylated RNA obtained from human liver tissue. In addition to identifying many known and predicted genes, we found 10,595 transcribed sequences not detected by other methods. A large fraction of these are located in intergenic regions distal from previously annotated genes and exhibit significant homology to other mammalian proteins.
View details for DOI 10.1126/science.1103388
View details for Web of Science ID 000225950000042
View details for PubMedID 15539566
The ENCyclopedia Of DNA Elements (ENCODE) Project aims to identify all functional elements in the human genome sequence. The pilot phase of the Project is focused on a specified 30 megabases (approximately 1%) of the human genome sequence and is organized as an international consortium of computational and laboratory-based scientists working to develop and apply high-throughput approaches for detecting all sequence elements that confer biological function. The results of this pilot phase will guide future efforts to analyze the entire human genome.
View details for DOI 10.1126/science.1105136
View details for Web of Science ID 000224756700037
View details for PubMedID 15499007