Preface
Since the discovery of the of double helix, the influence of DNA over the structure and function of organisms has long been the topic of extensive biological study, underscoring the basic tenants of biology, and arguably serving as the core for the science as a whole. In 2003, almost 50 years after the discovery of DNA, The National Human Genome Research Institute (NHGRI) completed the Human Genome Project, a landmark undertaking which oversaw the sequencing of the approximate 3 billion base-pairs comprising the human genome, as well as assessing the presence of an estimated 30,000 protein-coding sequences.1 However, despite the substantial trove of knowledge the project attained, many questions remained unanswered. In particular, researchers were amazed to find that only one to two percent of the entire human genome’s base pair sequence actually coded for transcribed proteins, 2 leaving the function of the remaining margin to be little more than the subject of speculation. Enter the Encyclopedia of DNA Elements (ENCODE) project, brainchild of the NHGRI and successor to the human genome project2. Less than two months ago, the project unveiled the fruits of almost a decade’s worth of genomic analysis across a plethora of cell types through the publication of thirty research papers2 detailing a massive reserve of data which assigned at least one biological function to 80.4% of the human genome.3 Extending beyond protein-coding sequences to encapsulate protein binding sites as well as an extensive array of RNA functionality, ENCODE paints a new portrait of the human genome, one that is far more regulated and multidimensional than previously fathomed.
Aims
The ENCODE project affords a mass of statistical and qualitative data detailing elements of the entire human genome. However, the data is quite complex, necessitating the inclusion of a brief overview of the role of DNA in gene expression prior to any examination of ENCODE. In addition to defining the locations of all known gene loci, as annotated by the ENCODE’s subproject, GENCODE, 7 the project highlights several other functions of the human genome. Finally, several implications of the ENCODE project, including the revolutionary manner in which its data is presented, will be discussed. Therefore, the aims of this paper are as such: to provide a brief summary of DNA and gene expression, to summarize ENCODE’s statistical findings pertaining to protein coding loci, RNA expression, DNA binding sites, and areas of histone modification, and to conclude with a brief discussion of the project’s various implications.
DNA: A Brief Summary
All information pertaining to the structure and function of a living cell is preserved and maintained within one or more molecules of DNA. It is this information that is passed from a cell, or any multicellular organism for that matter, to its progeny. Structurally, DNA is composed of two helical strands running antiparallel to each other and bound to each other through complementary hydrogen bonding that occurs between a series of four nitrogenous bases lining the molecule’s interior. Through a process known as gene expression, information stored in the form of nitrogenous base-pair sequences within the DNA molecule is used to create a variety of polypeptides, which form proteins capable of performing a variety of tasks. This process begins with transcription, during which a sequence of DNA is copied onto a molecule of single-stranded RNA, and ends with translation, the process of translating the base sequence of a transcribed RNA molecule into a polypeptide.9
A variety of elements exists to regulate and fine tune this process. For example, not all transcribed RNA is translated into protein. Prior to translation, RNA is spliced into various new sequences, with only portions known as exons possessing the potential to be translated. Furthermore, not all exons are translated into polypeptides, as many are folded and modified into what are known as non-coding RNAs, which perform a variety of tasks associated with gene expression.9
In addition, DNA molecules of humans and other eukaryotic organisms, coil around proteinaceous bodies known as histones. The resulting aggregate is known as chromatin. Histones are often modified, controlling how coiling occurs in a given portion of DNA, as well as how accessible its sequence is to a variety of proteins associated with genetic transcription.9
ENCODE: A Summation
Essentially, ENCODE is the compilation of a variety of experiments testing for specific genomic sequences or facets performed across a varying array of 147 cell types.3 The project focused primarily on several cell lines, GM12878, a cell line produced from immature lymphcytes, K562, an immortal cell line derived from cancerous leukemia cells,4 and the embryonic stem cell line H1-hESC.2 Expanding upon its predecessor’s study of protein-coding regions, ENCODE outlines a list of exactly 20,687 said regions, discerning a mean of 6.3 alternatively spliced transcripts per locus,3 as well as listing 11,224 pseudogenes,3 nonfunctioning gene analogs. Still, protein coding regions, that is, exons of protein-coding genes, were shown to comprise only 2.94% of the genome’s base pairs,3 leaving the bulk of the project’s discovery to genomic regions serving different purposes.
One such function is the transcription of RNA, a nucleotide which serves variety of genetic roles, most notably its intermediary status in the translation of DNA sequences to protein. A staggering 62% of the genome was found to be expressed in the form of RNA molecules measuring more than 200 base-pairs in length, with only 5.5% of the aforementioned percentage accounting for protein-coding exons.3 In addition, using a method known as CAGE-sequencing to capture, methylate, and sequence RNA molecules, 62,403 transcription start sites were pinpointed across the genome, with a significant number of such sites located within exons and untranslated regions. Other project-specific data shows a statistically significant portion of these start sites to exhibit “cell-type-restricted expression”— gene expression ascribed only to specific cell types,3 further supporting the idea that portions of a human’s genome, or that of any multi-cellular organism for that matter, are expressed only in specific cell types.3 Furthermore, sequences coding for long and short non-coding RNAs were found to account for a significant portion of the genome,3 suggesting a higher amount of genomic regulation, particularly with respect to cell type, than previously considered7. These molecules, known for their roles in RNA translation as tRNAs and alternative splicing as sNRPs,9 have also been recently speculated to possess a multitude of significant roles in organismal development.6
Beyond transcriptionally expressed elements, the project found a substantial portion of the genome to be involved with the facilitation of physically binding to proteins and other molecules. Using the procedure known as ChIP-seq to identify proteins through specific antibody binding, ENCODE accounts for the binding sites of 119 DNA-binding proteins across 72 cell types. 8.1% of the genome is involved with such functions.3 Particular emphasis appeared to be placed on transcription factor binding sites, which accounted for 87 of 119 DNA-binding sites studied, as well as their correlation with the presence of a strong DNA-binding motif,3 a sequence of DNA associated with the increased affinity for DNA binding proteins to the sequence in which it is contained.11 In addition, to map areas of DNA accessibility, sensitivity to the nuclease DNase1 was also documented across the entirety of the genome and cross-referenced with each area’s affinity for DNA binding proteins. Interestingly, DNase activity was significantly higher in regions with lower affinity than in regions with higher affinity, suggesting that such low-affinity regions to be associated with other, as-of-yet unspecified factors3.
In addition to transcription factors, areas involved with histone activity were also the topic of substantial study. The locations for as many as 12 histone modifications were studied across 46 cell types. Overall, histone modification varied across a the cell types studied in correlation with various transcription patterns, with areas of intense modification substantiating 56.1% of the genome3, further verifying claims in previous studies on the consistent transcriptional impact of histone modification across varying cell types.10
Furthermore, areas where DNA was bound directly to outside molecules were the subject intense statistical analysis. In particular, locations of DNA methylation of cytosine in CpG dinucleotides, a phenomenon correlated with both DNA repression and increased transcriptional activity,3 were analyzed across 82 cell lines, with 96% areas of methylation varying across different cell types,3 a figure showing strong statistical correlation between differential methylation and transcriptional expression across various cell types.
Finally, the connections between various sequences at differing areas across the genome were assessed. Using an approach known as 3C carbon copy(5C)y to analyze chromosomal positioning and activity, the interactions with transcription start sequences (TSS’s) located on separate chromosomes across four cell types were cataloged through the analysis of many statistically-significant correlations for various activities. TSS’s studied were found to interact on average with 3.9 other distant elements. The project considers such activity to be indicative of the presence an undiscovered and interconnected dimension to the genome involving long-range physical interaction between sequences.3
Conclusion: Beyond ENCODE
In summation, according to ENCODE’s data, 80.4% of the genome has now been assigned at least one biological function. The wealth of information afforded by such a feat is staggering, and is already developing into a core reference for researchers across a wide spectrum of study. The data itself is publicly available on several online databases, including the those of ENCODE and the National Center for Biotechnology Information.8
To understand the magnitude of what such availability implies, consider the analogy posed by NHGRI program director Elise Feingold, Ph.D., who likened the data to a genetic version of Google Maps:
“Simply by selecting the magnification in Google Maps, you can see countries, states, cities, streets, even individual intersections, and by selecting different features, you can get directions, see street names and photos, and get information about traffic and even weather. The ENCODE maps allow researchers to inspect the chromosomes, genes, functional elements and individual nucleotides in the human genome in much the same way.”8
With respect to the project, much remains undiscovered. Note that percentages mentioned in this paper do not sum to 100%, showing a significant portion of the genome to be involved with more than one of the aforementioned functions. Alas, while the data provided by ENCODE is by far the most complex comprehensive study of the human genome to date, the work the project began is far from complete, as a vast array of cell types have yet to be tested. Therefore, it is not unreasonable to posit that a variety of genomic elements, as well as a myriad of functions not yet ascribed to previously-studied sequences, remains undiscovered. However, the ENCODE’s approach, as well as its revolutionary presentation of data as a free online resource, has paved the way for similar endeavors. Indeed, the fruits of ENCODE may well be seeds of a coming age of genomics, one in which a comprehensive view of humanity from the molecular level upward may finally be realized.
References
1. Collins, Francis S. et al. “A Vision for the Future of Genomics Research”. Nature, Vol. 422, no. 6934. 24 April, 2003.
2. Pennisi, Elizabeth. “ENCODE Project Writes Eulogy for Junk DNA”. Science, Vol. 337 no. 6099. 7 September, 2012.
3. The ENCODE Project Consortium. “An integrated encyclopedia of DNA elements in the human genome”. Nature, Vol 489 no 11247. 6 September, 2012.
4. “ENCODE Project Common Cell Types”. National Human Genome Research Institute.
9 March 2012.
5. Mattick, John S. and Makunin, Igor V. “Non-coding RNA”. Human Molecular Genetics, Vol. 15 no. 1. 22 February 2006. < http://iris.nyit.edu/~apetro01/old-postings-Fall-2011/Week-11/W11-C1/non-coding-RNA.pdf>
6. Kapranov, Philipp and St. Laurent, Georges. “Dark Matter RNA: Existence, Function, and Controversy”. Frontiers in Genetics, Vol. 3 no. 60. 23 April, 2012.
7. “Details”. The GENCODE Project: Encyclopedia of Genes and Gene Variants.
8. McCrimmon, Omar. “ENCODE data describes function of human genome”. National Human Genome Research Institution / NIH News. 5 September, 2012.
9. Freeman, Scott. Biological Science. Custom ed. San Francisco: Addison-Wesley, 2011. pp. 260-261, 276, 279-284, 289-300
10. Jung, I. and Kim, D. “Histone modification profiles characterize function-specific gene regulation.” Journal of Theoretical Biology. Vol. 310. pp.132-142. 7 October, 2012.
11. Eden, Eran et al. “Discovering Motifs in Ranked Lists of DNA Sequences” PLOS: Computational Biology. 5 January, 2007.