Genomics in the World of Big Data Analytics
Data analytics projects have proven successful in accelerating the pace of research, analysis and decision-making at pharmaceutical and biotech corporations. For genomics specifically, the promise of utilizing big data to capture and unlock its full potential is exciting, but not without its challenges. Everyone is aware of the primary technological challenges for data analytics projects, like data integration, privacy/security, storage and statistical power. But as data analytics and genomics intersect, new challenges emerge.
Almost thirty years ago, The Human Genome Project engulfed over 20 university labs, billions of dollars globally, and over a decade of work. Today, genome sequencing time and effort has been reduced to a few hundred bucks with a turnaround time of a few days, and there have even been Black Friday sales vying for consumer affection. As a result of this increased accessibility, companies are compiling data at an exponential rate, and IT engagements are in demand to help keep up with the storage, privacy and computational speeds needed to extract meaning from it all.
A Lack of Standardized Nomenclature: Data’s Biggest Enemy
Taking a step back from IT, one of the biggest problems facing big data in biological sciences is standardized nomenclature. As the field emerged, naming conventions were non-existent, giving rise to a “Wild West”-type environment where scientists labeled as they thought best.
In 1979, The HUGO Gene Nomenclature Committee (HGNC) was organized to “approve unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.” However, this nomenclature effort was confined to humans. It lacked conservation of terminology across evolutionary trees to model organisms such as yeast, flies, worms or mice—causing complications to effectively mine data. For example, the KRAS gene has 15 synonyms for a human and 11 for a mouse where only 4 values overlap.
In fact, it is not uncommon for different scientists within the same team to prefer the use of different nomenclatures. In addition to gene naming, there are also proteins, cells, tissues, organisms, diseases, technologies, protocols and algorithms that exist without definitive standards. Organizations such as Pistoia Alliance and Transforming Genetic Medicine Initiative work to improve consistency and establish FAIR (Findable, Accessible, Interoperable, and Reusable) best practices for use and management of ontologies.
Lastly, even if the nomenclature were established, it is imperative to capture all the metadata associated with a particular sequence. Specifically, what organism did the sequence come from? What is the gender or ethnicity? What is the context (e.g. normal vs. cancerous tissue) associated with that sample? Finally, which technologies (i.e. physical or computational process) were used to generate the data?
Drawing Meaningful Conclusions from Disparate Data
The problem described above becomes more pronounced when trying to integrate information from disparate data sources. If the organizational structure of the metadata does not match, analysis of combined datasets will be a challenge. As the field is working to adapt comprehensive standards, curators will be necessary to bring the data together in an integrable and interpretable way.
In order to draw meaningful conclusions, data analytics within genomics projects will need to account for the technological problems, nomenclature standards and metadata issues that turnkey big data implementations cannot foresee in total. To do so, the next generation of applying data analytics to genomics will build off robust engineering cloud platforms for data storage and analysis.
With public- and government-sponsored projects, such as the National Institutes of Health’s 100,000 genomes, the UK's 1 million or Finland's 0.5 million genomes, as well as commercial endeavors, underway, we can access and integrate an almost ludicrous amount of data. However, even the most organized projects still need data scientists to ensure that all data and metadata complies with FAIR standards.
In conclusion, big data platforms will need to proactively anticipate the metadata necessary to make the volume of disparate data sources across an organization into worthwhile research and commercial applications. Despite the challenges, implementing a genomics-centered big data ecosystem that is powerful, scalable and searchable will yield integrations and interpretations to pave the way for futuristic academic and commercial advances worldwide.