Skip navigation

Genomics in the World of Big Data Analytics

Julie Gorenstein

Manager, Life Sciences Consulting

Dennis Leskowski

Senior Manager, Life Sciences Consulting
Blog
  • Life Sciences & Healthcare

Data analytics projects have proven successful in accelerating the pace of research, analysis and decision-making at pharmaceutical and biotech corporations. For genomics specifically, the promise of utilizing big data to capture and unlock its full potential is exciting, but not without its challenges. Everyone is aware of the primary technological challenges for data analytics projects, like data integration, privacy/security, storage and statistical power. But as data analytics and genomics intersect, new challenges emerge.

Almost thirty years ago, The Human Genome Project engulfed over 20 university labs, billions of dollars globally, and over a decade of work. Today, genome sequencing time and effort has been reduced to a few hundred bucks with a turnaround time of a few days, and there have even been Black Friday sales vying for consumer affection. As a result of this increased accessibility, companies are compiling data at an exponential rate, and IT engagements are in demand to help keep up with the storage, privacy and computational speeds needed to extract meaning from it all.

A Lack of Standardized Nomenclature: Data’s Biggest Enemy

Taking a step back from IT, one of the biggest problems facing big data in biological sciences is standardized nomenclature. As the field emerged, naming conventions were non-existent, giving rise to a “Wild West”-type environment where scientists labeled as they thought best.

In 1979, The HUGO Gene Nomenclature Committee (HGNC) was organized to “approve unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.” However, this nomenclature effort was confined to humans. It lacked conservation of terminology across evolutionary trees to model organisms such as yeast, flies, worms or mice—causing complications to effectively mine data. For example, the KRAS gene has 15 synonyms for a human and 11 for a mouse where only 4 values overlap.

In fact, it is not uncommon for different scientists within the same team to prefer the use of different nomenclatures. In addition to gene naming, there are also proteins, cells, tissues, organisms, diseases, technologies, protocols and algorithms that exist without definitive standards. Organizations such as Pistoia Alliance and Transforming Genetic Medicine Initiative work to improve consistency and establish FAIR (Findable, Accessible, Interoperable, and Reusable) best practices for use and management of ontologies.

Lastly, even if the nomenclature were established, it is imperative to capture all the metadata associated with a particular sequence. Specifically, what organism did the sequence come from? What is the gender or ethnicity? What is the context (e.g. normal vs. cancerous tissue) associated with that sample? Finally, which technologies (i.e. physical or computational process) were used to generate the data?   

Drawing Meaningful Conclusions from Disparate Data

The problem described above becomes more pronounced when trying to integrate information from disparate data sources. If the organizational structure of the metadata does not match, analysis of combined datasets will be a challenge. As the field is working to adapt comprehensive standards, curators will be necessary to bring the data together in an integrable and interpretable way.

In order to draw meaningful conclusions, data analytics within genomics projects will need to account for the technological problems, nomenclature standards and metadata issues that turnkey big data implementations cannot foresee in total. To do so, the next generation of applying data analytics to genomics will build off robust engineering cloud platforms for data storage and analysis. 

With public- and government-sponsored projects, such as the National Institutes of Health’s 100,000 genomes, the UK's 1 million or Finland's 0.5 million genomes, as well as commercial endeavors, underway, we can access and integrate an almost ludicrous amount of data. However, even the most organized projects still need data scientists to ensure that all data and metadata complies with FAIR standards. 

In conclusion, big data platforms will need to proactively anticipate the metadata necessary to make the volume of disparate data sources across an organization into worthwhile research and commercial applications. Despite the challenges, implementing a genomics-centered big data ecosystem that is powerful, scalable and searchable will yield integrations and interpretations to pave the way for futuristic academic and commercial advances worldwide.

Hello. How Can We Help You?


Our Offices

  • Canada

    • Ottawa

      343 Preston Street,
      ON K1S 1N4, Ottawa
      Canada

      Map
    • Toronto

      5 Park Home Avenue,
      Suite 400,
      ON M2N 6L4, North York,
      Toronto
      Canada

      Map
      F: +1-416-595-1551
  • Mexico

    • Guadalajara

      Periférico Sur #8110,
      Col. El Mante
      45609 Tlaquepaque, Jalisco
      Mexico

      Map
  • United States

    • Newtown, PA

      41 University Drive,
      Suite 202,
      Newtown, PA 18940
      USA

      Map
      F: +1-267-759-8989
    • Bellevue, WA

      110 110th Ave. NE,
      Suite 310
      Bellevue, WA 98004
      USA

      Map
    • Boston, MA

      21 Drydock Avenue,
      Suite 410 W,
      Boston, MA 02210
      USA

      Map
    • Conshohocken, PA

      101 East 8th Ave,
      Suite 201,
      Conshohocken, PA 19428
      USA

      Map
    • Los Angeles, CA

      11601 Wilshire Blvd,
      Suite 350,
      Los Angeles, CA 90025
      USA

      Map
    • New York, NY

      24 West 25th Street,
      5th Floor,
      New York, NY 10010
      USA

      Map
      F: +1-267-759-8989
    • Philadelphia, PA

      30 South 15th Street,
      9th Floor,
      Philadelphia, PA 19102
      USA

      Map
    • San Francisco, CA

      222 Kearny Street,
      Suite 308,
      San Francisco, CA 94108
      USA

      Map
    • San Jose, CA

      2055 Gateway Place,
      Suite 510,
      San Jose, CA 95110
      USA

      Map
    • Washington D.C.

      7901 Jones Branch Drive,
      Suite 400,
      McLean, VA 22102
      USA

      Map
  • Australia

  • China

    • Guangzhou

      Unit B01, 23/F,
      Yuexiuxinduhui Building,
      No. 236, 6th Zhongshan Road,
      Yuexiu District, Guangzhou,
      China 510180

      Map
    • 广州

      中国广州市越秀区
      中山六路236号
      越秀新都会大厦中座 23楼 B01室
      邮编510180

      地图
    • Shanghai

      Room B509, 5th Floor,
      48 Weihai Road,
      Huangpu District, Shanghai,
      China 200000

      Map
    • 上海

      上海市黄浦区
      威海路48号
      5楼B509室
      邮编200000

      地图
    • Shenzhen

      3/F, Block 5, Vision Shenzhen Business Park,
      9th Gaoxin South Road, 
      Shenzhen Hi-tech Industrial Park,
      Nanshan District, Shenzhen,
      Guangdong, China 518057

      Map
    • 深圳

      中国广东省深圳市
      南山区高新南九道
      威新软件园5号楼3楼
      邮编518057

      地图
    • Suzhou

      Building 12, Creative Industrial Park,
      328 Xinghu Street,
      Suzhou Industrial Park,
      Suzhou, China 215123

      Map
    • 苏州

      中国江苏省苏州市
      苏州工业园区星湖街328号
      创意产业园内12号楼
      邮编215123

      地图
  • Hong Kong

    • Hong Kong

      26F&17F, The Wellington Tower,
      198 Wellington Street,
      Central, HK

      Map
  • India

    • Bangalore

      Smartworks,  
      Global Technology Park,
      Block C, Outer Ring Rd,
      Adarsh Palm Retreat, Bellandur,
      Bengaluru, Karnataka 560103
      India

      Map
    • Hyderabad

      10, 11 & 12th Floors,
      Salarpuria Sattva Knowledge City,
      Plot No. 2, Phase - 1,
      Survey No. 83/1,
      Raidurgam Village,
      Serilingampally Mandal,
      Hyderabad, Telangana - 500081
      India

      Map
    • Pune

      SmartWork Business Center Pvt Ltd,
      Suite 8, Level 1,
      West Wing, Nyati Unitree,
      Samrat Ashok Road,
      Yerwada, Pune - 411006,
      Maharashtra
      India

      Map
  • Japan

    • Tokyo

      Floor 1-10-11
      Shibadaimon Centre Building 10th
      Shibadaimon Minato-ku
      Tokyo 105-0012
      Japan

      Map
      F: +81-03-6880-9201
  • Singapore

    • Singapore

      5 Shenton Way
      UIC Building, #10-01,
      Singapore (068808)

      Map
  • United Arab Emirates

    • Dubai

      EPAM Systems FZ-LLC Dubai Branch
      2307 Arenco Tower, Dubai Media City
      PO Box 501929 Dubai
      United Arab Emirates

      Map