Breaking Down Data Silos: The Path to Data-Centricity and FAIR Principles
Does this scenario sound familiar? You need an answer that requires computation, and the computation is dependent on data. So, you go to your data lake, you peer over the edge and ... you have no idea what you're looking at.
That’s where the fun starts. We've all heard that data scientists (and essentially anybody who uses data) spend around 80% of their time preparing data for its intended purpose. But, first, you’ll have to find it — consider yourself lucky if everything you need is in your data lake and that you can recognize and trust it. Once you locate the data, you need to determine how and whether you’re allowed to access it. After that comes the really hard part: bringing the data together and making sure it plays well with the other data you need to get your answer.
A Familiar Challenge
This scenario plays itself out repeatedly each day in environments that contain a non-trivial amount of data, primarily because the data is siloed everywhere. Finding the data, establishing its accessibility and making it interoperable with other data is so difficult because of a decades-long laser focus on containers for data, rather than on the data itself.
These “containers for data” are usually relational databases, data lakes and similar technologies that have been used to store data for many years. Storing data in these containers results in stripping the semantics from the data and obfuscating it in schemas (in the case of relational databases) or simply ignoring the semantics altogether (in the case of data lakes). No semantics means no metadata, inadequate identifiers and extremely limited interoperability — which in turn creates data silos. People writing applications to get answers are obligated to look to those silos for their data.
The Struggle with Data Silos & Application-Centricity
Data silos, by their very nature, force application developers to spend too much time and effort extracting data, transforming it into something digestible and loading it into their application’s environment, a process called ETL (extract, transform and load). This scenario results in yet another silo of data, owned by the specific application that requires it — and often only useful to that one application. Any fluidity in the data is lost, so answers come slowly, if at all. This is application-centricity: the unfortunate truth that most data belong exclusively to the application that requires it and have been duplicated and formatted to meet that application’s specifications. In reality, the problem is more pervasive than that, and it begins much earlier in the data lifecycle — starting with the focus on containers for data — rather than on data as the key asset.
Embracing Data-Centricity and the FAIR Principles
Shifting the focus from data containers and bespoke applications to the data itself leads to data-centricity, which is enabled by adherence to the FAIR Principles (findable, accessible, interoperable and reusable). When properly applied, the FAIRification process allows data and metadata to become more independent of any particular repository or application, making them readily usable and reusable for unforeseen purposes:
- Usable signifies that you derive immediate value from a data-centric approach. For example, AstraZeneca’s Integrative Informatics group leverages their FAIR identifier policy to create research and clinical study metadata that is “born FAIR” and primed for alignment with internal and external ontologies in their respective domains.
- Reusable suggests that even more value can be realized in the data when addressing new questions or challenges, such as when your colleague shares the latest laboratory results or during a global pandemic.
Of course, data containers and applications are important. Data must reside safely somewhere, and automation and analysis are critical, making excellence in computer system architecture and software engineering indispensable. But no one has ever made an amazing scientific discovery simply by installing a data lake. A shift from application-centricity to a more data-centric approach — facilitated by FAIR data — moves the focus to data reusability, slowing the proliferation of data silos, reducing application development costs and speeding time to insights.