Big Data​: Are You Ready For Fungible Data?

Ghislain Fourny
5 min readMar 2, 2022

We are in the middle of a revolution in the database world. In the press, you might have heard of it referred to as Big Data. We are accumulating data at a scale so massive that we can get insights we could never have dreamed of before, across almost all scientific disciplines. So much data. Too much data. The problem has shifted to: how do we deal with it and make it useful?

In the past few years, I have increasingly been forced to realize that, in order to manage large amounts of data, you need to think small. Very small. Nano-small. We need to explore what minuscule amounts of data look like. Data so tiny you can split it no more. Atoms of data. Quanta of data.

Here is a guided tour of what might become the new world of the database space.

Exploring the infinitely small — A particle collision at the LHC in Geneva.

Exploring the infinitely small — A particle collision at the LHC in Geneva.

Data Silos

When I talk with my colleagues working with large companies about the issues their employers have to deal with, there is one that pops up regularly. Large corporations have a lot of databases using all kinds of technologies: from Excel files through relational databases to (for the most courageous) cutting-edge innovations such as document stores and column stores. And these silos of data cannot easily communicate with each other because of impedance mismatches that cost a lot of money to overcome — billions of dollars for the economy as a whole.

The so much praised heterogeneity of NoSQL data, which was originally the revolution that gave us the flexibility to scale up its storage, is slowly turning itself into the new shortcoming of current technologies.

This challenge is already being addressed with Data Virtualization, which consists in decoupling the interaction with data from its underlying storage. Kurt Cagle identified Data Virtualization as one of the ten trends of 2015 [1].

Why Fungible Data Is A Game Changer

At gigantic scales, we need to homogenize data. This is why many paradigms (MapReduce, Spark, …) are based on large collections of small stuff. But here I do not mean homogeneous like a single SQL table and going back to the 70s. No, I mean homogeneous like making the entire data universe fungible.

Fungible data means that, from a storage perspective, any data looks the same.

  • Electricity, wherever produced, looks the same. It revolutionized the energy industry precisely because of its fungibility. Electric cars are disruptive because they decouple the consumption of energy from its production.
  • Virtual machines, wherever they are physically located, look alike. This brought elasticity to the computing cloud.
  • Stocks of the same company are interchangeable. Stock fungibility allowed us to scale up the size of corporations over the last few centuries to more than what a few individuals would otherwise achieve.

These are all fungible assets. Think of fungible data like you can store data in buckets, pour buckets of data into each other, open a tap to fill buckets. Store data without worrying about its shape. Generic clustering, universal data interchange.

Going Big Also Means Going Small. Very Small.

Every time mankind pushed a boundary in its discovery of the universe, over the last 5 to 6 centuries, it meant going to a greater scale. The Earth: 40,000 km — megameters. Our solar system: one light-year — petameters. Our galaxy, 100,000 light-years — one exameter. The visible universe, 13 billions of light-years — zettameters.

As we created and stored more data, we have travelled through the same scales all over again, in just half a century: almost everybody reading this post is familiar with the concept of kilobytes, megabytes, gigabytes, terabytes, petabytes.

In our discovery of large scales, as we approached the boundaries of the visible universe, this also meant exploring the incredibly small: atoms: nanometers. Protons: femtometers. Electrons: attometers. The plank scale is even off the chart: one billionth of a yoctometer. As heterogeneous as our universe seems in our everyday life, on very small scales, it is amazingly (surprisingly?) homogeneous: the world as we know it is made of as little as 17 elementary particles, including the new Higgs Boson, not even half of them being commonplace. They might even all be vibrating strings on a smaller scale.

Homogeneous in the small. You probably see where this is going: I strongly believe that, like astrophysics, this odyssey towards large scales of data will be accompanied by a rediscovery of minuscule, elementary chunks of the data world. Mastering these elementary atoms of data is the way to make it fully fungible.

The Technology Is Already Here, Today

It is worth emphasizing that fungible data is not science fiction. It is here already, and you might even be already using it, as there are at least two standards that are already very popular and established. They are complementary to each other, since they work at different scales:

  • RDF (Resource Description Framework) [2]: to pursue the analogy to particle physics, RDF triples are baryons (= made of three quarks) of linked data. Each triple describes an elementary relationship between a subject, a property and an object. The entire RDF data space, worldwide, can be seen as one single, huge graph made of these relationships. I have the feeling RDF and OWL are slowly gaining in popularity in the business world.
  • XBRL (eXtensible Business Reporting Language) [3]: XBRL facts are atoms of tabular data. Each fact makes an elementary statement about reality, within the context of dimensions such as who, what, when, where, why, as of when, etc. The entire XBRL data space, world, can be seen as one single, sparse hypercube. Regulatory authorities around the world are increasingly asking companies to file their reports using XBRL.

We Are Data Explorers

You might have heard about these new, trendy job titles: “Data Scientist”, “Data Engineer”. In a way, the people involved in the design of new NoSQL paradigms (document stores, triple stores, column stores, key-value stores) are explorers of data in the infinitely large.

In this respect, the people involved in the design of RDF and XBRL technologies are nothing else than explorers of data in the infinitely small.

Long Story Short…

In 1970, Codd threw databases at computers.

In the 90s, Oracle and IBM were some of the pioneers of throwing computers at one another and offering computing as a commodity.

In the 2000s, fast growing start-ups such as Google, Facebook, Twitter, had the genius to throw computers at databases (“move the queries to the data”) to address the scale-up issue.

Maybe it’s time to start throwing databases at one another and see what other Higgs Bosons we will find.

References

[1] Kurt Cagle, Ten Trends In Data Science 2015: https://www.linkedin.com/pulse/ten-trends-data-science-2015-kurt-cagle

[2] RDF specifications at the W3C: http://www.w3.org/RDF/

[3] XBRL specifications: http://www.xbrl.org/

Update: after writing this post, I googled around for fungible data and stumbled on an interesting read that advocates on a similar line of reasoning, but from a business perspective: http://itknowledgeexchange.techtarget.com/total-cio/whats-data-fungibility-got-to-do-with-delivering-business-insight/

Picture copyright: generalfmv @ 123RF.com

Originally published at https://www.linkedin.com.

--

--

Ghislain Fourny

Ghislain Fourny is a senior scientist at ETH Zurich with a focus on databases and game theory.