Article | October 27, 2020
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.
Article | February 10, 2020
There are few movie scenes I can recall from my childhood, but I vividly remember seeing the 1968 Stanley Kubrick sci-fi movie 2001 A Space Odyssey in 1970 with my older cousin. What stays with me to this day is the scene where astronaut Dave asks HAL, the homicidal computer based on artificial intelligence (AI), to open the pod bay doors. HAL's eerie reply: I'm sorry, Dave. I'm afraid I can't do that.In that moment, the concept of man vs. machine was created, predicated on the idea that machines created by man and using AI could (eventually) defy orders, position themselves in the vanguard, and overthrow humankind. Fast forward to today. Within the information governance space, there are two terms that have been used quite frequently in recent years analytics and AI. Often they are used interchangeably and are practically synonymous.
Article | April 3, 2020
Primarily,the IoT stack is going beyond merely ingesting data to data analytics and management, with a focus on real-time analysis and autonomous AI capacities. Enterprises are finding more advanced ways to apply IoT for better and more profitable outcomes. IoT platforms have evolved to use standard open-source protocols and components. Now enterprises are primarily focusing on resolving business problems such as predictive maintenance or usage of smart devices to streamline business operations.Platforms focus on similar things, but early attempts at the creation of highly discrete solutions around specific use cases in place of broad platforms, have been successful. That means more vendors offer more choices for customers, to broaden the chances for success. Clearly, IoT platforms actually sit at the heart of value creation in the IoT.
Article | July 29, 2020
Headquartered in London, England, BP (NYSE: BP) is a multinational oil and gas company. Operating since 1909, the organization offers its customers with fuel for transportation, energy for heat and light, lubricants to keep engines moving, and the petrochemicals products.
Business intelligence has always been a key enabler for improving decision making processes in large enterprises from early days of spreadsheet software to building enterprise data warehouses for housing large sets of enterprise data and to more recent developments of mining those datasets to unearth hidden relationships. One underlying theme throughout this evolution has been the delegation of crucial task of finding out the remarkable relationships between various objects of interest to human beings.
What BI technology has been doing, in other words, is to make it possible (and often easy too) to find the needle in the proverbial haystack if you somehow know in which sectors of the barn it is likely to be. It is a validatory as opposed to a predictory technology.
When the amount of data is huge in terms of variety, amount, and dimensionality (a.k.a. Big Data) and/or the relationship between datasets are beyond first-order linear relationships amicable to human intuition, the above strategy of relying solely on humans to make essential thinking about the datasets and utilizing machines only for crucial but dumb data infrastructure tasks becomes totally inadequate. The remedy to the problem follows directly from our characterization of it: finding ways to utilize the machines beyond menial tasks and offloading some or most of cognitive work from humans to the machines.
Does this mean all the technology and associated practices developed over the decades in BI space are not useful anymore in Big Data age? Not at all. On the contrary, they are more useful than ever: whereas in the past humans were in the driving seat and controlling the demand for the use of the datasets acquired and curated diligently, we have now machines taking up that important role and hence unleashing manifold different ways of using the data and finding out obscure, non-intuitive relationships that allude humans. Moreover, machines can bring unprecedented speed and processing scalability to the game that would be either prohibitively expensive or outright impossible to do with human workforce.
Companies have to realize both the enormous potential of using new automated, predictive analytics technologies such as machine learning and how to successfully incorporate and utilize those advanced technologies into the data analysis and processing fabric of their existing infrastructure. It is this marrying of relatively old, stable technologies of data mining, data warehousing, enterprise data models, etc. with the new automated predictive technologies that has the huge potential to unleash the benefits so often being hyped by the vested interests of new tools and applications as the answer to all data analytical problems.
To see this in the context of predictive analytics, let's consider the machine learning(ML) technology. The easiest way to understand machine learning would be to look at the simplest ML algorithm: linear regression. ML technology will build on basic interpolation idea of the regression and extend it using sophisticated mathematical techniques that would not necessarily be obvious to the causal users. For example, some ML algorithms would extend linear regression approach to model non-linear (i.e. higher order) relationships between dependent and independent variables in the dataset via clever mathematical transformations (a.k.a kernel methods) that will express those non-linear relationship in a linear form and hence suitable to be run through a linear algorithm.
Be it a simple linear algorithm or its more sophisticated kernel methods variation, ML algorithms will not have any context on the data they process. This is both a strength and weakness at the same time. Strength because the same algorithms could process a variety of different kinds of data, allowing us to leverage all the work gone through the development of those algorithms in different business contexts, weakness because since the algorithms lack any contextual understanding of the data, perennial computer science truth of garbage in, garbage out manifests itself unceremoniously here : ML models have to be fed "right" kind of data to draw out correct insights that explain the inner relationships in the data being processed.
ML technology provides an impressive set of sophisticated data analysis and modelling algorithms that could find out very intricate relationships among the datasets they process. It provides not only very sophisticated, advanced data analysis and modeling methods but also the ability to use these methods in an automated, hence massively distributed and scalable ways. Its Achilles' heel however is its heavy dependence on the data it is being fed with. Best analytic methods would be useless, as far as drawing out useful insights from them are concerned, if they are applied on the wrong kind of data. More seriously, the use of advanced analytical technology could give a false sense of confidence to their users over the analysis results those methods produce, making the whole undertaking not just useless but actually dangerous.
We can address the fundamental weakness of ML technology by deploying its advanced, raw algorithmic processing capabilities in conjunction with the existing data analytics technology whereby contextual data relationships and key domain knowledge coming from existing BI estate (data mining efforts, data warehouses, enterprise data models, business rules, etc.) are used to feed ML analytics pipeline. This approach will combine superior algorithmic processing capabilities of the new ML technology with the enterprise knowledge accumulated through BI efforts and will allow companies build on their existing data analytics investments while transitioning to use incoming advanced technologies. This, I believe, is effectively a win-win situation and will be key to the success of any company involved in data analytics efforts.