Article | April 13, 2020
THE CORONAVIRUS PANDEMIC has spurred interest in big data to track the spread of the fast-moving pathogen and to plan disease prevention efforts. But the urgent need to contain the outbreak shouldn’t cloud thinking about big data’s potential to do more harm than good.Companies and governments worldwide are tapping the location data of millions of internet and mobile phone users for clues about how the virus spreads and whether social distancing measures are working. Unlike surveillance measures that track the movements of particular individuals, these efforts analyze large data sets to uncover patterns in people’s movements and behavior over the course of the pandemic.
Article | April 13, 2020
The acronym DMaaS can refer to two related but separate things: data center management-as-a-service referred to here by its other acronym, DCMaaS and data management-as-a-service. The former looks at infrastructure-level questions such as optimization of data flows in a cloud service, the latter refers to master data management and data preparation as applied to federated cloud services.DCMaaS has been under development for some years; DMaaS is slightly younger and is a product of the growing interest in machine learning and big data analytics, along with increasing concern over privacy, security, and compliance in a cloud environment.DMaaS responds to a developing concern over data quality in machine learning due to the large amount of data that must be used for training and the inherent dangers posed by divergence in data structure from multiple sources. To use the rapidly growing array of cloud data, including public cloud information and corporate internal information from hybrid clouds, you must aggregate data in a normalized way so it can be available for model training and processing with ML algorithms. As data volumes and data diversity increase, this becomes increasingly difficult.
Article | April 13, 2020
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.
Article | April 13, 2020
The terms data science and data analytics are not unfamiliar with individuals who function within the technology field. Indeed, these two terms seem the same and most people use them as synonyms for each other. However, a large proportion of individuals are not aware that there is actually a difference between data science and data analytics.It is pertinent that individuals whose work revolves around these terms or the information and technology industries, should know how to use these terms in the appropriate contexts. The reason for this is quite simple: the right usage of these terms has significant impacts on the management and productivity of a business, especially in today’s rapidly data-dependent world.