Article | April 13, 2020
THE CORONAVIRUS PANDEMIC has spurred interest in big data to track the spread of the fast-moving pathogen and to plan disease prevention efforts. But the urgent need to contain the outbreak shouldn’t cloud thinking about big data’s potential to do more harm than good.Companies and governments worldwide are tapping the location data of millions of internet and mobile phone users for clues about how the virus spreads and whether social distancing measures are working. Unlike surveillance measures that track the movements of particular individuals, these efforts analyze large data sets to uncover patterns in people’s movements and behavior over the course of the pandemic.
Article | April 13, 2020
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.
Article | April 13, 2020
Powerful technologies and expertise can help provide better data and help people better understand their situation. As the world contends with the ongoing coronavirus outbreak, officials battling the pandemic need tools and valid information at scale to help foster a greater sense of security for the public. As technologists, we have been heartened by the prevalence of projects such as Call for Code, hackathons and other attempts by our colleagues to rapidly create tools that might be able to help stem the crisis. But for these tools to work, they need data from sources they can validate. For example, reopening the world’s economy will likely require not only testing millions of people, but also being able to map who tested positive, where people can and can’t go and who is at exceptionally high risk of exposure and must be quarantined again.
Article | April 13, 2020
Emerging technology has the power to transform history and cultural heritage into a living resource. The Time Machine project will digitise archives from museums and libraries, using Artificial Intelligence and Big Data mining, to offer richer interpretations of our past. An inclusive European identity benefits from a deep engagement with the region’s past. The Time Machine project set out to offer this by exploiting already freely accessible Big Data sources. EU support for a preparatory action enabled the development of a decade-long roadmap for the large-scale digitisation of kilometres of archives, from large museum and library collections, into a distributed information system. Artificial Intelligence (AI) will play a key role at each step, from digitisation planning to document interpretation and fact-checking. Once embedded, this infrastructure could create new business and employment opportunities across a range of sectors including ICT, the creative industries and tourism.