CHITRESH SHARMA | October 27, 2020
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market. Read More
CHITRESH SHARMA | August 13, 2020
The coronavirus outbreak in China has grown to a pandemic and is affecting the global health & social and economic dynamics. An ever increasing velocity and scale of analysis — in terms of both processing and access is required to succeed in the face of unimaginable shifts of market; health and social paradigms. The COVID-19 pandemic is accompanied by an Infodemic. With the global Novel Coronavirus pandemic filling headlines, TV news space and social media it can seem as if we are drowning in information and data about the virus. With so much data being pushed at us and shared it can be hard for the general public to know what is correct, what is useful and (unfortunately) what is dangerous. In general, levels of trust in scientists are quite high albeit with differences across countries and regions. A 2019 survey conducted across 140 countries showed that, globally, 72% of the respondents trusted scientists at “high” or “medium” levels. However, the proportion expressing “high” or “medium” levels of trust in science ranged from about 90% in Northern and Western Europe to 68% in South America and 48% in Central Africa (Rabesandratana, 2020).
In times of crisis, like the ongoing spread of COVID-19, both scientific & non-scientific data should be a trusted source for information, analysis and decision making. While global sharing and collaboration of research data has reached unprecedented levels, challenges remain. Trust in at least some of the data is relatively low, and outstanding issues include the lack of specific standards, co-ordination and interoperability, as well as data quality and interpretation. To strengthen the contribution of open science to the COVID-19 response, policy makers need to ensure adequate data governance models, interoperable standards, sustainable data sharing agreements involving public sector, private sector and civil society, incentives for researchers, sustainable infrastructures, human and institutional capabilities and mechanisms for access to data across borders.
The COVID19 data is cited critical for vaccine discovery; planning and forecasting for healthcare set up; emergency systems set up and expected to contribute to policy objectives like higher transparency and accountability, more informed policy debates, better public services, greater citizen engagement, and new business development. This is precisely why the need to have “open data” access to COVID-19 information is critical for humanity to succeed. In global emergencies like the coronavirus (COVID-19) pandemic, open science policies can remove obstacles to the free flow of research data and ideas, and thus accelerate the pace of research critical to combating the disease. UNESCO have set up open access to few data is leading a major role in this direction. Thankfully though, scientists around the world working on COVID-19 are able to work together, share data and findings and hopefully make a difference to the containment, treatment and eventually vaccines for COVID-19.
Science and technology are essential to humanity’s collective response to the COVID-19 pandemic. Yet the extent to which policymaking is shaped by scientific evidence and by technological possibilities varies across governments and societies, and can often be limited. At the same time, collaborations across science and technology communities have grown in response to the current crisis, holding promise for enhanced cooperation in the future as well.
A prominent example of this is the Coalition for Epidemic Preparedness Innovations (CEPI), launched in 2017 as a partnership between public, private, philanthropic and civil society organizations to accelerate the development of epidemic vaccines. Its ongoing work has cut the expected development time for a COVID-19 vaccine to 12–18 months, and its grants are providing quick funding for some promising early candidates. It is estimated that an investment of USD 2 billion will be needed, with resources being made available from a variety of sources (Yamey, et al., 2020).
The Open COVID Pledge was launched in April 2020 by an international coalition of scientists, lawyers, and technology companies, and calls on authors to make all intellectual property (IP) under their control available, free of charge, and without encumbrances to help end the COVID-19 pandemic, and reduce the impact of the disease. Some notable signatories include Intel, Facebook, Amazon, IBM, Sandia National Laboratories, Hewlett Packard, Microsoft, Uber, Open Knowledge Foundation, the Massachusetts Institute of Technology, and AT&T. The signatories will offer a specific non-exclusive royalty-free Open COVID license to use IP for the purpose of diagnosing, preventing and treating COVID-19.
Also illustrating the power of open science, online platforms are increasingly facilitating collaborative work of COVID-19 researchers around the world. A few examples include:
1. Research on treatments and vaccines is supported by Elixir, REACTing, CEPI and others.
2. WHO funded research and data organization.
3. London School of Hygiene and Tropical Medicine releases a dataset about the environments that have led to significant clusters of COVID-19 cases,containing more than 250 records with date, location, if the event was indoors or outdoors, and how many individuals became infected. (7/24/20)
4. The European Union Science Hub publishes a report on the concept of data-driven Mobility Functional Areas (MFAs). They demonstrate how mobile data calculated at a European regional scale can be useful for informing policies related to COVID-19 and future outbreaks. (7/16/20)
While clinical, epidemiological and laboratory data about COVID-19 is widely available, including genomic sequencing of the pathogen, a number of challenges remain:
1. All data is not sufficiently findable, accessible, interoperable and reusable (FAIR), or not yet FAIR data.
2. Sources of data tend to be dispersed, even though many pooling initiatives are under way, curation needs to be operated “on the fly”.
3. In addition, many issues arise around the interpretation of data – this can be illustrated by the widely followed epidemiological statistics. Typically, the statistics concern “confirmed cases”, “deaths” and “recoveries”. Each of these items seem to be treated differently in different countries, and are sometimes subject to methodological changes within the same country.
4. Specific standards for COVID-19 data therefore need to be established, and this is one of the priorities of the UK COVID-19 Strategy. A working group within Research Data Alliance has been set up to propose such standards at an international level.
Given the achievements and challenges of open science in the current crisis, lessons from prior experience & from SARS and MARS outbreaks globally can be drawn to assist the design of open science initiatives to address the COVID-19 crisis. The following actions can help to further strengthen open science in support of responses to the COVID-19 crisis:
1. Providing regulatory frameworks that would enable interoperability within the networks of large electronic health records providers, patient mediated exchanges, and peer-to-peer direct exchanges. Data standards need to ensure that data is findable, accessible, interoperable and reusable, including general data standards, as well as specific standards for the pandemic.
2. Working together by public actors, private actors, and civil society to develop and/or clarify a governance framework for the trusted reuse of privately-held research data toward the public interest. This framework should include governance principles, open data policies, trusted data reuse agreements, transparency requirements and safeguards, and accountability mechanisms, including ethical councils, that clearly define duties of care for data accessed in emergency contexts.
3. Securing adequate infrastructure (including data and software repositories, computational infrastructure, and digital collaboration platforms) to allow for recurrent occurrences of emergency situations. This includes a global network of certified trustworthy and interlinked repositories with compatible standards to guarantee the long-term preservation of FAIR COVID-19 data, as well as the preparedness for any future emergencies.
4. Ensuring that adequate human capital and institutional capabilities are in place to manage, create, curate and reuse research data – both in individual institutions and in institutions that act as data aggregators, whose role is real-time curation of data from different sources.
In increasingly knowledge-based societies and economies, data are a key resource. Enhanced access to publicly funded data enables research and innovation, and has far-reaching effects on resource efficiency, productivity and competitiveness, creating benefits for society at large. Yet these benefits must also be balanced against associated risks to privacy, intellectual property, national security and the public interest.
Entities such as UNESCO are helping the open science movement to progress towards establishing norms and standards that will facilitate greater, and more timely, access to scientific research across the world. Independent scientific assessments that inform the work of many United Nations bodies are indicating areas needing urgent action, and international cooperation can help with national capacities to implement them. At the same time, actively engaging with different stakeholders in countries around the dissemination of the findings of such assessments can help in building public trust in science. Read More