How much is It Required to Learn Java for Hadoop?

AMIT VERMA | September 3, 2018

article image
Apache Hadoop is one of the most popular enterprise solutions for big data, adopted by most of the IT majors. It has established itself among one of the top 10 IT jobs for the years 2016 and 2017. Hence, it becomes essential for the professionals who aspire to become proficient in Hadoop to explore this evolving ecosystem on a daily basis.

Spotlight

Iris Data Science

'Iris Data Science' is a data science, machine learning and 'Big Data' provider, specialising in predictive analytics. More data is being created rapidly and in novel ways. Real-time responses and predictions for the increasingly diverse data will undoubtedly provide advantages for early adopters. 'Iris Data Science' is working on providing a suite of services and products to enable turning data into gold, to discover previously hidden patterns, to enable better decision-making, to provide richer analysis and predictive analytics…

OTHER ARTICLES

Big Data Could Undermine the Covid-19 Response

Article | April 13, 2020

THE CORONAVIRUS PANDEMIC has spurred interest in big data to track the spread of the fast-moving pathogen and to plan disease prevention efforts. But the urgent need to contain the outbreak shouldn’t cloud thinking about big data’s potential to do more harm than good.Companies and governments worldwide are tapping the location data of millions of internet and mobile phone users for clues about how the virus spreads and whether social distancing measures are working. Unlike surveillance measures that track the movements of particular individuals, these efforts analyze large data sets to uncover patterns in people’s movements and behavior over the course of the pandemic.

Read More
DATA ARCHITECTURE

Evolution of capabilities of Data Platforms & data ecosystem

Article | April 13, 2020

Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match). The end of the data warehouse Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop. On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware. Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent. Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it. Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement. Data storage through the years: from GFS to Snowflake or Snowflake blizzard Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores. Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space. Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible. While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.

Read More

MiPasa project and IBM Blockchain team on open data platform to support Covid-19 response

Article | April 13, 2020

Powerful technologies and expertise can help provide better data and help people better understand their situation. As the world contends with the ongoing coronavirus outbreak, officials battling the pandemic need tools and valid information at scale to help foster a greater sense of security for the public. As technologists, we have been heartened by the prevalence of projects such as Call for Code, hackathons and other attempts by our colleagues to rapidly create tools that might be able to help stem the crisis. But for these tools to work, they need data from sources they can validate. For example, reopening the world’s economy will likely require not only testing millions of people, but also being able to map who tested positive, where people can and can’t go and who is at exceptionally high risk of exposure and must be quarantined again.

Read More

Time Machine Big Data of the Past for the Future of Europe

Article | April 13, 2020

Emerging technology has the power to transform history and cultural heritage into a living resource. The Time Machine project will digitise archives from museums and libraries, using Artificial Intelligence and Big Data mining, to offer richer interpretations of our past. An inclusive European identity benefits from a deep engagement with the region’s past. The Time Machine project set out to offer this by exploiting already freely accessible Big Data sources. EU support for a preparatory action enabled the development of a decade-long roadmap for the large-scale digitisation of kilometres of archives, from large museum and library collections, into a distributed information system. Artificial Intelligence (AI) will play a key role at each step, from digitisation planning to document interpretation and fact-checking. Once embedded, this infrastructure could create new business and employment opportunities across a range of sectors including ICT, the creative industries and tourism.

Read More

Spotlight

Iris Data Science

'Iris Data Science' is a data science, machine learning and 'Big Data' provider, specialising in predictive analytics. More data is being created rapidly and in novel ways. Real-time responses and predictions for the increasingly diverse data will undoubtedly provide advantages for early adopters. 'Iris Data Science' is working on providing a suite of services and products to enable turning data into gold, to discover previously hidden patterns, to enable better decision-making, to provide richer analysis and predictive analytics…

Events