Integrating Apache Spark and NiFi for Data Lakes

| December 6, 2016

article image
Founder and president of Think Big, Ron Bodkin discusses our experiences of integrating Spark with NiFi in building data lakes at Hadoop Summit, 2016. In this video, Ron discusses requirements, design and a demonstration.

Spotlight

Zoho Corporation

Zoho offers beautifully smart software to help you grow your business. With over 20 million users worldwide, Zoho's 33+ products aid your sales and marketing, support and collaboration, finance and recruitment needs - letting you focus only on your business. Zoho CRM is our flagship service and has won many awards such as the 2012 CRM Magazine Market Leader Award and the 2012 Sleeter 'Awesome Application'​ Award.

OTHER ARTICLES

Bringing big data science to Africa

Article | March 24, 2020

Africa is set to establish its first big data hub, boosting knowledge sharing and information extraction from complex data sets.The hub will enable the continent to access and analyse timely data relating to the Sustainable Development Goals for evidence based decision making, says Oliver Chinganya, director of the Africa Statistics Centre at the United Nations Economic Commission for Africa (UNECA).According to a study, big data is impacting positively in almost every sphere of life, such as in health, aviation, banking, military intelligence and space science.

Read More

Will We Be Able to Use AI to Prevent Further Pandemics?

Article | March 9, 2021

For many, 2021 has brought hope that they can cautiously start to prepare for a world after Covid. That includes living with the possibility of future pandemics, and starting to reflect on what has been learned from such a brutal shared experience. One of the areas that has come into its own during Covid has been artificial intelligence (AI), a technology that helped bring the pandemic under control, and allow life to continue through lockdowns and other disruptions. Plenty has been written about how AI has supported many aspects of life at work and home during Covid, from videoconferencing to online food ordering. But the role of AI in preventing Covid causing even more havoc is not necessarily as widely known. Perhaps even more importantly, little has been said about the role AI is likely to play in preparing for, responding to and even preventing future pandemics. From what we saw in 2020, AI will help prevent global outbreaks of new diseases in three ways: prediction, diagnosis and treatment. Prediction Predicting pandemics is all about tracking data that could be possible early signs that a new disease is spreading in a disturbing way. The kind of data we’re talking about includes public health information about symptoms presenting to hospitals and doctors around the world. There is already plenty of this captured in healthcare systems globally, and is consolidated into datasets such as the Johns Hopkins reports that many of us are familiar with from news briefings. Firms like Bluedot and Metabiota are part of a growing number of organisations which use AI to track both publicly available and private data and make relevant predictions about public health threats. Both of these received attention in 2020 by reporting the appearance of Covid before it had been officially acknowledged. Boston Children’s Hospital is an example of a healthcare institution doing something similar with their Healthmap resource. In addition to conventional healthcare data, AI is uniquely able to make use of informal data sources such as social media, news aggregators and discussion forums. This is because of AI techniques such as natural language processing and sentiment analysis. Firms such as Stratifyd use AI to do this in other business settings such as marketing, but also talk publicly about the use of their platform to predict and prevent pandemics. This is an example of so-called augmented intelligence, where AI is used to guide people to noteworthy data patterns, but stops short of deciding what it means, leaving that to human judgement. Another important part of preventing a pandemic is keeping track of the transmission of disease through populations and geographies. A significant issue in 2020 was difficulty tracing people who had come into contact with infection. There was some success using mobile phones for this, and AI was critical in generating useful knowledge from mobile phone data. The emphasis of Covid tracing apps in 2020 was keeping track of how the disease had already spread, but future developments are likely to be about predicting future spread patterns from such data. Prediction is a strength of AI, and the principles used to great effect in weather forecasting are similar to those used to model likely pandemic spread. Diagnosis To prevent future pandemics, it won’t be enough to predict when a disease is spreading rapidly. To make the most of this knowledge, it’s necessary to diagnose and treat cases. One of the greatest early challenges with Covid was the lack of speedy, reliable tests. For future pandemics, AI is likely to be used to create such tests more quickly than was the case in 2020. Creating a useful test involves modelling a disease’s response to different testing reagents, finding right balance between speed, convenience and accuracy. AI modelling simulates in a computer how individual cells respond to different stimuli, and could be used to perform virtual testing of many different types of test to accelerate how quickly the most promising ones reach laboratory and field trials. In 2020 there were also several novel uses of AI to diagnose Covid, but there were few national and global mechanisms to deploy these at scale. One example was the use of AI imaging, diagnosing Covid by analysing chest x-rays for features specific to Covid. This would have been especially valuable in places that didn’t have access to lab testing equipment. Another example was using AI to analyse the sound of coughs to identify unique characteristics of a Covid cough. AI research to systematically investigate innovative diagnosis techniques such as these should result in better planning for alternatives to laboratory testing. Faster and wider rollout of this kind of diagnosis would help control spread of a future disease during the critical period waiting for other tests to be developed or shared. This would be another contribution of AI to preventing a localised outbreak becoming a pandemic. Treatment Historically, vaccination has proven to be an effective tool for dealing with pandemics, and was the long term solution to Covid for most countries. AI was used to accelerate development of Covid vaccines, helping cut the development time from years or decades to months. In principle, the use of AI was similar to that described above for developing diagnostic tests. Different drug development teams used AI in different ways, but they all relied on mathematical modelling of how the Covid virus would respond to many forms of treatment at a microscopic level. Much of the vaccine research and modelling focused on the “spike” proteins that allow Covid to attack human cells and enter the body. These are also found in other viruses, and were already the subject of research before the 2020 pandemic. That research allowed scientists to quickly develop AI models to represent the spikes, and simulate the effects of different possible treatments. This was crucial in trialling thousands of possible treatments in computer models, pinpointing the most likely successes for further investigation. This kind of mathematical simulation using AI continued during drug development, and moved substantial amounts of work from the laboratory to the computer. This modelling also allowed the impact of Covid mutations on vaccines to be assessed quickly. It is why scientists were reasonably confident of developing variants of vaccines for new Covid mutations in days and weeks rather than months. As a result of the global effort to develop Covid vaccines, the body of data and knowledge about virus behaviour has grown substantially. This means it should be possible to understand new pathogens even more rapidly than Covid, potentially in hours or days rather than weeks. AI has also helped create new ways of approaching vaccine development, for example the use of pre-prepared generic vaccines designed to treat viruses from the same family as Covid. Modifying one of these to the specific features of a new virus is much faster than starting from scratch, and AI may even have already simulated exactly such a variation. AI has been involved in many parts of the fight against Covid, and we now have a much better idea than in 2020 of how to predict, diagnose and treat pandemics, especially similar viruses to Covid. So we can be cautiously optimistic that vaccine development for any future Covid-like viruses will be possible before it becomes a pandemic. Perhaps a trickier question is how well we will be able to respond if the next pandemic is from a virus that is nothing like Covid. Was Rahman is an expert in the ethics of artificial intelligence, the CEO of AI Prescience and the author of AI and Machine Learning. See more at www.wasrahman.com

Read More

What is Data Integrity and Why is it Important?

Article | July 19, 2021

In an era of big data, data health has become a pressing issue when more and more data is being stored and processed. Therefore, preserving the integrity of the collected data is becoming increasingly necessary. Understanding the fundamentals of data integrity and how it works is the first step in safeguarding the data. Data integrity is essential for the smooth running of a company. If a company’s data is altered, deleted, or changed, and if there is no way of knowing how it can have significant impact on any data-driven business decisions. Data integrity is the reliability and trustworthiness of data throughout its lifecycle. It is the overall accuracy, completeness, and consistency of data. It can be indicated by lack of alteration between two updates of a data record, which means data is unchanged or intact. Data integrity refers to the safety of data regarding regulatory compliance- like GDPR compliance- and security. A collection of processes, rules, and standards implemented during the design phase maintains the safety and security of data. The information stored in the database will remain secure, complete, and reliable no matter how long it’s been stored; that’s when you know that the integrity of data is safe. A data integrity framework also ensures that no outside forces are harming this data. This term of data integrity may refer to either the state or a process. As a state, the data integrity framework defines a data set that is valid and accurate. Whereas as a process, it describes measures used to ensure validity and accuracy of data set or all data contained in a database or a construct. Data integrity can be enforced at both physical and logical levels. Let us understand the fundamentals of data integrity in detail: Types of Data Integrity There are two types of data integrity: physical and logical. They are collections of processes and methods that enforce data integrity in both hierarchical and relational databases. Physical Integrity Physical integrity protects the wholeness and accuracy of that data as it’s stored and retrieved. It refers to the process of storage and collection of data most accurately while maintaining the accuracy and reliability of data. The physical level of data integrity includes protecting data against different external forces like power cuts, data breaches, unexpected catastrophes, human-caused damages, and more. Logical Integrity Logical integrity keeps the data unchanged as it’s used in different ways in a relational database. Logical integrity checks data accuracy in a particular context. The logical integrity is compromised when errors from a human operator happen while entering data manually into the database. Other causes for compromised integrity of data include bugs, malware, and transferring data from one site within the database to another in the absence of some fields. There are four types of logical integrity: Entity Integrity A database has columns, rows, and tables. These elements need to be as numerous as required for the data to be accurate, but no more than necessary. Entity integrity relies on the primary key, the unique values that identify pieces of data, making sure the data is listed just once and not more to avoid a null field in the table. The feature of relational systems that store data in tables can be linked and utilized in different ways. Referential Integrity Referential integrity means a series of processes that ensure storage and uniform use of data. The database structure has rules embedded into them about the usage of foreign keys and ensures only proper changes, additions, or deletions of data occur. These rules can include limitations eliminating duplicate data entry, accurate data guarantee, and disallowance of data entry that doesn’t apply. Foreign keys relate data that can be shared or null. For example, let’s take a data integrity example, employees that share the same work or work in the same department. Domain Integrity Domain Integrity can be defined as a collection of processes ensuring the accuracy of each piece of data in a domain. A domain is a set of acceptable values a column is allowed to contain. It includes constraints that limit the format, type, and amount of data entered. In domain integrity, all values and categories are set. All categories and values in a database are set, including the nulls. User-Defined Integrity This type of logical integrity involves the user's constraints and rules to fit their specific requirements. The data isn’t always secure with entity, referential, or domain integrity. For example, if an employer creates a column to input corrective actions of the employees, this data would fall under user-defined integrity. Difference between Data Integrity and Data Security Often, the terms data security and data integrity get muddled and are used interchangeably. As a result, the term is incorrectly substituted for data integrity, but each term has a significant meaning. Data integrity and data security play an essential role in the success of each other. Data security means protecting data against unauthorized access or breach and is necessary to ensure data integrity. Data integrity is the result of successful data security. However, the term only refers to the validity and accuracy of data rather than the actual act of protecting data. Data security is one of the many ways to maintain data integrity. Data security focuses on reducing the risk of leaking intellectual property, business documents, healthcare data, emails, trade secrets, and more. Some facets of data security tactics include permissions management, data classification, identity, access management, threat detection, and security analytics. For modern enterprises, data integrity is necessary for accurate and efficient business processes and to make well-intentioned decisions. Data integrity is critical yet manageable for organizations today by backup and replication processes, database integrity constraints, validation processes, and other system protocols through varied data protection methods. Threats to Data Integrity Data integrity can be compromised by human error or any malicious acts. Accidental data alteration during the transfer from one device to another can be compromised. There is an assortment of factors that can affect the integrity of the data stored in databases. Following are a few of the examples: Human Error Data integrity is put in jeopardy when individuals enter information incorrectly, duplicate, or delete data, don’t follow the correct protocols, or make mistakes in implementing procedures to protect data. Transfer Error A transfer error occurs when data is incorrectly transferred from one location in a database to another. This error also happens when a piece of data is present in the destination table but not in the source table in a relational database. Bugs and Viruses Data can be stolen, altered, or deleted by spyware, malware, or any viruses. Compromised Hardware Hardware gets compromised when a computer crashes, a server gets down, or problems with any computer malfunctions. Data can be rendered incorrectly or incompletely, limit, or eliminate data access when hardware gets compromised. Preserving Data Integrity Companies make decisions based on data. If that data is compromised or incorrect, it could harm that company to a great extent. They routinely make data-driven business decisions, and without data integrity, those decisions can have a significant impact on the company’s goals. The threats mentioned above highlight a part of data security that can help preserve data integrity. Minimize the risk to your organization by using the following checklist: Validate Input Require an input validation when your data set is supplied by a known or an unknown source (an end-user, another application, a malicious user, or any number of other sources). The data should be validated and verified to ensure the correct input. Validate Data Verifying data processes haven’t been corrupted is highly critical. Identify key specifications and attributes that are necessary for your organization before you validate the data. Eliminate Duplicate Data Sensitive data from a secure database can easily be found on a document, spreadsheet, email, or shared folders where employees can see it without proper access. Therefore, it is sensible to clean up stray data and remove duplicates. Data Backup Data backups are a critical process in addition to removing duplicates and ensuring data security. Permanent loss of data can be avoided by backing up all necessary information, and it goes a long way. Back up the data as much as possible as it is critical as organizations may get attacked by ransomware. Access Control Another vital data security practice is access control. Individuals in an organization with any wrong intent can harm the data. Implement a model where users who need access can get access is also a successful form of access control. Sensitive servers should be isolated and bolted to the floor, with individuals with an access key are allowed to use them. Keep an Audit Trail In case of a data breach, an audit trail will help you track down your source. In addition, it serves as breadcrumbs to locate and pinpoint the individual and origin of the breach. Conclusion Data collection was difficult not too long ago. It is no longer an issue these days. With the amount of data being collected these days, we must maintain the integrity of the data. Organizations can thus make data-driven decisions confidently and take the company ahead in a proper direction. Frequently Asked Questions What are integrity rules? Precise data integrity rules are short statements about constraints that need to be applied or actions that need to be taken on the data when entering the data resource or while in the data resource. For example, precise data integrity rules do not state or enforce accuracy, precision, scale, or resolution. What is a data integrity example? Data integrity is the overall accuracy, completeness, and consistency of data. A few examples where data integrity is compromised are: • When a user tries to enter a date outside an acceptable range • When a user tries to enter a phone number in the wrong format • When a bug in an application attempts to delete the wrong record What are the principles of data integrity? The principles of data integrity are attributable, legible, contemporaneous, original, and accurate. These simple principles need to be part of a data life cycle, GDP, and data integrity initiatives. { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [{ "@type": "Question", "name": "What are integrity rules?", "acceptedAnswer": { "@type": "Answer", "text": "Precise data integrity rules are short statements about constraints that need to be applied or actions that need to be taken on the data when entering the data resource or while in the data resource. For example, precise data integrity rules do not state or enforce accuracy, precision, scale, or resolution." } },{ "@type": "Question", "name": "What is a data integrity example?", "acceptedAnswer": { "@type": "Answer", "text": "Data integrity is the overall accuracy, completeness, and consistency of data. A few examples where data integrity is compromised are: When a user tries to enter a date outside an acceptable range When a user tries to enter a phone number in the wrong format When a bug in an application attempts to delete the wrong record" } },{ "@type": "Question", "name": "What are the principles of data integrity?", "acceptedAnswer": { "@type": "Answer", "text": "The principles of data integrity are attributable, legible, contemporaneous, original, and accurate. These simple principles need to be part of a data life cycle, GDP, and data integrity initiatives." } }] }

Read More

6 Best SaaS Marketing Metrics for Business Growth

Article | July 22, 2021

The software-as-a-service industry is rapidly growing with an estimate to reach $219.5 billion by 2027. SaaS marketing strategies is highly different from other industries; thus, tracking the right metrics for marketing is necessary. SaaS kpis or metrics measure an enterprise’s performance, growth, and momentum. These saas marketing metrics are have been designed to evaluate the health of a business by tracking sales, marketing, and customer success. Direct access to data will help you develop your business and show whether there is any room for development. SaaS KPIs: What Are They and Why Do They Matter? Marketing metrics for SaaS indicate growth in different ways. SaaS KPIs, just like regular KPIs, helps business to evaluate their business models and strategies. These key metrics for SaaS companies give a deep insight into which sectors perform well and require reassessment. To optimize any company’s exposure, SaaS metrics for marketing are highly essential. They measure the performance of sales, marketing, and customer retention. SaaS companies believe in the entire life cycle of the customer, while traditional web-based companies focus on immediate sales. The overall goal of SaaS companies is to build long-lasting customer relationships since most revenue is generated through their recurring payments. SaaS marketing technology are SaaS marketers’ greatest asset if they take the time and effort to understand and implement them. There are essential and unimportant metrics. Knowing which metrics to pay attention to is a challenge. Once you get these metrics right, they will help you to detect your company’s strengths and weaknesses and help you understand whether they are working or not. There are more than fifteen metrics one can track but make you lose sight of what matters. In this article, we have identified the critical metrics every SaaS should track: Unique Visitors This metric measures the number of visitors your website or page sees in a specific time period. If someone visits your website four to five times in that given time period, it will be counted as one unique visitor. Recording this metric is crucial as it shows you what type of visitors your site receives and from what channels they arrive. When the number of unique visitors is high, it indicates to the SaaS marketers that their content resonates with the target customers. It is vital to note, however, which channels these unique visitors reach your website. These channels can be: Organic traffic Social media Paid ads SaaS marketers should, at this point, identify which channels are working and double down on those. Once you know these channels, you can allocate budgets and optimize these channels for better performance. Google Analytics is the best free tool to track unique visitors. The tool enables you to refine by dates and compare time periods and generate a report. Leads Leads is a broad term that can be broken down into two sub-categories: Sales Qualified Leads (SQL) and Marketing Qualified Leads (MQL). Defining SQL and MQL is important as they can be different for every business. So, let us break down the definitions for the two: MQL MQLs are those leads that have moved past the visitor phase in the customer lifecycle. They have taken steps to move ahead and become qualified to become potential customers. They have engaged with your website multiple times. For example, they have visited your website to check out prices, case studies or have downloaded your whitepapers more than two times. SQL SQLs actively engage with your site and are more qualified than MQLs. This lead is what you have deemed as the ideal sales candidate. They are way past the initial search stage, evaluating vendors, and are ready for a direct sales pitch. The most crucial distinction between the two is that your sales team has deemed them sales-worthy. After distinguishing between the two leads, you need to take the next appropriate steps. The best way to measure these leads is through closed-loop automation tools like HubSpot, Marketo, or Pardot. These automation tools will help you set up the criteria that automatically set up an individual as lead based on your website's SQL and MQL actions. Next, track the website traffic to ensure these unique visitors turn into potential leads. Churn The churn rate, in short, refers to the number of customers lost in a given time frame. It is the number of revenue SaaS customers who cancel their recurring revenue services. Since SaaS is a subscription-based service, losing customers directly correlates to losing money. The churn rate also indicates that your customers aren’t getting what they want from your service. Like most of your saas KPIs, you will be reporting on the churn rate every month. To calculate the churn rate, take the total number of customers you lost in the month you’re reporting on. Next, divide that by the number of customers you had at the beginning of the reporting month. Then, multiply that number by 100 to get the percentage. A churn is natural for any business. However, a high churn rate is an indicator that your business is in trouble. Therefore, it is an essential metric to track for your SaaS company. Customer Lifetime Value Customer lifetime value (CLV) measures how valuable a customer is to your business. It is the average amount of money your customers pay during their involvement with your SaaS company. You measure not only their value based on purchases but also the overall relationship. Keeping an existing client is more important than acquiring a new one which makes this metric important. Measuring CLV is a bit complicated than measuring other metrics. First, calculate the average customer lifetime by taking the number one divided by the customer churn rate. As an example, let’s say your monthly churn rate is 1%. Your average customer lifetime would be 1/0.01 = 100 months. Then take the average customer lifetime and multiply it by the average revenue per account (ARPA) over a given time period. If your company, for example, brought in $100,000 in revenue last month off of 100 customers, that would be $1,000 in revenue per account. Finally, this brings us to CLV. You’ll now need to multiply customer lifetime (100 months) by your ARPA ($1,000). That brings us to 100 x $1,000, or $100,000 CLV. CLV is crucial as it indicates whether or not there is a proper strategy in place for business growth. It also shows investors the value of your company. Customer Acquisition Cost Customer acquisition cost (CAC) tells you how much you should spend on acquiring a new customer. The two main factors that determine the CAC are: Lead generation costs Cost of converting that lead into a client The CAC predicts the resources needed to acquire new customers. It is vital to understand this metric if you want to grow your customer base and make a profit. To calculate your CAC for any given period, divide your marketing and sales spend over that time period by the number of customers gained during the same time. It might cost more to acquire a new customer, but what if that customer ends up spending more than most? That’s where the CLV to CAC ratio comes into play. CLV: CAC Ratio CLV: CAC ratio go hand in hand. Comparing the two will help you understand the impact of your business. The CLV: CAC ratio shows the lifetime value of your customers and the amount you spend to gain new ones in a single metric. The ultimate goal of your company should be to have a high CLV: CAC ratio. According to SaaS analytics, a healthy business should have a CLV three times greater than its CAC. Just divide your calculated CLV by CAC to get the ratio. Some top-performing companies even have a ratio of 5:1. SaaS companies use this number to measure the health of marketing programs to invest in campaigns that work well or divert the resources to those campaigns that work well. Conclusion Always remember to set healthy marketing KPIs. Reporting on these numbers is never enough. Ensure that everything you do in marketing ties up to all the goals you have set for your company. Goal-driven SaaS marketing strategies always pay off and empower you and your company to be successful. Frequently Asked Questions What are the 5 most important metrics for SaaS companies? The five most important metrics for SaaS companies are Unique Visitors, Churn, Customer Lifetime Value, Customer Acquisition Cost, and Lead to Customer Conversion Rate. Why should we measure SaaS marketing metrics? Measuring marketing metrics are critically important because they help brands determine whether campaigns are successful, and provide insights to adjust future campaigns accordingly. They help marketers understand how their campaigns are driving towards their business goals, and inform decisions for optimizing their campaigns and marketing channels. How to measure the success of your SaaS marketing? The success of SaaS marketing can be measured by identifying the metrics that help them succeed. Some examples of those metrics are: Unique Visitors, Churn, Customer Lifetime Value, Customer Acquisition Cost, and Lead to Customer Conversion Rate. { "@context": "https://schema.org", "@type": "FAQPage", "mainEntity": [{ "@type": "Question", "name": "What are the 5 most important metrics for SaaS companies?", "acceptedAnswer": { "@type": "Answer", "text": "The five most important metrics for SaaS companies are Unique Visitors, Churn, Customer Lifetime Value, Customer Acquisition Cost, and Lead to Customer Conversion Rate." } },{ "@type": "Question", "name": "Why should we measure SaaS marketing metrics?", "acceptedAnswer": { "@type": "Answer", "text": "Measuring marketing metrics are critically important because they help brands determine whether campaigns are successful, and provide insights to adjust future campaigns accordingly. They help marketers understand how their campaigns are driving towards their business goals, and inform decisions for optimizing their campaigns and marketing channels." } },{ "@type": "Question", "name": "How to measure the success of your SaaS marketing?", "acceptedAnswer": { "@type": "Answer", "text": "The success of SaaS marketing can be measured by identifying the metrics that help them succeed. Some examples of those metrics are: Unique Visitors, Churn, Customer Lifetime Value, Customer Acquisition Cost, and Lead to Customer Conversion Rate." } }] }

Read More

Spotlight

Zoho Corporation

Zoho offers beautifully smart software to help you grow your business. With over 20 million users worldwide, Zoho's 33+ products aid your sales and marketing, support and collaboration, finance and recruitment needs - letting you focus only on your business. Zoho CRM is our flagship service and has won many awards such as the 2012 CRM Magazine Market Leader Award and the 2012 Sleeter 'Awesome Application'​ Award.

Events