Creating a Hortonworks Big Data Pipeline at the Speed of Talend

| August 5, 2016

article image
All data, be it big, little, dark, structured, or unstructured, must be ingested, cleansed, and transformed before insights can be gleaned, a base tenet of the analytics process model.

Spotlight

HG Data

HG Data uses advanced data science to provide B2B companies a better way to analyze markets and target prospects to achieve remarkable results in their marketing and sales programs. We offer the most comprehensive technographics in the industry, indexing billions of unstructured documents each day – including B2B social media, case studies, press releases, blog postings, government documents, content libraries, technical support forums, website source code, job postings and much more – to produce a detailed census of the technologies companies use to run their business. This is powerful information you can use to out-market, out-sell and outgrow your competition.

OTHER ARTICLES

CYBERSECURITY STRATEGIES TO MAKE IT NETWORKS MORE RESILIENT TO CYBERATTACKS

Article | February 28, 2020

The increasing use of advanced technologies and the internet have created an attack surface for malicious attackers. With these progressions, businesses’ IT systems are now more vulnerable which has led them to leverage innovative cybersecurity strategies that can thwart and make their networks more resilient to cyberattacks. Cybercriminals can use a variety of attacks against individuals or businesses like accessing, changing or deleting sensitive data; extracting payment; interfering with business processes and more.These kinds of attacks present an evolving danger to organizations, employees and consumers, and can cost them reputation, finances and personal lives to some extent. So, in order to protect IT networks from cyberattacks, it is significant to be aware of the various aspects of cybersecurity.

Read More

Roles in a Data Team

Article | December 17, 2020

In this article, we’ll talk about different roles in a data team and discuss their responsibilities. In particular, we will cover: The types of roles in a data team; The responsibilities of each role; The skills and knowledge each role needs to have. This is not a comprehensive list and the majority of what you will read in this article is my opinion, which comes out of my experience from working as a data scientist. You can interpret the following information as “the description of data roles from the perspective of a data scientist”. For example, my views on the role of a data engineer may be a bit simplified because I don’t see all the complexities of their work firsthand. I do hope you will find this information useful nonetheless. Roles in a Team A typical data team consists of the following roles: Product managers, Data analysts, Data scientists, Data engineers, Machine learning engineers, and Site reliability engineers / MLOps engineers. All these people work to create a data product. To explain the core responsibilities of each role, we will use a case scenario: Suppose we work at an online classifieds company. It’s a platform where users can go to sell things they don’t need (like OLX, where I work). If a user has an iPhone they want to sell — they go to this website, create a listing and sell their phone. On this platform, sellers sometimes have problems with identifying the correct category for the items they are selling. To help them, we want to build a service that suggests the best category. To sell their iPhone, the user creates a listing and the site needs to automatically understand that this iPhone has to go in the “mobile phones” category. Let’s start with the first role: product manager. Product Manager A product manager is someone responsible for developing products. Their goal is to make sure that the team is building the right thing. They are typically less technical than the rest of the team: they don’t focus on the implementation aspects of a problem, but rather the problem itself. Product managers need to ensure that the product is actually used by the end-users. This is a common problem: in many companies, engineers create something that doesn’t solve real problems. Therefore, the product manager is somebody who speaks to the team on behalf of the users. The primary skills a PM needs to have are communication skills. For data scientists, communication is a soft skill, but for a product manager — it’s a hard skill. They have to have it to perform their work. Product managers also do a lot of planning: they need to understand the problem, come up with a solution, and make sure the solution is implemented in a timely manner. To accomplish this, PMs need to know what’s important and plan the work accordingly. When somebody has a problem, they approach the PM with it. Then the task of the PM is to figure out if users actually need this feature, how important this feature is, and if the team has the capacity to implement it. Let’s come back to our example. Suppose somebody comes to the PM and says: “We want to build a feature to automatically suggest the category for a listing. Somebody’s selling an iPhone, and we want to create a service that predicts that the item goes in the mobile phones category.” Product managers need to answer these questions: “Is this feature that important to the user?” “Is it an important problem to solve in the product at all?” To answer these questions, PMs ask data analysts to help them figure out what to do next. Data Analyst Data analysts know how to analyze the data available in the company. They discover insights in the data and then explain their findings to others. So, analysts need to know: What kind of data the company has; How to get the data; How to interpret the results; How to explain their findings to colleagues and management. Data analysts are also often responsible for defining key metrics and building different dashboards. This includes things like showing the company’s profits, displaying the number of listings, or how many contacts buyers made with sellers. Thus, data analysts should know how to calculate all the important business metrics, and how to present them in a way that is understandable to others. When it comes to skills, data analysts should know: SQL — this is the main tool that they work with; Programming languages such as Python or R; Tableau or similar tools for building dashboards; Basics of statistics; How to run experiments; A bit of machine learning, such as regression analysis, and time series modeling. For our example, product managers turn to data analysts to help them quantify the extent of the problem. Together with the PM, the data analyst tries to answer questions like: “How many users are affected by this problem?” “How many users don’t finish creating their listing because of this problem?” “How many listings are there on the platform that don’t have the right category selected?” After the analyst gets the data, analyzes it and answers these questions, they may conclude: “Yes, this is actually a problem”. Then the PM and the team discuss the repost and agree: “Indeed, this problem is actually worth solving”. Now the data team will go ahead and start solving this problem. After the model for the service is created, it’s necessary to understand if the service is effective: whether this model helps people and solves the problem. For that, data analysts usually run experiments — usually, A/B tests. When running an experiment, we can see if more users successfully finish posting an item for sale or if there are fewer ads that end up in the wrong category. Data Scientist The roles of a data scientist and data analyst are pretty similar. In some companies, it’s the same person who does both jobs. However, data scientists typically focus more on predicting rather than explaining. A data analyst fetches the data, looks at it, explains what’s going on to the team, and gives some recommendations on what to do about it. A data scientist, on the other hand, focuses more on creating machine learning services. For example, one of the questions that a data scientist would want to answer is “How can we use this data to build a machine learning model for predicting something?” In other words, data scientists incorporate the data into the product. Their focus is more on engineering than analysis. Data scientists work more closely with engineers on integrating data solutions into the product. The skills of data scientists include: Machine learning — the main tool for building predictive services; Python — the primary programming language; SQL — necessary to fetch the data for training their models; Flask, Docker, and similar — to create simple web services for serving the models. For our example, the data scientists are the people who develop the model used for predicting the category. Once they have a model, they can develop a simple web service for hosting this model. Data Engineers Data engineers do all the heavy lifting when it comes to data. A lot of work needs to happen before data analysts can go to a database, fetch the data, perform their analysis, and come up with a report. This is precisely the focus of data engineers — they make sure this is possible. Their responsibility is to prepare all the necessary data in a form that is consumable for their colleagues. To accomplish this, data engineers create “a data lake”. All the data that users generate needs to be captured properly and saved in a separate database. This way, analysts can run their analysis, and data scientists can use this data for training models. Another thing data engineers often need to do, especially at larger companies, is to ensure that the people who look at the data have the necessary clearance to do so. Some user data is sensitive and people can’t just go looking around at personal information (such as emails or phone numbers) unless they have a really good reason to do so. Therefore, data engineers need to set up a system that doesn’t let people just access all the data at once. The skills needed for data engineers usually include: AWS or Google Cloud — popular cloud providers; Kubernetes and Terraform — infrastructure tools; Kafka or RabbitMQ — tools for capturing and processing the data; Databases — to save the data in such a way that it’s accessible for data analysts; Airflow or Luigi — data orchestration tools for building complex data pipelines. In our example, a data engineer prepares all the required data. First, they make sure the analyst has the data to perform the analysis. Then they also work with the data scientist to prepare the information that we’ll need for training the model. That includes the title of the listing, its description, the category, and so on. A data engineer isn’t the only type of engineer that a data team has. There are also machine learning engineers. Machine Learning Engineer Machine learning engineers take whatever data scientists build and help them scale it up. They also ensure that the service is maintainable and that the team follows the best engineering practices. Their focus is more on engineering than on modeling. The skills ML engineers have are similar to that of data engineers: AWS or Google Cloud; Infrastructure tools like Kubernetes and Terraform; Python and other programming languages; Flask, Docker, and other tools for creating web services. Additionally, ML engineers work closely with more “traditional” engineers, like backend engineers, frontend engineers, or mobile engineers, to ensure that the services from the data team are included in the final product. For our example, ML engineers work together with data scientists on productionizing the category suggestion services. They make sure it’s stable once it’s rolled out to all the users. They must also ensure that it’s maintainable and it’s possible to make changes to the service in the future. There’s another kind of engineer that can be pretty important in a data team — site reliability engineers. DevOps / Site Reliability Engineer The role of SREs is similar to the ML engineer, but the focus is more on the availability and reliability of the services. SREs aren’t strictly limited to working with data. Their role is more general: they tend to focus less on business logic and more on infrastructure, which includes things like networking and provisioning infrastructure. Therefore, SREs look after the servers where the services are running and take care of collecting all the operational metrics like CPU usage, how many requests per second there are, the services’ processes, and so on. As the name suggests, site reliability engineers have to make sure that everything runs reliably. They set up alerts and are constantly on call to make sure that the services are up and running without any interruptions. If something breaks, SREs quickly diagnose the problem and fix it, or involve an engineer to help find the solution. The skills needed for site reliability engineers: Cloud infrastructure tools; Programming languages like Python, Unix/Linux; Networking; Best DevOps practices like automation, CI/CD, and the like. Of course, ML engineers and data engineers should also know these best practices, but the focus of DevOps engineers/SREs is to establish them and make sure that they are followed. There is a special type of DevOps engineer, called “MLOps engineer”. MLOps Engineer An MLOps engineer is a DevOps engineer who also knows the basics of machine learning. Similar to an SRE, the responsibility of an MLOps Engineer is to make sure that the services, developed by data scientists, ML engineers, and data engineers, are up and running all the time. MLOps engineers know the lifecycle of a machine learning model: the training phase, serving phase, and so on. Despite having this knowledge, MLOps Engineers are still focused more on operational support than on anything else. This means that they need to know and follow all the DevOps practices and make sure that the rest of the team is following them as well. They accomplish this by setting up things like continuous retraining, and CI/CD pipelines. Even though everyone in the team has a different focus, they all work together on achieving the same goal: solve the problems of the users. Summary To summarize, the roles in the data team and their responsibilities are: Product managers — make sure that the team is building the right thing, act as a gateway to all the requests and speak on behalf of the users. Data analysts — analyze data, define key metrics, and create dashboards. Data scientists — build models and incorporate them into the product. Data engineers — prepare the data for analysts and data scientists. ML engineers — productionize machine learning services and establish the best engineering practices. Site reliability engineers — focus on availability, reliability, enforce the best DevOps practices. This list is not comprehensive, but it should be a good starting point if you are just getting into the industry, or if you just want to know how the lines between different roles are defined in the industry.

Read More

How can machine learning detect money laundering?

Article | December 16, 2020

In this article, we will explore different techniques to detect money laundering activities. Notwithstanding, regardless of various expected applications inside the financial services sector, explicitly inside the Anti-Money Laundering (AML) appropriation of Artificial Intelligence and Machine Learning (ML) has been generally moderate. What is Money Laundering, Anti Money Laundering? Money Laundering is where someone unlawfully obtains money and moves it to cover up their crimes. Anti-Money Laundering can be characterized as an activity that forestalls or aims to forestall money laundering from occurring. It is assessed by UNO that, money-laundering exchanges account in one year is 2–5% of worldwide GDP or $800 billion — $3 trillion in USD. In 2019, regulators and governmental offices exacted fines of more than $8.14 billion. Indeed, even with these stunning numbers, gauges are that just about 1 % of unlawful worldwide money related streams are ever seized by the specialists. AML activities in banks expend an over the top measure of manpower, assets, and cash flow to deal with the process and comply with the guidelines. What are the punishments for money laundering? In 2019, Celent evaluated that spending came to $8.3 billion and $23.4 billion for technology and operations, individually. This speculation is designated toward guaranteeing anti-money laundering. As we have seen much of the time, reputational costs can likewise convey a hefty price. In 2012, HSBC laundering of an expected £5.57 billion over at least seven years.   What is the current situation of the banks applying ML to stop money laundering? Given the plenty of new instruments the banks have accessible, the potential feature risk, the measure of capital involved, and the gigantic expenses as a form of fines and punishments, this should not be the situation. A solid impact by nations to curb illicit cash movement has brought about a huge yet amazingly little part of money laundering being recognized — a triumph rate of about 2% average. Dutch banks — ABN Amro, Rabobank, ING, Triodos Bank, and Volksbank announced in September 2019 to work toward a joint transaction monitoring to stand-up fight against Money Laundering. A typical challenge in transaction monitoring, for instance, is the generation of a countless number of alerts, which thusly requires operation teams to triage and process the alarms. ML models can identify and perceive dubious conduct and besides they can classify alerts into different classes such as critical, high, medium, or low risk. Critical or High alerts may be directed to senior experts on a high need to quickly explore the issue. Today is the immense number of false positives, gauges show that the normal, of false positives being produced, is the range of 95 and 99%, and this puts extraordinary weight on banks. The examination of false positives is tedious and costs money. An ongoing report found that banks were spending near 3.01€ billion every year exploring false positives. Establishments are looking for increasing productive ways to deal with crime and, in this specific situation, Machine Learning can end up being a significant tool. Financial activities become productive, the gigantic sum and speed of money related exchanges require a viable monitoring framework that can process exchanges rapidly, ideally in real-time.   What are the types of machine learning algorithms which can identify money laundering transactions? Supervised Machine Learning, it is essential to have historical information with events precisely assigned and input variables appropriately captured. If biases or errors are left in the data without being dealt with, they will get passed on to the model, bringing about erroneous models. It is smarter to utilize Unsupervised Machine Learning to have historical data with events accurately assigned. It sees an obscure pattern and results. It recognizes suspicious activity without earlier information of exactly what a money-laundering scheme resembles. What are the different techniques to detect money laundering? K-means Sequence Miner algorithm: Entering banking transactions, at that point running frequent pattern mining algorithms and mining transactions to distinguish money laundering. Clustering transactions and dubious activities to money laundering lastly show them on a chart. Time Series Euclidean distance: Presenting a sequence matching algorithm to distinguish money laundering detection, utilizing sequential detection of suspicious transactions. This method exploits the two references to recognize dubious transactions: a history of every individual’s account and exchange data with different accounts. Bayesian networks: It makes a model of the user’s previous activities, and this model will be a measure of future customer activities. In the event that the exchange or user financial transactions have. Cluster-based local outlier factor algorithm: The money laundering detection utilizing clustering techniques combination and Outliers.   Conclusion For banks, now is the ideal opportunity to deploy ML models into their ecosystem. Despite this opportunity, increased knowledge and the number of ML implementations prompted a discussion about the feasibility of these solutions and the degree to which ML should be trusted and potentially replace human analysis and decision-making. In order to further exploit and achieve ML promise, banks need to continue to expand on its awareness of ML strengths, risks, and limitations and, most critically, to create an ethical system by which the production and use of ML can be controlled and the feasibility and effect of these emerging models proven and eventually trusted.

Read More

Natural Language Desiderata: Understanding, explaining and interpreting a model.

Article | May 3, 2021

Clear conceptualization, taxonomies, categories, criteria, properties when solving complex real-life contextualized problems is non-negotiable, a “must” to unveil the hidden potential of NPL impacting on the transparency of a model. It is common knowledge that many authors and researchers in the field of natural language processing (NLP) and machine learning (ML) are prone to use explainability and interpretability interchangeably, which from the start constitutes a fallacy. They do not mean the same, even when looking for a definition from different perspectives. A formal definition of what explanation, explainable, explainability mean can be traced to social science, psychology, hermeneutics, philosophy, physics and biology. In The Nature of Explanation, Craik (1967:7) states that “explanations are not purely subjective things; they win general approval or have to be withdrawn in the face of evidence or criticism.” Moreover, the power of explanation means the power of insight and anticipation and why one explanation is satisfactory involves a prior question why any explanation at all should be satisfactory or in machine learning terminology how a model is performant in different contextual situations. Besides its utilitarian value, that impulse to resolve a problem whether or not (in the end) there is a practical application and which will be verified or disapproved in the course of time, explanations should be “meaningful”. We come across explanations every day. Perhaps the most common are reason-giving ones. Before advancing in the realm of ExNLP, it is crucial to conceptualize what constitutes an explanation. Miller (2017) considered explanations as “social interactions between the explainer and explainee”, therefore the social context has a significant impact in the actual content of an explanation. Explanations in general terms, seek to answer the why type of question. There is a need for justification. According to Bengtsson (2003) “we will accept an explanation when we feel satisfied that the explanans reaches what we already hold to be true of the explanandum”, (being the explanandum a statement that describes the phenomenon to be explained (it is a description, not the phenomenon itself) and the explanan at least two sets of statements, used for the purpose of elucidating the phenomenon). In discourse theory (my approach), it is important to highlight that there is a correlation between understanding and explanation, first and foremost. Both are articulated although they belong to different paradigmatic fields. This dichotomous pair is perceived as a duality, which represents an irreducible form of intelligibility. When there are observable external facts subject to empirical validation, systematicity, subordination to hypothetic procedures then we can say that we explain. An explanation is inscribed in the analytical domain, the realm of rules, laws and structures. When we explain we display propositions and meaning. But we do not explain in a vacuum. The contextual situation permeates the content of an explanation, in other words, explanation is an epistemic activity: it can only relate things described or conceptualized in a certain way. Explanations are answers to questions in the form: why fact, which most authors agree upon. Understanding can mean a number of things in different contexts. According to Ricoeur “understanding precedes, accompanies and swathes an explanation, and an explanation analytically develops understanding.” Following this line of thought, when we understand we grasp or perceive the chain of partial senses as a whole in a single act of synthesis. Originally, belonging to the field of the so-called human science, then, understanding refers to a circular process and it is directed to the intentional unit of discourse whereas an explanation is oriented to the analytical structure of a discourse. Now, to ground any discussion on what interpretation is, it is crucial to highlight that the concept of interpretation opposes the concept of explanation. They cannot be used interchangeably. If considered as a unit, they composed what is called une combinaison éprouvé (a contrasted dichotomy). Besides, in dissecting both definitions we will see that the agent that performs the explanation differs from the one that produce the interpretation. At present there is a challenge of defining—and evaluating—what constitutes a quality interpretation. Linguistically speaking, “interpretation” is the complete process that encompasses understanding and explanation. It is true that there is more than one way to interprete an explanation (and then, an explanation of a prediction) but it is also true that there is a limited number of possible explanations if not a unique one since they are contextualized. And it is also true that an interpretation must not only be plausible, but more plausible than another interpretation. Of course there are certain criteria to solve this conflict. And to prove that an interpretation is more plausible based on an explanation or the knowledge could be related to the logic of validation rather than to the logic of subjective probability. Narrowing it down How are these concepts transferred from theory to praxis? What is the importance of the "interpretability" of an explainable model? What do we call a "good" explainable model? What constitutes a "good explanation"? These are some of the many questions that researchers from both academia and industry are still trying to answer. In the realm on machine learning current approaches conceptualize interpretation in a rather ad-hoc manner, motivated by practical use cases and applications. Some suggest model interpretability as a remedy, but only a few are able to articulate precisely what interpretability means or why it is important. Hence more, most in the research community and industry use this term as synonym of explainability, which is certainly not. They are not overlapping terms. Needless to say, in most cases technical descriptions of interpretable models are diverse and occasionally discordant. A model is better interpretable than another model if its decisions are easier for a human to comprehend than decisions from the other model (Molnar, 2021). For a model to be interpretable (being interpretable the quality of the model), the information conferred by an interpretation may be useful. Thus, one purpose of interpretations may be to convey useful information of any kind. In Molnar’s words the higher the interpretability of a machine learning model, the easier it is for someone to comprehend why certain decisions or predictions have been made.” I will make an observation here and add “the higher the interpretability of an explainable machine learning model”. Luo et. al. (2021) defines “interpretability as ‘the ability [of a model] to explain or to present [its predictions] in understandable terms to a human.” Notice that in this definition the author includes “understanding” as part of the definition, giving the idea of completeness. Thus, the triadic closure explanation-understanding-interpretation is fulfilled, in which the explainer and interpretant (the agents) belong to different instances and where interpretation allows the extraction and formation of additional knowledge captured by the explainable model. Now are the models inherently interpretable? Well, it is more a matter of selecting the methods of achieving interpretability: by (a) interpreting existing models via post-hoc techniques, or (b) designing inherently interpretable models, which claim to provide more faithful interpretations than post-hoc interpretation of blackbox models. The difference also lies in the agency –like I said before– , and how in one case interpretation may affect the explanation process, that is model’s inner working or just include natural language explanations of learned representations or models.

Read More

Spotlight

HG Data

HG Data uses advanced data science to provide B2B companies a better way to analyze markets and target prospects to achieve remarkable results in their marketing and sales programs. We offer the most comprehensive technographics in the industry, indexing billions of unstructured documents each day – including B2B social media, case studies, press releases, blog postings, government documents, content libraries, technical support forums, website source code, job postings and much more – to produce a detailed census of the technologies companies use to run their business. This is powerful information you can use to out-market, out-sell and outgrow your competition.

Events