Data engineers: Key role in data privacy and protection

JIM HARRIS | January 23, 2019

article image
As a career data geek, I enjoy watching how the growing pervasiveness and popularity of data is reshaping industries and mainstream culture. A related trend is the increasing number of jobs that include data in their title. One that's becoming almost as prevalent as data scientist these days is data engineer. Searching a few of the major job posting websites, I discovered an expected amount of variability in how the role of data engineer was defined.

Spotlight

NAVEOS

NAVEOS ®, formerly DSH Mgmt Solutions, is a national health care data analytics firm headquartered in Virginia and was recently approved as KHA Solutions Group's newest Affinity Partner. NAVEOS® has developed technology processes and expert systems to enable automated workflows for hospitals and other health care providers to maximize past, present and future revenues under federal, state and other government health care programs…

OTHER ARTICLES

Topic modelling. Variation on themes and the Holy Grail

Article | September 2, 2021

Massive amount of data is collected and stored by companies in the search for the “Holy Grail”. One crucial component is the discovery and application of novel approaches to achieve a more complete picture of datasets provided by the local (sometimes global) event-based analytic strategy that currently dominates a specific field. Bringing qualitative data to life is essential since it provides management decisions’ context and nuance. An NLP perspective for uncovering word-based themes across documents will facilitate the exploration and exploitation of qualitative data which are often hard to “identify” in a global setting. NLP can be used to perform different analysis mapping drivers. Broadly speaking, drivers are factors that cause change and affect institutions, policies and management decision making. Being more precise, a “driver” is a force that has a material impact on a specific activity or an entity, which is contextually dependent, and which affects the financial market at a specific time. (Litterio, 2018). Major drivers often lie outside the immediate institutional environment such as elections or regional upheavals, or non-institutional factors such as Covid or climate change. In Total global strategy: Managing for worldwide competitive advantage, Yip (1992) develops a framework based on a set of four industry globalization drivers, which highlights the conditions for a company to become more global but also reflecting differentials in a competitive environment. In The lexicons: NLP in the design of Market Drivers Lexicon in Spanish, I have proposed a categorization into micro, macro drivers and temporality and a distinction among social, political, economic and technological drivers. Considering the “big picture”, “digging” beyond usual sectors and timeframes is key in state-of-the-art findings. Working with qualitative data. There is certainly not a unique “recipe” when applying NLP strategies. Different pipelines could be used to analyse any sort of textual data, from social media and reviews to focus group notes, blog comments and transcripts to name just a few when a MetaQuant team is looking for drivers. Generally, being textual data the source, it is preferable to avoid manual task on the part of the analyst, though sometimes, depending on the domain, content, cultural variables, etc. it might be required. If qualitative data is the core, then the preferred format is .csv. because of its plain nature which typically handle written responses better. Once the data has been collected and exported, the next step is to do some pre-processing. The basics include normalisation, morphosyntactic analysis, sentence structural analysis, tokenization, lexicalization, contextualization. Just simplify the data to make analysis easier. Topic modelling. Topic modelling refers to the task of recognizing words from the main topics that best describe a document or the corpus of data. LAD (Latent Dirichlet Allocation) is one of the most powerful algorithms with excellent implementations in the Python’s Gensim package. The challenge: how to extract good quality of topics that are clear and meaningful. Of course, this depends mostly on the nature of text pre-processing and the strategy of finding the optimal number of topics, the creation of a lexicon(s) and the corpora. We can say that a topic is defined or construed around the most representative keywords. But are keywords enough? Well, there are some other factors to be observed such as: 1. The variety of topics included in the corpora. 2. The choice of topic modelling algorithm. 3. The number of topics fed to the algorithm. 4. The algorithms tuning parameters. As you probably have noticed finding “the needle in the haystack” is not that easy. And only those who can use creatively NLP will have the advantage of positioning for global success.

Read More

How Data Analytics in The Hospitality Industry Can be Helpful?

Article | September 2, 2021

In recent years, we have seen more industries adopt data analytics as they realize how important it is. Even the hotel industry is not left behind in this. This is because the hospitality industry is data-rich. And the key to maintaining a competitive advantage has come down to ‘how hotels manage and analyze this data’. With the changes taking place in the hospitality industry, data analysis can help you gain meaningful insights that can redefine the way hotels conduct business.

Read More
BIG DATA MANAGEMENT

Roles in a Data Team

Article | September 2, 2021

In this article, we’ll talk about different roles in a data team and discuss their responsibilities. In particular, we will cover: The types of roles in a data team; The responsibilities of each role; The skills and knowledge each role needs to have. This is not a comprehensive list and the majority of what you will read in this article is my opinion, which comes out of my experience from working as a data scientist. You can interpret the following information as “the description of data roles from the perspective of a data scientist”. For example, my views on the role of a data engineer may be a bit simplified because I don’t see all the complexities of their work firsthand. I do hope you will find this information useful nonetheless. Roles in a Team A typical data team consists of the following roles: Product managers, Data analysts, Data scientists, Data engineers, Machine learning engineers, and Site reliability engineers / MLOps engineers. All these people work to create a data product. To explain the core responsibilities of each role, we will use a case scenario: Suppose we work at an online classifieds company. It’s a platform where users can go to sell things they don’t need (like OLX, where I work). If a user has an iPhone they want to sell — they go to this website, create a listing and sell their phone. On this platform, sellers sometimes have problems with identifying the correct category for the items they are selling. To help them, we want to build a service that suggests the best category. To sell their iPhone, the user creates a listing and the site needs to automatically understand that this iPhone has to go in the “mobile phones” category. Let’s start with the first role: product manager. Product Manager A product manager is someone responsible for developing products. Their goal is to make sure that the team is building the right thing. They are typically less technical than the rest of the team: they don’t focus on the implementation aspects of a problem, but rather the problem itself. Product managers need to ensure that the product is actually used by the end-users. This is a common problem: in many companies, engineers create something that doesn’t solve real problems. Therefore, the product manager is somebody who speaks to the team on behalf of the users. The primary skills a PM needs to have are communication skills. For data scientists, communication is a soft skill, but for a product manager — it’s a hard skill. They have to have it to perform their work. Product managers also do a lot of planning: they need to understand the problem, come up with a solution, and make sure the solution is implemented in a timely manner. To accomplish this, PMs need to know what’s important and plan the work accordingly. When somebody has a problem, they approach the PM with it. Then the task of the PM is to figure out if users actually need this feature, how important this feature is, and if the team has the capacity to implement it. Let’s come back to our example. Suppose somebody comes to the PM and says: “We want to build a feature to automatically suggest the category for a listing. Somebody’s selling an iPhone, and we want to create a service that predicts that the item goes in the mobile phones category.” Product managers need to answer these questions: “Is this feature that important to the user?” “Is it an important problem to solve in the product at all?” To answer these questions, PMs ask data analysts to help them figure out what to do next. Data Analyst Data analysts know how to analyze the data available in the company. They discover insights in the data and then explain their findings to others. So, analysts need to know: What kind of data the company has; How to get the data; How to interpret the results; How to explain their findings to colleagues and management. Data analysts are also often responsible for defining key metrics and building different dashboards. This includes things like showing the company’s profits, displaying the number of listings, or how many contacts buyers made with sellers. Thus, data analysts should know how to calculate all the important business metrics, and how to present them in a way that is understandable to others. When it comes to skills, data analysts should know: SQL — this is the main tool that they work with; Programming languages such as Python or R; Tableau or similar tools for building dashboards; Basics of statistics; How to run experiments; A bit of machine learning, such as regression analysis, and time series modeling. For our example, product managers turn to data analysts to help them quantify the extent of the problem. Together with the PM, the data analyst tries to answer questions like: “How many users are affected by this problem?” “How many users don’t finish creating their listing because of this problem?” “How many listings are there on the platform that don’t have the right category selected?” After the analyst gets the data, analyzes it and answers these questions, they may conclude: “Yes, this is actually a problem”. Then the PM and the team discuss the repost and agree: “Indeed, this problem is actually worth solving”. Now the data team will go ahead and start solving this problem. After the model for the service is created, it’s necessary to understand if the service is effective: whether this model helps people and solves the problem. For that, data analysts usually run experiments — usually, A/B tests. When running an experiment, we can see if more users successfully finish posting an item for sale or if there are fewer ads that end up in the wrong category. Data Scientist The roles of a data scientist and data analyst are pretty similar. In some companies, it’s the same person who does both jobs. However, data scientists typically focus more on predicting rather than explaining. A data analyst fetches the data, looks at it, explains what’s going on to the team, and gives some recommendations on what to do about it. A data scientist, on the other hand, focuses more on creating machine learning services. For example, one of the questions that a data scientist would want to answer is “How can we use this data to build a machine learning model for predicting something?” In other words, data scientists incorporate the data into the product. Their focus is more on engineering than analysis. Data scientists work more closely with engineers on integrating data solutions into the product. The skills of data scientists include: Machine learning — the main tool for building predictive services; Python — the primary programming language; SQL — necessary to fetch the data for training their models; Flask, Docker, and similar — to create simple web services for serving the models. For our example, the data scientists are the people who develop the model used for predicting the category. Once they have a model, they can develop a simple web service for hosting this model. Data Engineers Data engineers do all the heavy lifting when it comes to data. A lot of work needs to happen before data analysts can go to a database, fetch the data, perform their analysis, and come up with a report. This is precisely the focus of data engineers — they make sure this is possible. Their responsibility is to prepare all the necessary data in a form that is consumable for their colleagues. To accomplish this, data engineers create “a data lake”. All the data that users generate needs to be captured properly and saved in a separate database. This way, analysts can run their analysis, and data scientists can use this data for training models. Another thing data engineers often need to do, especially at larger companies, is to ensure that the people who look at the data have the necessary clearance to do so. Some user data is sensitive and people can’t just go looking around at personal information (such as emails or phone numbers) unless they have a really good reason to do so. Therefore, data engineers need to set up a system that doesn’t let people just access all the data at once. The skills needed for data engineers usually include: AWS or Google Cloud — popular cloud providers; Kubernetes and Terraform — infrastructure tools; Kafka or RabbitMQ — tools for capturing and processing the data; Databases — to save the data in such a way that it’s accessible for data analysts; Airflow or Luigi — data orchestration tools for building complex data pipelines. In our example, a data engineer prepares all the required data. First, they make sure the analyst has the data to perform the analysis. Then they also work with the data scientist to prepare the information that we’ll need for training the model. That includes the title of the listing, its description, the category, and so on. A data engineer isn’t the only type of engineer that a data team has. There are also machine learning engineers. Machine Learning Engineer Machine learning engineers take whatever data scientists build and help them scale it up. They also ensure that the service is maintainable and that the team follows the best engineering practices. Their focus is more on engineering than on modeling. The skills ML engineers have are similar to that of data engineers: AWS or Google Cloud; Infrastructure tools like Kubernetes and Terraform; Python and other programming languages; Flask, Docker, and other tools for creating web services. Additionally, ML engineers work closely with more “traditional” engineers, like backend engineers, frontend engineers, or mobile engineers, to ensure that the services from the data team are included in the final product. For our example, ML engineers work together with data scientists on productionizing the category suggestion services. They make sure it’s stable once it’s rolled out to all the users. They must also ensure that it’s maintainable and it’s possible to make changes to the service in the future. There’s another kind of engineer that can be pretty important in a data team — site reliability engineers. DevOps / Site Reliability Engineer The role of SREs is similar to the ML engineer, but the focus is more on the availability and reliability of the services. SREs aren’t strictly limited to working with data. Their role is more general: they tend to focus less on business logic and more on infrastructure, which includes things like networking and provisioning infrastructure. Therefore, SREs look after the servers where the services are running and take care of collecting all the operational metrics like CPU usage, how many requests per second there are, the services’ processes, and so on. As the name suggests, site reliability engineers have to make sure that everything runs reliably. They set up alerts and are constantly on call to make sure that the services are up and running without any interruptions. If something breaks, SREs quickly diagnose the problem and fix it, or involve an engineer to help find the solution. The skills needed for site reliability engineers: Cloud infrastructure tools; Programming languages like Python, Unix/Linux; Networking; Best DevOps practices like automation, CI/CD, and the like. Of course, ML engineers and data engineers should also know these best practices, but the focus of DevOps engineers/SREs is to establish them and make sure that they are followed. There is a special type of DevOps engineer, called “MLOps engineer”. MLOps Engineer An MLOps engineer is a DevOps engineer who also knows the basics of machine learning. Similar to an SRE, the responsibility of an MLOps Engineer is to make sure that the services, developed by data scientists, ML engineers, and data engineers, are up and running all the time. MLOps engineers know the lifecycle of a machine learning model: the training phase, serving phase, and so on. Despite having this knowledge, MLOps Engineers are still focused more on operational support than on anything else. This means that they need to know and follow all the DevOps practices and make sure that the rest of the team is following them as well. They accomplish this by setting up things like continuous retraining, and CI/CD pipelines. Even though everyone in the team has a different focus, they all work together on achieving the same goal: solve the problems of the users. Summary To summarize, the roles in the data team and their responsibilities are: Product managers — make sure that the team is building the right thing, act as a gateway to all the requests and speak on behalf of the users. Data analysts — analyze data, define key metrics, and create dashboards. Data scientists — build models and incorporate them into the product. Data engineers — prepare the data for analysts and data scientists. ML engineers — productionize machine learning services and establish the best engineering practices. Site reliability engineers — focus on availability, reliability, enforce the best DevOps practices. This list is not comprehensive, but it should be a good starting point if you are just getting into the industry, or if you just want to know how the lines between different roles are defined in the industry.

Read More
THEORY AND STRATEGIES

Rethinking and Recontextualizing Context(s) in Natural Language Processing

Article | September 2, 2021

We discursive creatures are construed within a meaningful, bounded communicative environment, namely context(s) and not in a vacuum. Context(s) co-occur in different scenarios, that is, in mundane talk as well as in academic discourse where the goal of natural language communication is mutual intelligibility, hence the negotiation of meaning. Discursive research focuses on the context-sensitive use of the linguistic code and its social practice in particular settings, such as medical talk, courtroom interactions, financial/economic and political discourse which may restrict its validity when ascribing to a theoretical framework and its propositions regarding its application. This is also reflected in the case of artificial intelligence approaches to context(s) such as the development of context-sensitive parsers, context-sensitive translation machines and context-sensitive information systems where the validity of an argument and its propositions is at stake. Context is at the heart of pragmatics or even better said context is the anchor of any pragmatic theory: sociopragmatics, discourse analysis and ethnomethodological conversation analysis. Academic disciplines, such as linguistics, philosophy, anthropology, psychology and literary theory have also studied various aspects of the context phenomena. Yet, the concept of context has remained fuzzy or is generally undefined. It seems that the denotation of the word [context] has become murkier as its uses have been extended in many directions. Context or/ and contexts? Now in order to be “felicitous” integrated into the pragmatic construct, the definition of context needs some delimitations. Depending on the frame of research, context is delimitated to the global surroundings of the phenomenon to be investigated, for instance if its surrounding is of extra-linguistic nature it is called the socio-cultural context, if it comprises features of a speech situation, it is called the linguistic context and if it refers to the cognitive material, that is a mental representation, it is called the cognitive context. Context is a transcendental notion which plays a key role in interpretation. Language is no longer considered as decontextualized sentences. Instead language is seen as embedded in larger activities, through which they become meaningful. In a dynamic outlook on communication, the acts of speaking (which generates a form discourse, for instance, conversational discourse, lecture or speech) and interpreting build contexts and at the same time constrain the building of such contexts. In Heritage’s terminology, “the production of talk is doubly contextual” (Heritage 1984: 242). An utterance relies upon the existing context for its production and interpretation, and it is, in its own right, an event that shapes a new context for the action that will follow. A linguistic context can be decontextualized at a local level, and it can be recontextualized at a global level. There is intra-discursive recontextualization anchored to local decontextualization, and there is interdiscursive recontextualization anchored to global recontextualization. “A given context not only 'legislates' the interpretation of indexical elements; indexical elements can also mold the background of the context” (Ochs, 1990). In the case of recontextualization, in a particular scenario, it is valid to ask what do you mean or how do you mean. Making a reference to context and a reference to meaning helps to clarify when there is a controversy about the communicative status and at the same time provides a frame for the recontextualization. A linguistic context is intrinsically linked to a social context and a subcategory of the latter, the socio-cultural context. The social context can be considered as unmarked, hence a default context, whereas a socio-cultural context can be conceived as a marked type of context in which specific variables are interpreted in a particular mode. Culture provides us, the participants, with a filter mechanism which allows us to interpret a social context in accordance with particular socio-cultural context constraints and requirements. Besides, socially constitutive qualities of context are unavoidable since each interaction updates the existing context and prepares new ground for subsequent interaction. Now, how these aforementioned conceptualizations and views are reflected in NLP? Most of the research work has focused in the linguistic context, that is, in the word level surroundings and the lexical meaning. An approach to producing sense embeddings for the lexical meanings within a lexical knowledge base which lie in a space that is comparable to that of contextualized word vectors. Contextualized word embeddings have been used effectively across several tasks in Natural Language Processing, as they have proved to carry useful semantic information. The task of associating a word in context with the most suitable meaning from a predefined sense inventory is better known as Word Sense Disambiguation (Navigli, 2009). Linguistically speaking, “context encompasses the total linguistic and non-linguistic background of a text” (Crystal, 1991). Notice that the nature of context(s) is clearly crucial when reconstructing the meaning of a text. Therefore, “meaning-in-context should be regarded as a probabilistic weighting, of the list of potential meanings available to the user of the language.” The so-called disambiguating role of context should be taken with a pinch of salt. The main reason for language models such as BERT (Devlin et al., 2019), RoBERTA (Liu et al., 2019) and SBERT (Reimers, 2019) proved to be beneficial in most NLP task is that contextualized embeddings of words encode the semantics defined by their input context. In the same vein, a novel method for contextualized sense representations has recently been employed: SensEmBERT (Scarlini et al., 2020) which computes sense representations that can be applied directly to disambiguation. Still, there is a long way to go regarding context(s) research. The linguistic context is just one of the necessary conditions for sentence embeddedness in “a” context. For interpretation to take place, well-formed sentences and well-formed constructions, that is, linguistic strings which must be grammatical but may be constrained by cognitive sentence-processability and pragmatic relevance, particular linguistic-context and social-context configurations, which make their production and interpretation meaningful, will be needed.

Read More

Spotlight

NAVEOS

NAVEOS ®, formerly DSH Mgmt Solutions, is a national health care data analytics firm headquartered in Virginia and was recently approved as KHA Solutions Group's newest Affinity Partner. NAVEOS® has developed technology processes and expert systems to enable automated workflows for hospitals and other health care providers to maximize past, present and future revenues under federal, state and other government health care programs…

Events