Article | April 16, 2021
There are many articles explaining advanced methods on AI, Machine Learning or Reinforcement Learning. Yet, when it comes to real life, data scientists often have to deal with smaller, operational tasks, that are not necessarily at the edge of science, such as building simple SQL queries to generate lists of email addresses to target for CRM campaigns. In theory, these tasks should be assigned to someone more suited, such as Business Analysts or Data Analysts, but it is not always the case that the company has people dedicated specifically to those tasks, especially if it’s a smaller structure.
In some cases, these activities might consume so much of our time that we don’t have much left for the stuff that matters, and might end up doing a less than optimal work in both. That said, how should we deal with those tasks? In one hand, not only we usually don’t like doing operational tasks, but they are also a bad use of an expensive professional. On the other hand, someone has to do them, and not everyone has the necessary SQL knowledge for it. Let’s see some ways in which you can deal with them in order to optimize your team’s time.
The first and most obvious way of doing less operational tasks is by simply refusing to do them. I know it sounds harsh, and it might be impractical depending on your company and its hierarchy, but it’s worth trying it in some cases. By “refusing”, I mean questioning if that task is really necessary, and trying to find best ways of doing it. Let’s say that every month you have to prepare 3 different reports, for different areas, that contain similar information. You have managed to automate the SQL queries, but you still have to double check the results and eventually add/remove some information upon the user’s request or change something in the charts layout. In this example, you could see if all of the 3 different reports are necessary, or if you could adapt them so they become one report that you send to the 3 different users. Anyways, think of ways through which you can reduce the necessary time for those tasks or, ideally, stop performing them at all.
Sometimes it can pay to take the time to empower your users to perform some of those tasks themselves. If there is a specific team that demands most of the operational tasks, try encouraging them to use no-code tools, putting it in a way that they fell they will be more autonomous. You can either use already existing solutions or develop them in-house (this could be a great learning opportunity to develop your data scientists’ app-building skills).
If you notice it’s a task that you can’t get rid of and can’t delegate, then try to automate it as much as possible. For reports, try to migrate them to a data visualization tool such as Tableau or Google Data Studio and synchronize them with your database. If it’s related to ad hoc requests, try to make your SQL queries as flexible as possible, with variable dates and names, so that you don’t have to re-write them every time.
Especially when you are a manager, you have to prioritize, so you and your team don’t get drowned in the endless operational tasks. In order to do this, set aside one or two days in your week which you will assign to that kind of work, and don’t look at it in the remaining 3–4 days. To achieve this, you will have to adapt your workload by following the previous steps and also manage expectations by taking this smaller amount of work hours when setting deadlines. This also means explaining the paradigm shift to your internal clients, so they can adapt to these new deadlines. This step might require some internal politics, negotiating with your superiors and with other departments.
Once you have mapped all your operational activities, you start by eliminating as much as possible from your pipeline, first by getting rid of unnecessary activities for good, then by delegating them to the teams that request them. Then, whatever is left for you to do, you automate and organize, to make sure you are making time for the relevant work your team has to do. This way you make sure expensive employees’ time is being well spent, maximizing company’s profit.
Article | July 19, 2021
In an era of big data, data health has become a pressing issue when more and more data is being stored and processed. Therefore, preserving the integrity of the collected data is becoming increasingly necessary. Understanding the fundamentals of data integrity and how it works is the first step in safeguarding the data.
Data integrity is essential for the smooth running of a company. If a company’s data is altered, deleted, or changed, and if there is no way of knowing how it can have significant impact on any data-driven business decisions.
Data integrity is the reliability and trustworthiness of data throughout its lifecycle. It is the overall accuracy, completeness, and consistency of data. It can be indicated by lack of alteration between two updates of a data record, which means data is unchanged or intact. Data integrity refers to the safety of data regarding regulatory compliance- like GDPR compliance- and security. A collection of processes, rules, and standards implemented during the design phase maintains the safety and security of data.
The information stored in the database will remain secure, complete, and reliable no matter how long it’s been stored; that’s when you know that the integrity of data is safe. A data integrity framework also ensures that no outside forces are harming this data.
This term of data integrity may refer to either the state or a process. As a state, the data integrity framework defines a data set that is valid and accurate. Whereas as a process, it describes measures used to ensure validity and accuracy of data set or all data contained in a database or a construct.
Data integrity can be enforced at both physical and logical levels. Let us understand the fundamentals of data integrity in detail:
Types of Data Integrity
There are two types of data integrity: physical and logical. They are collections of processes and methods that enforce data integrity in both hierarchical and relational databases.
Physical integrity protects the wholeness and accuracy of that data as it’s stored and retrieved. It refers to the process of storage and collection of data most accurately while maintaining the accuracy and reliability of data. The physical level of data integrity includes protecting data against different external forces like power cuts, data breaches, unexpected catastrophes, human-caused damages, and more.
Logical integrity keeps the data unchanged as it’s used in different ways in a relational database. Logical integrity checks data accuracy in a particular context. The logical integrity is compromised when errors from a human operator happen while entering data manually into the database. Other causes for compromised integrity of data include bugs, malware, and transferring data from one site within the database to another in the absence of some fields.
There are four types of logical integrity:
A database has columns, rows, and tables. These elements need to be as numerous as required for the data to be accurate, but no more than necessary. Entity integrity relies on the primary key, the unique values that identify pieces of data, making sure the data is listed just once and not more to avoid a null field in the table. The feature of relational systems that store data in tables can be linked and utilized in different ways.
Referential integrity means a series of processes that ensure storage and uniform use of data. The database structure has rules embedded into them about the usage of foreign keys and ensures only proper changes, additions, or deletions of data occur. These rules can include limitations eliminating duplicate data entry, accurate data guarantee, and disallowance of data entry that doesn’t apply. Foreign keys relate data that can be shared or null. For example, let’s take a data integrity example, employees that share the same work or work in the same department.
Domain Integrity can be defined as a collection of processes ensuring the accuracy of each piece of data in a domain. A domain is a set of acceptable values a column is allowed to contain. It includes constraints that limit the format, type, and amount of data entered. In domain integrity, all values and categories are set. All categories and values in a database are set, including the nulls.
This type of logical integrity involves the user's constraints and rules to fit their specific requirements. The data isn’t always secure with entity, referential, or domain integrity. For example, if an employer creates a column to input corrective actions of the employees, this data would fall under user-defined integrity.
Difference between Data Integrity and Data Security
Often, the terms data security and data integrity get muddled and are used interchangeably. As a result, the term is incorrectly substituted for data integrity, but each term has a significant meaning.
Data integrity and data security play an essential role in the success of each other. Data security means protecting data against unauthorized access or breach and is necessary to ensure data integrity.
Data integrity is the result of successful data security. However, the term only refers to the validity and accuracy of data rather than the actual act of protecting data. Data security is one of the many ways to maintain data integrity. Data security focuses on reducing the risk of leaking intellectual property, business documents, healthcare data, emails, trade secrets, and more. Some facets of data security tactics include permissions management, data classification, identity, access management, threat detection, and security analytics.
For modern enterprises, data integrity is necessary for accurate and efficient business processes and to make well-intentioned decisions. Data integrity is critical yet manageable for organizations today by backup and replication processes, database integrity constraints, validation processes, and other system protocols through varied data protection methods.
Threats to Data Integrity
Data integrity can be compromised by human error or any malicious acts. Accidental data alteration during the transfer from one device to another can be compromised. There is an assortment of factors that can affect the integrity of the data stored in databases. Following are a few of the examples:
Data integrity is put in jeopardy when individuals enter information incorrectly, duplicate, or delete data, don’t follow the correct protocols, or make mistakes in implementing procedures to protect data.
A transfer error occurs when data is incorrectly transferred from one location in a database to another. This error also happens when a piece of data is present in the destination table but not in the source table in a relational database.
Bugs and Viruses
Data can be stolen, altered, or deleted by spyware, malware, or any viruses.
Hardware gets compromised when a computer crashes, a server gets down, or problems with any computer malfunctions. Data can be rendered incorrectly or incompletely, limit, or eliminate data access when hardware gets compromised.
Preserving Data Integrity
Companies make decisions based on data. If that data is compromised or incorrect, it could harm that company to a great extent. They routinely make data-driven business decisions, and without data integrity, those decisions can have a significant impact on the company’s goals.
The threats mentioned above highlight a part of data security that can help preserve data integrity. Minimize the risk to your organization by using the following checklist:
Require an input validation when your data set is supplied by a known or an unknown source (an end-user, another application, a malicious user, or any number of other sources). The data should be validated and verified to ensure the correct input.
Verifying data processes haven’t been corrupted is highly critical. Identify key specifications and attributes that are necessary for your organization before you validate the data.
Eliminate Duplicate Data
Sensitive data from a secure database can easily be found on a document, spreadsheet, email, or shared folders where employees can see it without proper access. Therefore, it is sensible to clean up stray data and remove duplicates.
Data backups are a critical process in addition to removing duplicates and ensuring data security. Permanent loss of data can be avoided by backing up all necessary information, and it goes a long way. Back up the data as much as possible as it is critical as organizations may get attacked by ransomware.
Another vital data security practice is access control. Individuals in an organization with any wrong intent can harm the data. Implement a model where users who need access can get access is also a successful form of access control. Sensitive servers should be isolated and bolted to the floor, with individuals with an access key are allowed to use them.
Keep an Audit Trail
In case of a data breach, an audit trail will help you track down your source. In addition, it serves as breadcrumbs to locate and pinpoint the individual and origin of the breach.
Data collection was difficult not too long ago. It is no longer an issue these days. With the amount of data being collected these days, we must maintain the integrity of the data. Organizations can thus make data-driven decisions confidently and take the company ahead in a proper direction.
Frequently Asked Questions
What are integrity rules?
Precise data integrity rules are short statements about constraints that need to be applied or actions that need to be taken on the data when entering the data resource or while in the data resource. For example, precise data integrity rules do not state or enforce accuracy, precision, scale, or resolution.
What is a data integrity example?
Data integrity is the overall accuracy, completeness, and consistency of data. A few examples where data integrity is compromised are:
• When a user tries to enter a date outside an acceptable range
• When a user tries to enter a phone number in the wrong format
• When a bug in an application attempts to delete the wrong record
What are the principles of data integrity?
The principles of data integrity are attributable, legible, contemporaneous, original, and accurate. These simple principles need to be part of a data life cycle, GDP, and data integrity initiatives.
"name": "What are integrity rules?",
"text": "Precise data integrity rules are short statements about constraints that need to be applied or actions that need to be taken on the data when entering the data resource or while in the data resource. For example, precise data integrity rules do not state or enforce accuracy, precision, scale, or resolution."
"name": "What is a data integrity example?",
"text": "Data integrity is the overall accuracy, completeness, and consistency of data. A few examples where data integrity is compromised are:
When a user tries to enter a date outside an acceptable range
When a user tries to enter a phone number in the wrong format
When a bug in an application attempts to delete the wrong record"
"name": "What are the principles of data integrity?",
"text": "The principles of data integrity are attributable, legible, contemporaneous, original, and accurate. These simple principles need to be part of a data life cycle, GDP, and data integrity initiatives."
Article | December 21, 2020
Machine Learning (ML) has taken strides over the past few years, establishing its place in data analytics. In particular, ML has become a cornerstone in data science, alongside data wrangling, and data visualization, among other facets of the field. Yet, we observe many organizations still hesitant when allocating a budget for it in their data pipelines. The data engineer role seems to attract lots of attention, but few companies leverage the machine learning expert/engineer. Could it be that ML can add value to other enterprises too? Let's find out by clarifying certain concepts.
What Machine Learning is
So that we are all on the same page, let's look at a down-to-earth definition of ML that you can include in a company meeting, a report, or even within an email to a colleague who isn't in this field. Investopedia defines ML as "the concept that a computer program can learn and adapt to new data without human intervention." In other words, if your machine (be it a computer, a smartphone, or even a smart device) can learn on its own, using some specialized software, then it's under the ML umbrella. It's important to note that ML is also a stand-alone field of research, predating most AI systems, even if the two are linked, as we'll see later on.
How Machine Learning is different from Statistics
It's also important to note that ML is different from Statistics, even if some people like to view the former as an extension of the latter. However, there is a fundamental difference that most people aren't aware of yet. Namely, ML is data-driven while Statistics is, for the most part, model-driven. This statement means that most Stats-based inferences are made by assuming a particular distribution in the data, or the interactions of different variables, and making predictions based on our mathematical models of these distributions. ML may employ distributions in some niche cases, but for the most part, it looks at data as-is, without making any assumptions about it.
Machine Learning’s role in data science work
Let’s now get to the crux of the matter and explore how ML can be a significant value-add to a data science pipeline. First of all, ML can potentially offer better predictions than most Stats models in terms of accuracy, F1 score, etc. Also, ML can work alongside existing models to form model ensembles that can tackle the problems more effectively. Additionally, if transparency is important to the project stakeholders, there are ML-based options for offering some insight as to what variables are important in the data at hand, for making predictions based on it. Moreover, ML is more parametrized, meaning that you can tweak an ML model more, adapting it to the data you have and ensuring more robustness (i.e., reliability). Finally, you can learn ML without needing a Math degree or any other formal training. The latter, however, may prove useful, if you wish to delve deeper into the topic and develop your own models. This innovation potential is a significant aspect of ML since it's not as easy to develop new models in Stats (unless you are an experienced Statistics researcher) or even in AI. Besides, there are a bunch of various "heuristics" that are part of the ML group of algorithms, facilitating your data science work, regardless of what predictive model you end up using.
Machine Learning and AI
Many people conflate ML with AI these days. This confusion is partly because many ML models involve artificial neural networks (ANNs) which are the most modern manifestation of AI. Also, many AI systems are employed in ML tasks, so they are referred to as ML systems since AI can be a bit generic as a term. However, not all ML algorithms are AI-related, nor are all AI algorithms under the ML umbrella. This distinction is of import because certain limitations of AI systems (e.g., the need for lots and lots of data) don't apply to most ML models, while AI systems tend to be more time-consuming and resource-heavy than the average ML one. There are several ML algorithms you can use without breaking the bank and derive value from your data through them. Then, if you find that you need something better, in terms of accuracy, you can explore AI-based ones. Keep in mind, however, that some ML models (e.g., Decision Trees, Random Forests, etc.) offer some transparency, while the vast majority of AI ones are black boxes.
Learning more about the topic
Naturally, it's hard to do this topic justice in a single article. It is so vast that someone can write a book on it! That's what I've done earlier this year, through the Technics Publications publishing house. You can learn more about this topic via this book, which is titled Julia for Machine Learning(Julia is a modern programming language used in data science, among other fields, and it's popular among various technical professionals). Feel free to check it out and explore how you can use ML in your work. Cheers!
Article | February 11, 2020
The latest pace of advancements in technology paves way for businesses to pay attention to digital strategy in order to drive effective digital transformation. Digital strategy focuses on leveraging technology to enhance business performance, specifying the direction where organizations can create new competitive advantages with it. Despite a lot of buzz around its advancement, digital transformation initiatives in most businesses are still in its infancy.Organizations that have successfully implemented and are effectively navigating their way towards digital transformation have seen that deploying a low-code workflow automation platform makes them more efficient.