Article | July 19, 2021
In an era of big data, data health has become a pressing issue when more and more data is being stored and processed. Therefore, preserving the integrity of the collected data is becoming increasingly necessary. Understanding the fundamentals of data integrity and how it works is the first step in safeguarding the data.
Data integrity is essential for the smooth running of a company. If a company’s data is altered, deleted, or changed, and if there is no way of knowing how it can have significant impact on any data-driven business decisions.
Data integrity is the reliability and trustworthiness of data throughout its lifecycle. It is the overall accuracy, completeness, and consistency of data. It can be indicated by lack of alteration between two updates of a data record, which means data is unchanged or intact. Data integrity refers to the safety of data regarding regulatory compliance- like GDPR compliance- and security. A collection of processes, rules, and standards implemented during the design phase maintains the safety and security of data.
The information stored in the database will remain secure, complete, and reliable no matter how long it’s been stored; that’s when you know that the integrity of data is safe. A data integrity framework also ensures that no outside forces are harming this data.
This term of data integrity may refer to either the state or a process. As a state, the data integrity framework defines a data set that is valid and accurate. Whereas as a process, it describes measures used to ensure validity and accuracy of data set or all data contained in a database or a construct.
Data integrity can be enforced at both physical and logical levels. Let us understand the fundamentals of data integrity in detail:
Types of Data Integrity
There are two types of data integrity: physical and logical. They are collections of processes and methods that enforce data integrity in both hierarchical and relational databases.
Physical integrity protects the wholeness and accuracy of that data as it’s stored and retrieved. It refers to the process of storage and collection of data most accurately while maintaining the accuracy and reliability of data. The physical level of data integrity includes protecting data against different external forces like power cuts, data breaches, unexpected catastrophes, human-caused damages, and more.
Logical integrity keeps the data unchanged as it’s used in different ways in a relational database. Logical integrity checks data accuracy in a particular context. The logical integrity is compromised when errors from a human operator happen while entering data manually into the database. Other causes for compromised integrity of data include bugs, malware, and transferring data from one site within the database to another in the absence of some fields.
There are four types of logical integrity:
A database has columns, rows, and tables. These elements need to be as numerous as required for the data to be accurate, but no more than necessary. Entity integrity relies on the primary key, the unique values that identify pieces of data, making sure the data is listed just once and not more to avoid a null field in the table. The feature of relational systems that store data in tables can be linked and utilized in different ways.
Referential integrity means a series of processes that ensure storage and uniform use of data. The database structure has rules embedded into them about the usage of foreign keys and ensures only proper changes, additions, or deletions of data occur. These rules can include limitations eliminating duplicate data entry, accurate data guarantee, and disallowance of data entry that doesn’t apply. Foreign keys relate data that can be shared or null. For example, let’s take a data integrity example, employees that share the same work or work in the same department.
Domain Integrity can be defined as a collection of processes ensuring the accuracy of each piece of data in a domain. A domain is a set of acceptable values a column is allowed to contain. It includes constraints that limit the format, type, and amount of data entered. In domain integrity, all values and categories are set. All categories and values in a database are set, including the nulls.
This type of logical integrity involves the user's constraints and rules to fit their specific requirements. The data isn’t always secure with entity, referential, or domain integrity. For example, if an employer creates a column to input corrective actions of the employees, this data would fall under user-defined integrity.
Difference between Data Integrity and Data Security
Often, the terms data security and data integrity get muddled and are used interchangeably. As a result, the term is incorrectly substituted for data integrity, but each term has a significant meaning.
Data integrity and data security play an essential role in the success of each other. Data security means protecting data against unauthorized access or breach and is necessary to ensure data integrity.
Data integrity is the result of successful data security. However, the term only refers to the validity and accuracy of data rather than the actual act of protecting data. Data security is one of the many ways to maintain data integrity. Data security focuses on reducing the risk of leaking intellectual property, business documents, healthcare data, emails, trade secrets, and more. Some facets of data security tactics include permissions management, data classification, identity, access management, threat detection, and security analytics.
For modern enterprises, data integrity is necessary for accurate and efficient business processes and to make well-intentioned decisions. Data integrity is critical yet manageable for organizations today by backup and replication processes, database integrity constraints, validation processes, and other system protocols through varied data protection methods.
Threats to Data Integrity
Data integrity can be compromised by human error or any malicious acts. Accidental data alteration during the transfer from one device to another can be compromised. There is an assortment of factors that can affect the integrity of the data stored in databases. Following are a few of the examples:
Data integrity is put in jeopardy when individuals enter information incorrectly, duplicate, or delete data, don’t follow the correct protocols, or make mistakes in implementing procedures to protect data.
A transfer error occurs when data is incorrectly transferred from one location in a database to another. This error also happens when a piece of data is present in the destination table but not in the source table in a relational database.
Bugs and Viruses
Data can be stolen, altered, or deleted by spyware, malware, or any viruses.
Hardware gets compromised when a computer crashes, a server gets down, or problems with any computer malfunctions. Data can be rendered incorrectly or incompletely, limit, or eliminate data access when hardware gets compromised.
Preserving Data Integrity
Companies make decisions based on data. If that data is compromised or incorrect, it could harm that company to a great extent. They routinely make data-driven business decisions, and without data integrity, those decisions can have a significant impact on the company’s goals.
The threats mentioned above highlight a part of data security that can help preserve data integrity. Minimize the risk to your organization by using the following checklist:
Require an input validation when your data set is supplied by a known or an unknown source (an end-user, another application, a malicious user, or any number of other sources). The data should be validated and verified to ensure the correct input.
Verifying data processes haven’t been corrupted is highly critical. Identify key specifications and attributes that are necessary for your organization before you validate the data.
Eliminate Duplicate Data
Sensitive data from a secure database can easily be found on a document, spreadsheet, email, or shared folders where employees can see it without proper access. Therefore, it is sensible to clean up stray data and remove duplicates.
Data backups are a critical process in addition to removing duplicates and ensuring data security. Permanent loss of data can be avoided by backing up all necessary information, and it goes a long way. Back up the data as much as possible as it is critical as organizations may get attacked by ransomware.
Another vital data security practice is access control. Individuals in an organization with any wrong intent can harm the data. Implement a model where users who need access can get access is also a successful form of access control. Sensitive servers should be isolated and bolted to the floor, with individuals with an access key are allowed to use them.
Keep an Audit Trail
In case of a data breach, an audit trail will help you track down your source. In addition, it serves as breadcrumbs to locate and pinpoint the individual and origin of the breach.
Data collection was difficult not too long ago. It is no longer an issue these days. With the amount of data being collected these days, we must maintain the integrity of the data. Organizations can thus make data-driven decisions confidently and take the company ahead in a proper direction.
Frequently Asked Questions
What are integrity rules?
Precise data integrity rules are short statements about constraints that need to be applied or actions that need to be taken on the data when entering the data resource or while in the data resource. For example, precise data integrity rules do not state or enforce accuracy, precision, scale, or resolution.
What is a data integrity example?
Data integrity is the overall accuracy, completeness, and consistency of data. A few examples where data integrity is compromised are:
• When a user tries to enter a date outside an acceptable range
• When a user tries to enter a phone number in the wrong format
• When a bug in an application attempts to delete the wrong record
What are the principles of data integrity?
The principles of data integrity are attributable, legible, contemporaneous, original, and accurate. These simple principles need to be part of a data life cycle, GDP, and data integrity initiatives.
"name": "What are integrity rules?",
"text": "Precise data integrity rules are short statements about constraints that need to be applied or actions that need to be taken on the data when entering the data resource or while in the data resource. For example, precise data integrity rules do not state or enforce accuracy, precision, scale, or resolution."
"name": "What is a data integrity example?",
"text": "Data integrity is the overall accuracy, completeness, and consistency of data. A few examples where data integrity is compromised are:
When a user tries to enter a date outside an acceptable range
When a user tries to enter a phone number in the wrong format
When a bug in an application attempts to delete the wrong record"
"name": "What are the principles of data integrity?",
"text": "The principles of data integrity are attributable, legible, contemporaneous, original, and accurate. These simple principles need to be part of a data life cycle, GDP, and data integrity initiatives."
Article | July 19, 2021
Headquartered in London, England, BP (NYSE: BP) is a multinational oil and gas company. Operating since 1909, the organization offers its customers with fuel for transportation, energy for heat and light, lubricants to keep engines moving, and the petrochemicals products.
Business intelligence has always been a key enabler for improving decision making processes in large enterprises from early days of spreadsheet software to building enterprise data warehouses for housing large sets of enterprise data and to more recent developments of mining those datasets to unearth hidden relationships. One underlying theme throughout this evolution has been the delegation of crucial task of finding out the remarkable relationships between various objects of interest to human beings.
What BI technology has been doing, in other words, is to make it possible (and often easy too) to find the needle in the proverbial haystack if you somehow know in which sectors of the barn it is likely to be. It is a validatory as opposed to a predictory technology.
When the amount of data is huge in terms of variety, amount, and dimensionality (a.k.a. Big Data) and/or the relationship between datasets are beyond first-order linear relationships amicable to human intuition, the above strategy of relying solely on humans to make essential thinking about the datasets and utilizing machines only for crucial but dumb data infrastructure tasks becomes totally inadequate. The remedy to the problem follows directly from our characterization of it: finding ways to utilize the machines beyond menial tasks and offloading some or most of cognitive work from humans to the machines.
Does this mean all the technology and associated practices developed over the decades in BI space are not useful anymore in Big Data age? Not at all. On the contrary, they are more useful than ever: whereas in the past humans were in the driving seat and controlling the demand for the use of the datasets acquired and curated diligently, we have now machines taking up that important role and hence unleashing manifold different ways of using the data and finding out obscure, non-intuitive relationships that allude humans. Moreover, machines can bring unprecedented speed and processing scalability to the game that would be either prohibitively expensive or outright impossible to do with human workforce.
Companies have to realize both the enormous potential of using new automated, predictive analytics technologies such as machine learning and how to successfully incorporate and utilize those advanced technologies into the data analysis and processing fabric of their existing infrastructure. It is this marrying of relatively old, stable technologies of data mining, data warehousing, enterprise data models, etc. with the new automated predictive technologies that has the huge potential to unleash the benefits so often being hyped by the vested interests of new tools and applications as the answer to all data analytical problems.
To see this in the context of predictive analytics, let's consider the machine learning(ML) technology. The easiest way to understand machine learning would be to look at the simplest ML algorithm: linear regression. ML technology will build on basic interpolation idea of the regression and extend it using sophisticated mathematical techniques that would not necessarily be obvious to the causal users. For example, some ML algorithms would extend linear regression approach to model non-linear (i.e. higher order) relationships between dependent and independent variables in the dataset via clever mathematical transformations (a.k.a kernel methods) that will express those non-linear relationship in a linear form and hence suitable to be run through a linear algorithm.
Be it a simple linear algorithm or its more sophisticated kernel methods variation, ML algorithms will not have any context on the data they process. This is both a strength and weakness at the same time. Strength because the same algorithms could process a variety of different kinds of data, allowing us to leverage all the work gone through the development of those algorithms in different business contexts, weakness because since the algorithms lack any contextual understanding of the data, perennial computer science truth of garbage in, garbage out manifests itself unceremoniously here : ML models have to be fed "right" kind of data to draw out correct insights that explain the inner relationships in the data being processed.
ML technology provides an impressive set of sophisticated data analysis and modelling algorithms that could find out very intricate relationships among the datasets they process. It provides not only very sophisticated, advanced data analysis and modeling methods but also the ability to use these methods in an automated, hence massively distributed and scalable ways. Its Achilles' heel however is its heavy dependence on the data it is being fed with. Best analytic methods would be useless, as far as drawing out useful insights from them are concerned, if they are applied on the wrong kind of data. More seriously, the use of advanced analytical technology could give a false sense of confidence to their users over the analysis results those methods produce, making the whole undertaking not just useless but actually dangerous.
We can address the fundamental weakness of ML technology by deploying its advanced, raw algorithmic processing capabilities in conjunction with the existing data analytics technology whereby contextual data relationships and key domain knowledge coming from existing BI estate (data mining efforts, data warehouses, enterprise data models, business rules, etc.) are used to feed ML analytics pipeline. This approach will combine superior algorithmic processing capabilities of the new ML technology with the enterprise knowledge accumulated through BI efforts and will allow companies build on their existing data analytics investments while transitioning to use incoming advanced technologies. This, I believe, is effectively a win-win situation and will be key to the success of any company involved in data analytics efforts.
Article | July 19, 2021
In recent years, we have seen more industries adopt data analytics as they realize how important it is. Even the hotel industry is not left behind in this.
This is because the hospitality industry is data-rich. And the key to maintaining a competitive advantage has come down to ‘how hotels manage and analyze this data’.
With the changes taking place in the hospitality industry, data analysis can help you gain meaningful insights that can redefine the way hotels conduct business.
Article | July 19, 2021
Nowadays, everyone with some technical expertise and a data science bootcamp under their belt calls themselves a data scientist. Also, most managers don't know enough about the field to distinguish an actual data scientist from a make-believe one someone who calls themselves a data science professional today but may work as a cab driver next year. As data science is a very responsible field dealing with complex problems that require serious attention and work, the data scientist role has never been more significant. So, perhaps instead of arguing about which programming language or which all-in-one solution is the best one, we should focus on something more fundamental. More specifically, the thinking process of a data scientist.
The challenges of the Data Science professional
Any data science professional, regardless of his specialization, faces certain challenges in his day-to-day work. The most important of these involves decisions regarding how he goes about his work. He may have planned to use a particular model for his predictions or that model may not yield adequate performance (e.g., not high enough accuracy or too high computational cost, among other issues). What should he do then? Also, it could be that the data doesn't have a strong enough signal, and last time I checked, there wasn't a fool-proof method on any data science programming library that provided a clear-cut view on this matter. These are calls that the data scientist has to make and shoulder all the responsibility that goes with them.
Why Data Science automation often fails
Then there is the matter of automation of data science tasks. Although the idea sounds promising, it's probably the most challenging task in a data science pipeline. It's not unfeasible, but it takes a lot of work and a lot of expertise that's usually impossible to find in a single data scientist. Often, you need to combine the work of data engineers, software developers, data scientists, and even data modelers. Since most organizations don't have all that expertise or don't know how to manage it effectively, automation doesn't happen as they envision, resulting in a large part of the data science pipeline needing to be done manually.
The Data Science mindset overall
The data science mindset is the thinking process of the data scientist, the operating system of her mind. Without it, she can't do her work properly, in the large variety of circumstances she may find herself in. It's her mindset that organizes her know-how and helps her find solutions to the complex problems she encounters, whether it is wrangling data, building and testing a model or deploying the model on the cloud. This mindset is her strategy potential, the think tank within, which enables her to make the tough calls she often needs to make for the data science projects to move forward.
Specific aspects of the Data Science mindset
Of course, the data science mindset is more than a general thing. It involves specific components, such as specialized know-how, tools that are compatible with each other and relevant to the task at hand, a deep understanding of the methodologies used in data science work, problem-solving skills, and most importantly, communication abilities. The latter involves both the data scientist expressing himself clearly and also him understanding what the stakeholders need and expect of him. Naturally, the data science mindset also includes organizational skills (project management), the ability to work well with other professionals (even those not directly related to data science), and the ability to come up with creative approaches to the problem at hand.
The Data Science process
The data science process/pipeline is a distillation of data science work in a comprehensible manner. It's particularly useful for understanding the various stages of a data science project and help plan accordingly. You can view one version of it in Fig. 1 below. If the data science mindset is one's ability to navigate the data science landscape, the data science process is a map of that landscape. It's not 100% accurate but good enough to help you gain perspective if you feel overwhelmed or need to get a better grip on the bigger picture.
Learning more about the topic
Naturally, it's impossible to exhaust this topic in a single article (or even a series of articles). The material I've gathered on it can fill a book! If you are interested in such a book, feel free to check out the one I put together a few years back; it's called Data Science Mindset, Methodologies, and Misconceptions and it's geared both towards data scientist, data science learners, and people involved in data science work in some way (e.g. project leaders or data analysts). Check it out when you have a moment. Cheers!