ZACHARIAS VOULGARIS | December 23, 2020
Nowadays, everyone with some technical expertise and a data science bootcamp under their belt calls themselves a data scientist. Also, most managers don't know enough about the field to distinguish an actual data scientist from a make-believe one someone who calls themselves a data science professional today but may work as a cab driver next year. As data science is a very responsible field dealing with complex problems that require serious attention and work, the data scientist role has never been more significant. So, perhaps instead of arguing about which programming language or which all-in-one solution is the best one, we should focus on something more fundamental. More specifically, the thinking process of a data scientist.
The challenges of the Data Science professional
Any data science professional, regardless of his specialization, faces certain challenges in his day-to-day work. The most important of these involves decisions regarding how he goes about his work. He may have planned to use a particular model for his predictions or that model may not yield adequate performance (e.g., not high enough accuracy or too high computational cost, among other issues). What should he do then? Also, it could be that the data doesn't have a strong enough signal, and last time I checked, there wasn't a fool-proof method on any data science programming library that provided a clear-cut view on this matter. These are calls that the data scientist has to make and shoulder all the responsibility that goes with them.
Why Data Science automation often fails
Then there is the matter of automation of data science tasks. Although the idea sounds promising, it's probably the most challenging task in a data science pipeline. It's not unfeasible, but it takes a lot of work and a lot of expertise that's usually impossible to find in a single data scientist. Often, you need to combine the work of data engineers, software developers, data scientists, and even data modelers. Since most organizations don't have all that expertise or don't know how to manage it effectively, automation doesn't happen as they envision, resulting in a large part of the data science pipeline needing to be done manually.
The Data Science mindset overall
The data science mindset is the thinking process of the data scientist, the operating system of her mind. Without it, she can't do her work properly, in the large variety of circumstances she may find herself in. It's her mindset that organizes her know-how and helps her find solutions to the complex problems she encounters, whether it is wrangling data, building and testing a model or deploying the model on the cloud. This mindset is her strategy potential, the think tank within, which enables her to make the tough calls she often needs to make for the data science projects to move forward.
Specific aspects of the Data Science mindset
Of course, the data science mindset is more than a general thing. It involves specific components, such as specialized know-how, tools that are compatible with each other and relevant to the task at hand, a deep understanding of the methodologies used in data science work, problem-solving skills, and most importantly, communication abilities. The latter involves both the data scientist expressing himself clearly and also him understanding what the stakeholders need and expect of him. Naturally, the data science mindset also includes organizational skills (project management), the ability to work well with other professionals (even those not directly related to data science), and the ability to come up with creative approaches to the problem at hand.
The Data Science process
The data science process/pipeline is a distillation of data science work in a comprehensible manner. It's particularly useful for understanding the various stages of a data science project and help plan accordingly. You can view one version of it in Fig. 1 below. If the data science mindset is one's ability to navigate the data science landscape, the data science process is a map of that landscape. It's not 100% accurate but good enough to help you gain perspective if you feel overwhelmed or need to get a better grip on the bigger picture.
Learning more about the topic
Naturally, it's impossible to exhaust this topic in a single article (or even a series of articles). The material I've gathered on it can fill a book! If you are interested in such a book, feel free to check out the one I put together a few years back; it's called Data Science Mindset, Methodologies, and Misconceptions and it's geared both towards data scientist, data science learners, and people involved in data science work in some way (e.g. project leaders or data analysts). Check it out when you have a moment. Cheers!
ZACHARIAS VOULGARIS | December 21, 2020
Machine Learning (ML) has taken strides over the past few years, establishing its place in data analytics. In particular, ML has become a cornerstone in data science, alongside data wrangling, and data visualization, among other facets of the field. Yet, we observe many organizations still hesitant when allocating a budget for it in their data pipelines. The data engineer role seems to attract lots of attention, but few companies leverage the machine learning expert/engineer. Could it be that ML can add value to other enterprises too? Let's find out by clarifying certain concepts.
What Machine Learning is
So that we are all on the same page, let's look at a down-to-earth definition of ML that you can include in a company meeting, a report, or even within an email to a colleague who isn't in this field. Investopedia defines ML as "the concept that a computer program can learn and adapt to new data without human intervention." In other words, if your machine (be it a computer, a smartphone, or even a smart device) can learn on its own, using some specialized software, then it's under the ML umbrella. It's important to note that ML is also a stand-alone field of research, predating most AI systems, even if the two are linked, as we'll see later on.
How Machine Learning is different from Statistics
It's also important to note that ML is different from Statistics, even if some people like to view the former as an extension of the latter. However, there is a fundamental difference that most people aren't aware of yet. Namely, ML is data-driven while Statistics is, for the most part, model-driven. This statement means that most Stats-based inferences are made by assuming a particular distribution in the data, or the interactions of different variables, and making predictions based on our mathematical models of these distributions. ML may employ distributions in some niche cases, but for the most part, it looks at data as-is, without making any assumptions about it.
Machine Learning’s role in data science work
Let’s now get to the crux of the matter and explore how ML can be a significant value-add to a data science pipeline. First of all, ML can potentially offer better predictions than most Stats models in terms of accuracy, F1 score, etc. Also, ML can work alongside existing models to form model ensembles that can tackle the problems more effectively. Additionally, if transparency is important to the project stakeholders, there are ML-based options for offering some insight as to what variables are important in the data at hand, for making predictions based on it. Moreover, ML is more parametrized, meaning that you can tweak an ML model more, adapting it to the data you have and ensuring more robustness (i.e., reliability). Finally, you can learn ML without needing a Math degree or any other formal training. The latter, however, may prove useful, if you wish to delve deeper into the topic and develop your own models. This innovation potential is a significant aspect of ML since it's not as easy to develop new models in Stats (unless you are an experienced Statistics researcher) or even in AI. Besides, there are a bunch of various "heuristics" that are part of the ML group of algorithms, facilitating your data science work, regardless of what predictive model you end up using.
Machine Learning and AI
Many people conflate ML with AI these days. This confusion is partly because many ML models involve artificial neural networks (ANNs) which are the most modern manifestation of AI. Also, many AI systems are employed in ML tasks, so they are referred to as ML systems since AI can be a bit generic as a term. However, not all ML algorithms are AI-related, nor are all AI algorithms under the ML umbrella. This distinction is of import because certain limitations of AI systems (e.g., the need for lots and lots of data) don't apply to most ML models, while AI systems tend to be more time-consuming and resource-heavy than the average ML one. There are several ML algorithms you can use without breaking the bank and derive value from your data through them. Then, if you find that you need something better, in terms of accuracy, you can explore AI-based ones. Keep in mind, however, that some ML models (e.g., Decision Trees, Random Forests, etc.) offer some transparency, while the vast majority of AI ones are black boxes.
Learning more about the topic
Naturally, it's hard to do this topic justice in a single article. It is so vast that someone can write a book on it! That's what I've done earlier this year, through the Technics Publications publishing house. You can learn more about this topic via this book, which is titled Julia for Machine Learning (Julia is a modern programming language used in data science, among other fields, and it's popular among various technical professionals). Feel free to check it out and explore how you can use ML in your work. Cheers!