Article | April 16, 2021
There are many articles explaining advanced methods on AI, Machine Learning or Reinforcement Learning. Yet, when it comes to real life, data scientists often have to deal with smaller, operational tasks, that are not necessarily at the edge of science, such as building simple SQL queries to generate lists of email addresses to target for CRM campaigns. In theory, these tasks should be assigned to someone more suited, such as Business Analysts or Data Analysts, but it is not always the case that the company has people dedicated specifically to those tasks, especially if it’s a smaller structure.
In some cases, these activities might consume so much of our time that we don’t have much left for the stuff that matters, and might end up doing a less than optimal work in both. That said, how should we deal with those tasks? In one hand, not only we usually don’t like doing operational tasks, but they are also a bad use of an expensive professional. On the other hand, someone has to do them, and not everyone has the necessary SQL knowledge for it. Let’s see some ways in which you can deal with them in order to optimize your team’s time.
The first and most obvious way of doing less operational tasks is by simply refusing to do them. I know it sounds harsh, and it might be impractical depending on your company and its hierarchy, but it’s worth trying it in some cases. By “refusing”, I mean questioning if that task is really necessary, and trying to find best ways of doing it. Let’s say that every month you have to prepare 3 different reports, for different areas, that contain similar information. You have managed to automate the SQL queries, but you still have to double check the results and eventually add/remove some information upon the user’s request or change something in the charts layout. In this example, you could see if all of the 3 different reports are necessary, or if you could adapt them so they become one report that you send to the 3 different users. Anyways, think of ways through which you can reduce the necessary time for those tasks or, ideally, stop performing them at all.
Sometimes it can pay to take the time to empower your users to perform some of those tasks themselves. If there is a specific team that demands most of the operational tasks, try encouraging them to use no-code tools, putting it in a way that they fell they will be more autonomous. You can either use already existing solutions or develop them in-house (this could be a great learning opportunity to develop your data scientists’ app-building skills).
If you notice it’s a task that you can’t get rid of and can’t delegate, then try to automate it as much as possible. For reports, try to migrate them to a data visualization tool such as Tableau or Google Data Studio and synchronize them with your database. If it’s related to ad hoc requests, try to make your SQL queries as flexible as possible, with variable dates and names, so that you don’t have to re-write them every time.
Especially when you are a manager, you have to prioritize, so you and your team don’t get drowned in the endless operational tasks. In order to do this, set aside one or two days in your week which you will assign to that kind of work, and don’t look at it in the remaining 3–4 days. To achieve this, you will have to adapt your workload by following the previous steps and also manage expectations by taking this smaller amount of work hours when setting deadlines. This also means explaining the paradigm shift to your internal clients, so they can adapt to these new deadlines. This step might require some internal politics, negotiating with your superiors and with other departments.
Once you have mapped all your operational activities, you start by eliminating as much as possible from your pipeline, first by getting rid of unnecessary activities for good, then by delegating them to the teams that request them. Then, whatever is left for you to do, you automate and organize, to make sure you are making time for the relevant work your team has to do. This way you make sure expensive employees’ time is being well spent, maximizing company’s profit.
Article | December 21, 2020
Machine Learning (ML) has taken strides over the past few years, establishing its place in data analytics. In particular, ML has become a cornerstone in data science, alongside data wrangling, and data visualization, among other facets of the field. Yet, we observe many organizations still hesitant when allocating a budget for it in their data pipelines. The data engineer role seems to attract lots of attention, but few companies leverage the machine learning expert/engineer. Could it be that ML can add value to other enterprises too? Let's find out by clarifying certain concepts.
What Machine Learning is
So that we are all on the same page, let's look at a down-to-earth definition of ML that you can include in a company meeting, a report, or even within an email to a colleague who isn't in this field. Investopedia defines ML as "the concept that a computer program can learn and adapt to new data without human intervention." In other words, if your machine (be it a computer, a smartphone, or even a smart device) can learn on its own, using some specialized software, then it's under the ML umbrella. It's important to note that ML is also a stand-alone field of research, predating most AI systems, even if the two are linked, as we'll see later on.
How Machine Learning is different from Statistics
It's also important to note that ML is different from Statistics, even if some people like to view the former as an extension of the latter. However, there is a fundamental difference that most people aren't aware of yet. Namely, ML is data-driven while Statistics is, for the most part, model-driven. This statement means that most Stats-based inferences are made by assuming a particular distribution in the data, or the interactions of different variables, and making predictions based on our mathematical models of these distributions. ML may employ distributions in some niche cases, but for the most part, it looks at data as-is, without making any assumptions about it.
Machine Learning’s role in data science work
Let’s now get to the crux of the matter and explore how ML can be a significant value-add to a data science pipeline. First of all, ML can potentially offer better predictions than most Stats models in terms of accuracy, F1 score, etc. Also, ML can work alongside existing models to form model ensembles that can tackle the problems more effectively. Additionally, if transparency is important to the project stakeholders, there are ML-based options for offering some insight as to what variables are important in the data at hand, for making predictions based on it. Moreover, ML is more parametrized, meaning that you can tweak an ML model more, adapting it to the data you have and ensuring more robustness (i.e., reliability). Finally, you can learn ML without needing a Math degree or any other formal training. The latter, however, may prove useful, if you wish to delve deeper into the topic and develop your own models. This innovation potential is a significant aspect of ML since it's not as easy to develop new models in Stats (unless you are an experienced Statistics researcher) or even in AI. Besides, there are a bunch of various "heuristics" that are part of the ML group of algorithms, facilitating your data science work, regardless of what predictive model you end up using.
Machine Learning and AI
Many people conflate ML with AI these days. This confusion is partly because many ML models involve artificial neural networks (ANNs) which are the most modern manifestation of AI. Also, many AI systems are employed in ML tasks, so they are referred to as ML systems since AI can be a bit generic as a term. However, not all ML algorithms are AI-related, nor are all AI algorithms under the ML umbrella. This distinction is of import because certain limitations of AI systems (e.g., the need for lots and lots of data) don't apply to most ML models, while AI systems tend to be more time-consuming and resource-heavy than the average ML one. There are several ML algorithms you can use without breaking the bank and derive value from your data through them. Then, if you find that you need something better, in terms of accuracy, you can explore AI-based ones. Keep in mind, however, that some ML models (e.g., Decision Trees, Random Forests, etc.) offer some transparency, while the vast majority of AI ones are black boxes.
Learning more about the topic
Naturally, it's hard to do this topic justice in a single article. It is so vast that someone can write a book on it! That's what I've done earlier this year, through the Technics Publications publishing house. You can learn more about this topic via this book, which is titled Julia for Machine Learning(Julia is a modern programming language used in data science, among other fields, and it's popular among various technical professionals). Feel free to check it out and explore how you can use ML in your work. Cheers!
Article | April 7, 2020
According to software vendors executing the big data projects, the answer is clear: More data means more options. Then add a bit of machine learning (ML) for good measure to get told what to do, and the revenue will thrive.This is not really feasible. Therefore, before starting a big data project, a checklist might come in handy.Make sure that the insights gained through machine learning are actionable. Gaining insights is always good, but it is even better if you can act on this new knowledge.A shopping basket analysis shows which products are sold together. What to do with that information?Companies could place the two products in opposite corners of the shop, so customers walk through all areas and will find other products to buy in addition. Or they could place both products next to each other so each boosts the sales of the other. Or how about discounting one product to gain more customers?As all actions have unknown side effects, companies have to decide for themselves which action makes sense to take in their case.
Article | October 27, 2020
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.