Article | October 27, 2020
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.
Article | December 23, 2020
Nowadays, everyone with some technical expertise and a data science bootcamp under their belt calls themselves a data scientist. Also, most managers don't know enough about the field to distinguish an actual data scientist from a make-believe one someone who calls themselves a data science professional today but may work as a cab driver next year. As data science is a very responsible field dealing with complex problems that require serious attention and work, the data scientist role has never been more significant. So, perhaps instead of arguing about which programming language or which all-in-one solution is the best one, we should focus on something more fundamental. More specifically, the thinking process of a data scientist.
The challenges of the Data Science professional
Any data science professional, regardless of his specialization, faces certain challenges in his day-to-day work. The most important of these involves decisions regarding how he goes about his work. He may have planned to use a particular model for his predictions or that model may not yield adequate performance (e.g., not high enough accuracy or too high computational cost, among other issues). What should he do then? Also, it could be that the data doesn't have a strong enough signal, and last time I checked, there wasn't a fool-proof method on any data science programming library that provided a clear-cut view on this matter. These are calls that the data scientist has to make and shoulder all the responsibility that goes with them.
Why Data Science automation often fails
Then there is the matter of automation of data science tasks. Although the idea sounds promising, it's probably the most challenging task in a data science pipeline. It's not unfeasible, but it takes a lot of work and a lot of expertise that's usually impossible to find in a single data scientist. Often, you need to combine the work of data engineers, software developers, data scientists, and even data modelers. Since most organizations don't have all that expertise or don't know how to manage it effectively, automation doesn't happen as they envision, resulting in a large part of the data science pipeline needing to be done manually.
The Data Science mindset overall
The data science mindset is the thinking process of the data scientist, the operating system of her mind. Without it, she can't do her work properly, in the large variety of circumstances she may find herself in. It's her mindset that organizes her know-how and helps her find solutions to the complex problems she encounters, whether it is wrangling data, building and testing a model or deploying the model on the cloud. This mindset is her strategy potential, the think tank within, which enables her to make the tough calls she often needs to make for the data science projects to move forward.
Specific aspects of the Data Science mindset
Of course, the data science mindset is more than a general thing. It involves specific components, such as specialized know-how, tools that are compatible with each other and relevant to the task at hand, a deep understanding of the methodologies used in data science work, problem-solving skills, and most importantly, communication abilities. The latter involves both the data scientist expressing himself clearly and also him understanding what the stakeholders need and expect of him. Naturally, the data science mindset also includes organizational skills (project management), the ability to work well with other professionals (even those not directly related to data science), and the ability to come up with creative approaches to the problem at hand.
The Data Science process
The data science process/pipeline is a distillation of data science work in a comprehensible manner. It's particularly useful for understanding the various stages of a data science project and help plan accordingly. You can view one version of it in Fig. 1 below. If the data science mindset is one's ability to navigate the data science landscape, the data science process is a map of that landscape. It's not 100% accurate but good enough to help you gain perspective if you feel overwhelmed or need to get a better grip on the bigger picture.
Learning more about the topic
Naturally, it's impossible to exhaust this topic in a single article (or even a series of articles). The material I've gathered on it can fill a book! If you are interested in such a book, feel free to check out the one I put together a few years back; it's called Data Science Mindset, Methodologies, and Misconceptions and it's geared both towards data scientist, data science learners, and people involved in data science work in some way (e.g. project leaders or data analysts). Check it out when you have a moment. Cheers!
Article | March 4, 2020
Deep learning, the main innovation that has renewed interest in artificial intelligence in the past years, has helped solve many critical problems in computer vision, natural language processing, and speech recognition. However, as the deep learning matures and moves from hype peak to its trough of disillusionment, it is becoming clear that it is missing some fundamental components.
Article | December 16, 2020
In this article, we will explore different techniques to detect money laundering activities.
Notwithstanding, regardless of various expected applications inside the financial services sector, explicitly inside the Anti-Money Laundering (AML) appropriation of Artificial Intelligence and Machine Learning (ML) has been generally moderate.
What is Money Laundering, Anti Money Laundering?
Money Laundering is where someone unlawfully obtains money and moves it to cover up their crimes.
Anti-Money Laundering can be characterized as an activity that forestalls or aims to forestall money laundering from occurring.
It is assessed by UNO that, money-laundering exchanges account in one year is 2–5% of worldwide GDP or $800 billion — $3 trillion in USD. In 2019, regulators and governmental offices exacted fines of more than $8.14 billion.
Indeed, even with these stunning numbers, gauges are that just about 1 % of unlawful worldwide money related streams are ever seized by the specialists.
AML activities in banks expend an over the top measure of manpower, assets, and cash flow to deal with the process and comply with the guidelines.
What are the punishments for money laundering?
In 2019, Celent evaluated that spending came to $8.3 billion and $23.4 billion for technology and operations, individually. This speculation is designated toward guaranteeing anti-money laundering.
As we have seen much of the time, reputational costs can likewise convey a hefty price. In 2012, HSBC laundering of an expected £5.57 billion over at least seven years.
What is the current situation of the banks applying ML to stop money laundering?
Given the plenty of new instruments the banks have accessible, the potential feature risk, the measure of capital involved, and the gigantic expenses as a form of fines and punishments, this should not be the situation.
A solid impact by nations to curb illicit cash movement has brought about a huge yet amazingly little part of money laundering being recognized — a triumph rate of about 2% average.
Dutch banks — ABN Amro, Rabobank, ING, Triodos Bank, and Volksbank announced in September 2019 to work toward a joint transaction monitoring to stand-up fight against Money Laundering.
A typical challenge in transaction monitoring, for instance, is the generation of a countless number of alerts, which thusly requires operation teams to triage and process the alarms.
ML models can identify and perceive dubious conduct and besides they can classify alerts into different classes such as critical, high, medium, or low risk. Critical or High alerts may be directed to senior experts on a high need to quickly explore the issue.
Today is the immense number of false positives, gauges show that the normal, of false positives being produced, is the range of 95 and 99%, and this puts extraordinary weight on banks.
The examination of false positives is tedious and costs money. An ongoing report found that banks were spending near 3.01€ billion every year exploring false positives.
Establishments are looking for increasing productive ways to deal with crime and, in this specific situation, Machine Learning can end up being a significant tool.
Financial activities become productive, the gigantic sum and speed of money related exchanges require a viable monitoring framework that can process exchanges rapidly, ideally in real-time.
What are the types of machine learning algorithms which can identify money laundering transactions?
Supervised Machine Learning, it is essential to have historical information with events precisely assigned and input variables appropriately captured. If biases or errors are left in the data without being dealt with, they will get passed on to the model, bringing about erroneous models.
It is smarter to utilize Unsupervised Machine Learning to have historical data with events accurately assigned. It sees an obscure pattern and results. It recognizes suspicious activity without earlier information of exactly what a money-laundering scheme resembles.
What are the different techniques to detect money laundering?
K-means Sequence Miner algorithm: Entering banking transactions, at that point running frequent pattern mining algorithms and mining transactions to distinguish money laundering. Clustering transactions and dubious activities to money laundering lastly show them on a chart.
Time Series Euclidean distance: Presenting a sequence matching algorithm to distinguish money laundering detection, utilizing sequential detection of suspicious transactions. This method exploits the two references to recognize dubious transactions: a history of every individual’s account and exchange data with different accounts.
Bayesian networks: It makes a model of the user’s previous activities, and this model will be a measure of future customer activities. In the event that the exchange or user financial transactions have.
Cluster-based local outlier factor algorithm: The money laundering detection utilizing clustering techniques combination and Outliers.
For banks, now is the ideal opportunity to deploy ML models into their ecosystem. Despite this opportunity, increased knowledge and the number of ML implementations prompted a discussion about the feasibility of these solutions and the degree to which ML should be trusted and potentially replace human analysis and decision-making.
In order to further exploit and achieve ML promise, banks need to continue to expand on its awareness of ML strengths, risks, and limitations and, most critically, to create an ethical system by which the production and use of ML can be controlled and the feasibility and effect of these emerging models proven and eventually trusted.