Article | September 2, 2021
Massive amount of data is collected and stored by companies in the search for the “Holy Grail”. One crucial component is the discovery and application of novel approaches to achieve a more complete picture of datasets provided by the local (sometimes global) event-based analytic strategy that currently dominates a specific field.
Bringing qualitative data to life is essential since it provides management decisions’ context and nuance. An NLP perspective for uncovering word-based themes across documents will facilitate the exploration and exploitation of qualitative data which are often hard to “identify” in a global setting. NLP can be used to perform different analysis mapping drivers.
Broadly speaking, drivers are factors that cause change and affect institutions, policies and management decision making. Being more precise, a “driver” is a force that has a material impact on a specific activity or an entity, which is contextually dependent, and which affects the financial market at a specific time. (Litterio, 2018). Major drivers often lie outside the immediate institutional environment such as elections or regional upheavals, or non-institutional factors such as Covid or climate change. In Total global strategy: Managing for worldwide competitive advantage, Yip (1992) develops a framework based on a set of four industry globalization drivers, which highlights the conditions for a company to become more global but also reflecting differentials in a competitive environment. In The lexicons: NLP in the design of Market Drivers Lexicon in Spanish, I have proposed a categorization into micro, macro drivers and temporality and a distinction among social, political, economic and technological drivers. Considering the “big picture”, “digging” beyond usual sectors and timeframes is key in state-of-the-art findings.
Working with qualitative data.
There is certainly not a unique “recipe” when applying NLP strategies. Different pipelines could be used to analyse any sort of textual data, from social media and reviews to focus group notes, blog comments and transcripts to name just a few when a MetaQuant team is looking for drivers.
Generally, being textual data the source, it is preferable to avoid manual task on the part of the analyst, though sometimes, depending on the domain, content, cultural variables, etc. it might be required. If qualitative data is the core, then the preferred format is .csv. because of its plain nature which typically handle written responses better. Once the data has been collected and exported, the next step is to do some pre-processing. The basics include normalisation, morphosyntactic analysis, sentence structural analysis, tokenization, lexicalization, contextualization. Just simplify the data to make analysis easier.
Topic modelling refers to the task of recognizing words from the main topics that best describe a document or the corpus of data. LAD (Latent Dirichlet Allocation) is one of the most powerful algorithms with excellent implementations in the Python’s Gensim package.
The challenge: how to extract good quality of topics that are clear and meaningful. Of course, this depends mostly on the nature of text pre-processing and the strategy of finding the optimal number of topics, the creation of a lexicon(s) and the corpora. We can say that a topic is defined or construed around the most representative keywords. But are keywords enough? Well, there are some other factors to be observed such as:
1. The variety of topics included in the corpora.
2. The choice of topic modelling algorithm.
3. The number of topics fed to the algorithm.
4. The algorithms tuning parameters.
As you probably have noticed finding “the needle in the haystack” is not that easy. And only those who can use creatively NLP will have the advantage of positioning for global success.
Article | September 2, 2021
THE CORONAVIRUS PANDEMIC has spurred interest in big data to track the spread of the fast-moving pathogen and to plan disease prevention efforts. But the urgent need to contain the outbreak shouldn’t cloud thinking about big data’s potential to do more harm than good.Companies and governments worldwide are tapping the location data of millions of internet and mobile phone users for clues about how the virus spreads and whether social distancing measures are working. Unlike surveillance measures that track the movements of particular individuals, these efforts analyze large data sets to uncover patterns in people’s movements and behavior over the course of the pandemic.
Article | September 2, 2021
Data Platforms and frameworks have been constantly evolving. At some point of time; we are excited by Hadoop (well for almost 10 years); followed by Snowflake or as I say Snowflake Blizzard (who managed to launch biggest IPO win historically) and the Google (Google solves problems and serves use cases in a way that few companies can match).
The end of the data warehouse
Once upon a time, life was simple; or at least, the basic approach to Business Intelligence was fairly easy to describe… A process of collecting information from systems, building a repository of consistent data, and bolting on one or more reporting and visualisation tools which presented information to users. Data used to be managed in expensive, slow, inaccessible SQL data warehouses. SQL systems were notorious for their lack of scalability. Their demise is coming from a few technological advances. One of these is the ubiquitous, and growing, Hadoop.
On April 1, 2006, Apache Hadoop was unleashed upon Silicon Valley. Inspired by Google, Hadoop’s primary purpose was to improve the flexibility and scalability of data processing by splitting the process into smaller functions that run on commodity hardware.
Hadoop’s intent was to replace enterprise data warehouses based on SQL. Unfortunately, a technology used by Google may not be the best solution for everyone else. It’s not that others are incompetent: Google solves problems and serves use cases in a way that few companies can match. Google has been running massive-scale applications such as its eponymous search engine, YouTube and the Ads platform. The technologies and infrastructure that make the geographically distributed offerings perform at scale are what make various components of Google Cloud Platform enterprise ready and well-featured. Google has shown leadership in developing innovations that have been made available to the open-source community and are being used extensively by other public cloud vendors and Gartner clients. Examples of these include the Kubernetes container management framework, TensorFlow machine learning platform and the Apache Beam data processing programming model. GCP also uses open-source offerings in its cloud while treating third-party data and analytics providers as first-class citizens on its cloud and providing unified billing for its customers. The examples of the latter include DataStax, Redis Labs, InfluxData, MongoDB, Elastic, Neo4j and Confluent.
Silicon Valley tried to make Hadoop work. The technology was extremely complicated and nearly impossible to use efficiently. Hadoop’s lack of speed was compounded by its focus on unstructured data — you had to be a “flip-flop wearing” data scientist to truly make use of it.
Unstructured datasets are very difficult to query and analyze without deep knowledge of computer science. At one point, Gartner estimated that 70% of Hadoop deployments would not achieve the goal of cost savings and revenue growth, mainly due to insufficient skills and technical integration difficulties. And seventy percent seems like an understatement.
Data storage through the years: from GFS to Snowflake or Snowflake blizzard
Developing in parallel with Hadoop’s journey was that of Marcin Zukowski — co-founder and CEO of Vectorwise. Marcin took the data warehouse in another direction, to the world of advanced vector processing. Despite being almost unheard of among the general public, Snowflake was actually founded back in 2012. Firstly, Snowflake is not a consumer tech firm like Netflix or Uber. It's business-to-business only, which may explain its high valuation – enterprise companies are often seen as a more "stable" investment. In short, Snowflake helps businesses manage data that's stored on the cloud. The firm's motto is "mobilising the world's data", because it allows big companies to make better use of their vast data stores.
Marcin and his teammates rethought the data warehouse by leveraging the elasticity of the public cloud in an unexpected way: separating storage and compute. Their message was this: don’t pay for a data warehouse you don’t need. Only pay for the storage you need, and add capacity as you go. This is considered one of Snowflake’s key innovations: separating storage (where the data is held) from computing (the act of querying). By offering this service before Google, Amazon, and Microsoft had equivalent products of their own, Snowflake was able to attract customers, and build market share in the data warehousing space.
Naming the company after a discredited database concept was very brave. For those of us not in the details of the Snowflake schema, it is a logical arrangement of tables in a multidimensional database such that the entity-relationship diagram resembles a snowflake shape. … When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. Needless to say, the “snowflake” schema is as far from Hadoop’s design philosophy as technically possible.
While Silicon Valley was headed toward a dead end, Snowflake captured an entire cloud data market.
Article | September 2, 2021
Saurav Singla is a Senior Data Scientist, a Machine Learning Expert, an Author, a Technical Writer, a Data Science Course Creator and Instructor, a Mentor, a Speaker.
While Media 7 has followed Saurav Singla’s story closely, this chat with Saurav was about analytics, his journey as a data scientist, and what he brings to the table with his 15 years of extensive statistical modeling, machine learning, natural language processing, deep learning, and data analytics across Consumer Durable, Retail, Finance, Energy, Human Resource and Healthcare sectors. He has grown multiple businesses in the past and is still a researcher at heart.
In the past, Analytics and Predictive Modeling is predominant in few industries but in current times becoming an eminent part of emerging fields such as health, human resource management, pharma, IoT, and other smart solutions as well.
Saurav had worked in data science since 2003. Over the years, he realized that all the people they had hired — whether they are from business or engineering backgrounds — needed extensive training to be able to perform analytics on real-world business datasets.
He got an opportunity to move to Australia in the year 2003. He joined a retail company Harvey Norman in Australia, working out of their Melbourne office for four years.
After moving back to India, in 2008, he joined one of the verticals of Siemens — one of the few companies in India then using analytics services in-house for eight years.
He is a very passionate believer that the use of data and analytics will dramatically change not only corporations but also our societies. Building and expanding the application of analytics for supply chain, logistics, sales, marketing, finance at Siemens was a very fulfilling and enjoyable experience for him.
Siemens was a tremendously rewarding and enjoyable experience for him. He grew the team from zero to fifteen while he was the data scientist leader. He believes those eight years taught him how to think big, scale organizations using data science.
He has demonstrated success in developing and seamlessly executing plans in complex organizational structures. He has also been recognized for maximizing performance by implementing appropriate project management tools through analysis of details to ensure quality control and understanding of emerging technology.
In the year 2016, he started getting a serious inner push to start thinking about joining a consulting and shifted to a company based out in Delhi NCR.
During his ten-month path with them, he improved the way clients and businesses implement and exploit machine learning in their consumer commitments. As part of that vision, he developed class-defining applications that eliminate tension technologies, processes, and humans. Another main aspect of his plan was to ensure that it was affected in very fast agile cycles. Towards that he was actively innovating on operating and engagement models.
In the year 2017, he moved to London and joined a digital technology company, and assisted in building artificial intelligence and machine learning products for their clients. He aimed to solve problems and transform the costs using technology and machine learning. He was associated with them for 2 years.
At the beginning of the year 2018, he joined Mindrops. He developed advanced machine learning technologies and processes to solve client problems. Mentored the Data Science function and guide them in the development of the solution. He built robust clients Data Science capabilities which can be scalable across multiple business use cases.
Outside work, Saurav associated with Mentoring Club and Revive. He volunteers in his spare time for helping, coaching, and mentoring young people in taking up careers in the data science domain, data practitioners to build high-performing teams and grow the industry. He assists data science enthusiasts to stay motivated and guide them along their career path. He helps fill the knowledge gap and help aspirants understand the core of the industry. He helps aspirants analyze their progress and help them upskill accordingly. He also helps them connect with potential job opportunities with their industry-leading network.
Additionally, in the year 2018, he joined as a mentor in the Transaction Behavioral Intelligence company that accelerates business growth for banks with the use of Artificial Intelligence and Machine Learning enabled products. He is guiding their machine learning engineers with their projects. He is enhancing the capabilities of their AI-driven recommendation engine product.
Saurav is teaching the learners to grasp data science knowledge more engaging way by providing courses on the Udemy marketplace. He has created two courses on Udemy, with over twenty thousand students enrolled in it. He regularly speaks at meetups on data science topics and writes articles on data science topics in major publications such as AI Time Journal, Towards Data Science, Data Science Central, Kdnuggets, Data-Driven Investor, HackerNoon, and Infotech Report. He actively contributes academic research papers in machine learning, deep learning, natural language processing, statistics and artificial intelligence.
His book on Machine Learning for Finance was published by BPB Publications which is Asia's largest publisher of Computer and IT Books. This is possibly one of the biggest milestones of his career.
Saurav turned his passion to make knowledge available for society. Saurav believes sharing knowledge is cool, and he wishes everyone should have that passion for knowledge sharing. That would be his success.