Article | December 23, 2020
Nowadays, everyone with some technical expertise and a data science bootcamp under their belt calls themselves a data scientist. Also, most managers don't know enough about the field to distinguish an actual data scientist from a make-believe one someone who calls themselves a data science professional today but may work as a cab driver next year. As data science is a very responsible field dealing with complex problems that require serious attention and work, the data scientist role has never been more significant. So, perhaps instead of arguing about which programming language or which all-in-one solution is the best one, we should focus on something more fundamental. More specifically, the thinking process of a data scientist.
The challenges of the Data Science professional
Any data science professional, regardless of his specialization, faces certain challenges in his day-to-day work. The most important of these involves decisions regarding how he goes about his work. He may have planned to use a particular model for his predictions or that model may not yield adequate performance (e.g., not high enough accuracy or too high computational cost, among other issues). What should he do then? Also, it could be that the data doesn't have a strong enough signal, and last time I checked, there wasn't a fool-proof method on any data science programming library that provided a clear-cut view on this matter. These are calls that the data scientist has to make and shoulder all the responsibility that goes with them.
Why Data Science automation often fails
Then there is the matter of automation of data science tasks. Although the idea sounds promising, it's probably the most challenging task in a data science pipeline. It's not unfeasible, but it takes a lot of work and a lot of expertise that's usually impossible to find in a single data scientist. Often, you need to combine the work of data engineers, software developers, data scientists, and even data modelers. Since most organizations don't have all that expertise or don't know how to manage it effectively, automation doesn't happen as they envision, resulting in a large part of the data science pipeline needing to be done manually.
The Data Science mindset overall
The data science mindset is the thinking process of the data scientist, the operating system of her mind. Without it, she can't do her work properly, in the large variety of circumstances she may find herself in. It's her mindset that organizes her know-how and helps her find solutions to the complex problems she encounters, whether it is wrangling data, building and testing a model or deploying the model on the cloud. This mindset is her strategy potential, the think tank within, which enables her to make the tough calls she often needs to make for the data science projects to move forward.
Specific aspects of the Data Science mindset
Of course, the data science mindset is more than a general thing. It involves specific components, such as specialized know-how, tools that are compatible with each other and relevant to the task at hand, a deep understanding of the methodologies used in data science work, problem-solving skills, and most importantly, communication abilities. The latter involves both the data scientist expressing himself clearly and also him understanding what the stakeholders need and expect of him. Naturally, the data science mindset also includes organizational skills (project management), the ability to work well with other professionals (even those not directly related to data science), and the ability to come up with creative approaches to the problem at hand.
The Data Science process
The data science process/pipeline is a distillation of data science work in a comprehensible manner. It's particularly useful for understanding the various stages of a data science project and help plan accordingly. You can view one version of it in Fig. 1 below. If the data science mindset is one's ability to navigate the data science landscape, the data science process is a map of that landscape. It's not 100% accurate but good enough to help you gain perspective if you feel overwhelmed or need to get a better grip on the bigger picture.
Learning more about the topic
Naturally, it's impossible to exhaust this topic in a single article (or even a series of articles). The material I've gathered on it can fill a book! If you are interested in such a book, feel free to check out the one I put together a few years back; it's called Data Science Mindset, Methodologies, and Misconceptions and it's geared both towards data scientist, data science learners, and people involved in data science work in some way (e.g. project leaders or data analysts). Check it out when you have a moment. Cheers!
Article | April 13, 2020
The acronym DMaaS can refer to two related but separate things: data center management-as-a-service referred to here by its other acronym, DCMaaS and data management-as-a-service. The former looks at infrastructure-level questions such as optimization of data flows in a cloud service, the latter refers to master data management and data preparation as applied to federated cloud services.DCMaaS has been under development for some years; DMaaS is slightly younger and is a product of the growing interest in machine learning and big data analytics, along with increasing concern over privacy, security, and compliance in a cloud environment.DMaaS responds to a developing concern over data quality in machine learning due to the large amount of data that must be used for training and the inherent dangers posed by divergence in data structure from multiple sources. To use the rapidly growing array of cloud data, including public cloud information and corporate internal information from hybrid clouds, you must aggregate data in a normalized way so it can be available for model training and processing with ML algorithms. As data volumes and data diversity increase, this becomes increasingly difficult.
Article | April 9, 2020
Across the world, governments and health authorities are now exploring distinct ways to contain the spread of Covid-19 as the virus has already dispersed across 196 countries in a short time. According to a professor of epidemiology and biostatistics at George Washington University and SAS analytics manager for infectious diseases epidemiology and biostatistics, data, analytics, AI and other technology can play a significant role in helping identify, understand and assist in predicting disease spread and progression.In its response to the virus, China, where the first case of coronavirus reported in late December 2019, started utilizing its sturdy tech sector. The country has specifically deployed AI, data science, and automation technology to track, monitor and defeat the pandemic. Also, tech players in China, such as Alibaba, Baidu, Huawei, among others expedited their company’s healthcare initiatives in their contribution to combat Covid-19.
Article | February 24, 2020
Emerging technology has the power to transform history and cultural heritage into a living resource. The Time Machine project will digitise archives from museums and libraries, using Artificial Intelligence and Big Data mining, to offer richer interpretations of our past. An inclusive European identity benefits from a deep engagement with the region’s past. The Time Machine project set out to offer this by exploiting already freely accessible Big Data sources. EU support for a preparatory action enabled the development of a decade-long roadmap for the large-scale digitisation of kilometres of archives, from large museum and library collections, into a distributed information system. Artificial Intelligence (AI) will play a key role at each step, from digitisation planning to document interpretation and fact-checking. Once embedded, this infrastructure could create new business and employment opportunities across a range of sectors including ICT, the creative industries and tourism.