Data science professionals also often do not realise the subject requires continuous learning and applying. “Things are changing so rapidly in this field that what is state-of-the-art today will not be so a month later,” says Parul Pandey, data science evangelist at H2O.ai, an open source AI company.
Platforms like Kaggle and HackerEarth are some of the best places to understand the latest developments. Hackathons hosted on Kaggle help data professionals to collaborate with others globally. “The insights and learnings that come with it are invaluable. We have to look at what is happening in the research world, what is happening in competitions, and which are the latest technologies,” says Pandey.
A data scientist’s job is a unique combination of domain expertise, analytical capability and programming experience. Getting such candidates has been a bit of a challenge for companies.
Parul Pandey, data science evangelist, H2O.ai
HackerEarth’s data science offerings include a practice component, where individual developers can sign up, and access lots of free content where they can build models, and test them and run. “Post the training, there are options for self assessment by attending challenges, where you get to compete with other data scientists,” says Vishwastam Shukla, CTO at HackerEarth. More than 10% of HackerEarth’s 5-million-plus community of developers are into data science.
The quality of professionals required is rising. The 2020 State of Data Science report by Anaconda, an open-source distribution of Python and R, predicts that larger organisations will establish data science centres of excellence to maximise the business impact from data science and cross-trained professionals.
People are starting to understand the real skills and real value that a data scientist brings. So the contours of data science jobs are getting well-defined. Because of that, you see a lot of maturity coming into these candidates, as well as the overall system.
Vishwastam Shukla, CTO, HackerEarth
However, the daily grind of a data scientist will continue. The Anaconda report, which surveyed professionals from 15 domains ranging from finance to healthcare, says that data scientists spend most of their time (26%) cleaning data. The first thing always in a data science pipeline, Pandey says, is to understand the dataset before you start predicting from it. Since the data is drawn from multiple sources, you don’t know what all it has or whether the data is clean. So you need to explore the data to ensure there’s no bias. Visualisation libraries like Plotly and Bokeh, and tools like Tableau and PowerBI are used to understand data by visualising them. Data scientists spend around 21% of their time on visualisation.
Such data exploration requires domain expertise. When dealing with a healthcare dataset, only a healthcare professional will be able to tell why there’s a particular pattern. A pure data scientist cannot. This is why data science becomes a field for everybody. “Many now are moving from their domain specific jobs to a data analytics sort of job, which has some programming also involved,” says Pandey.
After everything is visualised and the data is cleaned, it is fed into libraries like Tensorflow and Pytorch to do predictions.