Automating the Data Scientist

May 15

Talking about automating one's job is a difficult exercise. People who need to have the job done but don't know how (or don't have enough resources) are bound to like the message. People who work on this automation can only approve. But many of those who do that job for a living will disagree. As I said in a previous post, my GigaOM piece on Data Scientists putting themselves out of business received mixed reactions. This is a complex topic, so let me add some more details to explain the perspective and context.

What is a Data Scientist?

The role of a Data Scientist hasn't been standardized, so everyone has their own definition. Also, the duties vary across companies of different sizes operating in different industries and structured in different ways. When I read about what people who have a "Data Scientist" title do, I often have to revisit my own conception of what a DS is.

So instead of giving a definition of the role, let me try to characterize the abilities of the people behind the title. To me, a Data Scientist is someone who has business acumen, technical skills, and who knows Machine Learning. The last point is key. Those who do not have Machine Learning expertise are Data Engineers, Data Analysts, Data Artisans, Data Somethingelse. Therefore, this discussion is focused on the Machine Learning component of the Data Scientist's role.

Apparently, people were shocked to hear me say that "most of a Data Scientist’s time is spent creating predictive models." Based on what I just said, applying Machine Learning technology to data should be an important part of the job. The fact that you're combining this with technical skills and business acumen implies that you're working on that data and that you're creating value from the model. "Creating predictive models" encompasses collecting, cleaning, wrangling data that is used to build the model, and putting the model to good use.

I totally agree with John Foreman when he says that predictive modeling is a means, not an end. This is something that I emphasize in my book (Chapter 5. Applying ML to your domain > Specifying the right problem > Making predictions... so what?): don't build a model before you have an idea of how it will be of value and before you know how to measure its impact. Another very good point that Foreman makes is that "complexity must be justified". Actually, I like simplicity so much that I'm talking about it in my bio. As a side note, one of the things that's cool with tools like BigML is that they allow you to automate the creation of predictive models but also to choose the degree of complexity of your model.

Where we are today

There are three points I want to make about automating the Data Scientist:

When you're in the business of making computer programs, for data-related things or anything else, it's your job to try to automate yourself. Succeeding at this is a different thing. When I did my ML thesis for instance, I wasn't able to fully automate the selection of hyperparameters of my Gaussian Process bandit algorithm, but I still tried and experimented.
People in the fields of Data Science and Machine Learning are actively working on automating parts of their jobs and they've come up with new predictive modeling solutions that are being used directly by domain experts: Ayasdi, BeyondCore, Baynote, Apigee, Emerald Logic, Emcien, BigML, PredictiveDB, Google, Wise.io, Ersatz Labs, Xyggy, PredicSis and many others. Besides, a couple of days ago I came across a research project at Cambridge University led by Zoubin Ghahramani and entitled "The Automatic Statistician", which has won a $750,000 Google Focused Research Award.
Some of these solutions are actually being used in production. There aren't many public benchmarks to assess their performance, but for now, you can check out how BigML compares to popular algorithms in Weka, how Wise.io compares to various random forest implementations, and how Emerald Logic's FACET advances medical research.

A post on automating the Data Scientist wouldn't have made much sense three years ago because the solutions I mention didn't exist or weren't used then, and there wasn't a trend yet. If you are sceptic about what they do, I urge you to set up a free account with BigML (which today is the most accessible of these solutions) and to give it a try — if only for intellectual curiosity. Then, take a minute to consider the benefits that a service like this can bring.

Plenty of companies, problems and data

As for any technology-related activity, people's outlook on Data Science can vary quite much depending on the type of company they work for. In large companies that have been doing Data Science for a while and where people use state-of-the-art solutions that they spent countless hours to set up, you don't see things changing anytime soon.

Then there are businesses who need to stay at the forefront or even to advance the state of the art if they want to stay ahead of the competition. Netflix or Spotify are not going to use automatic solutions like DirectedEdge's to power their movie or music recommendations. They are, and will likely always be, building their own recommender systems. Right now, they are tackling the next big challenge in this area: looking at and understanding the actual content of items before making recommendations to users.

But state-of-the-art doesn't concern everyone. There are businesses that can shine with the simplest application of Machine Learning / predictive analytics, because they use it in a domain where no one else knows about it, or in a way that no one has thought of.

“You don’t need to predict accurately to get great value. Organizations, by making predictions per person that are a good bit better than guessing, actually play the numbers game better.” — Eric Siegel

In a startup or in a small business, you can now do the same thing a Data Scientist would, without knowing ML algorithms. All you need is business acumen, technical skills, and Prediction APIs to remove Machine Learning's barrier to entry. As Dan Putler of Alteryx puts it: "Data artisans allow businesses to do deeper, more predictive analytics than they would otherwise be able to do." He is also one to predict that Data Artisans will partially take over the functions of Data Scientists in the near future.

Is that a threat to Data Scientists? Not necessarily. According to Josh Wills, Senior Director of Data Science at Cloudera, there's no reason to worry about their future: "There is plenty of data and plenty of problems for everybody to work on."

Still hard work

Remember that, even though some of the learning techniques are automated, you still have to prepare the data to learn from (feature engineering, data extraction, cleaning, etc.) and to deliver value with whatever predictive model you come up with. If it was that easy, I wouldn't have written a book on the topic!

Here is a workflow for applying Machine Learning in your business (or in your application):

Only three of these things are being automated (those in black). This means that it still takes lots of efforts to transform data into value, but that Data Scientists can now focus on more meaningful activities. If revenue is on the line, they can later work at making fancier models that improve accuracy, but only after all the rest has been done.

For those who work in smaller companies, you can now "bootstrap" Machine Learning to improve your business. Even though there's hard work ahead of you, it's manageable and it's considerably less than if you'd also had to learn how to use Machine Learning algorithms to create predictive models. Most often, these algorithms are the barrier to entry to Data Science, but they are not what represents most of the work: it's rather in what happens before and after the algorithms are used, and you're likely to already have the right competencies in your team to work on that.

Enjoyed this article? Vote on Hacker News and follow me!