Metaflow : A boon for data scientists !

Mohita Parashar
7 min readJul 6, 2020
A data scientist’s condition

Soumya is a data scientist. She has to solve important business problems. To start, somebody will provide her with the business metrics ,that needs to be optimized. So, Soumya will -

1. Spin a notebook, gather data, slice and dice the data, get to number crunching , and figure out various stuffs and ways about how to optimize the business metric.

2. Use packages like TensorFlow, and write her own training code.

3. All of these codes, need to run on some data set. So she will use some ETL , so that she can feed the data to her model.

4. Production workflows should run on servers, not on a laptop. So she will schedule these model somewhere, like cloud.

5. These days, servers execute containers. Workflows need to be containerized for execution.

6. Think where should the results go? Sometimes they are deployed as containers too.

7. Finally containerized models can be consumed by business applications.

The stakeholders evaluate the results, and they want more models , better features, more demands and so on, thus requiring Soumya to repeat the entire process again.

So how much time did Soumya actually spend on Data Science? Pretty minimal! Most of her time is spent on writing ETL, figuring out where to get the data from, scheduling the compute, and if anything fails, it’s very difficult to triage what happened!

Data science is a very iterative process, and generally it is guaranteed that you’d not succeed in your very first attempt. There’ll be a plethora of models Soumya would be working with. So she needs to keep track of the different experiments that she had launched, and monitor which ones are working, which ones are not! So her another job is experiment management and tracking.

That’s a lot of work for a single data scientist!!

A general ML stack

That’s when Metaflow comes in. What metaflow does is, provide an unified abstraction to this ML stack, so that Soumya can be more of data scientist and less of a system engineer.

Metaflow provides a unified API to the infrastructure stack that is required to execute data science projects, from prototype to production.

Let’s see how that happens.

So this what any ML code at the end of the day looks like.

What metaflow does is, it allows you to declare your computation as a Directed Acyclic Graph(DAG).We call the graph of operations a flow. You define the operations, called steps. Steps become nodes of the graph and contain transitions to the next steps, which serve as edges. You can construct the graph using an arbitrary combination of the following three types of transitions supported by Metaflow:

1.Linear

The most basic type of transition is a linear transition. It moves from one step to another one.

Here is a graph with two linear transitions:

The corresponding Metaflow script looks like this:

This flow creates a data artifact called my_var. In Metaflow, data artifacts are created simply by assigning values to instance variables like my_var.

2.Branch

You can express parallel steps with a branch. In the figure below:

Any number of parallel steps are allowed. A benefit of a branch like this is performance: Metaflow can execute a and b over multiple CPU cores or over multiple instances in the cloud.

3. Foreach

The above transition is useful only if you know the branches already at development time. But what if you want to branch dynamically based on the data. This is the usecase for Foreach branch.

Foreach works similarly like the branch above but instead of creating named step methods, many parallel copies of steps inside a foreach loop are executed.

A foreach loop can iterate over any list like titles below.

Steps inside a foreach loop create separate tasks to process each item of the list. Here, Metaflow creates three parallel tasks for the step a to process the three items of the titles list in parallel.

Features:

  1. Automatic State transfer:

In tools like AirFlow, the nodes are isolated units of computation. The state transfer is left upto the users, leaving them confused with what to pass to the other containers and what to store and what not to.

And that’s what metaflow does.
Every single time, your step finishes, it snapshots the entire state and it is magically transferred to all the subsequent steps. So at any point of time, if something fails, we can come back and check what the state of our workflow was in a particular stage.

2. Resuming a run :

The fact is, storage is cheap, so when things fail, you need to make sure you have access to the tracking history and you feel at that time, it would have been a good idea if you could have recorded the entire world at that time!

As a result of state transfer, if a run failed (or was stopped intentionally)and you fixed the error in your code in a particular stage, you could restart the workflow from the stage where it failed/stopped.

3. Versioning:

Every time you execute a workflow, a dedicated namespace is created for that particular user and are allocated run ids, and then bundled in a client API.

Metaflow persists all runs and all the data artifacts they produce. Every run gets a unique run ID, e.g. HelloFlow/546, which can be used to refer to a specific set of results. You can access these results with the Client API.

4. Real time Monitoring

You can monitor status of your workflow in a Jupyter notebook abstraction.

5. Vertical Scalability

In this world of cloud, we have access to incredible resources on our finger tips. But our laptops have limited capacity and more often , the data sets are in petabytes, needing 8 GPUs. What do you do then?

You wish you package that piece of code and deploy and run it on specific resources. But this is difficult even for an engineer, imagine about data scientists!

Metaflow relies on decorators and we let us specify the resources we need .
Your function then gets packaged and is run on top of a docker container on an instance that has these resources.

6. Horizontal Scalability:

Users can parallelize the tasks using Foreach.

7. Dependency Management/ Reproducibility:

Reproducibility is a core value of Machine Learning Infrastructure. It is hard to collaborate on data science projects effectively without being able to reproduce past results reliably. Metaflow tries to solve several questions related to reproducible research, principle amongst them, dependency management: how can you, the data scientist, specify libraries that your code needs so that the results are reproducible?

Metaflow aims at solving both the questions at once: how can we handle dependencies so that the results are reproducible? Specifically, it addresses the following three issues:

  1. How to make external dependencies available locally during development?
  2. How to execute code remotely on AWS Batch with external dependencies?
  3. How to ensure that anyone can reproduce past results even months later?

Metaflow provides an execution environment context, --environment=conda, which runs every step in a separate environment that only contains dependencies that are explicitly listed as requirements for that step. The solution relies on Conda, a language-agnostic open-source package manager by the authors of Numpy.

$ python LinearFlow.py — environment=conda run

8. Sandbox-

Metaflow comes with built-in integrations to various services on AWS. The seamless integration to the cloud is a key benefit of Metaflow.

Downsides:

  1. Tightly coupled with AWS.
  2. Not supported on Windows.

And that’s all! While Metaflow seems to be a step in the right direction, there’s still a long way to go until it is ready for universal use. Thank you for reading!

References:
1.https://docs.metaflow.org/
2. More Data Science, Less Engineering with Netflix’s Metaflow By Savin Goyal

--

--