My 2cents worth after reviewing an academic project

Posted on Mon 21 December 2020 in posts

Background

Recently we had a completion of an R&D project with a prominent university. Although the results of the project were insightful and possibly applicable to our organization, the workflow that the academic R&D team used seemed inadequate.

In this post, I wish to highlight some simple steps that can assist in running Data Science (DS) projects (from the initiation until the deployment) by DS teams, especially those working without the support of specialized tools.

I'm not going to go into the Project Management aspects but rather some tools and tips for any DS project.

Setup

  1. Using drawing for project structure.

    Although this team included MSc. and PhD. students who are running multiple collaborative projects, they did not have a convention for a DS project structure. While reviewing the deliverables we needed to contact several team members in order to find a specific file. Working with a known template allows all team members to save files easily in designated folders and readily locate files from any project.
    The template that is used is not really the issue - we use this Data Science template, which sometimes is an "overkill" for simple projects, but normally most of the structure is used.

  2. Using drawing (!@$#%)

Yes - still in the year 2020 - teams run projects without a version control system! The entire project was offline - so the team thought that they did not need one. How difficult is it to set up a local git server? Although we did not have a failing disc, we did lose a specific file that somehow went missing...

Running

  1. CI/CD (low-tech solution)
    This one is a bit more tricky. CI/CD is a "must" these days for companies who are shipping a product, but what about for a Data Science team? This is even more challenging when using Jupyter Notebooks that are not "git friendly".
    Recently our team decided on a simple CI/CD for our - which include a kernel restart run all cells. This solution allows for picking up any notebook and knowing that what ever is inside the notebook, and can run without any errors.
    We supplement this solution with the following procedures:

    • Removing functions into a separate .py file, leaving the notebook clean and more readable.
    • Separating each notebook as a single step in the analysis pipeline.
    • Complement a set of notebooks with a README file describing the general process and specifically the data input/output files.

    Once the project is mature you can upgrade the pipeline into a designated framework such as dagster

  2. Monitoring experiments
    As scientists - experimentation and failures are part of our daily life. Working in a systematic manner allows for confidence in the results and for reproducible science. Stating that "we checked the various parameters and these values were the best" is not the best practice unless these can be easily reviewed and reproduced.
    Running print statements without a central logging module is also very problematic. Just being able to run the exact same code and get similar logs is very beneficial for understanding how the project runs etc.
    During the past years, there are many platform/frameworks that have been developed for managing solutions for ML projects. We have settled on MLFlow, which allows for ease of installation and use even in an offline environment.

  3. Code design
    When a code is full of iterrows while transforming data in dataframes, there is a serious code smell. Running within loops instead of utilizing the vectorization computation is a serious efficiency problem and most likely displays miss understanding of the Python and Pandas paradigm.

Summary

There are many constraints when running a project. However, some minimal infrastructure can get you a long way. Working without any guidelines will normally lead to chaos and inefficiency, while, at the same time, lowering the quality level of the science and of the project.
Today, MLOps and DataOps tools and guidelines are constantly being developed, so I'm sure we will see ease of use and improvements in the coming years.