Project Tooling & Automations
Poetry : python packaging
Poetry is a tool for making python packaging and dependency management easy and used int his project.
Markdown : documentation tooling
This project uses Markdown "Markdown is intended to be as easy-to-read and easy-to-write as is feasible" according to John Gruber. This is a nice guide to Markdown basic syntax
Documation build automation and serving is done with Read the Docs and mkdocs. To setup documentation, the following steps were followed.
1. Setup your mkdocs project
2. Create the .readthedocs.yaml
file from by following those instructions
2. Follow the Read The Docs tutorial to link your github project with Read the DOcs.
Pylint : static Code Analysis (Linting)
Linting is a process of running a static code analysis witht he goal of flagging programming errors, bugs, stylistic errors and suspicious constructs [[2]]. An example of a rule enforced by linting in this project : use of snake_case which suggests that complex token names should be separated by underscored. The linter used in this project is Pylint. Development environment like Visual Studio Code integrates linting tools like pylint and automatically highlight issues. Pylint settings are located in the .pylintrc
file
Black : automatic code formatting
Black is used for automatic code formatting. Automatic code formatting automatically modifies code to enforce a programming style, ensuring a uniform code style and making code maintainance easier.
Visual Studio Code : development environment
This project was developped using Visual Studio Code and includes a very limited set of workspace settings for activating tools used to develop this project. As of writing this documentation, the following tools are integrated with Visual Studio Code. * Automatic Testing : Pytest * Automatic Code Formatting : Black (runs when saving files) * Linting : Pylint (runs when saving files)
Visual Studio Code workspace settings are located in ./vscode/settings.json
. Settings can be acces with ⌘,
on Mac and ctrl+,
on other OS.
By the way, the difference between user settings and workspace settings? From Visal Studio Code documentation: * User Settings - Settings that apply globally to any instance of VS Code you open. * Workspace Settings - Settings stored inside your workspace and only apply when the workspace is opened.
Of course, using Visual Studio Code is not required, nor recommended to work his project. Use whatever tool you like!!
Git : source control
Git is a "free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency"[[9]].
pre-commit : pre-commit verifications
Git allows automatically calling scripts at every commits (and other particular events). "Git hook scripts are useful for identifying simple issues before submission to code review" [[8]]. pre-commit is used for managing pre-commit verifications. As of writing this documentation, it automatically * yaml files * Fix end of files * Trim Trailing Whitespaces * Runs automatic code formatting (Black) * Runs Pytest pre-commit tests
bump2version : versionning automation
bump2version automates project versionning. It is especially useful where version number appear in multiple locations in projects. Not yet used in this project as of writing this, but it is planned. This project uses Semantic Versionning 2.0
See the versionning page for more details
Github workflow and automations
This project uses Github flow as a contribution workflow. A pull request template is implemented in .github/pull_request_template.md
Github Actions are used to automate workflows. Github actions scripts are located in .github/workflow/
CML : continuous machine learning
CML is a tool for continuous integration in ML. Using this tool, a github action that trains the model, create a training report and adds as a pull request comment is located in .github/workflow/cml.yaml
.
* A tutorial on the CML tool from Iterative
* Continuous machine learning explained
DVC : dataset version control, experiment tracking
DVC is a tool for dataset version control, experiment tracking and monitoring. A full page dedicated to DVC is here
Background documentation
Project Struture
The following articles were used as inspiration this project folder structure : * Folder Structure for Machine Learning Projects * Machine Learning: Models to Production
Refactoring a data science project
Youtube series on refactoring a data science project by arjan_codes. * Part 1 * Part 2 * Part 3
Learning Referecnes
Courses * Full Stack Deep Learning 2022 * ML Ops Tutorials using iterative.io tools * ML Ops Guide
Virtual environments
A python virtual environments is a "self-contained directory tree that contains a Python installation for a particular version of Python" [[1]] and
* Official Python Documentation
* Python Virtual Environments Primer by Martin Breuss on RealPython
* Managing Application Dependencies
ML Ops Definitions * By Databricks * By Arrikto
ML Ops Challenges * Why Production Machine Learning Fails — And How To Fix It * The Ultimate Guide: Challenges of Machine Learning Model Deployment * Model Deployment Challenges: 6 Lessons From 6 ML Engineers
Maturity model in ML Ops * Three Levels of ML Software
Could be usefull but not used in this project
Scikit-Learn Pipelines
- Scikit-Learn Pipelines
- Basic Tutorial
- Advanced Tutorial
- Pipelines & Custom Transformers in scikit-learn
MLFlow : Machine Learning LifeCycle Management
An open source platform for the machine learning lifecycle, according to how they define themselves. * MLFlow * Youtube Series ML Lifecycle by Isaac Reis * Getting Started on Databricks Community Edition