Track Experiments¶

In the previous part of the Get Started section, we learned how to track and push files to DagsHub using Git and DVC. This part will cover how to track your Data Science Experiments and save their parameters and metrics. We assume you have a project that you want to add experiment tracking to. We will be showing an example based on the result of the last section, but you can adapt it to your project in a straightforward way.

Video for this tutorial

Prefer to follow along with a video instead of reading? Check out the video for this section below:

Start From This Part

To start the project from this part, please follow the instructions below:

Fork the hello-world repository.
Clone the repository and work on the start-track-experiments branch using the following command (change the user name):
```
git clone -b start-track-experiments https://dagshub.com/<DagsHub-user-name>/hello-world.git
```
Create and activate a virtual environment.

Install the python dependencies:

pip3 install -r requirements.txt
pip3 install dvc

Configure DVC locally and set DagsHub storage as the remote.

Download the files using following commands:

dvc get --rev processed-data https://dagshub.com/nirbarazida/hello-world-files data/

Track the data directory using DVC and the data.dvc file using Git.
Push the files to Git and DVC remotes.

Important

To avoid conflicts, work on the start-track-experiments branch for the rest of the tutorial.

Add DagsHub Logger¶

DagsHub logger is a plain Python Logger for your metrics and parameters. The logger saves the information as human-readable files – CSV for metrics files, and YAML for parameters. Once you push these files to your DagsHub repository, they will be automatically parsed and visualized in the Experiments Tab. For further information please see the Experiment Tab documentation and the DagsHub Logger repository.

Note

Since DagsHub Experiments uses generic formats, you don't have to use DagsHub Logger. Instead, you can write your metrics and parameters into metrics.csv and params.yml files however you want, and push them to your DagsHub repository, where they will automatically be scanned and added to the experiment tab.

We will start by installing the 'dagshub' python package on the project's virtual environment:
Mac, Linux, Windows
pip3 install dagshub

Next, we will import 'dagshub' to modeling.py module and track the Random Forest Classifier Hyperparameters and ROC AUC Score. You can copy the code below into your modeling.py file:

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
from const import *
import dagshub

print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA)
X_train = pd.read_csv(X_TRAIN_PATH)
X_test = pd.read_csv(X_TEST_PATH)
y_train = pd.read_csv(Y_TRAIN_PATH)
y_test = pd.read_csv(Y_TEST_PATH)

print(M_MOD_RFC)
with dagshub.dagshub_logger() as logger:
    rfc = RandomForestClassifier(n_estimators=1, random_state=0)
    # log the model's parameters
    logger.log_hyperparams(model_class=type(rfc).__name__)
    logger.log_hyperparams({'model': rfc.get_params()})

    # Train the model
    rfc.fit(X_train, y_train.values.ravel())
    y_pred = rfc.predict(X_test)

    # log the model's performances
    logger.log_metrics({f'roc_auc_score':round(roc_auc_score(y_test, y_pred),3)})
    print(M_MOD_SCORE, round(roc_auc_score(y_test, y_pred),3))

Checkpoint

Check that the current status of your Git tracking matches the following:

Mac, Linux, Windows

git status -s
    M modeling.py

Track and commit the changes with Git

Mac, Linux, Windows

git add src/modeling.py
git commit -m "Add DagsHub Logger to the modeling module"

Create New Experiment¶

As mentioned above, to create a new experiment, we need to update at least one of the two metrics.csv ,params.yml files, track them using Git, and push them to the DagsHub repository. After editing the modeling.py module, once we run its script it will generate those two files.

Run the modeling.py script

Mac, Linux, Windows

python3 src/modeling.py
    [DEBUG] Initialize Modeling
         [DEBUG] Loading data sets for modeling
         [DEBUG] Running Random Forest Classifier
         [INFO] Finished modeling with AUC Score: 0.931
git status -s
    ?? metrics.csv
    ?? params.yml

As we can see from the output above, two new files were created containing the current experiment's information.

The Files Content

The metrics.csv file has four fields:

Name - the name of the Metric.
Value - the value of the Metric.
Timestamp - the time that the log was written.
Step - the step number when logging multi-step metrics like loss.

The params.yml file holds all the hyperparameters of the Random Forest Classifier

Example of the files content:

Mac, LinuxWindows

cat metrics.csv
    Name,Value,Timestamp,Step
    "roc_auc_score",0.931,1615794229099,1
cat params.yml
    model:
      bootstrap: true
      ccp_alpha: 0.0
      class_weight: null
      criterion: gini
      max_depth: null
      max_features: auto
      max_leaf_nodes: null
      max_samples: null
      min_impurity_decrease: 0.0
      min_impurity_split: null
      min_samples_leaf: 1
      min_samples_split: 2
      min_weight_fraction_leaf: 0.0
      n_estimators: 1
      n_jobs: null
      oob_score: false
      random_state: 0
      verbose: 0
      warm_start: false
    model_class: RandomForestClassifier

type metrics.csv
    Name,Value,Timestamp,Step
    "roc_auc_score",0.931,1615794229099,1
type params.yml
    model:
      bootstrap: true
      ccp_alpha: 0.0
      class_weight: null
      criterion: gini
      max_depth: null
      max_features: auto
      max_leaf_nodes: null
      max_samples: null
      min_impurity_decrease: 0.0
      min_impurity_split: null
      min_samples_leaf: 1
      min_samples_split: 2
      min_weight_fraction_leaf: 0.0
      n_estimators: 1
      n_jobs: null
      oob_score: false
      random_state: 0
      verbose: 0
      warm_start: false
    model_class: RandomForestClassifier

Commit and push the files to our DagsHub repository using Git

Mac, Linux, Windows

git add metrics.csv params.yml
git commit -m "New Experiment - Random Forest Classifier with basic processing"
git push

Let's check the new status of our repository The two files were added to the repository and one experiment was created.
The information about the experiment is displayed under the Experiment Tab. Congratulations - You created your first Experiment!

This part covers the Experiment Tracking workflow. We highly recommend reading the experiment tab documentation to explore the various features that it has to offer. In the next part, we will learn how to explore a new hypothesis and switch between project versions