Track Experiments¶
In the previous part of the Get Started section, we learned how to track and push files to DagsHub using Git and DVC. This part will cover how to track your Data Science Experiments and save their parameters and metrics. We assume you have a project that you want to add experiment tracking to. We will be showing an example based on the result of the last section, but you can adapt it to your project in a straightforward way.
Video for this tutorial
Prefer to follow along with a video instead of reading? Check out the video for this section below:
Start From This Part
To start the project from this part, please follow the instructions below:
- Fork the hello-world repository.
-
Clone the repository and work on the start-track-experiments branch using the following command (change the user name):
git clone -b start-track-experiments https://dagshub.com/<DagsHub-user-name>/hello-world.git
-
Create and activate a virtual environment.
- Install the python dependencies:
pip3 install -r requirements.txt pip3 install dvc
- Configure DVC locally and set DagsHub storage as the remote.
- Download the files using following commands:
dvc get --rev processed-data https://dagshub.com/nirbarazida/hello-world-files data/
- Track the data directory using DVC and the
data.dvc
file using Git. - Push the files to Git and DVC remotes.
Important
To avoid conflicts, work on the start-track-experiments branch for the rest of the tutorial.
Add DagsHub Logger¶
DagsHub logger is a plain Python Logger for your metrics and parameters. The logger saves the information as human-readable files – CSV for metrics files, and YAML for parameters. Once you push these files to your DagsHub repository, they will be automatically parsed and visualized in the Experiments Tab. For further information please see the Experiment Tab documentation and the DagsHub Logger repository.
Note
Since DagsHub Experiments uses generic formats, you don't have to use DagsHub Logger. Instead, you can write your metrics and parameters into metrics.csv
and params.yml
files however you want, and push them to your DagsHub repository, where they will automatically be scanned and added to the experiment tab.
-
We will start by installing the 'dagshub' python package on the project's virtual environment:
pip3 install dagshub
-
Next, we will import 'dagshub' to
modeling.py
module and track the Random Forest Classifier Hyperparameters and ROC AUC Score. You can copy the code below into yourmodeling.py
file:from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_auc_score import pandas as pd from const import * import dagshub print(M_MOD_INIT,'\n'+M_MOD_LOAD_DATA) X_train = pd.read_csv(X_TRAIN_PATH) X_test = pd.read_csv(X_TEST_PATH) y_train = pd.read_csv(Y_TRAIN_PATH) y_test = pd.read_csv(Y_TEST_PATH) print(M_MOD_RFC) with dagshub.dagshub_logger() as logger: rfc = RandomForestClassifier(n_estimators=1, random_state=0) # log the model's parameters logger.log_hyperparams(model_class=type(rfc).__name__) logger.log_hyperparams({'model': rfc.get_params()}) # Train the model rfc.fit(X_train, y_train.values.ravel()) y_pred = rfc.predict(X_test) # log the model's performances logger.log_metrics({f'roc_auc_score':round(roc_auc_score(y_test, y_pred),3)}) print(M_MOD_SCORE, round(roc_auc_score(y_test, y_pred),3))
Checkpoint
Check that the current status of your Git tracking matches the following:
git status -s
M modeling.py
-
Track and commit the changes with Git
git add src/modeling.py git commit -m "Add DagsHub Logger to the modeling module"
Create New Experiment¶
As mentioned above, to create a new experiment, we need to update at least one of the two metrics.csv
,params.yml
files, track them using Git, and push them to the DagsHub repository. After editing the modeling.py
module, once we
run its script it will generate those two files.
-
Run the
modeling.py
scriptpython3 src/modeling.py [DEBUG] Initialize Modeling [DEBUG] Loading data sets for modeling [DEBUG] Running Random Forest Classifier [INFO] Finished modeling with AUC Score: 0.931 git status -s ?? metrics.csv ?? params.yml
-
As we can see from the output above, two new files were created containing the current experiment's information.
The Files Content
The metrics.csv
file has four fields:
- Name - the name of the Metric.
- Value - the value of the Metric.
- Timestamp - the time that the log was written.
- Step - the step number when logging multi-step metrics like loss.
The params.yml
file holds all the hyperparameters of the Random Forest Classifier
Example of the files content:
cat metrics.csv
Name,Value,Timestamp,Step
"roc_auc_score",0.931,1615794229099,1
cat params.yml
model:
bootstrap: true
ccp_alpha: 0.0
class_weight: null
criterion: gini
max_depth: null
max_features: auto
max_leaf_nodes: null
max_samples: null
min_impurity_decrease: 0.0
min_impurity_split: null
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
n_estimators: 1
n_jobs: null
oob_score: false
random_state: 0
verbose: 0
warm_start: false
model_class: RandomForestClassifier
type metrics.csv
Name,Value,Timestamp,Step
"roc_auc_score",0.931,1615794229099,1
type params.yml
model:
bootstrap: true
ccp_alpha: 0.0
class_weight: null
criterion: gini
max_depth: null
max_features: auto
max_leaf_nodes: null
max_samples: null
min_impurity_decrease: 0.0
min_impurity_split: null
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
n_estimators: 1
n_jobs: null
oob_score: false
random_state: 0
verbose: 0
warm_start: false
model_class: RandomForestClassifier
-
Commit and push the files to our DagsHub repository using Git
git add metrics.csv params.yml git commit -m "New Experiment - Random Forest Classifier with basic processing" git push
-
Let's check the new status of our repository
The two files were added to the repository and one experiment was created.
-
The information about the experiment is displayed under the Experiment Tab.
Congratulations - You created your first Experiment!
This part covers the Experiment Tracking workflow. We highly recommend reading the experiment tab documentation to explore the various features that it has to offer. In the next part, we will learn how to explore a new hypothesis and switch between project versions