Version Code and Data¶
In the previous part of the Get Started section, we created and configured a DAGsHub repository. In this part, we will download and add a project to our local directory, track the files using DVC and Git, and push the files to the remotes.
Start From This Part
To start the project from this part, please follow the instructions below.
- Fork the hello-world repository.
- Clone the repository and work on the start-version-project branch using the following command:
git clone -b start-version-project https://dagshub.com/<DAGsHub-user-name>/hello-world.git
- Create and activate a virtual environment.
- Install and initialize DVC
- Configure DVC locally and set DAGsHub storage as the remote.
Check that the current DVC configuration matches the following:
To avoide conflicts, work on the start-version-project branch for the rest of the toturial.
Add a Project¶
At this point, we want to add the required files for our ML project to the local directory. We will use the
command that downloads files from a Git repository or DVC storage without tracking them.
Run the following commands from your CLI:
dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/
This project is a simple 'Ham or Spam' classifier for emails using the Enron data set.
tree -I venv . ├── data │ └── enron.csv ├── requirements.txt └── src ├── const.py ├── data_preprocessing.py └── modeling.py 2 directories, 5 files
- src directory - Holds the data-preprocessing, modeling and const files:
data-preprocessing.py- Processing the raw data, splits it to train and test sets, and saves it to the data directory.
modeling.py- Simple Random Forest Regressor.
const.py- Holds the Constance of the projects.
- data directory - Contains the raw data -
requirements.txt- Python dependencies that are required to run the python files.
We will use the requirements.txt file to install the project's dependencies on the Virtual Environment. Make sure that the Virtual Environment is activated.
Run the following command from your CLI:
pip3 install -r requirements.txt
Check that the current status of your Git tracking matches the following
git status -s ?? data/ ?? requirements.txt ?? src/
Track Files Using Git and DVC¶
At this point, we need to decide which files will be tracked by Git and which will be tracked by DVC. We will start with files tracked by DVC because this action will generate new files tracked by Git.
Track Files with DVC¶
The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.
Add the data directory to DVC tracking
dvc add data 100% Add|███████████████████████████████████████████████████████████████|1/1 [00:01, 1.28s/file] To track the changes with git, run: git add data.dvc .gitignore
As we can see from the above, DVC provides us with information about the modified files due to this action and what we should do with them.
Track the changes with Git
git add data.dvc .gitignore git commit -m "Add the data directory to DVC tracking" 2 files changed, 6 insertions(+) create mode 100644 data.dvc
Track Files with Git¶
Check the current status of the files in the directory
git status -s ?? requirements.txt ?? src/
As we can see from the above, all the remaining files are either code or configuration files. Thus, we will track them using Git.
Track the files with Git
git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking"
Push the Files to the Remotes¶
At this point, we would like to push the files tracked by Git and DVC to our DAGsHub remotes. Performing this action will make the entire project available for sharing & collaborating on DAGsHub.
Push DVC tracked files¶
dvc push -r origin Enter a password for host <storage-provider> user <user-name>: 1 file pushed
You will be asked to enter the password of your user at the storage provider.
To see the DVC tracked files in your DAGsHub repository, you will need to push the
<filename>.dvc file, which is tracked by Git, to the remote repository. We will do this in the next step.
Push Git tracked files¶
After completing these steps, our repository will look like this:
- The main repository page:
The DVC tracked files are marked with blue background.
The data directory:
The data file itself: As we can see in the image above, DAGsHub displays the content of the files (e.g. CSV, YAML, image, etc.), tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can filter and compare to different commits.
Process and Track Data Changes¶
- Now, we would like to preprocess our data and track the results using DVC.
Let's run the
data_preprocessing.pyfile from our CLI.
python3 src/data_preprocessing.py [DEBUG] Preprocessing raw data [DEBUG] Loading raw data [DEBUG] Removing punctuation from Emails [DEBUG] Label encoding target column [DEBUG] Vectorizing the emails by words [DEBUG] Splitting data to train and test [DEBUG] Saving data to file
This action generated 4 new files of processed data to the 'data' directory.
tree data data ├── X_test.csv ├── X_train.csv ├── enron.csv ├── y_test.csv └── y_train.csv 0 directories, 5 files
Check that the current status of your Git and DVC tracking matches the following
git status On branch master Your branch is up to date with 'origin/<branch-name>'. nothing to commit, working tree clean dvc status data.dvc: changed outs: modified: data
Nothing was changed in Git tracking because the data directory is being tracked by DVC.
Let's version the new status of the data directory with DVC:
dvc add data 100% Add|███████████████████████████████████████████████████████|1/1 [00:01, 1.18s/file] To track the changes with git, run: git add data.dvc git add data.dvc git commit -m "Process raw-data and save it to data directory"
Push our changes to the remote
dvc push -r origin Enter a password for host <storage provider> user <username>: 5 files pushed git push
Let's see the new status of the data directory in DAGsHub.
In this section, we covered the basic workflow of DVC and Git. We added our project files to the repository and tracked them using Git and DVC. We generated preprocessed data files and learned how to add these changes to DVC as well. In the next sections, we will learn how to: