Direct Data Access¶
Direct Data Access, or DDA for short, is a magical component of the DagsHub client and API libraries, that lets you stream your data from, and upload it to, any DagsHub project. It makes it extremely simple to get started with your data science work, even when you have a large dataset.
In other words, DDA gives you the organization and reproducibility provided by DVC, with the ease of use and flexibility of a data API, without requiring any changes to your project. This also means you can stream and upload data directly with only a part of your project’s files without pulling everything to your machine.
Direct Data Access has 2 main components:
- Data Streaming
- Python Hooks approach
- Mounted Filesystem approach – Experimental
- Data Upload
Installation and Setup¶
DDA comes with the new DagsHub client libraries. To install it, simply type in the following:
$ pip3 install dagshub
Using all functionality of DDA requires authentication, and you can do this easily by running:
$ dagshub login
Info
If you prefer to use a non-temporary token for logging in, you can run the following command:
$ dagshub login --token <your dagshub token>
Data Streaming¶
How does Data Streaming work?¶
Data streaming via DDA has two main implementations, Python Hooks and Mounted Filesystem, each valid for different cases. Review the support matrix below for more details and recommendations on when to use each one.
The Python Hooks method automatically detects calls to Python's built-in file operations (such as open()
), and if
the files exist on your DagsHub repo, it will load them on the fly as they're requested. This means that most Python ML
and data libraries will automatically work with this method, without requiring manual integration.
The Mounted Filesystem implementation, based on FUSE, relies on an interface called Filesystem in UserSpacE. It creates a virtual mounted filesystem reflecting your DagsHub repo, that behaves like a part of your local filesystem for all intents and purposes.
How to use Data Streaming?¶
Python Hooks – Recommended¶
To use Python Hooks, open your DagsHub project, and copy the following 2 lines of code into your Python code which accesses your data:
from dagshub.streaming import install_hooks
install_hooks()
To see an example of this that actually runs, check out the Colab below:
Known Limitations
- Some frameworks, such as TensorFlow and OpenCV, which rely on routines written in C or C++ for file input/output, are currently not supported.
dvc repro
anddvc run
commands for stages that have dvc tracked files indeps
will not work, showing errors of missing data, to run a stage, use the--downstream
flag instead, or run it manually, and usedvc commit
.
Mounted Filesystem – Experimental¶
The Mounted Filesystem approach uses FUSE under the hood. This bypasses the limitations in the Python Hooks approach by creating a fully virtual filesystem that connects your remote to the local workspace. It supports all frameworks and non-Python languages. However, note that FUSE only supports Linux machines and is currently unstable. Read more about it in the DagsHub client Readme
Non-magical API approach¶
Magic is awesome, but sometimes you need more control over how you access your project files and prefer a direct API. If you want to explicitly and unambiguously state that you're using DagsHub Streaming, or else none of the other methods are supported on your machine, we also offer a straightforward Python client class that you can use.
Just copy the following code into your Python code:
from dagshub.streaming import DagsHubFilesystem
fs = DagsHubFilesystem()
Then replace any use of Python file-handling function in the following way:
open()
→fs.open()
os.stat()
→fs.stat()
os.listdir()
→fs.listdir()
os.scandir()
→fs.scandir()
You can pass the same arguments you would to the built-in functions to our client's functions, and streaming functionality will be provided. e.g.:
fs.open('/full/path/from/root/to/dvc/managed/file')
Data Upload¶
You don't need to pull the entire dataset anymore.
The upload API lets you upload or append files to existing DVC directories, without downloading anything to your machine, quickly and efficiently.
How does Data Upload work?¶
Data upload is an API, and a Python client library, that enables you to send files you’d like to track to DagsHub and have them automatically added to your project, using DVC or Git for versioning. In order to accomplish this, we implement all the logic for tracking new files on our server, so that you end up with a fully DVC-(or Git-)tracked file or folder.
How to use Data Upload?¶
After installing the client, you can use the upload function for both Git and DVC-tracked files.
Upload single files using the DagsHub CLI¶
You can upload a single file to any location in your repository, including DVC directories by using the DagsHub CLI in your terminal. This utility is useful for active learning scenarios when you want to append a new file to your dataset.
A basic usage example is:
$ dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>
Options:
-m, --message TEXT Commit message for the upload
-b, --branch TEXT Branch to upload the file to - this is required for private repositories
--update Force update an existing file
-v, --verbose Verbosity level
--help Show this message and exit.
Upload a single file using the Python client¶
Basic usage example is as follows:
from dagshub.upload import Repo
repo = Repo("<repo_owner>", "<repo_name>") # Optional: username, password, token, branch
# Upload a single file to a repository in one line
repo.upload(file="<local_file_path>", path="<path_in_remote>", versioning=”dvc”) # Optional: versioning, new_branch, commit_message
Upload multiple files using the Python client¶
To upload multiple files, use:
# Upload multiple files to a dvc folder in a repository with a single commit
ds = repo.directory("<name_of_remote_folder")
# Add file-like object (path_in_remote is the relative path inside of the remote folder)
f = open("<local_file_path>", 'rb')
ds.add(file=f, path="<path_in_remote>")
# Or add a local file path
ds.add(file="<local_file_path>", path="<path_in_remote>")
ds.commit("<commit_message>", versioning="dvc")
Automagic Repo Configuration¶
Parts of DDA will try to pick up configuration required to communicate with DagsHub. For example, Data Streaming will use the configuration of your git repository to get the branch you're currently working on and your authentication username and password.
OAuth token acquired via dagshub login
is cached locally, so you don't need to log in every time you run your scripts.
If you need to override the automatically detected configuration, use the following environment variables and options in the CLI:
--repo
(a command line option)DAGSHUB_USERNAME
DAGSHUB_PASSWORD
DAGSHUB_USER_TOKEN
Or provide the relevant arguments to the Python entrypoints:
repo_url=
(For Data Streaming)username=
password=
token=
Data Upload Use Cases¶
Appending files to a DVC directory¶
Adding files to an existing DVC directory stored in a remote is time-consuming and sometimes expensive. Normally, you
need to start by running dvc pull
to download the dataset to a local system, adding the new files to the folder, then
running dvc commit
and dvc push
to re-upload them.
In cases where you want to add 10 files to a million-file dataset, this can become a very painful process.
With Direct Data Access and the Data Upload functionality, DagsHub takes care of that for you. Since we host or connect with your DVC remote, we calculate the new hashes, and commit the new DVC-tracked and modified Git-tracked files on your behalf.
All the above methods for uploading files work, but the easiest way to do this is to use the CLI, by running the following:
dagshub upload <repo_owner>/<repo_name> <local_file_path> <path_in_remote>
With <path_in_remote>
as a DVC-tracked folder in your DagsHub folder. After running this, you will see a new commit,
and the appended file will appear as part of the directory on DagsHub.
Important
For uploading to private repositories, you must use the --branch BRANCH_NAME
option.
Creating a dataset repo from scratch with Python¶
Sometimes you need to set up a repository to track a new dataset. This might be useful to share it for training and experimentation across your team, to start labeling raw data, or just to centrally manage something that exists on a single server.
With DDA, we provide a simple command to do this, which can be accomplished fully in Python, without needing to do anything (except signing up) on DagsHub.
This is the easiest way to create a versioned dataset repo and upload it directly from your IDE or Notebook.
To do this, we can use the create_dataset()
function. Here is a basic example usage:
from dagshub.upload import create_dataset
repo = create_dataset("<dataset_name>", "<path/to/data/directory>") # Optional: glob_exclude, org_name, private
Support Matrix¶
The following table shows which use cases are supported for each component of Direct Data Access and recommendations for when to use each data streaming implementation. We’ll update it as we add support for additional use cases. Please let us know on our Discord community if you have any requests.
Data Streaming Support¶
Python Hooks | Mounted Filesystem | |
---|---|---|
Stable | V | X |
Tensorflow | X | V |
DVC Repro | X | V |
DVC Repro (with --downstream ) |
V | V |
Python Support | V | V |
Non-Python Support | X | V |
No additional Installations | V | X |
Files are visible in the file explorer | X | V |
Support for Windows & Mac | V | X |
Support for Linux | V | V |
C (low-level) Open | X | V |
Recommendations¶
- Python Hooks are recommended for usage in Windows & Mac, any with any framework that uses Python and doesn’t rely on C Opens.
- Mounted Filesystem is recommended for use in cases where Python Hooks don’t apply, and you’re using a Linux system ( Colab is a great example).
Data Upload Support¶
- The upload client currently supports DagsHub repositories only. GitHub-connected repositories and empty repositories are not yet supported.
- Uploading files that are outputs of DVC pipeline stages is not supported as well, as it is expected to be run as part of
a
dvc repro
command. - Deleting files is not yet supported.
Contributing¶
Contributions are welcome! Direct Data Access is part of the open-source DagsHub client, contributions of dedicated support for auto-logging, data-streaming, and more will be greatly appreciated. You can start by creating an issue, or asking for guidance on our Discord community.