OpenProteinSet Dataset for Machine Learning

Install DagsHub:

pip install dagshub

Click on copy button to copy content

To stream this data directly on DagsHub

from dagshub.streaming import DagsHubFilesystem

fs = DagsHubFilesystem(".", repo_url="https://dagshub.com/DagsHub-Datasets/openfold-dataset")

fs.listdir("s3://openfold")

Click on copy button to copy content

Description

Multiple sequence alignments (MSAs) for 132,000 unique Protein Data Bank (PDB) chains, covering 640,000 PDB chains in total, and 4,850,000 UniClust30 clusters. Template hits are also provided for the PDB chains and 270,000 UniClust30 clusters chosen for maximal diversity and MSA depth. MSAs were generated with HHBlits (-n3) and JackHMMER against MGnify, BFD, UniRef90, and UniClust30 while templates were identified from PDB70 with HHSearch, all according to procedures outlined in the supplement to the AlphaFold 2 Nature paper, Jumper et al. 2021. We expect the database to be broadly useful to structural biologists training or validating deep learning models for protein structure prediction and related tasks.

Explore this dataset on DagsHub

Additional information

Documentation

https://docs.google.com/document/d/1R90-VJSLQEbot7tgXF3zb068Y1ZJAmsckQ_t2sJTv2c/edit?usp=sharing

Update frequency

Never

Managed by

OpenFold

License

CC BY 4.0

Explore this dataset on DagsHub

OpenProteinSet Dataset for Machine Learning

Install DagsHub:

To stream this data directly on DagsHub

Description

Additional information

Documentation

Update frequency

Managed by

License

Related datasets

Allen Brain Observatory – Visual Coding AWS Public Data Set

Allen Cell Imaging Collections

Biological and Physical Sciences (BPS) Microscopy Benchmark Training Dataset

Cancer Cell Line Encyclopedia (CCLE)

Launch your ML development to new heights with DagsHub

Take control of your multimodal data

ML Newsletter

OpenProteinSet Dataset for Machine Learning

Install DagsHub:

To stream this data directly on DagsHub

Description

Additional information

Documentation

Update frequency

Managed by

License

Tags

Related datasets

Allen Brain Observatory – Visual Coding AWS Public Data Set

Allen Cell Imaging Collections

Biological and Physical Sciences (BPS) Microscopy Benchmark Training Dataset

Cancer Cell Line Encyclopedia (CCLE)

Launch your ML development to new heights with DagsHub

Take control of your multimodal data

ML Newsletter