DS569k: Protein Sequence and Function Joint Embeddings Dataset

By Donald Bertucci and Alex Endert

Protein embeddings based on function (ProteinCLIP + ESM2) for ~569k proteins from UniprotKB. And web app to query similar proteins given a sequence.

Demo

Paper

Cite

BibTeX
@misc{bertucci2024ds569k,
  author = {Donald Bertucci and Alex Endert},
  title = {DS569k: Protein Sequence and Function Joint Embeddings Dataset},
  year = {2024},
  url = {https://xnought.github.io/files/DS569k.pdf},
}