Agent Skill · Hugging Face

train-sentence-transformers

Train or fine-tune sentence-transformers models across `SentenceTransformer` (bi-encoder; dense or static embedding model; for retrieval, similarity, clustering, classification, paraphrase mining, dedup, multimodal), `CrossEncoder` (reranker; pair scoring for two-stage retrieval / pair classification), and `SparseEncoder` (SPLADE, sparse embedding model; for learned-sparse retrieval). Covers loss selection, hard-negative mining, evaluators, distillation, LoRA, Matryoshka, and Hugging Face Hub publishing. Use for any sentence-transformers training task.

Provider: Hugging Face Path in repo: skills/train-sentence-transformers/SKILL.md

Skill body

Train a sentence-transformers Model

This SKILL.md is a router, not a manual. It tells you which references and example scripts to load for your task. The actual content — recommended losses, evaluators, training-script structure, model selection, training-arg knobs, troubleshooting — lives in references/ and scripts/.

Do not synthesize a training script from this file alone. Open the per-type production template (scripts/train_<type>_example.py) and copy it as your starting point. The templates contain load-bearing scaffolding (autocast helper, model-card class, logger silencing list, force=True, seed, TF32, version-compatible imports, named-evaluator metric handling) that prior agent runs have repeatedly missed when rolling their own from a synthesized snippet.

1. Identify the model type

Tag Class What it does When to pick
[SentenceTransformer] SentenceTransformer (bi-encoder) Maps each input to a fixed-dim dense vector Retrieval, similarity, clustering, classification, paraphrase mining, dedup
[CrossEncoder] CrossEncoder (reranker) Scores (query, passage) pairs jointly Two-stage retrieval (rerank top-100 from bi-encoder), pair classification
[SparseEncoder] SparseEncoder (SPLADE) Sparse vectors over the vocabulary Learned-sparse retrieval, inverted-index backends (Elasticsearch / OpenSearch / Lucene)

Tiebreakers when the request is ambiguous: “embedding model” / “vector search” / “similarity” → [SentenceTransformer]. “rerank” / “ranker” / “two-stage” → [CrossEncoder]. “SPLADE” / “sparse” / “inverted index” → [SparseEncoder]. If still unclear, ask.

2. Required reading

Read these in full before writing any code. Do not triage by perceived relevance.

Per-type — always required

[SentenceTransformer]

[CrossEncoder]

[SparseEncoder]

Cross-cutting — always required (regardless of task)

Cross-cutting — load when applicable

Variant scripts (open when the task matches)

3. Defaults

Override only if the user specifies otherwise:

4. Constraints the produced script must satisfy

These are non-negotiable contracts. Implementation lives in the production templates and references — do not reinvent.

5. Workflow

  1. Identify the model type (§1). Ask if ambiguous.
  2. Load the §2 required-reading files for that type.
  3. Open scripts/train_<type>_example.py and copy it as your starting point.
  4. Replace MODEL_NAME, DATASET_NAME, RUN_NAME, the loss, and the evaluator with the user’s task. Cross-check loss/data-shape match against references/losses_<type>.md; cross-check the metric_for_best_model key against references/evaluators_<type>.md (named evaluators format the key as eval_{name}_{primary_metric}).
  5. Smoke-test (max_steps=1).
  6. Run.
  7. After the run, append to logs/experiments.md and propose iteration if the verdict is weak/marginal.

Prerequisites

pip install "sentence-transformers[train]>=5.0"        # add [train,image] / [audio] / [video] for [SentenceTransformer] multimodal
pip install trackio                                    # optional tracker; or wandb / tensorboard / mlflow
hf auth login                                          # or set HF_TOKEN with write scope (for Hub push)

GPU strongly recommended. CPU works only for demos and [SentenceTransformer] StaticEmbedding.