Agent Skill · Hugging Face

huggingface-best

Use when the user asks about finding the best, top, or recommended model for a task, wants to know what AI model to use, or wants to compare models by benchmark scores. Triggers on: "best model for X", "what model should I use for", "top models for [task]", "which model runs on my laptop/machine/device", "recommend a model for", "what LLM should I use for", "compare models for", "what's state of the art for", or any question about choosing an AI model for a specific use case. Always use this skill when the user wants model recommendations or comparisons, even if they don't explicitly mention HuggingFace or benchmarks.

Provider: Hugging Face Path in repo: skills/huggingface-best/SKILL.md

Skill body

HuggingFace Best Model Finder

Finds the best models for a task by querying official HF benchmark leaderboards, enriching results with model size data, filtering for what fits on the user’s device, and returning a comparison table with benchmark scores.


Step 1: Parse the request

Extract from the user’s message:

If device is not mentioned, skip filtering entirely and return the highest-performing models regardless of size. If the task is genuinely ambiguous, ask one clarifying question.

Device → max parameter budget

When a device is specified, extract its available memory (unified RAM for Apple Silicon, VRAM for discrete GPUs) and apply:

Examples: 16GB → 8B fp16 / 32B Q4 — 24GB VRAM → 12B fp16 / 48B Q4 — 8GB → 4B fp16 / 16B Q4


Step 2: Find relevant benchmark datasets

Fetch the full list of official HF benchmarks:

curl -s -H "Authorization: Bearer $(cat ~/.cache/huggingface/token)" \
  "https://huggingface.co/api/datasets?filter=benchmark:official&limit=500" | jq '[.[] | {id, tags, description}]'

Read the returned list and select the datasets most relevant to the user’s task — match on dataset id, tags, and description. Use your judgment; don’t limit yourself to 2-3. Aim for comprehensive coverage: if 5 benchmarks clearly cover the task, use all 5.


Step 3: Fetch top models from leaderboards

For each selected benchmark dataset:

curl -s -H "Authorization: Bearer $(cat ~/.cache/huggingface/token)" \
  "https://huggingface.co/api/datasets/<namespace>/<repo>/leaderboard" | jq '[.[:15] | .[] | {rank, modelId, value, verified}]'

Collect model IDs and scores across all benchmarks. If a leaderboard returns an error (404, 401, etc.), skip it and note it in the output.


Step 4: Enrich with model metadata

For the top 10-15 candidate model IDs, get model infos.

# REST API
curl -s -H "Authorization: Bearer $(cat ~/.cache/huggingface/token)" \
  "https://huggingface.co/api/models/org/model1" | jq '{safetensors, tags, cardData}'

# CLI (hf-cli)
hf models info org/model1 --json | jq '{safetensors, tags, cardData}'

Extract from each response:


Step 5: Filter and rank

If a device was specified:

  1. Remove models exceeding the fp16 parameter budget for the device
  2. Flag models that fit only with Q4 quantization (multiply budget by ~4 for Q4 capacity)
  3. If a highly-ranked model is slightly over budget, keep it with a “needs Q4” note — don’t silently drop it

If no device was mentioned: skip all size filtering — just rank by benchmark score.

Then: rank by benchmark score (descending), keep top 5-8 models.

Include proprietary models (GPT-4, Claude, Gemini) if they appear on leaderboards, but flag them as “API only / not self-hostable”. If the user explicitly asked for local/open models only, exclude them.


Step 6: Output

Comparison table

| # | Model | Params | [Benchmark 1] | [Benchmark 2] | License | On device |
|---|-------|--------|--------------|--------------|---------|-----------|
| ⭐1 | [org/name](https://huggingface.co/org/name) | 7B | 85.2% | — | Apache 2.0 | Yes (fp16) |
| 2 | [org/name](https://huggingface.co/org/name) | 13B | 83.1% | 71.5% | MIT | Q4 only |
| 3 | [org/name](https://huggingface.co/org/name) | 70B | 90.0% | 81.0% | Llama | Too large |

Follow-up

After presenting the table, ask the user: “Would you like to run [top recommended model]?”

If they say yes, ask whether they’d prefer to:


Error handling