Agent Skill · OpenAI

speech

Use when the user asks for text-to-speech narration or voiceover, accessibility reads, audio prompts, or batch speech generation via the OpenAI Audio API; run the bundled CLI (`scripts/text_to_speech.py`) with built-in voices and require `OPENAI_API_KEY` for live calls. Custom voice creation is out of scope.

Provider: OpenAI Path in repo: skills/.curated/speech/SKILL.md

Skill body

Speech Generation Skill

Generate spoken audio for the current project (narration, product demo voiceover, IVR prompts, accessibility reads). Defaults to gpt-4o-mini-tts-2025-12-15 and built-in voices, and prefers the bundled CLI for deterministic, reproducible runs.

When to use

Decision tree (single vs batch)

Workflow

  1. Decide intent: single vs batch (see decision tree above).
  2. Collect inputs up front: exact text (verbatim), desired voice, delivery style, format, and any constraints.
  3. If batch: write a temporary JSONL under tmp/ (one job per line), run once, then delete the JSONL.
  4. Augment instructions into a short labeled spec without rewriting the input text.
  5. Run the bundled CLI (scripts/text_to_speech.py) with sensible defaults (see references/cli.md).
  6. For important clips, validate: intelligibility, pacing, pronunciation, and adherence to constraints.
  7. Iterate with a single targeted change (voice, speed, or instructions), then re-check.
  8. Save/return final outputs and note the final text + instructions + flags used.

Temp and output conventions

Dependencies (install if missing)

Prefer uv for dependency management.

Python packages:

uv pip install openai

If uv is unavailable:

python3 -m pip install openai

Environment

If the key is missing, give the user these steps:

  1. Create an API key in the OpenAI platform UI: https://platform.openai.com/api-keys
  2. Set OPENAI_API_KEY as an environment variable in their system.
  3. Offer to guide them through setting the environment variable for their OS/shell if needed.
    • Never ask the user to paste the full key in chat. Ask them to set it locally and confirm when ready.

If installation isn’t possible in this environment, tell the user which dependency is missing and how to install it locally.

Defaults & rules

Instruction augmentation

Reformat user direction into a short, labeled spec. Only make implicit details explicit; do not invent new requirements.

Quick clarification (augmentation vs invention):

Template (include only relevant lines):

Voice Affect: <overall character and texture of the voice>
Tone: <attitude, formality, warmth>
Pacing: <slow, steady, brisk>
Emotion: <key emotions to convey>
Pronunciation: <words to enunciate or emphasize>
Pauses: <where to add intentional pauses>
Emphasis: <key words or phrases to stress>
Delivery: <cadence or rhythm notes>

Augmentation rules:

Examples

Single example (narration)

Input text: "Welcome to the demo. Today we'll show how it works."
Instructions:
Voice Affect: Warm and composed.
Tone: Friendly and confident.
Pacing: Steady and moderate.
Emphasis: Stress "demo" and "show".

Batch example (IVR prompts)

{"input":"Thank you for calling. Please hold.","voice":"cedar","response_format":"mp3","out":"hold.mp3"}
{"input":"For sales, press 1. For support, press 2.","voice":"marin","instructions":"Tone: Clear and neutral. Pacing: Slow.","response_format":"wav"}

Instructioning best practices (short list)

More principles: references/prompting.md. Copy/paste specs: references/sample-prompts.md.

Guidance by use case

Use these modules when the request is for a specific delivery style. They provide targeted defaults and templates.

CLI + environment notes

Reference map