Agent Skill · Databricks

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

Provider: Databricks Path in repo: experimental/databricks-synthetic-data-gen/SKILL.md

Skill body

Catalog and schema are always user-supplied — never default to any value. If the user hasn’t provided them, ask. For any UC write, always create the schema if it doesn’t exist before writing data.

Databricks Synthetic Data Generation

Generate realistic, story-driven synthetic data for Databricks using Spark + Faker + Pandas UDFs (strongly recommended).

Data Must Tell a Business Story

Synthetic data should demonstrate how Databricks helps solve real business problems.

The pattern: Something goes wrong → business impact ($) → analyze root cause → identify affected customers → fix and prevent.

Key principles:

Why no flat distributions: Uniform data has no story — no spikes, no anomalies, no cohort, no 20/80, no skew, nothing to investigate. It can’t show Databricks’ value for root cause analysis.

References

When Guide
User mentions ML model training or complex time patterns references/1-data-patterns.md — ML-ready data, time multipliers, row coherence
Errors during generation references/2-troubleshooting.md — Fixing common issues

Critical Rules

  1. Data tells a story — Something goes wrong, impacts $, can be analyzed and fixed. Show Databricks value.
  2. All data serves the story — Every table and column must be coherent and usable in dashboards or ML models. No orphan data, no random noise — if it doesn’t help explain or plot a futur dashboard or predict, don’t generate it.
  3. Industry terms, simple schema — Use domain-specific vocabulary but keep it easy to understand (few tables, clear relationships)
  4. Never uniform distributions — Skewed categories, log-normal amounts, 80/20 patterns. Flat = no story = useless
  5. Enough data for trends — ~100K+ rows for main tables so patterns survive aggregation
  6. Ask for catalog/schema — Never default, always confirm before generating
  7. Present plan for approval — Show tables, distributions, assumptions before writing code
  8. Master tables first — Generate parent tables, write to Delta, then create children with valid FKs
  9. Use Spark + Faker + Pandas UDFs — Scalable, parallel. Polars only if user explicitly wants local + <30K rows
  10. Use Databricks Connect Serverless by default to generate data — Update databricks-connect on python 3.12 if required (avoid using execute_code unless instructed to not use Databricks Connect)
  11. No .cache() or .persist() — Not supported on serverless. Write to Delta, read back for joins
  12. No Python loops or .collect() — Use Spark parallelism. No driver-side iteration, avoid Pandas↔Spark conversions

Generation Planning Workflow

Before generating any code, you MUST present a plan for user approval.

⚠️ MUST DO: Confirm Catalog Before Proceeding

You MUST explicitly ask the user which catalog to use. Do not assume or proceed without confirmation.

Example prompt to user:

“Which Unity Catalog should I use for this data?”

When presenting your plan, always show the selected catalog prominently:

📍 Output Location: catalog_name.schema_name
   Volume: /Volumes/catalog_name/schema_name/raw_data/

This makes it easy for the user to spot and correct if needed.

Step 1: Gather Requirements

Ask the user about:

If user doesn’t specify a story: Propose one. Don’t generate bland data — suggest an incident, anomaly, or trend that shows Databricks value (e.g., “I’ll include a system outage that causes ticket spike and churn — this lets you demo root cause analysis”).

Step 2: Present Plan with Story

Show a clear specification with the business story and your assumptions surfaced:

📍 Output Location: {user_catalog}.support_demo
   Volume: /Volumes/{user_catalog}/support_demo/raw_data/

📖 Story: A payment system outage causes support ticket spike. Resolution times
   degrade, enterprise customers churn, revenue drops $2.3M. With Databricks we
   identify the root cause, affected customers, and prevent future impact.
Table Description Rows Key Assumptions
customers Customer profiles with tier, MRR 10,000 Enterprise 10% but 60% of revenue
tickets Support tickets with priority, resolution_time 80,000 Spike during outage, SLA breaches
incidents System events (outages, deployments) 50 Payment outage mid-month
churn_events Customer cancellations with reason 500 Spike after poor support experience

Business metrics:

The story this data tells:

Ask user: “Does this story work? Any adjustments?”

Step 3: Ask About Data Features

Pre-Generation Checklist

Do NOT proceed to code generation until user approves the plan, including the catalog.

Post-Generation Validation

Use databricks experimental aitools tools query to validate generated data (row counts, distributions, referential integrity). Query parquet files directly:

databricks experimental aitools tools query --warehouse $WAREHOUSE_ID "
SELECT COUNT(*) FROM parquet.\`/Volumes/CATALOG/SCHEMA/raw_data/customers\`
"

See references/2-troubleshooting.md for full validation examples.

Use Databricks Connect Spark + Faker Pattern

from databricks.connect import DatabricksSession
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import pandas as pd

# Setup serverless Spark session
spark = DatabricksSession.builder.serverless(True).getOrCreate()

# Pandas UDF pattern - import lib INSIDE the function (libs must be installed locally)
@F.pandas_udf(StringType())
def fake_name(ids: pd.Series) -> pd.Series:
    from faker import Faker  # Import inside UDF
    fake = Faker()
    return pd.Series([fake.name() for _ in range(len(ids))])

# Generate with spark.range, apply UDFs
customers_df = spark.range(0, 10000, numPartitions=16).select(
    F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"),
    fake_name(F.col("id")).alias("name"),
)

# Write to Volume as Parquet (default for raw data)
# Path is a folder with table name: /Volumes/catalog/schema/raw_data/customers/
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")
customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers")

Partitions by scale: spark.range(N, numPartitions=P)

Output formats:

Performance Rules

Generated scripts must be highly performant. Never do these:

Anti-Pattern Why It’s Slow Do This Instead
Python loops on driver Single-threaded, no parallelism Use spark.range() + Spark operations
.collect() then iterate Brings all data to driver memory Keep data in Spark, use DataFrame ops
Pandas → Spark → Pandas Serialization overhead, defeats distribution Stay in Spark, use pandas_udf only for UDFs
Read/write temp files Unnecessary I/O Chain DataFrame transformations
Scalar UDFs Row-by-row processing Use pandas_udf for batch processing

Good pattern: spark.range() → Spark transforms → pandas_udf for Faker → write directly

Common Patterns

Weighted Categories (never uniform)

F.when(F.rand() < 0.6, "Free").when(F.rand() < 0.9, "Pro").otherwise("Enterprise")

Log-Normal Amounts (in a pandas UDF)

Use np.random.lognormal(mean, sigma) — always positive, long tail:

Date Range (Last 6 Months)

END_DATE = datetime.now()
START_DATE = END_DATE - timedelta(days=180)

Infrastructure (always create in script)

spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data")

Referential Integrity (FK pattern)

Write master table to Delta first, then read back for FK joins (no .cache() on serverless):

# 1. Write master table
customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers")

# 2. Read back for FK lookup
customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_idx", "customer_id")

# 3. Generate child table with valid FKs via join
orders_df = spark.range(N_ORDERS).select(
    (F.abs(F.hash(F.col("id"))) % N_CUSTOMERS).alias("customer_idx")
)
orders_with_fk = orders_df.join(customer_lookup, on="customer_idx")

Setup

Requires Python 3.12 and databricks-connect>=16.4. Use uv:

uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays

Common Issues

Issue Solution
ModuleNotFoundError: faker Install locally: uv pip install faker, import inside UDF
Faker UDF is slow Use pandas_udf for batch processing
Out of memory Increase numPartitions in spark.range()
Referential integrity errors Write master table to Delta first, read back for FK joins
PERSIST TABLE is not supported on serverless NEVER use .cache() or .persist() with serverless - write to Delta table first, then read back
F.window vs Window confusion Use from pyspark.sql.window import Window for row_number(), rank(), etc. F.window is for streaming only.
Broadcast variables not supported NEVER use spark.sparkContext.broadcast() with serverless

See references/2-troubleshooting.md for full troubleshooting guide.