Skip to content

Tuning and Metrics

Two concerns share this page because they share one vocabulary: batches, retries, pools, counters. Use it after the job works, when you need it to be faster or more visible.

batchSize (default 1000) controls how many rows PGStyx buffers before it flushes them to PostgreSQL.

Larger batches trade memory for fewer round trips. Typical tuning range is 500 to 5000. Above that, returns usually flatten while per-task memory pressure rises.

df.write
.format("pgstyx")
.option("batchSize", "5000")
// ...
.save()

The pool settings determine how many database connections PGStyx can hold and how long idle connections stay around.

That matters for capacity planning:

maxPoolSize defaults to spark.executor.cores when Spark provides it, or 2 otherwise. High core counts can become high connection counts quickly.

OptionDefaultNotes
maxPoolSizespark.executor.cores (fallback: 2)Hard ceiling on concurrent connections per active Spark process
minIdle0Warm connections kept open
connectionTimeout30000 msHow long a task waits for a connection before failing
idleTimeout30000 msWhen idle connections close
maxLifetime1800000 msWhen a connection is retired and replaced

All five pool options are available on every plan.

If the database is connection-bound, reduce the number of writer tasks before you change anything else:

df.coalesce(n)
.write
.format("pgstyx")
.options(opts)
.save()

Rule of thumb:

n ≤ (postgres_max_connections − headroom) / maxPoolSize

Prefer coalesce when reducing partitions. It avoids a full shuffle.

Default: three total attempts with exponential backoff. With retryBackoffMs=1000 and retryBackoffMultiplier=2.0, the delays are 1 second, then 2 seconds.

Only these failures retry:

  • SQLSTATE 08000, 08003, 08006 — connection errors.
  • SQLSTATE 40001, 40P01 — serialization failure, deadlock.
  • SQLSTATE 53000, 53100, 53200, 53300, 53400 — resource errors.

Everything else throws on the first failure. Constraint violations (23xxx), permission errors (42xxx), and type mismatches do not retry.

df.write
.format("pgstyx")
.option("maxRetries", "5")
.option("retryBackoffMs", "2000")
.option("retryBackoffMultiplier", "1.5")
// ...
.save()

After maxRetries attempts, the original exception is wrapped in PGStyxException with a message starting Operation failed after N retries:.

While metricsEnabled=true (the default), PGStyx tracks four counters:

  • pgstyx.rows.written
  • pgstyx.rows.filtered
  • pgstyx.retries
  • pgstyx.errors

Read the latest batch report with Metrics.getReport():

import com.pgstyx.metrics.Metrics
df.write
.format("pgstyx")
.options(opts)
.save()
println(Metrics.getReport())

The report is a multi-line string:

PGStyx Metrics Report:
Rows Written: 1234567
Rows Filtered: 42
Retries: 0
Errors: 0

Metrics.getReport() must be called on the driver after the write completes. Calling it from an executor throws IllegalStateException.

metricsEnabled=false stops counter updates. Metrics.getReport() then returns zeros. Only useful when per-row overhead matters; the updates are cheap but not free.