Tuning and Metrics
Two concerns share this page because they share one vocabulary: batches, retries, pools, counters. Use it after the job works, when you need it to be faster or more visible.
Batch size
Section titled “Batch size”batchSize (default 1000) controls how many rows PGStyx buffers before it flushes them to PostgreSQL.
Larger batches trade memory for fewer round trips. Typical tuning range is 500 to 5000. Above that, returns usually flatten while per-task memory pressure rises.
df.write .format("pgstyx") .option("batchSize", "5000") // ... .save()Connection pool
Section titled “Connection pool”The pool settings determine how many database connections PGStyx can hold and how long idle connections stay around.
That matters for capacity planning:
maxPoolSize defaults to spark.executor.cores when Spark provides it, or 2 otherwise. High core counts can become high connection counts quickly.
| Option | Default | Notes |
|---|---|---|
maxPoolSize | spark.executor.cores (fallback: 2) | Hard ceiling on concurrent connections per active Spark process |
minIdle | 0 | Warm connections kept open |
connectionTimeout | 30000 ms | How long a task waits for a connection before failing |
idleTimeout | 30000 ms | When idle connections close |
maxLifetime | 1800000 ms | When a connection is retired and replaced |
All five pool options are available on every plan.
If the database is connection-bound, reduce the number of writer tasks before you change anything else:
df.coalesce(n) .write .format("pgstyx") .options(opts) .save()Rule of thumb:
n ≤ (postgres_max_connections − headroom) / maxPoolSizePrefer coalesce when reducing partitions. It avoids a full shuffle.
Retries
Section titled “Retries”Default: three total attempts with exponential backoff. With retryBackoffMs=1000 and retryBackoffMultiplier=2.0, the delays are 1 second, then 2 seconds.
Only these failures retry:
- SQLSTATE
08000,08003,08006— connection errors. - SQLSTATE
40001,40P01— serialization failure, deadlock. - SQLSTATE
53000,53100,53200,53300,53400— resource errors.
Everything else throws on the first failure. Constraint violations (23xxx), permission errors (42xxx), and type mismatches do not retry.
df.write .format("pgstyx") .option("maxRetries", "5") .option("retryBackoffMs", "2000") .option("retryBackoffMultiplier", "1.5") // ... .save()After maxRetries attempts, the original exception is wrapped in PGStyxException with a message starting Operation failed after N retries:.
Metrics
Section titled “Metrics”While metricsEnabled=true (the default), PGStyx tracks four counters:
pgstyx.rows.writtenpgstyx.rows.filteredpgstyx.retriespgstyx.errors
Read the latest batch report with Metrics.getReport():
import com.pgstyx.metrics.Metrics
df.write .format("pgstyx") .options(opts) .save()
println(Metrics.getReport())The report is a multi-line string:
PGStyx Metrics Report: Rows Written: 1234567 Rows Filtered: 42 Retries: 0 Errors: 0Metrics.getReport() must be called on the driver after the write completes. Calling it from an executor throws IllegalStateException.
Turning metrics off
Section titled “Turning metrics off”metricsEnabled=false stops counter updates. Metrics.getReport() then returns zeros. Only useful when per-row overhead matters; the updates are cheap but not free.