PGStyx Docs
PGStyx writes Apache Spark DataFrames into PostgreSQL for teams that have outgrown Spark’s generic JDBC path, whether that is df.write.jdbc(...) or .write.format("jdbc"). PGStyx is the river between the Spark world and the PostgreSQL world.
Why not stock JDBC
Section titled “Why not stock JDBC”Five problems show up as soon as a pipeline leaves a laptop:
- Connection pressure. Spark parallelism can open more PostgreSQL connections than the database comfortably supports.
- No built-in upsert. Duplicate-key handling and record refresh logic turn into custom job code.
- Type fidelity loss.
JSONB, arrays,UUID, and precise numerics need extra handling. - Streaming delivery is underspecified. Some pipelines need structured streaming with replay-safe delivery, including exactly-once semantics when required.
- Schema drift becomes manual work. Schema changes should not require manual
ALTER TABLEor a job restart.
What PGStyx is
Section titled “What PGStyx is”An Apache Spark 3.x / 4.x datasource named pgstyx for writing to PostgreSQL. It is PostgreSQL-specific and currently write-only.
It gives you upsert, schema evolution, JSONB and array handling, retry controls, TLS settings, structured streaming, and exactly-once semantics when the workload requires them.
When to reach for PGStyx
Section titled “When to reach for PGStyx”- The target table is not append-only.
- PostgreSQL shows connection pressure as Spark parallelism increases.
- Semi-structured payloads need to land as
JSONBor arrays. - Streaming jobs need replay-safe delivery, including exactly-once semantics when required.
- Schema changes should not require manual
ALTER TABLEor a job restart.
Start with Install if the JAR is not yet on your cluster. Go to Getting Started if the JAR is already on the cluster.