Skip to content

PGStyx Docs

PGStyx writes Apache Spark DataFrames into PostgreSQL for teams that have outgrown Spark’s generic JDBC path, whether that is df.write.jdbc(...) or .write.format("jdbc"). PGStyx is the river between the Spark world and the PostgreSQL world.

Five problems show up as soon as a pipeline leaves a laptop:

  • Connection pressure. Spark parallelism can open more PostgreSQL connections than the database comfortably supports.
  • No built-in upsert. Duplicate-key handling and record refresh logic turn into custom job code.
  • Type fidelity loss. JSONB, arrays, UUID, and precise numerics need extra handling.
  • Streaming delivery is underspecified. Some pipelines need structured streaming with replay-safe delivery, including exactly-once semantics when required.
  • Schema drift becomes manual work. Schema changes should not require manual ALTER TABLE or a job restart.

An Apache Spark 3.x / 4.x datasource named pgstyx for writing to PostgreSQL. It is PostgreSQL-specific and currently write-only.

It gives you upsert, schema evolution, JSONB and array handling, retry controls, TLS settings, structured streaming, and exactly-once semantics when the workload requires them.

  • The target table is not append-only.
  • PostgreSQL shows connection pressure as Spark parallelism increases.
  • Semi-structured payloads need to land as JSONB or arrays.
  • Streaming jobs need replay-safe delivery, including exactly-once semantics when required.
  • Schema changes should not require manual ALTER TABLE or a job restart.

Start with Install if the JAR is not yet on your cluster. Go to Getting Started if the JAR is already on the cluster.