Understanding PostgreSQL Parallel Query Execution: Multi-Core Performance

When PostgreSQL introduced parallel query execution in version 9.6, it unlocked a fundamental capability: leveraging multi-core CPUs to accelerate analytical workloads. Before parallel query, PostgreSQL was strictly single-threaded—one query, one CPU core. On a 32-core server, a full table scan used 1 core at 100%, leaving 31 cores idle. For OLTP workloads (short, selective queries), this was fine. For analytics (scanning millions of rows), it was a bottleneck. Parallel query changes this by allowing operations like Seq Scan, Hash Join, and Aggregate to split work across multiple background worker processes. A Gather or Gather Merge node collects results from workers and returns them to the client. The result? 2x–4x speedup (or more) on suitable workloads. But parallel query isn’t free. It adds overhead, consumes resources, and can slow down queries if misused. This guide explains how it works, when to use it, and how to tune it.

1 The Problem: Single-Threaded Execution

The Pre-9.6 Reality

SELECT COUNT(*) FROM transactions;  -- 100 million rows

Execution (single-threaded):

┌─────────────────────────────────────┐
│  Seq Scan: transactions             │
│  ─────────────────────────────────  │
│  Rows: 100,000,000                  │
│  Time: 12,000ms                     │
│  CPU: 1 core @ 100%                 │
│  Other 31 cores: Idle               │
└─────────────────────────────────────┘

The bottleneck: Disk I/O and CPU-bound operations (filtering, aggregation) can’t be parallelized. One core does all the work.

The Parallel Solution

SET max_parallel_workers_per_gather = 4;
SELECT COUNT(*) FROM transactions;

Execution (parallel):

┌─────────────────────────────────────────────────────────┐
│                    Gather (Aggregate)                   │
│                    ──────────────────                   │
│                    Combines partial results             │
└─────────────────────────────────────────────────────────┘
           │              │              │              │
    ┌──────┴─────┐ ┌─────┴──────┐ ┌─────┴──────┐ ┌─────┴──────┐
    │  Worker 1  │ │  Worker 2  │ │  Worker 3  │ │  Worker 4  │
    │  Scan 25M  │ │  Scan 25M  │ │  Scan 25M  │ │  Scan 25M  │
    │  rows      │ │  rows      │ │  rows      │ │  rows      │
    └────────────┘ └────────────┘ └────────────┘ └────────────┘
           │              │              │              │
    ┌──────┴──────────────┴──────────────┴──────────────┴──────┐
    │              Seq Scan: transactions                       │
    │              (table split into 4 ranges)                  │
    └───────────────────────────────────────────────────────────┘
Total Time: ~3,500ms (3.4x speedup)
CPU: 4 cores @ ~90% each

Key insight: The table is divided into ranges, each worker scans a portion, and results are combined at the top.

2 Architecture: How Parallel Query Works

The Parallel Execution Stack

flowchart TD A[Client] --> B[Gather Node] B --> C[Parallel Leader] C --> D[Worker 1] C --> E[Worker 2] C --> F[Worker 3] C --> G[Worker 4] D --> H[Seq Scan: Range 1] E --> I[Seq Scan: Range 2] F --> J[Seq Scan: Range 3] G --> K[Seq Scan: Range 4] style B fill:#fce4ec,stroke:#c2185b style C fill:#e3f2fd,stroke:#1976d2 style D fill:#fff3e0,stroke:#f57c00 style E fill:#fff3e0,stroke:#f57c00 style F fill:#fff3e0,stroke:#f57c00 style G fill:#fff3e0,stroke:#f57c00

Components:

Component	Role
Parallel Leader	Main backend that coordinates workers
Parallel Workers	Background processes that execute portions of the plan
Gather Node	Collects results from workers and returns to client
Gather Merge Node	Like Gather, but preserves sort order
Shared Memory	Communication channel between leader and workers

The Gather Node: Collecting Results

The Gather node is the bridge between parallel workers and the client:

/* Simplified from src/backend/executor/nodeGather.c */
typedef struct GatherState {
    PlanState ps;
    int nworkers;              /* Number of workers */
    struct ParallelExecutorContext *pei;  /* Communication context */
    TupleTableSlot *pending_result;  /* Next row from workers */
} GatherState;
static TupleTableSlot *
ExecGather(GatherState *node)
{
    /* Initialize workers on first call */
    if (!node->workers_initialized) {
        InitializeParallelWorkers(node);
        node->workers_initialized = true;
    }
    /* Get next row from any worker */
    while (node->pending_result == NULL) {
        if (all_workers_done(node))
            return NULL;  /* No more rows */
        node->pending_result = FetchRowFromWorker(node);
    }
    /* Return row to parent */
    TupleTableSlot *result = node->pending_result;
    node->pending_result = NULL;
    return result;
}

The Contract:

Method	Purpose
`InitializeParallelWorkers()`	Launch worker processes
`FetchRowFromWorker()`	Get next row from any worker (round-robin)
`all_workers_done()`	Check if all workers finished
Gather vs. Gather Merge:
Node Type	Use Case
-----------	----------
Gather	General parallel execution
Gather Merge	Parallel execution with ORDER BY

-- Uses Gather
SELECT COUNT(*) FROM large_table;
-- Uses Gather Merge
SELECT * FROM large_table ORDER BY created_at;
-- Workers return sorted chunks; Gather Merge combines them

Parallel Workers: Background Processes

Workers are separate PostgreSQL backend processes:

postgres: parallel worker for database neo01              # Worker 1
postgres: parallel worker for database neo01              # Worker 2
postgres: parallel worker for database neo01              # Worker 3
postgres: parallel worker for database neo01              # Worker 4
postgres: neo01                                           # Parallel Leader

Key properties:

Property	Description
Separate processes	Not threads—full OS processes with own memory
Shared memory	Communicate via DSM (Dynamic Shared Memory)
Inherit transaction	Workers see the same transaction snapshot
No direct client access	Only the leader communicates with the client
Temporary	Workers exit when query completes

⚠️Process Overhead

Because workers are separate processes (not threads), there’s overhead:

Process creation: ~1-5ms per worker
DSM setup: Shared memory allocation
Context switching: OS must schedule multiple processes For queries that run in < 100ms, parallel overhead often exceeds the benefit.

3 Parallel-Aware Operators

Not all operators can run in parallel. PostgreSQL supports parallelism for specific node types:

Parallel-Capable Operators

Operator	Parallel Strategy	Speedup Potential
Seq Scan	Split table into ranges; each worker scans a range	⭐⭐⭐ High
Hash Join	Build phase: parallel hash; Probe phase: parallel probe	⭐⭐⭐ High
Aggregate	Each worker computes partial aggregate; leader combines	⭐⭐⭐ High
Bitmap Heap Scan	Split bitmap into ranges	⭐⭐ Medium
Index Scan	Limited support (index-only scans)	⭐ Low
Sort	Each worker sorts a chunk; Gather Merge combines	⭐⭐ Medium

Non-Parallel Operators

Operator	Why Not Parallel?
Nested Loop Join	Inherently sequential (inner depends on outer)
Window Functions	Require ordered input; hard to partition
CTEs (pre-PG15)	Optimization fence; executed separately
Functions (most)	Not marked `PARALLEL SAFE`
INSERT/UPDATE/DELETE	Row-level locking conflicts

💡PARALLEL Safety Levels

Functions are marked with parallel safety:

Level	Meaning	Can Run in Parallel?
`PARALLEL UNSAFE`	May modify state	❌ No
`PARALLEL RESTRICTED`	Read-only, but can’t execute parallel plans	⚠️ Leader only
`PARALLEL SAFE`	Pure read-only	✅ Yes

-- Check function parallel safety
SELECT proname, proparallel 
FROM pg_proc 
WHERE proname = 'random';  -- 'u' = UNSAFE
-- random() is UNSAFE (uses global RNG state)

Example: Parallel Seq Scan

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM transactions
WHERE created_at >= '2025-01-01';

Plan:

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=1234567.89..1234568.01 rows=1 width=8) (actual time=3521.234..3521.456 rows=1 loops=1)
   Buffers: shared hit=123456, read=567890
   ->  Gather  (cost=1234567.00..1234567.89 rows=4 width=8) (actual time=3520.123..3521.234 rows=4 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         Buffers: shared hit=123456, read=567890
         ->  Partial Aggregate  (cost=1234567.00..1234567.00 rows=1 width=8) (actual time=3510.567..3510.678 rows=1 loops=4)
               Buffers: shared hit=30864, read=141972
               ->  Parallel Seq Scan on transactions  (cost=0.00..1234000.00 rows=226667 width=0) (actual time=12.345..3450.123 rows=25000000 loops=4)
                     Filter: (created_at >= '2025-01-01'::date)
                     Rows Removed by Filter: 75000000
                     Buffers: shared hit=30864, read=141972
 Planning Time: 0.456 ms
 Workers Planned: 4
 Workers Launched: 4
 JIT:
   Functions: 8
   Options: inlining false, optimization true, expression_decomposition true, code_generation true
 Execution Time: 3545.678 ms

Key observations:

Metric	Value	Meaning
`Workers Planned`	4	Planner requested 4 workers
`Workers Launched`	4	All 4 workers started (no resource limits hit)
`loops=4`	4	Each worker executed this node once
`rows=25000000`	25M per worker	Total: 100M rows scanned
`Partial Aggregate`	Per-worker	Each worker counts its portion
`Finalize Aggregate`	Leader	Combines 4 partial counts

Example: Parallel Hash Join

EXPLAIN (ANALYZE, BUFFERS)
SELECT u.name, COUNT(o.id)
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2025-01-01'
GROUP BY u.name;

Plan:

                                                               QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=5678901.23..5678912.34 rows=100000 width=40) (actual time=8234.567..8245.678 rows=95432 loops=1)
   Group Key: u.name
   Buffers: shared hit=234567, read=1234567
   ->  Gather Merge  (cost=5678901.23..5678909.87 rows=400000 width=40) (actual time=8230.123..8240.234 rows=381728 loops=1)
         Workers Planned: 4
         Workers Launched: 4
         Buffers: shared hit=234567, read=1234567
         ->  Partial GroupAggregate  (cost=5678901.23..5678905.45 rows=100000 width=40) (actual time=8200.456..8210.567 rows=95432 loops=4)
               Group Key: u.name
               Buffers: shared hit=58641, read=308641
               ->  Sort  (cost=5678901.23..5678902.34 rows=400000 width=40) (actual time=8190.345..8195.456 rows=95432 loops=4)
                     Sort Key: u.name
                     Sort Method: quicksort  Memory: 256kB
                     Buffers: shared hit=58641, read=308641
                     ->  Hash Join  (cost=123456.78..5678000.00 rows=400000 width=40) (actual time=456.789..8100.123 rows=95432 loops=4)
                           Hash Cond: (o.user_id = u.id)
                           Buffers: shared hit=58641, read=308641
                           ->  Parallel Seq Scan on orders o  (cost=0.00..5500000.00 rows=2500000 width=8) (actual time=12.345..7800.234 rows=625000 loops=4)
                                 Buffers: shared hit=45678, read=234567
                           ->  Hash  (cost=100000.00..100000.00 rows=250000 width=12) (actual time=440.123..440.234 rows=250000 loops=1)
                                 Buckets: 262144  Batches: 2  Memory Usage: 12288kB
                                 ->  Seq Scan on users u  (cost=0.00..100000.00 rows=250000 width=12) (actual time=0.123..350.456 rows=250000 loops=1)
                                       Filter: (created_at > '2025-01-01'::date)
                                       Rows Removed by Filter: 750000
                                       Buffers: shared hit=12345, read=67890
 Planning Time: 1.234 ms
 Workers Planned: 4
 Workers Launched: 4
 Execution Time: 8267.890 ms

Execution Flow:

┌─────────────────────────────────────────────────────────────────┐
│                    Finalize GroupAggregate                      │
│                    (Leader: combines 4 partial aggregates)      │
└─────────────────────────────────────────────────────────────────┘
                              │
                    ┌─────────┴─────────┐
                    │   Gather Merge    │  (Preserves sort order)
                    └─────────┬─────────┘
                              │
        ┌──────────┬──────────┼──────────┬──────────┐
        │          │          │          │          │
   ┌────┴───┐ ┌────┴───┐ ┌────┴───┐ ┌────┴───┐
   │Worker 1│ │Worker 2│ │Worker 3│ │Worker 4│
   │Partial │ │Partial │ │Partial │ │Partial │
   │Agg     │ │Agg     │ │Agg     │ │Agg     │
   └────┬───┘ └────┬───┘ └────┬───┘ └────┬───┘
        │          │          │          │
        └──────────┴──────────┴──────────┘
                   │
         ┌─────────┴─────────┐
         │   Parallel Hash   │
         │   (Shared hash    │
         │    table in DSM)  │
         └─────────┬─────────┘
                   │
        ┌──────────┴──────────┐
        │                     │
   ┌────┴────┐          ┌─────┴─────┐
   │Parallel │          │   Hash    │
   │Scan     │          │   Build   │
   │orders   │          │   (users) │
   └─────────┘          └───────────┘

Key insight: The hash table is built once (by the leader or workers collectively) and stored in shared memory. All workers probe the same hash table.

4 Configuration: Tuning Parallel Query

PostgreSQL provides several GUCs (Grand Unified Configuration) to control parallel query:

Core Parameters

Parameter	Default	Description
`max_parallel_workers_per_gather`	2	Max workers per Gather node
`max_parallel_workers`	8	Total workers available system-wide
`max_parallel_maintenance_workers`	2	Workers for maintenance (VACUUM, CREATE INDEX)
`parallel_leader_participation`	on	Whether leader also works
`min_parallel_table_scan_size`	8MB	Minimum table size for parallel scan
`min_parallel_index_scan_size`	512kB	Minimum index size for parallel scan
`parallel_setup_cost`	1000.0	Planner cost estimate for parallel setup
`parallel_tuple_cost`	0.1	Planner cost per tuple transferred

Parameter Deep Dive

`max_parallel_workers_per_gather`

-- Default: 2
SET max_parallel_workers_per_gather = 4;
-- For a single large query
SET LOCAL max_parallel_workers_per_gather = 8;
SELECT COUNT(*) FROM huge_table;

Trade-off:

Value	Pros	Cons
Low (1-2)	Less overhead, more predictable	Limited speedup
High (8+)	Maximum parallelism	Diminishing returns, resource contention
Rule of thumb: Start with 4, measure, adjust.

`max_parallel_workers`

-- System-wide limit (postgresql.conf)
max_parallel_workers = 16  -- Total workers across all queries
-- Check current usage
SELECT count(*) 
FROM pg_stat_activity 
WHERE backend_type = 'parallel worker';

Relationship:

max_parallel_workers_per_gather ≤ max_parallel_workers
Example:
  max_parallel_workers = 8
  max_parallel_workers_per_gather = 4
  → Max 2 concurrent parallel queries (8 / 4 = 2)

`parallel_leader_participation`

-- Default: on (leader also does work)
SET parallel_leader_participation = on;
-- Leader coordinates AND scans a portion of the table

When to disable:

Scenario	Setting	Why
Many concurrent queries	`off`	Leader handles coordination, workers do all scanning
Small number of workers	`on`	Maximize parallelism (leader + workers)
I/O-bound workloads	`on`	Leader can help with I/O
CPU-bound workloads	`off`	Leader focuses on coordination

`min_parallel_table_scan_size`

-- Default: 8MB
SET min_parallel_table_scan_size = 16MB;
-- Tables smaller than 16MB won't use parallel scan

Why it exists: Parallel overhead (~5-10ms) exceeds benefit for small tables. Tuning:

Workload	Recommended Value
OLTP (small queries)	32MB+ (discourage parallelism)
Analytics (large scans)	4-8MB (encourage parallelism)
Mixed	8-16MB (balance)

Planner Cost Parameters

The planner uses cost estimates to decide whether parallelism is worthwhile:

-- Default costs
parallel_setup_cost = 1000.0    -- Equivalent to ~10ms of work
parallel_tuple_cost = 0.1       -- Per-row transfer cost

How the planner decides:

Total Parallel Cost = parallel_setup_cost 
                    + (parallel_tuple_cost × rows)
                    + (scan_cost / workers)
If Parallel Cost < Serial Cost → Choose Parallel

Tuning:

-- Make parallelism more attractive
SET parallel_setup_cost = 500.0;   -- Lower setup penalty
SET parallel_tuple_cost = 0.05;    -- Lower transfer cost
-- Make parallelism less attractive
SET parallel_setup_cost = 2000.0;  -- Higher setup penalty

💡Debugging Planner Decisions

Use EXPLAIN (VERBOSE, COSTS) to see why the planner chose (or rejected) parallelism:

EXPLAIN (VERBOSE, COSTS)
SELECT COUNT(*) FROM transactions;
-- Look for:
-- "Workers Planned: 0" → Planner rejected parallelism
-- "Workers Planned: 4" → Planner chose parallel

5 When Parallel Query Helps (and When It Hurts)

Ideal Workloads for Parallel Query

Workload	Characteristics	Expected Speedup
Large table scans	>100MB tables, selective filters	2x-4x
Bulk aggregations	COUNT, SUM, AVG over millions of rows	3x-5x
Hash joins	Large fact tables joining to dimensions	2x-4x
Index-only scans	Covering indexes on large tables	2x-3x
Parallel CREATE INDEX	`CREATE INDEX CONCURRENTLY` (PG11+)	2x-3x
Example: Large Aggregation

-- Before: 12 seconds (single-threaded)
-- After: 3.5 seconds (4 workers)
SET max_parallel_workers_per_gather = 4;
SELECT 
    DATE_TRUNC('hour', created_at) as hour,
    COUNT(*) as transactions,
    SUM(amount) as total_amount
FROM transactions
WHERE created_at >= '2025-01-01'
GROUP BY hour;

When Parallel Query Hurts

Scenario	Why It Hurts	Alternative
Small tables (< 10MB)	Overhead > benefit	Disable parallelism
OLTP queries (< 100ms)	Setup time dominates	Keep `max_parallel_workers_per_gather = 0`
Many concurrent queries	Worker starvation	Limit `max_parallel_workers`
Memory-bound workloads	Workers compete for cache	Reduce workers
Functions marked UNSAFE	Falls back to serial	Mark functions SAFE if possible
Example: Parallel Overhead

-- Small table: parallel is SLOWER
SELECT COUNT(*) FROM small_table;  -- 50ms serial, 80ms parallel
-- Why?
-- - Process creation: 5ms × 4 workers = 20ms
-- - DSM setup: 10ms
-- - Coordination: 5ms
-- - Actual scan: 5ms (table is tiny)
-- Total parallel: 40ms overhead + 5ms scan = 45ms+ (vs 50ms serial)

⚠️The OLTP Trap

For OLTP workloads (short, frequent queries), parallel query often degrades performance:

-- Typical OLTP query
SELECT * FROM users WHERE id = 12345;  -- Uses index, returns 1 row
-- Parallel overhead: 30-50ms
-- Serial execution: 0.5ms
-- Result: 60-100x slower with parallelism!

Solution: Disable parallelism for OLTP:

-- In postgresql.conf for OLTP databases
max_parallel_workers_per_gather = 0
max_parallel_workers = 0

6 Debugging Parallel Query Issues

Common Problems

Problem 1: Workers Not Launched

Workers Planned: 4
Workers Launched: 2  -- Only 2 started!

Causes:

Cause	Check	Fix
Worker limit hit	`SHOW max_parallel_workers;`	Increase limit
Memory pressure	Check `work_mem × workers`	Reduce `max_parallel_workers`
Table too small	Check table size vs `min_parallel_table_scan_size`	Lower threshold or accept serial
Function is UNSAFE	Check `proparallel` in `pg_proc`	Mark SAFE or rewrite

Problem 2: Parallel Scan Falls Back to Index Scan

-- Expected: Parallel Seq Scan
-- Actual: Index Scan (serial)

Why: The planner determined index scan is cheaper. Force parallel scan (for testing):

SET enable_indexscan = off;
SET enable_seqscan = on;
SET max_parallel_workers_per_gather = 4;
EXPLAIN ANALYZE
SELECT COUNT(*) FROM large_table;

⚠️Don't Force in Production

Disabling index scans is for debugging only. The planner usually knows best. If parallel seq scan is truly better, adjust cost parameters instead.

Problem 3: Uneven Worker Load

Worker 1: 50M rows, 3000ms
Worker 2: 30M rows, 1800ms
Worker 3: 15M rows, 900ms
Worker 4: 5M rows, 300ms

Why: Data skew (workers scan different-sized ranges). Impact: Slowest worker determines total time (Amdahl’s Law). Mitigation:

Strategy	How
Better data distribution	Reorganize table (PARTITION BY)
More workers	More granular ranges
Accept imbalance	Often not worth optimizing

Monitoring Parallel Query

-- Check active parallel workers
SELECT 
    pid,
    usename,
    query,
    backend_type
FROM pg_stat_activity
WHERE backend_type = 'parallel worker';
-- Check parallel query statistics (PG14+)
SELECT 
    datname,
    numbackends,
    xact_commit,
    xact_rollback
FROM pg_stat_database;
-- View parallel worker memory usage
SELECT 
    pid,
    usename,
    query,
    temp_blks_read,
    temp_blks_written
FROM pg_stat_activity
WHERE query LIKE '%parallel%';

7 Advanced: Parallel Query Internals

Dynamic Shared Memory (DSM)

Workers communicate via Dynamic Shared Memory:

/* Simplified from src/backend/storage/ipc/dsm.c */
typedef struct dsm_segment {
    dsm_handle handle;      /* Unique identifier */
    void *mapped_address;   /* Mapped memory address */
    Size size;              /* Segment size */
} dsm_segment;
/* Create shared memory segment */
dsm_segment *
dsm_create(Size size, int flags)
{
    /* Allocate shared memory */
    /* Return segment handle */
}
/* Attach to shared memory (workers) */
void *
dsm_attach(dsm_handle handle)
{
    /* Map shared memory into worker's address space */
}

What’s stored in DSM:

Data	Size	Purpose
Query plan	~10-100KB	Workers need to know what to execute
Table scan ranges	~1KB per worker	Which blocks each worker scans
Hash tables	Variable (MBs)	Shared hash for parallel hash join
Partial aggregates	~1KB per worker	Intermediate results
Tuple queues	Variable	Rows transferred from workers to leader

Parallel Context Switching

┌─────────────────────────────────────────────────────────────┐
│                     Parallel Leader                         │
│  ────────────────────────────────────────────────────────   │
│  1. Parse & analyze query                                   │
│  2. Generate parallel-aware plan                            │
│  3. Create DSM segment                                      │
│  4. Launch worker processes                                 │
│  5. Send plan + context to workers via DSM                  │
│  6. Wait for workers to start                               │
│  7. Begin execution (Gather node pulls from workers)        │
│  8. Collect results, return to client                       │
│  9. Clean up DSM, terminate workers                         │
└─────────────────────────────────────────────────────────────┘

Timeline:

0ms     5ms     10ms    15ms    20ms    100ms   3500ms
│───────│───────│───────│───────│───────│───────│───────│
│ DSM   │Worker │Workers│First  │       │       │Final  │
│ setup │launch │ready  │row    │Scan   │Scan   │result │
│       │       │       │       │       │       │       │
└───────────────┘
    Overhead (~20ms)

8 Practical Tuning Guide

OLTP Database

-- postgresql.conf
max_parallel_workers_per_gather = 0  -- Disable parallel query
max_parallel_workers = 0
max_parallel_maintenance_workers = 1  -- Keep for maintenance
-- Why: OLTP queries are short; parallel overhead hurts more than helps

Analytics Database

-- postgresql.conf
max_parallel_workers_per_gather = 8   -- Maximize parallelism
max_parallel_workers = 32             -- Support multiple concurrent queries
max_parallel_maintenance_workers = 4  -- Parallel VACUUM, CREATE INDEX
min_parallel_table_scan_size = 4MB    -- Encourage parallelism
parallel_setup_cost = 500.0           -- Lower barrier to parallelism
parallel_tuple_cost = 0.05
-- Per-query (for critical reports)
SET LOCAL max_parallel_workers_per_gather = 16;
SELECT /* complex analytical query */;

Mixed Workload

-- postgresql.conf
max_parallel_workers_per_gather = 4   -- Moderate parallelism
max_parallel_workers = 16             -- Support ~4 concurrent parallel queries
max_parallel_maintenance_workers = 2
min_parallel_table_scan_size = 16MB   -- Only parallelize large scans
-- Application-level tuning
-- OLTP queries: Use default (4 workers)
-- Analytical queries: SET LOCAL max_parallel_workers_per_gather = 8

Tuning Checklist

□ Check current settings:
  SHOW max_parallel_workers_per_gather;
  SHOW max_parallel_workers;
  SHOW min_parallel_table_scan_size;
□ Identify candidate queries:
  -- Find queries with sequential scans on large tables
  SELECT query, total_exec_time, calls
  FROM pg_stat_statements
  WHERE query LIKE '%Seq Scan%'
  ORDER BY total_exec_time DESC;
□ Test with parallelism:
  SET max_parallel_workers_per_gather = 4;
  EXPLAIN (ANALYZE, BUFFERS) <query>;
□ Measure speedup:
  -- Compare serial vs parallel execution time
  -- Check Workers Launched matches Workers Planned
□ Adjust based on results:
  -- Good speedup (2x+): Keep settings
  -- Marginal speedup (< 1.5x): Reduce workers or disable
  -- Slower: Disable parallelism for this query type

Summary: Parallel Query in One Diagram

flowchart TD subgraph "Parallel Query Execution" A[Client] --> B[Gather Node] B --> C[Parallel Leader] C --> D[Worker 1] C --> E[Worker 2] C --> F[Worker 3] C --> G[Worker 4] D --> H[Parallel Seq Scan] E --> I[Parallel Seq Scan] F --> J[Parallel Seq Scan] G --> K[Parallel Seq Scan] H --> L[Table Range 1] I --> M[Table Range 2] J --> N[Table Range 3] K --> O[Table Range 4] end subgraph "Configuration" P[max_parallel_workers_per_gather] Q[max_parallel_workers] R[min_parallel_table_scan_size] end subgraph "Best For" S[Large table scans] T[Bulk aggregations] U[Hash joins] end style B fill:#fce4ec,stroke:#c2185b style D fill:#fff3e0,stroke:#f57c00 style E fill:#fff3e0,stroke:#f57c00 style F fill:#fff3e0,stroke:#f57c00 style G fill:#fff3e0,stroke:#f57c00 style P fill:#e3f2fd,stroke:#1976d2 style Q fill:#e3f2fd,stroke:#1976d2 style R fill:#e3f2fd,stroke:#1976d2

Key Takeaways:

Aspect	Parallel Query
Architecture	Leader + workers via DSM
Gather node	Collects results from workers
Gather Merge	Preserves sort order
Best for	Large scans, aggregations, hash joins
Avoid for	OLTP, small tables, sub-100ms queries
Speedup	2x-4x typical, up to 10x for ideal workloads
Overhead	~20-50ms setup time

Further Reading:

PostgreSQL Docs: “Parallel Query”
Robert Haas: “Parallel Query in PostgreSQL” (2015)
Andres Freund: “Parallelism in PostgreSQL”
Source: src/backend/executor/nodeGather.c

Comments

Please accept the "Functionality" cookie category to view and post comments.