Thursday, February 12, 2026

BullMQ Python vs RQ Performance Benchmark

Cover for BullMQ Python vs RQ Performance Benchmark

Python developers have long relied on RQ (Redis Queue) as the go-to solution for background job processing with Redis. It's simple, well-documented, and has been around since 2012. But as Python applications scale, the question arises: is there a faster alternative?

BullMQ Python is a native Python port of the popular BullMQ library, bringing the same battle-tested architecture that powers millions of jobs in production Node.js applications. With its async-first design and optimized Lua scripts, BullMQ Python promises significant performance improvements over traditional synchronous queues.

In this benchmark, we'll put both libraries head-to-head to see how they perform in real-world scenarios.

Architecture Differences

Before diving into the numbers, it's important to understand the fundamental architectural differences between these two libraries:

RQ (Redis Queue)

  • Synchronous, blocking architecture
  • One job processed at a time per worker
  • Simple and straightforward design
  • To scale, you run multiple worker processes

BullMQ Python

  • Asynchronous, non-blocking architecture (built on asyncio)
  • Configurable concurrency per worker
  • Lua scripts for atomic Redis operations
  • Single worker can handle many concurrent jobs

These design choices have real implications. RQ's simplicity is its strength for basic use cases, but BullMQ's async architecture allows a single worker process to handle multiple jobs concurrently, significantly improving resource efficiency.

Feature Comparison

Beyond performance, the two libraries differ significantly in available features:

FeatureBullMQ PythonRQ
Async supportNative (asyncio)No (synchronous)
Concurrency per workerConfigurable (1–1000+)1 (scale via processes)
Job prioritiesYes (numeric)Separate queues only
Rate limitingYes (global, per-queue)No
Delayed jobsYes (built-in)Via rq-scheduler
Repeatable / cron jobsYes (built-in)Via rq-scheduler
Retries with backoffYes (exponential, custom)Basic (fixed count)
Parent–child flowsYes (FlowProducer)Basic dependencies
Job progress trackingYesNo
Global eventsYesNo
Stalled job recoveryAutomaticManual
Sandboxed processorsYesVia forking worker
Dashboard / UITaskforce.sh / Bull Boardrq-dashboard
Atomic operationsLua scriptsPython + Redis calls

Benchmark Configuration

I ran these benchmarks on my local machine:

  • Machine: MacBook Pro with M2 Pro chip, 16GB RAM
  • Python: 3.13
  • Redis: 6.4.0 (local Docker instance)
  • BullMQ Python: 2.19.5
  • RQ: 2.6.1
  • Methodology: 5 runs per test, reporting mean values

For RQ, we used SimpleWorker which processes jobs in the same process without forking, providing the fairest comparison since BullMQ also processes jobs in-process. For processing tests with simulated I/O work, we also tested RQ with multiple worker processes to match BullMQ's concurrency level.

Benchmark Results

Bulk Job Insertion

Adding 50,000 jobs at once using bulk insertion:

BullMQRQ05.5K11K17K22K21,90018,700
Jobs per second - Bulk job insertion (50,000 jobs)

For bulk insertions, BullMQ holds a slight ~17% edge at ~21,900 vs ~18,700 jobs/sec. Both libraries efficiently batch Redis round-trips, BullMQ uses pipelined Lua scripts while RQ uses enqueue_many() with its own pipelining, so the results are in the same ballpark.

Single Job Insertion (Concurrent)

Adding 5,000 jobs with concurrent add() calls (concurrency=10):

BullMQRQ02K4K6K8K6,2002,800
Jobs per second - Single job insertion with concurrency

Here BullMQ shows a ~2.2x advantage (6,200 vs 2,800 jobs/sec). The async architecture allows multiple add() calls to be in-flight simultaneously, while RQ's synchronous design means each insertion blocks until complete. This is a significant win for applications that need to enqueue jobs from async web frameworks like FastAPI or Starlette.

Processing with Simulated I/O Work

Most real-world jobs involve some I/O: calling APIs, querying databases, reading files. To simulate this, each job performs a 10ms async sleep (BullMQ) or 10ms thread sleep (RQ).

For a fair scaling comparison, we match BullMQ's concurrency level with an equal number of RQ worker processes. At concurrency=10:

BullMQ (c=10)RQ (10 workers)0200400600800654471
Jobs per second - 10ms I/O work (BullMQ c=10 vs RQ with 10 workers)

With 10 concurrent workers, BullMQ is 39% faster (654 vs 471 jobs/sec). RQ performs respectably here: 10 OS processes provide real parallelism. But the gap widens dramatically when we scale to 50:

BullMQ (c=50)RQ (50 workers)08001.6K2.4K3.2K3,1001,200
Jobs per second - 10ms I/O work (BullMQ c=50 vs RQ with 50 workers)

At concurrency=50, BullMQ is 2.6x faster (3,100 vs 1,200 jobs/sec). BullMQ scales near-linearly (654 → 3,100, a 4.7x increase for 5x more concurrency), while RQ shows sub-linear scaling (471 → 1,200, only 2.5x for 5x more workers). The reason? 50 RQ processes all compete for the same Redis queue, creating contention. BullMQ handles all 50 concurrent jobs in a single process with zero contention.

Pure Queue Overhead

To isolate the queue machinery cost from the job work itself, we process no-op jobs (jobs that return immediately without doing any work). This test deliberately uses a single RQ worker to measure the true per-job overhead floor (note that this is lower than the I/O test above, which uses 10 parallel RQ worker processes):

BullMQ (c=10)RQ (1 worker)08501.7K2.6K3.4K3,400335
Jobs per second - No-op jobs (pure queue overhead, single RQ worker)

BullMQ is ~10x faster for raw job turnover (3,400 vs 335 jobs/sec). This explains why RQ struggles with high-throughput workloads: each RQ job cycle involves approximately 24 sequential Redis round-trips (dequeue, deserialize, status updates, result storage, cleanup). At ~0.14ms per Redis round-trip on localhost, that's ~3.3ms of unavoidable per-job overhead, capping a single worker at ~300 jobs/sec regardless of how lightweight the job is.

BullMQ uses optimized Lua scripts that batch multiple Redis operations into atomic calls, reducing per-job overhead to a fraction of RQ's.

CPU-Bound Processing

Processing jobs with CPU work (1,000 sin/cos operations per job, matching the Elixir benchmark methodology):

BullMQ (c=10)RQ (1 worker)07501.5K2.3K3K2,900351
Jobs per second - CPU-bound processing (1000 sin/cos per job)

BullMQ is ~8x faster for CPU-bound work (2,900 vs 351 jobs/sec). Since both libraries are single-threaded for CPU work (Python's GIL limits parallelism), the difference comes entirely from per-job overhead: BullMQ's ~0.3ms vs RQ's ~3ms. When the job work itself is lightweight (~0.3ms for 1000 sin/cos), BullMQ's lower overhead translates directly into higher throughput.

Summary

BenchmarkBullMQ PythonRQWinner
Bulk Insert21,900/sec18,700/secBullMQ (1.2x)
Single Insert6,200/sec2,800/secBullMQ (2.2x)
10ms I/O (c=10)654/sec471/sec (10 workers)BullMQ (1.4x)
10ms I/O (c=50)3,100/sec1,200/sec (50 workers)BullMQ (2.6x)
Pure Overhead3,400/sec335/sec (1 worker)BullMQ (10x)
CPU Processing2,900/sec351/secBullMQ (8x)

Understanding Python's Performance Ceiling

Readers familiar with BullMQ's Node.js implementation may notice that the Python version tops out at ~3,400 jobs/sec for pure overhead, while the Node.js counterpart routinely exceeds 30,000 jobs/sec. The gap is not a design issue it is a platform-level constraint rooted in the performance of Python's Redis client.

To pinpoint the bottleneck we profiled every layer of BullMQ Python's job-processing pipeline:

OperationTime per call
Redis round-trip (PING)0.22 ms
Lua script eval (simple)0.23 ms
Full job cycle (c=1)0.51 ms
JSON encode / decode< 0.002 ms
asyncio scheduling0.05 ms

The dominant cost is the Redis round-trip itself. BullMQ Python uses redis-py, the standard async Redis client for Python. Node.js uses ioredis, which benefits from several structural advantages:

  • C++ networking via libuv: ioredis performs socket I/O through Node.js's libuv layer, written in C/C++. redis-py uses Python's asyncio stream layer, which is implemented in pure Python.
  • Native protocol parsing: ioredis can use hiredis (a C library) to parse Redis protocol responses. redis-py parses them in Python.
  • Lower event-loop overhead: Node.js's libuv dispatches callbacks roughly 10× faster than Python's asyncio event loop.

The net effect is that a single Redis round-trip takes ~0.22 ms in Python versus ~0.05–0.08 ms in Node.js: a 3–4× difference. Because every job requires at least one Redis Lua-script call, this per-call gap directly caps throughput.

Unfortunately, redis-py is the only mature async Redis client available for Python, so this ceiling is effectively a platform limitation rather than a BullMQ design issue. Despite this constraint, BullMQ Python still delivers the fastest job processing in the Python ecosystem, up to 10× faster than RQ and considerably ahead of any other Python queue library we've tested.

The Resource Efficiency Story

The raw numbers tell only part of the story. Consider what these results mean in practice:

Single Process Comparison:

  • 1 BullMQ worker (concurrency=50): ~3,100 jobs/sec (10ms I/O work)
  • 50 RQ worker processes: ~1,200 jobs/sec (same 10ms I/O work)

BullMQ achieves 2.6x higher throughput in a single process compared to 50 RQ processes. Each RQ process consumes memory independently, needs its own Redis connection, and adds operational complexity. BullMQ's async architecture eliminates this overhead entirely.

Scaling Behavior:

  • BullMQ scales near-linearly with concurrency: c=10 → c=50 yields 4.7x throughput
  • RQ scales sub-linearly with workers: 10 → 50 workers yields only 2.5x throughput

The sub-linear scaling for RQ comes from Redis contention: 50 processes all polling the same queue simultaneously. BullMQ avoids this by scheduling all work in a single event loop.

For a production workload processing 100,000 jobs per hour (with real I/O work):

  • BullMQ: 1 worker process
  • RQ: 25-30 worker processes

This translates directly to infrastructure savings, simpler deployments, and reduced resource consumption.

When to Choose Each

Choose RQ when:

  • You need simplicity above all else
  • Your job volume is moderate (less than 1,000 jobs/min)
  • You prefer synchronous Python code
  • You're already running multiple worker processes anyway

Choose BullMQ Python when:

  • You need high throughput from minimal workers
  • You're using async Python (FastAPI, Starlette, etc.)
  • You want advanced features (priorities, rate limiting, job dependencies)
  • Resource efficiency matters for your infrastructure costs

Conclusion

BullMQ Python delivers substantial performance improvements over RQ across all benchmark categories. The async architecture and optimized Lua scripts provide up to 10x speedups for pure job throughput, 2.6x gains for realistic I/O workloads at scale, 8x for CPU-bound work, and 2.2x for concurrent single insertions. BullMQ also scales near-linearly with concurrency, while RQ's multi-process scaling hits diminishing returns from Redis contention.

The key architectural insight: RQ's SimpleWorker performs ~24 sequential Redis round-trips per job, creating an unavoidable ~3ms overhead floor. BullMQ batches these operations into atomic Lua scripts, keeping per-job overhead under 0.3ms, a 10x reduction that compounds across every job processed.

For Python applications that need to process high volumes of background jobs efficiently, BullMQ Python offers a compelling alternative to traditional synchronous queues. The ability to handle thousands of jobs per second with a single worker process can significantly reduce infrastructure complexity and costs.

The benchmark code is available at bullmq-python-bench if you'd like to run these tests yourself.


Ready to try BullMQ Python? Check out the documentation to get started.