Queue scaling strategies for real workloads

There is a moment most Laravel teams hit somewhere in year two or three. Queue depth starts climbing. Jobs that used to clear in seconds are taking minutes. A user submits a form and waits. Support tickets appear. Someone checks the monitor and finds three thousand jobs backed up behind a process that stalled forty minutes ago.

Someone increases the worker count. The backlog clears. The incident closes.

And nothing changes.

The same incident will happen again, probably under different circumstances, probably at a worse time. Because the response was operational, not architectural. And because the architecture, even when it gets fixed, still has to be operated by someone. That someone is your team.

Queue scaling is not an optimisation problem. It is an ownership problem. The question is not how to configure your workers correctly. The question is why your team is responsible for configuring, running, and recovering them at all.

The default setup and where it breaks

Most Laravel applications start with a single worker running php artisan queue:work. One queue, one process, processes everything in order. For early-stage volume this is fine.

The problem is that it is still in place when the application is processing thousands of jobs per hour, because nobody ever made a deliberate decision to change it. The single-queue setup has no concept of priority. A weekly analytics report and a password reset email sit in the same queue. The report arrived first, so the user waiting on their password reset waits behind it.

It also has no resilience to volume spikes. A large batch of jobs dispatched at once serialises everything behind it. There is no mechanism to route work differently. When it breaks, and it will, the failure is not a bug. It is the predictable result of running a configuration that was never designed for the workload it is carrying.

The application-layer fix is straightforward. Priority queues, named lanes, deliberate dispatch:

// User is waiting on this
dispatch(new SendPasswordReset($user))->onQueue('critical');
 
// Standard background work
dispatch(new SyncUserToMailingList($user));
 
// Expensive, non-urgent
dispatch(new GenerateMonthlyReport($account))->onQueue('bulk');
 
// Delay is fine
dispatch(new UpdateSearchIndex($post))->onQueue('low');

That code is readable, reviewable, and entirely within the skillset of any Laravel developer. The design decisions it encodes belong in the application layer. They are product decisions: which work is urgent, which can wait, which should never block anything else.

What happens next is the problem.

The supervisor problem

To actually run this architecture on a self-managed server, you need Supervisor. A configuration block for each worker type. Process counts. Restart policies. Log routing. Memory limits. Here is what a production queue setup with four named queues looks like as Supervisor configuration:

[program:laravel-worker-critical]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/artisan queue:work redis --queue=critical --sleep=1 --tries=3 --timeout=30
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
user=www-data
numprocs=5
redirect_stderr=true
stdout_logfile=/var/log/supervisor/critical.log
stopwaitsecs=30
 
[program:laravel-worker-default]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/artisan queue:work redis --queue=default --sleep=3 --tries=3 --timeout=60
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
user=www-data
numprocs=3
redirect_stderr=true
stdout_logfile=/var/log/supervisor/default.log
stopwaitsecs=60
 
[program:laravel-worker-bulk]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/artisan queue:work redis --queue=bulk --sleep=5 --tries=2 --timeout=300
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
user=www-data
numprocs=2
redirect_stderr=true
stdout_logfile=/var/log/supervisor/bulk.log
stopwaitsecs=300
 
[program:laravel-worker-low]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/artisan queue:work redis --queue=low --sleep=10 --tries=1 --timeout=120
autostart=true
autorestart=true
stopasgroup=true
killasgroup=true
user=www-data
numprocs=1
redirect_stderr=true
stdout_logfile=/var/log/supervisor/low.log
stopwaitsecs=120

Your team now owns sixty-plus lines of INI configuration that has to be version-controlled, kept in sync with your application, and deployed to every environment you run. When you add a new queue type, it needs a new block. When you change concurrency on the critical queue, you edit this file, reload Supervisor, and hope the reload does not interrupt in-flight jobs. When a worker silently dies and Supervisor does not recover it, you read Supervisor logs to find out why, at whatever time that happens to occur.

This is the moment most teams realise the problem is not queue design. The design is fine. The problem is that they are responsible for running it.

Even if you do everything right, you still own the operational surface. The INI files. The reload procedures. The log locations. The process counts that need revisiting as traffic grows. The knowledge of how all of it fits together, concentrated in one or two engineers, unavailable when those engineers are not available. You can architect perfect queue isolation and still spend your Friday night debugging a Supervisor reload that behaved unexpectedly on one environment but not another.

This is not a solvable configuration problem. It is what infrastructure ownership looks like in practice.

If your platform requires this, the problem is your platform

The application architecture described here is not complex. Priority queues, deliberate dispatch, isolated workers for different job classes. Laravel supports all of it natively. The design decisions cost a few hours of thought and almost nothing to implement.

The sixty lines of Supervisor configuration required to run it are not part of the design. They are the overhead of running on infrastructure that was not built for this. AWS, bare VPS, self-managed Kubernetes: all of them require your team to own the process management layer. All of them put the operational burden of keeping workers running, recovering from failures, and maintaining consistency across environments on your engineering team.

If your team is still writing Supervisor config, managing worker restarts, and debugging queue isolation on self-managed infrastructure, they are doing infrastructure work that the platform should be handling. That is not a reflection of the team's skill. It is a reflection of the wrong tool for the job.

Sevalla runs your queue workers as persistent background processes managed by the platform. You define which queue each worker listens on, the concurrency, and the memory allocation. The platform handles process supervision, restarts on failure, and log collection. Your critical queue gets five dedicated workers. Your bulk queue gets two. When a worker fails, the platform restarts it. The failure is visible in the same place as your application logs. When you add a new queue type, you update the worker definition alongside the rest of your application.

There is no Supervisor configuration. There is no reload procedure. There is no INI file drifting out of sync across environments. There is no engineer whose weekend depends on a process monitor behaving correctly.

This is what it looks like when you stop owning queue infrastructure.

The failure modes that do not go away

Stopping the Supervisor problem does not mean stopping the thought. The application-layer decisions still matter and they are still yours to make.

Jobs that never finish hold worker processes indefinitely. If three of your five workers are stuck on hung HTTP requests to an external service, your effective capacity is two workers. Every job that does external work needs an explicit timeout and a retry policy that reflects how that external dependency actually behaves.

class SyncToExternalService extends Job
{
    public int $timeout = 30;
    public int $tries = 3;
    public int $backoff = 60;
 
    public function handle(): void
    {
        // If this hangs, it gets killed at 30 seconds
    }
}

Jobs that fail and retry immediately, fail again, and retry again consume worker capacity without making progress. Backoff keeps them from flooding your queue while an underlying problem resolves. Jobs that should not retry at all should say so explicitly. Not every failure is recoverable.

Volume spikes, a user action that dispatches a thousand jobs at once, need chunking. A single dispatch loop that creates a thousand independent jobs floods the queue. A job that processes recipients in batches of fifty fans the work out gradually and does not serialise everything behind it.

// Floods the queue
foreach ($users as $user) {
    dispatch(new SendNotification($user, $notification));
}
 
// Fans out gradually
$users->chunk(50)->each(function ($chunk) use ($notification) {
    dispatch(new SendChunkedNotifications($chunk, $notification));
});

These are application decisions. They are the right things for your team to be thinking about. They are not infrastructure work.

The failed jobs table is a signal you are already paying for

Most teams set up the failed jobs table because the documentation says to. Fewer treat it as the production monitoring signal it is.

Failed jobs tell you which job classes are failing most frequently, whether failures are concentrated in a particular queue, whether retries are recovering or producing more failures, and when an external dependency started degrading. A failure rate that suddenly exceeds your baseline is an early warning. Reading it is faster than correlating application logs.

If you are not monitoring failed job rates and alerting on abnormal patterns, you are missing one of the better production signals available to you. It is already there. You are just not looking at it.

The question worth asking

The queue configuration that works at two hundred jobs per day is not the one you want at twenty thousand. Most teams find this out during an incident rather than before one, because the system technically works until it does not, and it never stops working at a convenient time.

The design work is not complicated. Identify your job types. Decide which are critical and which are not. Set timeouts and retry limits that match actual behaviour. Monitor failure rates. None of this is beyond any Laravel team.

What should not be complicated is running the result. If your current platform requires a separate process management layer, a set of configuration files that live outside your application, and operational knowledge concentrated in a small number of engineers to keep your queue workers running, the problem is not your architecture.

The problem is your platform.

Sevalla is built for Laravel teams who have reached that conclusion. Your queue architecture runs the way you designed it. The operational layer is handled. Your team gets back to the work they are actually here to do.

Queue scaling strategies for real workloads

The default setup and where it breaks

The supervisor problem

If your platform requires this, the problem is your platform

The failure modes that do not go away

The failed jobs table is a signal you are already paying for

The question worth asking

Deep dive into the cloud!

Legal

Compare

Queue scaling strategies for real workloads

The default setup and where it breaks#

The supervisor problem#

If your platform requires this, the problem is your platform#

The failure modes that do not go away#

The failed jobs table is a signal you are already paying for#

The question worth asking#

Deep dive into the cloud!

The default setup and where it breaks

The supervisor problem

If your platform requires this, the problem is your platform

The failure modes that do not go away

The failed jobs table is a signal you are already paying for

The question worth asking