Skip to content
Menu
vAndu
  • Home
  • Home Lab
  • AI/ML & vGPU
  • Snapshot
  • The Lab Floor
vAndu
09 locust quick setup

09 Locust Quick Setup Guide

Posted on June 3, 2026June 3, 2026

Short description

Locust is a load testing tool. You describe simulated users in a small Python file, then watch how your service behaves as concurrency rises.

Purpose and how people use it

People use Locust to find where a service slows down or breaks under load. For an AI stack it answers a specific question: how many simultaneous requests can my model endpoint handle before latency climbs. It pairs well with a gateway, since you can load test the single endpoint that fronts everything.

Prerequisites

  1. Python 3 with the venv module. On Debian or Ubuntu you may need to install it first with sudo apt install python3-venv.
  2. A running endpoint to test, for example the LiteLLM gateway on port 4000.

Quick setup

mkdir -p ~/locust && cd ~/locust
python3 -m venv .venv
source .venv/bin/activate
pip install locust

Create a test file that hits an OpenAI compatible chat endpoint:

cat > locustfile.py << 'EOF'
import os
from locust import HttpUser, task, between

LITELLM_KEY = os.getenv("LITELLM_KEY", "sk-REPLACE-ME")
MODEL = os.getenv("MODEL", "qwen3-32b")

class LiteLLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def chat(self):
        self.client.post(
            "/v1/chat/completions",
            headers={
                "Authorization": f"Bearer {LITELLM_KEY}",
                "Content-Type": "application/json",
            },
            json={
                "model": MODEL,
                "messages": [{"role": "user", "content": "Say hello in one short sentence."}],
                "max_tokens": 50,
            },
            name="/v1/chat/completions",
        )
EOF

Set your key and run:

export LITELLM_KEY="sk-your-real-key"
export MODEL="qwen3-32b"
locust -f locustfile.py --host http://localhost:4000

Open http://localhost:8089, set a small number of users (start with 5) and a spawn rate of 1, then press Start. Watch requests per second, the median and 95th percentile response times, and the failure rate.

The one thing that trips everyone up

Locust here runs natively, not in a container, so the target host is http://localhost:4000, not host.docker.internal. Also remember that a local LLM is slow compared to a normal web app. The useful result is the saturation point: the user count where the median response time stops being flat and starts climbing. That bend is your real concurrency limit.

Reading the result

Do not chase a high requests per second number. With a single GPU, generation is the bottleneck, and concurrency is governed by how many requests the model server runs in parallel. The valuable output is the shape of the latency curve as users rise, not the peak throughput.

Home Labber who likes to build things and push it to the limits. vSphere is like Lego for adults.

“The fastest way to learn IT is 80% labbing and 20% studying theory. Just do it and have fun.” – vAndu

“If you wish to achieve worthwhile things in your personal and career life, you must become a worthwhile person in your own self-development” – Brian Tracy

VMware vExpert 2023
VMware vExpert NSX
VMware vExpert Pro
©2026 vAndu | Powered by SuperbThemes!