Rate limiting is what keeps your APIs stable under pressure. It helps to control how many requests a user or system can make, especially when working with heavy AI models.
This guide walks through how API rate limiting works and how you can implement it in real-world systems. Exploring common strategies and learning how to handle the rate limit and errors helps you across different stacks.
How to Implement Rate Limiting in an API (Step by Step)
Step 1: Define what you want to limit
Start by selecting the key used to track requests. Which are usually:
IP address (simple, but less accurate)
User ID (better for authenticated systems)
API key (common for AI APIs)
For an AI system, API keys or user IDs give more control and fairness.
Step 2: Set a clear rate limit policy
Decide the number of requests you want within a particular time.
Examples:
100 requests per minute per user
1,000 requests per hour per API key
Keep the limits realistic. Most AI endpoints are often resource-heavy, so be sure to reduce the limits where necessary.
Step 3: Choose a rate-limiting algorithm
Make sure you pick a strategy based on your use case:
Fixed window: it’s simple, but can cause bursts at window edges
Sliding window: it gives smoother control over traffic
Token bucket: very flexible, and allows short bursts while enforcing limits
For most AI APIs, a token bucket or a sliding window works best.
Step 4: Save request counts efficiently
You need a fast storage layer that can help you track requests in real time.
Here are some common options:
In-memory store (Redis is widely used)
Application memory (only for single-instance apps)
Example Redis key:
rate_limit:user_123
Step 5: Intercept requests with middleware
Rate limiting should run before your main logic. In some frameworks, this is done using middleware, which helps to:
Extracts the identifier (IP, user ID, or API key)
Check the current request count
Decides whether to allow or reject a request
Step 6: Enforce the limit
If a request exceeds its limit, block it immediately by returning a standard HTTP response:
HTTP/1.1 429 Too Many Requests
Include helpful headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 60
This helps clients understand when they can retry.
Step 7: Handle allowed requests
If the request is within limits:
Increase the counter
Forward the request to the API handler
Keep this step fast. Rate limiting should not introduce noticeable latency.
Step 8: Add retry and backoff guidance
Clients should not retry immediately after hitting limits, encouraging exponential backoff helps to reduce pressure on your API during spikes.
Step 9: Log and monitor rate limit activity
Be sure to track:
The number of blocked requests
The most active users or API keys
The patterns of abuse
This will help to fine-tune limits and detect misuse early.
Step 10: Test under load
Simulate high traffic before deploying, and be sure to check:
Whether limits are enforced correctly
If legitimate users are blocked too early
How the system behaves under burst traffic
Rate limiting should protect your API without breaking normal usage.
Implementing Rate Limiting in C-Based APIs
C-based APIs mostly run on high-performance environments where efficiency matters. For this, rate limiting needs to be fast, memory-conscious, and thread-safe by making sure to choose a lightweight tracking mechanism.
In C, this is usually done with:
in-memory hash tables
shared memory (for multi-process setups)
external stores like Redis (for distributed systems)
Defining a rate limit structure
Create a struct to track request counts and timing:
typedef struct {
int request_count;
time_t window_start;
} RateLimit;
Map the key as an IP address or API key.
Use a hash table for fast lookups, and each incoming request should:
Extract the identifier (IP or API key)
Look up its rate limit record
Update or reset the counter
Example using pseudo-hash map logic:
RateLimit* rl = get_rate_limit(key);
if (current_time - rl->window_start > WINDOW_SIZE) {
rl->request_count = 1;
rl->window_start = current_time;
} else {
rl->request_count++;
}
Check the count before processing the request:
if (rl->request_count > MAX_REQUESTS) {
return 429; // Too Many Requests
}
Avoid heavy operations in this path.
C-based APIs often run in multi-threaded or multi-process environments, and the race conditions can break rate-limiting logic.
Use:
mutex locks (for threads)
atomic operations (for counters)
shared memory locks (for multi-process systems)
Example with a mutex:
pthread_mutex_lock(&lock);
/* update rate limit */
pthread_mutex_unlock(&lock);
Keep lock duration short to avoid performance bottlenecks.
Each active user or IP consumes memory, and without cleanup, the memory usage grows over time.
Implement expiration:
remove entries after inactivity
periodically clean old records
This prevents memory leaks in long-running services.
Use Redis for distributed rate limiting
If your C API runs across multiple servers, in-memory tracking is not going to be enough.
Use Redis with atomic operations:
INCR rate_limit:user_123
EXPIRE rate_limit:user_123 60
This ensures consistent limits across all instances.
In C environments, simplicity often wins.
Fixed window: easiest to implement
Token bucket: better for handling bursts
A token bucket can be implemented with a counter that refills over time.
Return proper HTTP responses
Even in C-based servers, follow standard responses:
HTTP/1.1 429 Too Many Requests
Be sure to include headers for better client handling
C systems are often used for high-throughput APIs to help simulate:
burst traffic
concurrent requests
edge cases around time windows
Efficient rate limiting in C rests on tight control rather than memory, concurrency, and execution time. If implemented correctly, it protects your API without slowing it down.
Adding Rate Limiting to Python and FastAPI Services
Rate limiting in Python APIs is normally implemented at the middleware or dependency level. For FastAPI, this approach keeps the logic centralized and easy to reuse across routes.
For most AI APIs, API keys or user IDs give better control than IP-based limits.
Instead of trying to build everything from scratch, use proven tools like slowapi or fastapi-limiter integrate them directly with FastAPI and reduce implementation complexity.
Example using fastapi-limiter with Redis:
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
from fastapi_limiter import FastAPILimiter
from fastapi_limiter.depends import RateLimiter
import aioredis
app = FastAPI()
@app.on_event("startup")
async def startup():
redis = await aioredis.from_url("redis://localhost")
await FastAPILimiter.init(redis)
@app.get("/chat", dependencies=[Depends(RateLimiter(times=5, seconds=60))])
async def chat_endpoint():
return {"message": "Request allowed"}
This limits requests to 5 per minute per client and each request:
Extracts a unique identifier
Stores or increments a counter in Redis
Checks if the limit is exceeded
Blocks or allows the request
Redis is mostly used because it supports atomic operations and works well in distributed systems.
Example:
@app.get("/inference", dependencies=[Depends(RateLimiter(times=10, seconds=60))])
This keeps resource-heavy endpoints protected.
When a limit has been exceeded, FastAPI automatically returns:
429 Too Many Requests
You can customize the response:
@app.exception_handler(429)
async def rate_limit_handler(request: Request, exc):
return JSONResponse(
status_code=429,
content={"detail": "Rate limit exceeded. Try again later."},
)
If you want a global limit across all routes, applying middleware instead of per-route dependencies works, and it ensures every request passes through the same rate-limiting logic.
In production, most FastAPI apps often run with multiple workers, which may stop in-memory counters from syncing across instances.
Always use a shared store like Redis for:
consistent limits
distributed deployments
horizontal scaling
FastAPI makes rate limiting straightforward when combined with Redis and middleware patterns. The most important part is keeping it efficient, consistent, and aligned with how your API is being used.



