Code › system-design
Designing a Document AI System for 100K DAU
A system design note on asynchronous document processing, GPU workers, queues, and storage choices for Document AI
I recently studied how a large-scale Document AI pipeline could be designed, using a service with roughly 100,000 daily active users as the working assumption.
Document AI is different from a typical web application. A single request may involve image preprocessing, OCR, layout analysis, PII masking, LLM processing, indexing, and retrieval. Some of those tasks are CPU-friendly, but OCR, layout models, and LLM inference can quickly become GPU-bound. That changes the architecture.
The main point of this design is simple: the API server should not process the document all the way through synchronously. It should accept the upload, validate it, store the original file, create a job, and hand the heavy work to a queue-backed worker layer.
Requirements
The functional requirements are straightforward.
- Upload PDF and image documents
- Extract text from uploaded documents
- Analyze layout, tables, paragraphs, and document structure
- Convert results into Markdown, JSON, or another structured format
- Support search or question answering over processed documents
- Expose document processing status
The important part is that this is not just raw OCR. If the system only extracts text, it loses much of the document’s meaning. A useful Document AI pipeline needs structure: sections, tables, fields, coordinates, and enough metadata to support later retrieval or question answering.
The non-functional requirements matter even more.
- Accuracy for OCR and layout recognition
- Latency control for small documents
- Asynchronous handling for large documents
- Throughput under peak traffic
- Independent scaling of API servers and workers
- Retry and dead-letter handling for failures
- PII masking before external model calls
- Human review for low-confidence or high-risk cases
Accuracy does not mean only average OCR accuracy. In insurance, finance, contracts, or medical documents, a small error in an amount, date, name, or field mapping can affect a real decision. A system that is 95% accurate overall may still need human review for the remaining uncertain cases.
That is where confidence scores and rule validation become important. A payment-related field may require a higher threshold than ordinary text. If the confidence is low, the total does not match the line items, or a required field is missing, the job should move into a human review queue instead of being automatically accepted.
Capacity estimate
I used a simple traffic assumption.
DAU: 100,000 users
Average uploads per user per day: 3
Daily uploads: 300,000
There are 86,400 seconds in a day, so the average upload TPS is roughly:
300,000 / 86,400 = about 3.47 TPS
Average TPS alone is misleading because traffic is rarely uniform. If peak traffic is five times the average, the service needs to handle roughly 20 upload requests per second.
For a CRUD API, 20 TPS is not scary. For an AI document pipeline, it can be significant. One document may contain many pages, and each page may require preprocessing, OCR, layout detection, masking, chunking, embedding, and possibly LLM calls. The real worker capacity depends on page count, model latency, GPU utilization, and retry behavior, not only upload TPS.
That is why the system should separate request intake from document processing.
High-level architecture
The simplified flow looks like this.
Client
│
▼
API Gateway / Load Balancer
│
▼
API Server
│
├── store original file ──► Object Storage
│
└── publish job ─────────► Message Queue
│
▼
GPU Workers
│
▼
RDB / Vector DB / Object Storage
The API server is responsible for fast and safe intake. It validates file type and size, checks user permissions, creates metadata, stores the original file, and publishes a processing job. After that, it can return a job ID to the client.
The client can poll the job status or subscribe to updates depending on the product requirements. The expensive work happens outside the request-response path.
Entry and routing layer
The first layer receives external traffic and forwards valid requests to the API server.
Client
│
▼
WAF / CDN
│
▼
API Gateway or ALB
│
▼
API Server
An API Gateway is useful when API management features matter: authentication integration, API keys, rate limiting, request validation, versioning, and logging. An Application Load Balancer is simpler and works well when the main goal is distributing traffic across API servers and removing unhealthy instances through health checks.
For this kind of service, either can be valid depending on the product shape. If the public API surface is large and user-specific rate limiting is important, API Gateway is a natural fit. If it is mostly a web application with internal APIs, ALB may be enough.
This layer should not analyze documents. Its job is to protect and route traffic.
API server layer
The API server handles request validation and job creation.
- Validate file type and size
- Check user authorization
- Create document metadata
- Store the original file in object storage
- Publish a processing job to the queue
- Return the document ID and job status
A common mistake in this kind of system is trying to do too much inside the upload request. If the API server waits for OCR and LLM processing before responding, a large PDF can easily cause timeouts, bad user experience, and poor horizontal scaling.
Instead, the API server should treat document processing as a background job. The upload request only starts the workflow.
Queue layer and idempotency
A message queue sits between the API server and the worker layer. In AWS, SQS is a reasonable default.
The queue gives the system backpressure. If upload traffic temporarily exceeds worker capacity, jobs can wait in the queue instead of overwhelming the GPU workers. Workers can scale out based on queue depth, age of oldest message, or GPU utilization.
Idempotency is important here. The same job may be delivered more than once, or a worker may crash after partially completing a job. The worker should be able to safely retry without duplicating results or corrupting state.
A simple state machine helps.
UPLOADED
-> QUEUED
-> PROCESSING
-> SUCCEEDED
-> FAILED
-> REVIEW_REQUIRED
Each transition should be explicit. If a job is retried, the worker should check the current state and decide whether it can continue, restart, or skip.
GPU worker layer
The worker layer performs the expensive pipeline.
1. Download original document
2. Split PDF into pages if needed
3. Preprocess images
4. Run OCR and layout analysis
5. Calculate confidence scores
6. Mask PII before external model calls
7. Run LLM processing when needed
8. Store structured output and embeddings
9. Update job status
Image preprocessing may include resizing, rotation correction, denoising, binarization, and page boundary detection. These steps are not glamorous, but they often decide whether OCR performs well.
OCR and layout analysis should preserve more than text. The system needs coordinates, paragraphs, tables, reading order, and field boundaries. A document is not just a sequence of characters.
PII masking should happen before sensitive content is sent to an external LLM. Names, resident numbers, account numbers, phone numbers, addresses, and other identifiers may need to be replaced or redacted depending on the domain.
LLM processing should be selective. Not every document needs a large model call. Some tasks can be handled by OCR output, rules, embeddings, or smaller models. Sending everything to an LLM is simple, but it can be expensive and risky.
Storage choices
A Document AI system usually needs more than one storage layer.
Object Storage
- Original files
- Page images
- Processed artifacts
- Large JSON or Markdown outputs
Relational Database
- Users
- Documents
- Job status
- Metadata
- Review status
- Audit records
Vector Database
- Embeddings
- Chunk retrieval
- Semantic search
- RAG context lookup
Object storage is the right place for large binary files and generated artifacts. The relational database owns metadata and transactional state. The vector database is used for retrieval, not as the source of truth.
That boundary matters. If the vector database is treated as the main store, it becomes harder to reason about job status, document ownership, audit trails, and recovery.
Large documents and queue starvation
Large documents can create a starvation problem. If a 300-page PDF and a one-page receipt sit in the same queue, workers may spend a long time on the large document while small jobs wait behind it.
There are a few ways to handle this.
- Split large documents into page-level jobs
- Use separate queues for small and large jobs
- Prioritize short jobs when user latency matters
- Limit per-job processing time
- Reassemble page-level results after processing
Splitting by page gives better parallelism, but it adds coordination overhead. The system needs to know when all pages are done, how to merge results, and how to handle partial failures. A separate queue is simpler, but it may be less efficient if traffic patterns vary.
The right answer depends on the product. For interactive uploads, keeping small jobs fast is usually worth the extra complexity.
Chunking and context loss
For retrieval and LLM processing, the document eventually needs to be chunked. Chunking by a fixed character count is simple, but it can split a table, paragraph, or legal clause in the wrong place.
Better chunking should use document structure when possible.
- Section headings
- Paragraph boundaries
- Table boundaries
- Page numbers
- Coordinates
- Semantic similarity
The goal is to preserve enough context for retrieval without making each chunk too large. If chunks are too small, the model loses context. If chunks are too large, retrieval becomes noisy and expensive.
This is one of the places where Document AI differs from plain text processing. Layout matters.
Failure handling
Failures are normal in this system. OCR can fail. A page image can be corrupt. An external model can timeout. A worker can crash. A queue message can be delivered twice.
The system needs retry policies, dead-letter queues, and clear status updates.
Retryable failures
- Temporary model timeout
- Network failure
- Worker interruption
Non-retryable or review-required cases
- Unsupported file format
- Corrupt document
- Low confidence on critical fields
- Policy violation
A dead-letter queue is useful when a job repeatedly fails. It prevents one bad job from being retried forever and gives operators a place to inspect failures.
For user experience, status should be honest. “Processing” is not enough for every case. The product may need states like queued, extracting text, analyzing layout, waiting for review, failed, or completed.
Closing thought
The main lesson from this design is that Document AI is not just an OCR endpoint. It is an asynchronous processing system with heavy compute, uncertain outputs, privacy boundaries, and recovery requirements.
The architecture becomes much easier to reason about when each layer has a clear job. The API server receives and validates. Object storage keeps large files. The queue absorbs load. GPU workers process documents. The database owns state. The vector store supports retrieval.
Once those boundaries are clear, the system can scale without turning every upload request into a long, fragile synchronous operation.