Designing a Document AI System for 100K DAU

I recently studied how a large-scale Document AI pipeline could be designed, using a service with roughly 100,000 daily active users as the working assumption.

Document AI is different from a typical web application. A single request may involve image preprocessing, OCR, layout analysis, PII masking, LLM processing, indexing, and retrieval. Some of those tasks are CPU-friendly, but OCR, layout models, and LLM inference can quickly become GPU-bound. That changes the architecture.

The main point of this design is simple: the API server should not process the document all the way through synchronously. It should accept the upload, validate it, store the original file, create a job, and hand the heavy work to a queue-backed worker layer.

Requirements

The functional requirements are straightforward.

- Upload PDF and image documents
- Extract text from uploaded documents
- Analyze layout, tables, paragraphs, and document structure
- Convert results into Markdown, JSON, or another structured format
- Support search or question answering over processed documents
- Expose document processing status

The important part is that this is not just raw OCR. If the system only extracts text, it loses much of the document’s meaning. A useful Document AI pipeline needs structure: sections, tables, fields, coordinates, and enough metadata to support later retrieval or question answering.

The non-functional requirements matter even more.

- Accuracy for OCR and layout recognition
- Latency control for small documents
- Asynchronous handling for large documents
- Throughput under peak traffic
- Independent scaling of API servers and workers
- Retry and dead-letter handling for failures
- PII masking before external model calls
- Human review for low-confidence or high-risk cases

Accuracy does not mean only average OCR accuracy. In insurance, finance, contracts, or medical documents, a small error in an amount, date, name, or field mapping can affect a real decision. A system that is 95% accurate overall may still need human review for the remaining uncertain cases.

That is where confidence scores and rule validation become important. A payment-related field may require a higher threshold than ordinary text. If the confidence is low, the total does not match the line items, or a required field is missing, the job should move into a human review queue instead of being automatically accepted.

Capacity estimate

I used a simple traffic assumption.

DAU: 100,000 users
Average uploads per user per day: 3
Daily uploads: 300,000

There are 86,400 seconds in a day, so the average upload TPS is roughly:

300,000 / 86,400 = about 3.47 TPS

Average TPS alone is misleading because traffic is rarely uniform. If peak traffic is five times the average, the service needs to handle roughly 20 upload requests per second.

For a CRUD API, 20 TPS is not scary. For an AI document pipeline, it can be significant. One document may contain many pages, and each page may require preprocessing, OCR, layout detection, masking, chunking, embedding, and possibly LLM calls. The real worker capacity depends on page count, model latency, GPU utilization, and retry behavior, not only upload TPS.

That is why the system should separate request intake from document processing.

High-level architecture

The simplified flow looks like this.

flowchart TD Client[Client / B2B User] --> Gateway[API Gateway / ALB Auth / Rate Limit] Gateway --> API[API Server Ingest / Extract API] API -->|store original| S3[S3 Originals / Artifacts] API -->|create job state| DB[(RDS Documents / Jobs / Billing)] API -->|publish job| Queue[SQS] Queue --> Worker[Worker Convert / Parse / Extract] Worker -->|read and write files| S3 Worker --> Vision[Vision / Document Parser] Worker --> LLM[LLM] Worker -->|status / result / usage| DB Worker -->|publish progress event| EventBus[EventBridge / PubSub] EventBus -->|status event| API Client -. SSE / WebSocket progress .-> Gateway Gateway -. proxied stream .-> API classDef api fill:#172554,stroke:#60a5fa,color:#f8fafc; classDef async fill:#422006,stroke:#f59e0b,color:#f8fafc; classDef storage fill:#581c87,stroke:#c084fc,color:#fdf4ff; classDef model fill:#064e3b,stroke:#34d399,color:#ecfdf5; classDef event fill:#312e81,stroke:#a78bfa,color:#f8fafc; class Client,Gateway,API api; class Queue,Worker async; class S3,DB storage; class Vision,LLM model; class EventBus event;

Document AI ingest, extract, and progress-update paths

The API server is responsible for fast and safe intake. It validates file type and size, checks user permissions, creates metadata, stores the original file, and publishes a processing job. After that, it can return a job ID to the client.

For live progress delivery, the client opens an SSE or WebSocket connection to API Gateway or ALB, and the entry layer proxies that connection to the API server. The worker does not need to know which API server owns the client connection. It updates job state in the database and publishes status-change events to an event bus such as EventBridge or to a Pub/Sub channel. The API server receives those events and forwards them over the open client connection.

Entry and routing layer

The first layer receives external traffic and forwards valid requests to the API server.

Client
  │
  ▼
WAF / CDN
  │
  ▼
API Gateway or ALB
  │
  ▼
API Server

An API Gateway is useful when API management features matter: authentication integration, API keys, rate limiting, request validation, versioning, and logging. An Application Load Balancer is simpler and works well when the main goal is distributing traffic across API servers and removing unhealthy instances through health checks.

For this kind of service, either can be valid depending on the product shape. If the public API surface is large and user-specific rate limiting is important, API Gateway is a natural fit. If it is mostly a web application with internal APIs, ALB may be enough.

This layer should not analyze documents. Its job is to protect and route traffic.

API server layer

The API server handles request validation and job creation.

- Validate file type and size
- Check user authorization
- Create document metadata
- Store the original file in object storage
- Publish a processing job to the queue
- Return the document ID and job status

A common mistake in this kind of system is trying to do too much inside the upload request. If the API server waits for OCR and LLM processing before responding, a large PDF can easily cause timeouts, bad user experience, and poor horizontal scaling.

Instead, the API server should treat document processing as a background job. The upload request only starts the workflow.

With that boundary, POST /document/ingest becomes a job-creation API rather than an API that promises the final extraction result. GET /document/extract/{taskId} reads the current state and the result when it is ready. If the job is still running, it can return states such as PENDING, PROCESSING, or FAILED.

The API server should not invent progress state by itself. When the worker enters steps such as conversion, parsing, or extraction, it updates the job status in the database and publishes a status-change event. On AWS, EventBridge is a reasonable event-bus option for this kind of event fan-out. For lighter real-time notifications, Redis Pub/Sub can also work, as long as the database remains the source of truth.

This keeps responsibilities separate. The worker owns document processing and state updates. The API server owns client connections and event forwarding. The event channel is not the source of truth; it is the delivery path for state changes.

Queue layer and idempotency

A message queue sits between the API server and the worker layer. In AWS, SQS is a reasonable default.

The queue gives the system backpressure. If upload traffic temporarily exceeds worker capacity, jobs can wait in the queue instead of overwhelming the GPU workers. Workers can scale out based on queue depth, age of oldest message, or GPU utilization.

Idempotency is important here. The same job may be delivered more than once, or a worker may crash after partially completing a job. The worker should be able to safely retry without duplicating results or corrupting state.

A simple state machine helps.

UPLOADED
  -> QUEUED
  -> PROCESSING
  -> SUCCEEDED
  -> FAILED
  -> REVIEW_REQUIRED

Each transition should be explicit. If a job is retried, the worker should check the current state and decide whether it can continue, restart, or skip.

GPU worker layer

The worker layer performs the expensive pipeline.

1. Download original document
2. Normalize the document format
3. Split PDF into pages if needed
4. Preprocess images
5. Run OCR and layout analysis
6. Calculate confidence scores
7. Mask PII before external model calls
8. Run LLM processing when needed
9. Store structured output and embeddings
10. Update job status

Format normalization should happen before the model calls. If the vision API only accepts JPG, PDF, PNG, and DOCX files have to be converted into page-level JPG images first. Page count and image count then become part of both worker capacity planning and billing metadata.

Image preprocessing may include resizing, rotation correction, denoising, binarization, and page boundary detection. These steps are not glamorous, but they often decide whether OCR performs well.

OCR and layout analysis should preserve more than text. The system needs coordinates, paragraphs, tables, reading order, and field boundaries. A document is not just a sequence of characters.

PII masking should happen before sensitive content is sent to an external LLM. Names, resident numbers, account numbers, phone numbers, addresses, and other identifiers may need to be replaced or redacted depending on the domain.

LLM processing should be selective. Not every document needs a large model call. Some tasks can be handled by OCR output, rules, embeddings, or smaller models. Sending everything to an LLM is simple, but it can be expensive and risky.

Storage choices

A Document AI system usually needs more than one storage layer.

Object Storage
- Original files
- Page images
- Processed artifacts
- Large JSON or Markdown outputs

Relational Database
- Users
- Documents
- Job status
- Metadata
- Review status
- Audit records

Vector Database
- Embeddings
- Chunk retrieval
- Semantic search
- RAG context lookup

Object storage is the right place for large binary files and generated artifacts. The relational database owns metadata and transactional state. The vector database is used for retrieval, not as the source of truth.

That boundary matters. If the vector database is treated as the main store, it becomes harder to reason about job status, document ownership, audit trails, and recovery.

Large documents and queue starvation

Large documents can create a starvation problem. If a 300-page PDF and a one-page receipt sit in the same queue, workers may spend a long time on the large document while small jobs wait behind it.

There are a few ways to handle this.

- Split large documents into page-level jobs
- Use separate queues for small and large jobs
- Prioritize short jobs when user latency matters
- Limit per-job processing time
- Reassemble page-level results after processing

Splitting by page gives better parallelism, but it adds coordination overhead. The system needs to know when all pages are done, how to merge results, and how to handle partial failures. A separate queue is simpler, but it may be less efficient if traffic patterns vary.

The right answer depends on the product. For interactive uploads, keeping small jobs fast is usually worth the extra complexity.

Chunking and context loss

For retrieval and LLM processing, the document eventually needs to be chunked. Chunking by a fixed character count is simple, but it can split a table, paragraph, or legal clause in the wrong place.

Better chunking should use document structure when possible.

- Section headings
- Paragraph boundaries
- Table boundaries
- Page numbers
- Coordinates
- Semantic similarity

The goal is to preserve enough context for retrieval without making each chunk too large. If chunks are too small, the model loses context. If chunks are too large, retrieval becomes noisy and expensive.

This is one of the places where Document AI differs from plain text processing. Layout matters.

Failure handling

Failures are normal in this system. OCR can fail. A page image can be corrupt. An external model can timeout. A worker can crash. A queue message can be delivered twice.

The system needs retry policies, dead-letter queues, and clear status updates.

Retryable failures
- Temporary model timeout
- Network failure
- Worker interruption

Non-retryable or review-required cases
- Unsupported file format
- Corrupt document
- Low confidence on critical fields
- Policy violation

A dead-letter queue is useful when a job repeatedly fails. It prevents one bad job from being retried forever and gives operators a place to inspect failures.

For user experience, status should be honest. “Processing” is not enough for every case. The product may need states like queued, extracting text, analyzing layout, waiting for review, failed, or completed.

Closing thought

The main lesson from this design is that Document AI is not just an OCR endpoint. It is an asynchronous processing system with heavy compute, uncertain outputs, privacy boundaries, and recovery requirements.

The architecture becomes much easier to reason about when each layer has a clear job. The API server receives and validates. Object storage keeps large files. The queue absorbs load. GPU workers process documents. The database owns state. The vector store supports retrieval.

Once those boundaries are clear, the system can scale without turning every upload request into a long, fragile synchronous operation.