Code › tail-villain

A Model Name Wasn't Enough

Provider boundaries, latency handling, and failure modes that surfaced while wiring Gemini and Ollama into the same product

At first, it looked like a cost problem.

Once roadmap analysis and topic generation started running through Gemini, the quality improved. But the next question followed immediately: could I keep sending every experiment, every retry, every long-running generation task to an external paid model?

It felt wrong to make the whole development loop depend on that. So I tried adding local models through Ollama. I started the server, tested llama3:8b and gemma2:9b, and expected the change to be mostly mechanical.

In theory, I just needed to swap the model name.

That was not what happened.


Adding another LLM was not the same as adding another API endpoint. Gemini and Ollama both sit under the broad label of “AI model,” but from the application’s point of view, they behave like different providers with different failure modes.

Gemini means network calls, external availability, billing, and timeout management. Ollama removes some of that cost pressure, but local models vary much more in speed and output stability. The difference became especially obvious when I asked them to return structured JSON. Answering like a person and answering in a shape the backend can safely parse are not the same skill.

gemma2:9b was slower than I wanted and more fragile under stricter schema requirements. llama3:8b was the more practical local default because it could get through both the analysis and topic-generation style tasks more reliably. At that stage, I did not need the model that looked best on paper. I needed the one that could finish the workflow.

That was when the AI calls needed a real routing layer. Roadmap analysis, topic generation, and interview turns all needed their own provider, model, and timeout settings. Analysis reads long inputs. Topic generation needs structured output. Interview turns need to preserve conversational flow. They are all LLM calls, but they do not have the same product requirements.


A small bug showed the problem.

I added a test model selector to the dashboard and chose Gemini 2.5 Flash. The backend sent that model name to Ollama. Ollama, quite reasonably, said the model did not exist.

That was not just a small wiring mistake. It exposed a bad assumption: I had treated the model name as if it carried enough information by itself. But once there is more than one provider, a model name is only meaningful inside its provider boundary. gemini-2.5-flash belongs to Gemini. llama3:8b belongs to Ollama.

So the model override and provider override had to travel together. When the test UI selected a model, it also had to determine where that request should go.

Obvious in hindsight. Most design mistakes are.


Local models also made the latency problem harder to ignore.

In the previous post, I moved roadmap analysis into a background flow so the user would not sit on a form submit for thirty seconds. Topic generation had not received the same treatment yet. It was still too close to a synchronous request: the user asked for topics, the model generated them, and the UI waited for the response.

That was risky with Gemini. With local models, it was worse. A slower model could stretch the request long enough to hit infrastructure timeouts, and the user would end up staring at another frozen-looking screen. The local model I added to reduce cost could easily create UX debt instead.

So topic generation moved into the same status-driven pattern. The request starts the job, the roadmap stores topic-generation state, and the frontend polls for whether the job is running, failed, or complete. The UI does not pretend to know an exact percentage. It shows only what the system actually knows.

I also added a warning around regeneration. If topics are deleted and recreated, existing mock-interview history connected to those topics can disappear through cascading deletes. To a developer, that is a database relationship doing what it was told to do. To a user, it is “where did my history go?” That needs to be said before the button is pressed.


That same session was also when the interview loop started becoming real.

Instead of burying the mock interview inside a topic card, I moved it into a dedicated interview page. The screen got a conversation area, a side panel for topic and interviewer metadata, a compact answer composer, Enter-to-send, Shift Enter for line breaks, optimistic user messages, and a loading bubble while the interviewer responded.

Those details sound small, but they change the posture of the feature. Interview practice should feel like a conversation, not a document submission. If the screen feels like a form, the user becomes someone filling in an answer. If the screen feels like a chat with pressure, the user is sitting across from an interviewer.

I also removed live grading from the conversation. Scores and detailed feedback after every answer break the illusion. A real interviewer does not pause after each response to hand you a rubric. The pressure comes from answering, being challenged, and staying coherent as the follow-up digs deeper. The detailed evaluation belongs at the end.

The first turn needed special handling too. An interviewer’s first message should not sound like a mid-session evaluation. It should greet the user and ask the opening question. So I split opening-turn generation from follow-up-turn generation. The first message starts the interview. Later turns respond to what the user actually said.


Language behavior needed the same kind of explicit policy.

If you do not tell an AI system what language to use, it drifts. An English JD can pull the answer into English. A Korean resume can pull it back into Korean. A single sentence in the prompt can tilt the result. The user does not experience that as model nuance. They experience it as inconsistency.

So I made the priority order explicit. If the user chooses a language manually, that wins. In auto mode, the system looks at the JD first, then the resume, then the user’s default preference. For interview preparation, the JD often reflects the language of the actual evaluation environment, so it deserves priority.

This was the same lesson again. Some decisions should be left to the model. Product policy should not be one of them. Language is part of the user experience, not a casual side effect of the prompt.


Looking back, that session collapsed into one lesson.

An AI feature cannot be abstracted by model name alone. Different providers fail differently. Different models have different speed and output-stability profiles. Different operations demand different UX. Roadmap analysis, topic generation, interview conversation, and language selection all use LLMs, but treating them as the same kind of call hides the important parts.

I started by adding Ollama to reduce cost. What I got was more important than cost reduction: a clearer boundary around AI work. Some operations belong on Gemini. Some are good enough on a local model. Some need to run in the background. Some need immediacy because conversation quality depends on it.

Making the model swappable mattered. Making the product flow survive a model swap mattered more.