Why Roadmap Analysis Looked Stuck

The backend log said the analysis had finished.

The roadmap analysis was complete. The data had been saved. But the page still said running. The user was waiting, and I was staring at a screen that did not seem to know what the backend already knew.

Bugs like this are confusing because they hide behind a single word. Maybe Gemini was slow. Maybe the model call had failed. Maybe the database write did not happen. Maybe the frontend had the wrong state. From the user’s side, all of that collapsed into one quiet label: still analyzing.

That day became about closing the gap between what the system knew and what the user could see.

A few days earlier, I had changed how roadmap creation worked.

Originally, the user clicked a button and waited for Gemini to finish the analysis. That could take close to thirty seconds. Technically, it worked, but it made the product feel frozen. So I moved the analysis into the background. Create the roadmap first, send the user to the detail page, and let the analysis attach its result when it is done.

That was the right direction. The user should not be trapped behind a long model call just to see the page they created. The detail page can show a pending state, then update when the analysis completes.

The tricky part was the last sentence.

A background job is not finished, from the user’s point of view, just because the server finished it. The backend can save the result, the database can contain the right data, and the logs can be clean. If the page never asks again, the user still sees an unfinished job.

That is exactly what happened.

At first, I suspected Gemini.

That was not unreasonable. Gemini can be busy. A request can time out, hit a rate limit, or come back with a try-again-later kind of failure. Roadmap analysis was not a tiny call either. It had to read a job description and a resume, then produce structured output the product could use.

So I added retry and backoff to the roadmap analysis path. Timeout, rate-limit, and high-demand errors should not immediately become final failures. They should be retried after a short wait. I also raised the analysis timeout, because it runs in the background and does not need to be constrained like an interactive request.

That work was necessary, but it did not explain the stuck screen.

The logs showed that analysis had completed. The backend had done its job. So the remaining question was the page.

The detail page polling logic was the real issue.

The intention was simple. If the analysis state is running, fetch again later. When the state becomes complete, render the result. But the implementation was closer to a one-shot timeout than a continuous polling loop. It scheduled one more refresh, then stopped. With stable dependencies, it never kept checking until the state changed.

That explained why a manual browser refresh fixed the page.

If a refresh shows the completed result, the backend data is not missing. A new request can read the correct state. The problem is that the page should have made that request without asking the user to refresh.

The fix was to make polling behave like polling. I replaced the one-off timeout with interval-based refresh until the analysis reached a terminal state. I also added overlap protection so a slow request would not be followed by another request before it finished. Without that guard, poor network timing can create unnecessary load and stale responses racing each other.

It was not a flashy change. But this is the kind of change that makes a product feel trustworthy.

I cleaned up the progress UI at the same time.

The page had too much changing text. Messages like “analyzing” and “please wait” can make the interface feel alive, but they can also make it feel noisy. If the product does not know the real progress percentage, it should not pretend it does.

The system knew only a few states: pending, running, completed, and failed. It did not know whether the model was 37% done or 82% done. So the UI needed to be calm and honest.

I kept a stable animated chip and progress bar instead of rotating status copy. The page still communicated that work was happening, but it stopped inventing precision. When polling found a new state, the UI changed because the state changed, not because the text wanted to look busy.

That matters more with AI features than with ordinary CRUD screens. Model calls have uneven latency and strange failure modes. The more uncertain the backend work is, the more grounded the UI should be.

The same session also changed how roadmap titles worked.

At first, I tried to infer a title locally when the user left it blank. Look for something that seemed like a company name or role, then build a title from that. But heuristics like that break quickly. A job description may not contain a clean company name. The role may be ambiguous. Sometimes the resume and the job description need to be read together before the title makes sense.

Gemini was already reading the full context for analysis, so it made more sense to let the model suggest a title as part of that response. I added suggestedTitle to the analysis schema. If the user had written a title, the product kept it. If the title was still the placeholder, the backend could promote the model-suggested title after analysis finished.

The important part was not letting AI overwrite user intent.

AI can fill in blanks. It should not win against an explicit user choice. That small rule says a lot about how I want the product to behave.

Topic generation had a related problem waiting nearby.

Roadmap analysis had moved into the background, but topic generation was still mostly synchronous. The user clicked Generate, the model created topics, and the UI waited for the response. That was acceptable for the moment, but the same failure pattern could come back as topics became larger or models became slower.

For that day, I focused on making the state clearer. Generate and Regenerate got animated loading states. The UI made it explicit that one generation attempt used one LLM request. Success messaging included the topic count and request count.

I also collapsed the AI interview modules by default. If every generated section is expanded at once, the page becomes too dense too quickly. Keeping only the first topic expanded made the page easier to scan while still letting the user open details when needed.

The direction was the same as the polling fix. As AI creates more material, the product has to become better at pacing what it shows.

The lesson from that day was simple.

For asynchronous work, finishing the job is not enough. The fact that it finished has to reach the user. If the backend stores completion but the screen does not update, the user is still waiting.

Polling is not just a timer that calls an API. It is a promise between the product and the user. The product is saying, “You can wait here. I will tell you when this changes.” If the product makes that promise, it has to keep checking until the state actually changes.

AI features create more of these promises. Models can be slow, fail transiently, or need retries. That makes state handling more important, not less. Unknown work should look unknown. Finished work should look finished. Failed work should say that it failed.

The problem was not the AI itself. It was state delivery.

It was a screen that could not say a finished job was finished.