Our AI Team
Sofia
Ivan
Vlad
Anton
Technolody Stack
Full Cycle Development
By Industry
Featured Cases
See All CasesOther projects
Yellow in Numbers
$2.1B+
Value generated through AI innovation47
Custom LLMs and AI agents deployed30M+
Engaging with products we created98%
Projects delivered within agreed budgetToday, we are conducting an interview with one of our AI engineers on the messy reality of building software with large language models. Meet Ivan, who is a software engineer with specific experience in LLM application development. He spends his days connecting heavy models to legacy business systems and making them work together.
We sat down with Ivan to talk about the friction between theory and practice. He shared his frustrations, his favorite workarounds, and the harsh lessons he learned along the way.
“What is one surprising limitation you discovered when integrating an LLM into applications?”
The absolute lack of deterministic outputs. When you build traditional software, you know exactly what a function will return. If you put a "2" into a math function, you get a "4" back every single time. With LLM integration, you can pass the exact same prompt to the exact same model twice and get two completely different JSON structures back. I mean, it’s impressive that the model acts creatively, but it may be a nightmare for backend systems.
“How did you work around it, and what would you recommend to others?”
Honestly, I haven’t seen a fully automatic fix here. A lot depends on how you configure the model framework. Even if you send the LLM an error and ask it to fix invalid JSON, there’s no guarantee the next response will be correct. That’s why we always put a hard limit on retries. Without it, you can get stuck in an endless loop, burning through tokens for nothing. Our approach is to always validate the response with, for example, Pydantic or a similar tool. If validation fails, we either raise an exception right away or retry the request with an explicit error message for the model. It’s about having a safeguard so you’re not letting unpredictable output get into your pipeline unnoticed.
“Can you share one prompt engineering technique that dramatically improved your LLM's output quality? What specific use case did you apply it to?”
Few-shot prompting with negative examples. Most people just give the model examples of what they want. I found that showing the model exactly what I don't want is actually way more effective. We were building a feature that summarized long customer complaint threads. The model kept writing summaries that sounded overly dramatic and apologetic. So, we started passing it specific examples of bad summaries. We literally wrote: "Do not write like this example." Then we provided the good example. This approach forces the model to understand the boundaries of the tone.
“What prompt pattern failed completely for you, and why do you think it didn’t work?”
The "You are an expert" persona trick. Everyone says you should start a prompt with "You are an expert lawyer" or "You are a senior developer." I tried this when building an internal code review assistant. I told the model it was a "world-class security auditor." It failed completely. Did it help the model write better code? No. It just started using unnecessarily complex jargon. It acted incredibly arrogant in its responses and over-explained basic concepts while completely missing the actual security vulnerability.
I think it failed because the model just mimics the linguistic style of an "expert" found in its training data, but not the actual reasoning. Now, I just skip the roleplay and give it very clear, direct instructions on the mechanics of the task.
“What is one cost optimization strategy you've implemented for LLM API usage that made a significant difference?”
Semantic caching. This was a total game-changer for our operating budget. When you run a popular application, users ask the exact same questions constantly. Before, every single query went straight to the premium LLM API, which is not perfect. We implemented a semantic cache layer using Redis. When a user asks a question, we convert it to an embedding and check the cache. If someone asked a mathematically similar question five minutes ago, we just return the cached answer. We don't hit the expensive LLM API at all. It cut our monthly token bill by about 40%.
“What monitoring tools/metrics helped you identify cost inefficiencies?”
We don’t usually have direct access to full product analytics, so we rely on whatever logging and API monitoring the client allows. Typically, we set up custom middleware to capture API call logs, aggregate token usage, and flag unusually large or frequent requests. We also monitor per-endpoint spending by tagging requests and looking for any method or route that suddenly spikes in monthly cost. Most of the insights come from reviewing these logs and negotiating with clients to access dashboards from third-party tools their cloud environments provide.
“How did you handle one instance where your integrated LLM produced unexpected or incorrect outputs?”
We built a financial data assistant for a client. The user asked for a summary of Q3 earnings. The model confidently hallucinated an entire revenue report, inventing numbers out of thin air. It even cited a fake SEC filing. I was terrified. There is something unsettling about a machine lying with absolute confidence. We had to shut the feature off temporarily. The problem was that we asked the model to answer the question without actually retrieving the source documents first. It just guessed based on its general training weights.
“What safeguards do you now have in place?”
We implemented a stricter RAG (Retrieval-Augmented Generation) pipeline with source citation enforcement. Now, the model isn't allowed to answer from memory. It must search our internal vector database first. And we also added a smaller LLM that acts as a judge. Before the final answer goes to the user, this "judge model" reads the answer and verifies if the numbers actually match the retrieved documents. If they don't match, it throws a generic "I don't know" error.
“What is one creative way you've used LLM embeddings in your application beyond basic semantic search?”
We used embeddings to detect anomalous user behavior in a security tool. Usually, embeddings are just used to build search engines. You turn text into numbers, find similar numbers, and return the result. Instead, we started taking the logs of what our users were doing in the app and turning those log sequences into text embeddings. We plotted them in a vector space. Normal user behavior clustered together in one dense area.
“What problem did it solve for you?”
It solved the problem of static security rules. Traditional security relies on rigid "if-then" rules to catch bad actors, but hackers adapt quickly. By using embeddings, we could instantly flag when a user's behavior drifted away from the normal cluster. If a session embedding suddenly mapped to a totally different part of the vector space, we knew something weird was happening.
“What user behavior surprised you the most when interacting with your LLM feature?”
One thing that genuinely surprised me was how quickly users adopted "workarounds" for the LLM’s limitations. For example, users started including extra instructions or context in their prompts, sometimes pasting entire email threads or writing detailed background stories, so the model would better understand what they were asking. It was clear people were willing to experiment and invest extra effort if they thought it would coax better results from the AI, which made me realize we needed to support this kind of back-and-forth usage in our product design.
“Can you describe one user experience consideration that changed how you designed your LLM-powered feature?”
Seeing users invent their own workarounds made us rethink how much control and feedback we should give them in the UI. We added features that let users easily edit and resubmit prompts and provided suggestions for how to rephrase their requests. We also built in a running history so users could refer back to previous interactions. It will help them experiment and iterate even better.
“What is one performance bottleneck you encountered with LLM integration, and how did you resolve it?”
The biggest bottleneck by far was the time to first token (TTFT). We were building a voice-to-text-to-voice agent. The user would speak, the LLM would generate a response, and a text-to-speech engine would read it back. The delay was brutal. It took nearly five seconds from the moment the user stopped talking to the moment the AI started speaking. It ruined the illusion of a conversation. The bottleneck was massive payload sizes. We were sending massive system prompts and huge conversation histories with every single API call.
“What metrics improved as a result?”
We tackled LLM performance optimization by radically pruning our payloads. We stopped sending the entire conversation history. Instead, we used a tiny model to summarize the past ten messages into three sentences, and only appended that short summary to the prompt. We also switched to a provider located in the same AWS region as our application servers to cut network latency. Our time to first token dropped from 2.5 seconds to 600 milliseconds.
“What is one lesson you learned about managing context windows in production LLM applications?”
Just because a model has a 128k context window doesn’t mean you should actually use it.
Early on, we took plenty of shortcuts to get the job done. Did it work? Sure, but quickly hit massive LLM implementation challenges. First, it was ridiculously expensive. Second, the model suffered from the "lost in the middle" phenomenon. It would perfectly recall the first paragraph and the last paragraph of a document, but it completely ignored critical data buried on page 25.
“How has this shaped your integration approach?”
Proper LLM context window management is now the foundation of our architecture. You have to actively curate what the model sees. Do not treat the context window as a garbage disposal. The cleaner your prompt, the more accurate and cheaper your application will be.
Integrating large language models into software is not just about making an API call. It’s an exercise in managing chaos. The models are inherently unpredictable, expensive to run, and prone to hallucination.
A successful implementation relies on robust engineering wrapping around the AI. You have to build strict parsing middleware to handle unpredictable JSON. You have to monitor your token usage aggressively to prevent runaway costs. Most importantly, you have to design user interfaces that mask the latency and guide the user to write better prompts.
This technology is incredibly powerful, but it requires a highly disciplined approach. Stop treating LLMs like magic, and start treating them like messy, unvalidated data streams. Once you accept their flaws, you can build truly resilient software.
Got a project in mind?
Fill in this form or send us an e-mail
Get weekly updates on the newest design stories, case studies and tips right in your mailbox.