AI Transformation
Our AI Team
Sofia
Ivan
Vlad
Anton
Technolody Stack
Our Clients
Writer
Netflix
MCD
Allianz
Featured Cases
AI Framework for Writer AI Tool for Process Optimization HR AI Agent Hotel Booking AI Agent AI Voice Agent for e-Learning Computer vision for ARAS
Services
AI Development LLM Development RAG Development ML Development Generative AI Computer Vision AI Agents AI Chatbots AI Copilots
Services
We deliver services that drive your business growth
Read our Clutch reviews
See All Services
Full Cycle Development
AI Development Software Engineering UI/UX Design QA DevOps
By Industry
Fintech Logistics Education Travel Healthcare
Works
Featured Cases
Writer Framework Platform
Predictive Lead Scoring with AI
AML Detection Tool
AI Concierge Agent
See All Cases
Other projects
AI Learning PersonalizationSmart content recommendations Hotel AI ConciergeAI assistant for hotel guests Claims Documentation AutomationPlatform for faster claims processing AI for Candidate ScreeningSmart HR efficiency booster AI Voice AgentAI agent for hands-free learning LLM Legal SummarizationEfficient and fast legal summaries Vision-Based Driving AssistanceReal-time threat detection system
Company
Measurable success powered by
AI innovation
Our Clients
Writer
Netflix
MCD
Allianz
Yellow in Numbers
$2.1B+
Value generated through AI innovation
47
Custom LLMs and AI agents deployed
30M+
Engaging with products we created
98%
Projects delivered within agreed budget
Navigation
About usWho we are and our mission in the AI landscape.Why usOur competitive edge and technical expertise.BlogInsights on the latest AI trends and practical use cases.

April 30, 2026

LLM Integration Challenges: Insights From A Yellow Engineer

Integrating LLMs into applications comes with surprises. Learn how developers handle costs, latency, prompt design, and unexpected outputs.

Conor Allen

Chief Data & AI Officer

Today, we are conducting an interview with one of our AI engineers on the messy reality of building software with large language models. Meet Ivan, who is a software engineer with specific experience in LLM application development. He spends his days connecting heavy models to legacy business systems and making them work together.

We sat down with Ivan to talk about the friction between theory and practice. He shared his frustrations, his favorite workarounds, and the harsh lessons he learned along the way.

Need reliable experts to build your AI tools?

View offer

The Most Surprising Limitation of LLM Integration

“What is one surprising limitation you discovered when integrating an LLM into applications?”

The absolute lack of deterministic outputs. When you build traditional software, you know exactly what a function will return. If you put a "2" into a math function, you get a "4" back every single time. With LLM integration, you can pass the exact same prompt to the exact same model twice and get two completely different JSON structures back. I mean, it’s impressive that the model acts creatively, but it may be a nightmare for backend systems.

“How did you work around it, and what would you recommend to others?”

Honestly, I haven’t seen a fully automatic fix here. A lot depends on how you configure the model framework. Even if you send the LLM an error and ask it to fix invalid JSON, there’s no guarantee the next response will be correct. That’s why we always put a hard limit on retries. Without it, you can get stuck in an endless loop, burning through tokens for nothing. Our approach is to always validate the response with, for example, Pydantic or a similar tool. If validation fails, we either raise an exception right away or retry the request with an explicit error message for the model. It’s about having a safeguard so you’re not letting unpredictable output get into your pipeline unnoticed.

Prompt Engineering Techniques That Actually Improve Results

“Can you share one prompt engineering technique that dramatically improved your LLM's output quality? What specific use case did you apply it to?”

Few-shot prompting with negative examples. Most people just give the model examples of what they want. I found that showing the model exactly what I don't want is actually way more effective. We were building a feature that summarized long customer complaint threads. The model kept writing summaries that sounded overly dramatic and apologetic. So, we started passing it specific examples of bad summaries. We literally wrote: "Do not write like this example." Then we provided the good example. This approach forces the model to understand the boundaries of the tone.

Source: Hostinger

“What prompt pattern failed completely for you, and why do you think it didn’t work?”

The "You are an expert" persona trick. Everyone says you should start a prompt with "You are an expert lawyer" or "You are a senior developer." I tried this when building an internal code review assistant. I told the model it was a "world-class security auditor." It failed completely. Did it help the model write better code? No. It just started using unnecessarily complex jargon. It acted incredibly arrogant in its responses and over-explained basic concepts while completely missing the actual security vulnerability.

I think it failed because the model just mimics the linguistic style of an "expert" found in its training data, but not the actual reasoning. Now, I just skip the roleplay and give it very clear, direct instructions on the mechanics of the task.

Reducing LLM API Costs Without Losing Performance

“What is one cost optimization strategy you've implemented for LLM API usage that made a significant difference?”

Semantic caching. This was a total game-changer for our operating budget. When you run a popular application, users ask the exact same questions constantly. Before, every single query went straight to the premium LLM API, which is not perfect. We implemented a semantic cache layer using Redis. When a user asks a question, we convert it to an embedding and check the cache. If someone asked a mathematically similar question five minutes ago, we just return the cached answer. We don't hit the expensive LLM API at all. It cut our monthly token bill by about 40%.

“What monitoring tools/metrics helped you identify cost inefficiencies?”

We don’t usually have direct access to full product analytics, so we rely on whatever logging and API monitoring the client allows. Typically, we set up custom middleware to capture API call logs, aggregate token usage, and flag unusually large or frequent requests. We also monitor per-endpoint spending by tagging requests and looking for any method or route that suddenly spikes in monthly cost. Most of the insights come from reviewing these logs and negotiating with clients to access dashboards from third-party tools their cloud environments provide.

Curious about how language models operate under the hood?

Read our LLM deep dive

Handling Unexpected or Incorrect LLM Outputs

“How did you handle one instance where your integrated LLM produced unexpected or incorrect outputs?”

We built a financial data assistant for a client. The user asked for a summary of Q3 earnings. The model confidently hallucinated an entire revenue report, inventing numbers out of thin air. It even cited a fake SEC filing. I was terrified. There is something unsettling about a machine lying with absolute confidence. We had to shut the feature off temporarily. The problem was that we asked the model to answer the question without actually retrieving the source documents first. It just guessed based on its general training weights.

“What safeguards do you now have in place?”

We implemented a stricter RAG (Retrieval-Augmented Generation) pipeline with source citation enforcement. Now, the model isn't allowed to answer from memory. It must search our internal vector database first. And we also added a smaller LLM that acts as a judge. Before the final answer goes to the user, this "judge model" reads the answer and verifies if the numbers actually match the retrieved documents. If they don't match, it throws a generic "I don't know" error.

Source: Hostinger

Creative Ways Engineers Use LLM Embeddings

“What is one creative way you've used LLM embeddings in your application beyond basic semantic search?”

We used embeddings to detect anomalous user behavior in a security tool. Usually, embeddings are just used to build search engines. You turn text into numbers, find similar numbers, and return the result. Instead, we started taking the logs of what our users were doing in the app and turning those log sequences into text embeddings. We plotted them in a vector space. Normal user behavior clustered together in one dense area.

“What problem did it solve for you?”

It solved the problem of static security rules. Traditional security relies on rigid "if-then" rules to catch bad actors, but hackers adapt quickly. By using embeddings, we could instantly flag when a user's behavior drifted away from the normal cluster. If a session embedding suddenly mapped to a totally different part of the vector space, we knew something weird was happening.

Designing Better User Experiences for LLM Features

“What user behavior surprised you the most when interacting with your LLM feature?”

One thing that genuinely surprised me was how quickly users adopted "workarounds" for the LLM’s limitations. For example, users started including extra instructions or context in their prompts, sometimes pasting entire email threads or writing detailed background stories, so the model would better understand what they were asking. It was clear people were willing to experiment and invest extra effort if they thought it would coax better results from the AI, which made me realize we needed to support this kind of back-and-forth usage in our product design.

“Can you describe one user experience consideration that changed how you designed your LLM-powered feature?”

Seeing users invent their own workarounds made us rethink how much control and feedback we should give them in the UI. We added features that let users easily edit and resubmit prompts and provided suggestions for how to rephrase their requests. We also built in a running history so users could refer back to previous interactions. It will help them experiment and iterate even better.

Solving Performance Bottlenecks in LLM Applications

“What is one performance bottleneck you encountered with LLM integration, and how did you resolve it?”

The biggest bottleneck by far was the time to first token (TTFT). We were building a voice-to-text-to-voice agent. The user would speak, the LLM would generate a response, and a text-to-speech engine would read it back. The delay was brutal. It took nearly five seconds from the moment the user stopped talking to the moment the AI started speaking. It ruined the illusion of a conversation. The bottleneck was massive payload sizes. We were sending massive system prompts and huge conversation histories with every single API call.

“What metrics improved as a result?”

We tackled LLM performance optimization by radically pruning our payloads. We stopped sending the entire conversation history. Instead, we used a tiny model to summarize the past ten messages into three sentences, and only appended that short summary to the prompt. We also switched to a provider located in the same AWS region as our application servers to cut network latency. Our time to first token dropped from 2.5 seconds to 600 milliseconds.

Source: Hostinger

Managing Context Windows in Production LLM Systems

“What is one lesson you learned about managing context windows in production LLM applications?”

Just because a model has a 128k context window doesn’t mean you should actually use it.

Early on, we took plenty of shortcuts to get the job done. Did it work? Sure, but quickly hit massive LLM implementation challenges. First, it was ridiculously expensive. Second, the model suffered from the "lost in the middle" phenomenon. It would perfectly recall the first paragraph and the last paragraph of a document, but it completely ignored critical data buried on page 25.

“How has this shaped your integration approach?”

Proper LLM context window management is now the foundation of our architecture. You have to actively curate what the model sees. Do not treat the context window as a garbage disposal. The cleaner your prompt, the more accurate and cheaper your application will be.

To Sum Up: What Defines a Successful LLM Implementation?

Integrating large language models into software is not just about making an API call. It’s an exercise in managing chaos. The models are inherently unpredictable, expensive to run, and prone to hallucination.

A successful implementation relies on robust engineering wrapping around the AI. You have to build strict parsing middleware to handle unpredictable JSON. You have to monitor your token usage aggressively to prevent runaway costs. Most importantly, you have to design user interfaces that mask the latency and guide the user to write better prompts.

This technology is incredibly powerful, but it requires a highly disciplined approach. Stop treating LLMs like magic, and start treating them like messy, unvalidated data streams. Once you accept their flaws, you can build truly resilient software.

Have a vision for a custom AI platform? We’ll help you make it true!

Subscribe to new posts.

Get weekly updates on the newest design stories, case studies and tips right in your mailbox.