May 29, 2026

Hallucinations, Sycophancy and Prompt Injection: Why LLM-Based OCR Fails on KYC Documents

Ph.D. Сhief technology officer
IEEE Senior Member
Hallucinations, Sycophancy and Prompt Injection: Why LLM-Based OCR Fails on KYC Documents

When we talk about LLMs in the context of search, analysis and data generation, the conversation tends to revolve around impressive capabilities. The limitations, however, are usually glossed over or deliberately pushed to the very end. Ignorance of how it works does not protect against its failure modes — the only productive approach here is to understand which tasks LLMs handle well and poorly and why relying on them for KYC ID verification is a path of considerably greater risk than reward.

LLMs vs KYC Requirements

Strictly speaking, a large language model in its base form is a statistical prediction machine. The way LLMs work is as follows:

  • The model receives an input sequence of data
  • Converts each element into a numerical vector — known as an embedding
  • Computes statistical relationships between these vectors
  • Predicts which token — fragment of data — is most likely to come next

The process repeats: new tokens are added to the sequence, and the model makes another prediction until what it determines to be the most probable final token. For LLMs, words are points in multidimensional space whose “coordinates” reflect proximity of appearance in context of other words. The concepts of “true” and “false” don’t exist here, only “probable” and “improbable.” This is precisely why LLMs can sound so convincing: they are optimized to produce not what is true, but what is plausible

For many everyday tasks — generating ideas, summarising text, interpreting unstructured content — this is more than enough. But when it comes to core business processes, reliability and accuracy become key principles. KYC can’t tolerate approximation.

The Know Your Customer requirement has been embedded in more than 200 national legislative frameworks through FATF standards. Recommendation 10 mandates that when establishing a business relationship, the counterparty must be identified using “reliable, independent source documents, data or information”. This formulation has been set in stone since 1990 — it is difficult to reconcile with a system that, by design, simply produces its best statistical guess and, even then, can be hallucinatory.

Why LLMs Hallucinate 

Hallucination is a feature of how LLMs operate when a correct answer is statistically unlikely. LLM cannot say “I don’t know” the way a human does. It continues generating, because that is its purpose. The result looks like an answer (coherent and grammatically correct) but doesn’t make any sense. It is caused by some independent mechanisms.

  • Randomness. Each token is selected from a probability distribution: the numerical values assigned to each possible token are converted into probability scores, and a sample is drawn from that distribution.
  • Sycophancy, or the tendency to agree with any premise embedded in the prompt.
  • The long-tail knowledge problem. When it comes to ID scanning for KYC, specific data combinations — passports, national IDs, driver’s licenses — vary by country, issuing authority and generation. They are underrepresented in training data, and the model defaults to the most probable pattern rather than extracting data.
  • Multimodal processing of images. Pixels are tokenized by the same mechanism as words — and the same statistical machine takes over. This means that hallucination risks carry over into the visual domain unchanged: a model can “see” data that is not present in a document, or fail to register data that is.

Eliminating hallucinations entirely is not possible — the architecture of an LLM is such that probabilistic sampling is built into its foundations. In 2025 a research group at the National University of Singapore demonstrated that for any computable LLM, there exists a function on which it will hallucinate — regardless of architecture, training data or algorithm. 

Сan LLMs Be Used in OCR 

ID verification is, by its nature, always a System 2 task — in Daniel Kahneman’s Dual Thinking framework — deliberate and anchored to a specific source of truth unlike the fast and intuitive thinking of System 1. A KYC-error in a date of birth or a document number is not a minor inaccuracy — it is a failure of identification, or in the worst case, the verification of not just a wrong person, but a criminal, which can cost the business even more.

Stanford researchers, writing in the context of legal AI systems, proposed a useful typology of hallucinations. The first type is incorrect: the model simply describes a fact wrongly. The second — more insidious — is misgrounded: the model describes a fact correctly, but extracts it from the wrong source.

This second type is the specific risk in OCR (optical character recognition) tasks. A date of birth extracted accurately, but from the date-of-issue field. A name reproduced correctly, but taken from a different document in the same submission. A document number off by a single digit — distorted toward the most probable pattern, not to mention more banal and more probabilistic types of hallucinations, such as the literal substitution of data in names, dates of birth and other fields.

If the task is to scan ID — reliably extract specific values of a specific document and compare them against a reference — that is a task for a deterministic recognition system, not a probabilistic generator. Not because one is categorically superior to the other, but because they are different tools built for different purposes. This distinction is the foundation of sound architecture in automated document verification.

KYC and Prompt Injection

What exactly do the regulators require from a verification system? Under FATF Recommendation 11, precisely three things: accuracy, reproducibility, auditability.

Now compare. An LLM produces “the best possible guess based on all the statistical dependencies the model has seen in its training data,” while the regulator demands accurate results. Beyond that, the non-deterministic nature of LLM output — different on every run — is by definition incompatible with the requirements of an audit trail.

Hallucinations and regulatory requirements are two independent reasons why LLMs are unsuitable for KYC verification. But there is a third, less obvious one: the document being verified is itself a source of risk. This is an architectural problem. The vulnerability exists because LLMs don’t separate instructions from data within the same space.

In 2025, researchers at the Technical University of Dresden conducted 594 attacks on four leading multimodal LLMs as part of a controlled experiment. The method was straightforward: a hidden instruction was embedded in a medical image — using low contrast or small font, invisible to a human observer but detectable by the model. The instruction directed the model to disregard the pathology and report normal findings. None of the four models tested withstood the prompt injection attacks — the LLMs reported “normal” results in more than half of the cases.

The authors draw a conclusion: prompt injection is a fundamental problem of LLMs/VLMs, not exclusive to the tested models, and not easily fixable. Whether in medicine or in KYC, the cost of failure is critical: even an ordinary passport becomes a source of the threat.

Modern KYC Solutions

The cost of wrong KYC isn’t symmetrical. Reject a legitimate user due to a verification error, and you lose a customer. Let a fraud actor through, and you face financial exposure, regulatory action, and reputational damage that compounds. These are not equivalent outcomes. They demand a pipeline built differently, from the ground up.

OCR Studio’s ID scanning solution was developed to solve this. KYC-ready ID scanning. Privacy-first, on-premise, zero data exposure. Deterministic output at enterprise speed. Deployment across server, desktop, mobile, and even web environments — all with local processing. This isn’t a race against LLMs — it’s a level up. A deliberate architectural choice made for organizations where verification errors carry real consequences.

LLM-based KYCKYC-ready OCR
Hallucination riskInherent — statistically probable patterns override actual dataNone — extracts only what is present in the document
Audit trailIncompatible — probabilistic output cannot be reliably reproducedFull — every extraction step is traceable and verifiable
Data privacyRequires third-party infrastructure in most deploymentsOn-premise processing — data never leaves the perimeter
Regulatory fitDifficult to reconcile with FATF Recommendation 10 and 11Aligned with accuracy, reproducibility and auditability requirements

For organizations operating under KYC and AML obligations, this means a pipeline without hallucinations, prompt injection risk, or audit trail gaps. Evaluate OCR Studio’s ID scanner directly — try ID scanning via Web Demo now.

In Conclusion

Large language models are powerful tools for an enormous class of tasks — but not for KYC. Document verification imposes three requirements that a probabilistic token generator is architecturally unable to meet: accuracy of output, determinism of results and auditability of process.

OCR Studio offers a KYC-ready document scanning and verification solution built for exactly this kind of reliability — covering KYC, AML, onboarding and other business-critical workflows where the cost of a wrong answer is simply too high.

Ready to modernize your KYC workflows? Learn more about OCR Studio’s ID Verification solution.

Contents

About the author

Konstantin Bulatov is a scientist and Chief Technology Officer of OCR Studio, where he has led the development and implementation of advanced OCR technologies. He has designed a method for optimizing object recognition in video streams, which has improved the accuracy and efficiency of real-time OCR systems. Under his direction, OCR Studio develops secure on-device programming solutions that address diverse industry needs and contribute to advancements in the field.

Konstantin is an IEEE Senior Member, he has authored multiple patent applications and published his research in prominent academic conferences and journals. His work emphasizes innovative approaches to developing high-performance recognition systems, reinforcing OCR Studio’s position as a significant contributor to the global technology landscape.

Continue reading

Get in Touch With Us Today!

For comprehensive details about our complete
range of solutions and services.

Or contact our sales team:

sales@ocrstudio.ai

    * Required information
    By clicking the “Send request” button, you consent to data processing