Dev.to
6/2/2026

How a Scanned PDF Broke My Invoice Agent in Production
Short summary
A production invoice extraction agent silently failed on scanned PDFs with rotated text and poor OCR, shifting decimal places while returning confident structured output. The fix wasn't a better model—it was a deterministic validation layer checking field presence, amount ranges, date windows, and line-item consistency before downstream system acceptance. This pattern, applicable across all agent deployments, caught 23 failed documents in the first week that would otherwise have contaminated the system.
- •Scanned PDFs with poor OCR caused silent extraction failures despite model confidence
- •Added synchronous validation layer with 4 deterministic checks before downstream acceptance
- •Pattern caught 23 bad documents in first week; applicable across all agent deployments
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



