Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

Short summary

Vision-Language Models frequently hallucinate objects due to over-reliance on linguistic priors rather than visual evidence. Researchers introduce Positive-and-Negative Decoding (PND), a training-free inference framework that uses dual-path contrast to amplify visual features while penalizing language-prior-dominant outputs. PND achieves state-of-the-art performance on POPE, MME, and CHAIR benchmarks without retraining the model.

•VLMs suffer from attention imbalance, under-weighting visual features relative to language priors, causing hallucination
•PND uses dual-path contrast during inference—positive path amplifies visual evidence, negative path constructs counterfactuals
•Training-free approach reaches SOTA on standard benchmarks without model retraining

Generated with AI, which can make mistakes.

#research-breakthrough #ai-tools #ai-agents

Read full article at arXiv cs.LG

Is this a good recommendation for you?

Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

Short summary

Explore more