Learn Self-Adaptation Thinking through RL and switch thinking modes flexibly according to scenarios

Learn Self-Adaptation Thinking through RL and switch thinking modes flexibly according to scenarios

Effective social intelligence simulation requires language agents to dynamically adjust the depth of reasoning, a capability conspicuously absent in current methods. Existing methods either lack such reasoning capabilities or enforce a uniform long-chain thinking reasoning in all scenarios, leading to excessive token usage and inappropriate social simulations. This paper proposes a Self-Adaptation Mindset Learning (AML) framework that strategically selects from four mindsets (intuitive reaction → deep thinking) based on real-time context. The core innovation of this framework, the Self-Adaptation Mindset Policy Optimization (AMPO) algorithm, achieves three breakthroughs compared to existing methods: (1) multi-granularity mindset design, (2) context-aware mindset switching in social interactions, and (3) token-efficient reasoning through deep self-adaptation. Extensive experiments on social intelligence tasks show that AML outperforms the current state-of-the-art method by 15.6%. Notably, while shortening the reasoning chain length by 32.8%, our method still outperforms GRPO by 7.0%. These results demonstrate that the context-sensitive mindset selection achieved by AMPO is closer to human self-adaptation thinking characteristics than the fixed-depth reasoning approach of GRPO.

Read more
Inspecting and Editing Knowledge Representations in Language Models
LLM 知识编辑调研

LLM 知识编辑调研

大语言模型(LMMs)惊人的知识保留能力,归功于 LLMs 处理和压缩大量数据的方式,可能形成了更简洁、连贯且可解释的生成过程模型,本质上是创建了一种“世界模型”。

Read more
Jailbreaking Attack against Multimodal Large Language Model

Jailbreaking Attack against Multimodal Large Language Model

本文重点关注针对多模式大语言模型 (MLLM) 的越狱攻击,试图引发 MLLM 对有害查询生成有害回答。提出了一种基于最大似然的算法来查找图像越狱提示(imgJP),从而能够跨多个未见的提示和图像(即数据通用属性)对 MLLM 进行越狱。该方法表现出很强的模型可转移性,因为生成的 imgJP 可以以黑盒方式转移到越狱各种模型,包括 MiniGPT-v2、LLaVA、InstructBLIP 和 mPLUGOwl2。此外,还揭示了 MLLM 越狱和 LLM 越狱之间的联系。最后引入了一种基于构造的方法来实现 LLM 越狱方法,证明了比当前最先进的方法更高的效率。

Read more
Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA

一种新的基于自监督多层次对比学习的VQA自然语言解释模型 (MCLE),该模型具有语义级、图像级和实例级的事实和反事实样本。
MCLE提取判别特征,并将特征空间中的解释与视觉问题和答案对齐,以产生更一致的解释。作者进行了大量的实验和案例研究,以证明提出的方法在两个VQA-NLE基准上的有效性。

Read more
Inversion-based Style Transfer with Diffusion Models

Inversion-based Style Transfer with Diffusion Models

Our key idea is to learn the artistic style directly from a single painting and then guide the synthesis without providing complex textual descriptions. Specifically, we perceive style as a learnable textual description of a painting. We propose an inversion-based style transfer method (InST), which can efficiently and accurately learn the key information of an image, thus capturing and transferring the artistic style of a painting.

Read more
InstructPix2Pix: Learning to Follow Image Editing Instructions

InstructPix2Pix: Learning to Follow Image Editing Instructions

We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models—a language model (GPT-3) and a text-to-image model (Stable Diffusion)—to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per-example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions.

Read more
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

In this work, we present a new approach for “personalization” of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images.

Read more
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy?

Read more
Diffusion Models: A Comprehensive Survey of Methods and Applications

Diffusion Models: A Comprehensive Survey of Methods and Applications

Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in many applications, including image synthesis, video generation, and molecule design. In this survey, we provide an overview of the rapidly expanding body of work on diffusion models, categorizing the research into three key areas: efficient sampling, improved likelihood estimation, and handling data with special structures. We also discuss the potential for combining diffusion models with other generative models for enhanced results. We further review the wide-ranging applications of diffusion models in fields spanning from computer vision, natural language processing, temporal data modeling, to interdisciplinary applications in other scientific disciplines.

Read more