<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>arXiv AIGC 图像/视频生成论文精选</title>
    <description>自动筛选 arXiv 上 AIGC 图像生成、视频生成、扩散模型相关的最新论文</description>
    <link>https://arxiv.org</link>
    <atom:link href="http://101.32.50.67:8080/arxiv_aigc_feed.xml" rel="self" type="application/rss+xml"/>
    <lastBuildDate>Tue, 28 Apr 2026 09:01:04 GMT</lastBuildDate>
    <language>zh-cn</language>
    <generator>arXiv AIGC RSS Generator v1.0</generator>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model</title>
      <link>https://arxiv.org/abs/2604.22700v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22700v1</guid>
      <pubDate>Fri, 24 Apr 2026 16:33:48 GMT</pubDate>
      <dc:creator>Nivetha Jayakumar, Swakshar Deb, Bahram Jafrasteh et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Generative Modeling of Neurodegenerative Brain Anatomy with 4D Longitudinal Diffusion Model</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Nivetha Jayakumar, Swakshar Deb, Bahram Jafrasteh et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Understanding and predicting the progression of neurodegenerative diseases remains a major challenge in medical AI, with significant implications for early diagnosis, disease monitoring, and treatment planning. However, most available longitudinal neuroimaging datasets are temporally sparse with a few follow-up scans per subject. This scarcity of temporal data limits our ability to model and accurately capture the continuous anatomical changes related to disease progression in individual subjects. To address this problem, we propose a novel 4D (3DxT) diffusion-based generative framework that effectively models and synthesizes longitudinal brain anatomy over time, conditioned on available clinical variables such as health status, age, sex, and other relevant factors. Moreover, while most current approaches focus on manipulating image intensity or texture, our method explicitly learns the data distribution of topology-preserving spatiotemporal deformations to effectively capture the geometric changes of brain structures over time. This design enables the realistic generation of future anatomical states and the reconstruction of anatomically consistent disease trajectories, providing a more faithful representation of longitudinal brain changes. We validate our model through both synthetic sequence generation and downstream longitudinal disease classification, as well as brain segmentation. Experiments on two large-scale longitudinal neuroimage datasets demonstrate that our method outperforms state-of-the-art baselines in generating anatomically accurate, temporally consistent, and clinically meaningful brain trajectories. Our code is available on Github.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22700v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22700v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views</title>
      <link>https://arxiv.org/abs/2604.22658v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22658v1</guid>
      <pubDate>Fri, 24 Apr 2026 15:32:03 GMT</pubDate>
      <dc:creator>Jiaxin Shi, Guofeng Zhang, Wufei Ma et al.</dc:creator>
      <category>模型架构</category>
      <description>Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those us...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">PASR: Pose-Aware 3D Shape Retrieval from Occluded Single Views</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jiaxin Shi, Guofeng Zhang, Wufei Ma et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Single-view 3D shape retrieval is a fundamental yet challenging task that is increasingly important with the growth of available 3D data. Existing approaches largely fall into two categories: those using contrastive learning to map point cloud features into existing vision-language spaces and those that learn a common embedding space for 2D images and 3D shapes. However, these feed-forward, holistic alignments are often difficult to interpret, which in turn limits their robustness and generalization to real-world applications. To address this problem, we propose Pose-Aware 3D Shape Retrieval (PASR), a framework that formulates retrieval as a feature-level analysis-by-synthesis problem by distilling knowledge from a 2D foundation model (DINOv3) into a 3D encoder. By aligning pose-conditioned 3D projections with 2D feature maps, our method bridges the gap between real-world images and synthetic meshes. During inference, PASR performs a test-time optimization via analysis-by-synthesis, jointly searching for the shape and pose that best reconstruct the patch-level feature map of the input image. This synthesis-based optimization is inherently robust to partial occlusion and sensitive to fine-grained geometric details. PASR substantially outperforms existing methods on both clean and occluded 3D shape retrieval datasets by a wide margin. Additionally, PASR demonstrates strong multi-task capabilities, achieving robust shape retrieval, competitive pose estimation, and accurate category classification within a single framework.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22658v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22658v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock</title>
      <link>https://arxiv.org/abs/2604.22657v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22657v1</guid>
      <pubDate>Fri, 24 Apr 2026 15:31:47 GMT</pubDate>
      <dc:creator>Shiva Paudel, TsungCheng Tsai, Dongyi Wang</dc:creator>
      <category>模型架构</category>
      <description>Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">A Non-Invasive Alternative to RFID: Self-Sufficient 3D Identification of Group-Housed Livestock</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Shiva Paudel, TsungCheng Tsai, Dongyi Wang</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Accurate identification of individual farm animals in group-housed environments is a cornerstone of precision livestock management. However, current industry standards rely heavily on Radio Frequency Identification (RFID) ear tags, which are invasive, prone to loss, and restricted by the spatial limitations of antenna fields. In this paper, we propose a non-intrusive, vision-based identification system leveraging 3D point cloud data captured within a commercial electronic feeding station (EFS). Departing from traditional supervised frame-level inference, we introduce the Temporal Adaptive Recognition Architecture (TARA), a self-sufficient, semi-supervised framework designed to maintain identity consistency over time. TARA employs a dynamic recalibration mechanism that updates individual identity profiles to account for morphological changes in the livestock. To facilitate training in label-scarce environments, we utilize a visit-level majority voting strategy to generate high-fidelity pseudo-labels from raw temporal sequences. Experimental results on a group housed sow dataset collected from an operational commercial barn demonstrate that our approach achieves 100% identification accuracy at the visit level. These results suggest that vision-based 3D point cloud analysis offers a robust, superior alternative to RFID-based systems, paving the way for fully autonomous individual animal monitoring.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22657v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22657v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] Structure-Guided Diffusion Model for EEG-Based Visual Cognition Reconstruction</title>
      <link>https://arxiv.org/abs/2604.22649v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22649v1</guid>
      <pubDate>Fri, 24 Apr 2026 15:20:59 GMT</pubDate>
      <dc:creator>Yongxiang Lian, Yueyang Cang, Pingge Hu et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <description>Objective: Decoding visual information from electroencephalography (EEG) is an important problem in neuroscience and brain-computer interface (BCI) research. Existing methods are largely restricted to...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Structure-Guided Diffusion Model for EEG-Based Visual Cognition Reconstruction</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yongxiang Lian, Yueyang Cang, Pingge Hu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, controlnet, diffusion, image generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Objective: Decoding visual information from electroencephalography (EEG) is an important problem in neuroscience and brain-computer interface (BCI) research. Existing methods are largely restricted to natural images and categorical representations, with limited capacity to capture structural features and to differentiate objective perception from subjective cognition. We propose a Structure-Guided Diffusion Model (SGDM) that incorporates explicit structural information for EEG-based visual reconstruction. Approach: SGDM is evaluated on the Kilogram abstract visual object dataset and the THINGS natural image dataset using a two-stage generative mechanism. The framework combines a structurally supervised variational autoencoder with a spatiotemporal EEG encoder aligned to a visual embedding space via contrastive learning. Structural information is integrated into a diffusion model through ControlNet to guide image generation from EEG features. Results: SGDM outperforms existing methods on both abstract and natural image datasets. Reconstructed images achieve higher fidelity in low-level visual features and semantic representations, indicating improved decoding accuracy and strong generalization across diverse visual domains. Spatiotemporal analysis of EEG signals further reveals hierarchical structural encoding patterns, consistent with the neural dynamics of visual cognition. Significance: These findings validate the effectiveness of SGDM in capturing explicit structural geometry and generating images with high fidelity to individual cognitive representations. By enabling decoding of complex visual content from EEG signals, the framework extends neural decoding beyond low-dimensional or categorical outputs. This supports BCIs with increased degrees of freedom for intention decoding and more flexible brain-to-machine communication.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22649v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22649v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [模型架构] FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing</title>
      <link>https://arxiv.org/abs/2604.22586v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22586v1</guid>
      <pubDate>Fri, 24 Apr 2026 14:17:11 GMT</pubDate>
      <dc:creator>Ze Chen, Lan Chen, Yuanhang Li et al.</dc:creator>
      <category>视频生成</category>
      <category>模型架构</category>
      <description>We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structu...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Ze Chen, Lan Chen, Yuanhang Li et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> video editing, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We propose FlowAnchor, a training-free framework for stable and efficient inversion-free, flow-based video editing. Inversion-free editing methods have recently shown impressive efficiency and structure preservation in images by directly steering the sampling trajectory with an editing signal. However, extending this paradigm to videos remains challenging, often failing in multi-object scenes or with increased frame counts. We identify the root cause as the instability of the editing signal in high-dimensional video latent spaces, which arises from imprecise spatial localization and length-induced magnitude attenuation. To overcome this challenge, FlowAnchor explicitly anchors both where to edit and how strongly to edit. It introduces Spatial-aware Attention Refinement, which enforces consistent alignment between textual guidance and spatial regions, and Adaptive Magnitude Modulation, which adaptively preserves sufficient editing strength. Together, these mechanisms stabilize the editing signal and guide the flow-based evolution toward the desired target distribution. Extensive experiments demonstrate that FlowAnchor achieves more faithful, temporally coherent, and computationally efficient video editing across challenging multi-object and fast-motion scenarios. The project page is available at https://cuc-mipg.github.io/FlowAnchor.github.io/.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22586v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22586v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] Useful nonrobust features are ubiquitous in biomedical images</title>
      <link>https://arxiv.org/abs/2604.22579v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22579v1</guid>
      <pubDate>Fri, 24 Apr 2026 14:12:18 GMT</pubDate>
      <dc:creator>Coenraad Mouton, Randle Rabe, Niklas C. Koser et al.</dc:creator>
      <category>扩散模型</category>
      <description>We study whether deep networks for medical imaging learn useful nonrobust features - predictive input patterns that are not human interpretable and highly susceptible to small adversarial perturbation...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Useful nonrobust features are ubiquitous in biomedical images</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Coenraad Mouton, Randle Rabe, Niklas C. Koser et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> edm</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We study whether deep networks for medical imaging learn useful nonrobust features - predictive input patterns that are not human interpretable and highly susceptible to small adversarial perturbations - and how these features impact test performance. We show that models trained only on nonrobust features achieve well above chance accuracy across five MedMNIST classification tasks, confirming their predictive value in-distribution. Conversely, adversarially trained models that primarily rely on robust features sacrifice in-distribution accuracy but yield markedly better performance under controlled distribution shifts (MedMNIST-C). Overall, nonrobust features boost standard accuracy yet degrade out-of-distribution performance, revealing a practical robustness-accuracy trade-off in medical imaging classification tasks that should be tailored to the requirements of the deployment setting.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22579v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22579v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] [图像生成] [多模态生成] Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors</title>
      <link>https://arxiv.org/abs/2604.22560v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22560v1</guid>
      <pubDate>Fri, 24 Apr 2026 13:54:48 GMT</pubDate>
      <dc:creator>Gautam Kumar Jain, Carsten Markgraf, Julian Stähler</dc:creator>
      <category>模型架构</category>
      <category>图像生成</category>
      <category>多模态生成</category>
      <description>Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent wi...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Cross-Stage Coherence in Hierarchical Driving VQA: Explicit Baselines and Learned Gated Context Projectors</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Gautam Kumar Jain, Carsten Markgraf, Julian Stähler</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构, 图像生成, 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan, vlm, dit, lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Graph Visual Question Answering (GVQA) for autonomous driving organizes reasoning into ordered stages, namely Perception, Prediction, and Planning, where planning decisions should remain consistent with the model&apos;s own perception. We present a comparative study of cross-stage context passing on DriveLM-nuScenes using two complementary mechanisms. The explicit variant evaluates three prompt-based conditioning strategies on a domain-adapted 4B VLM (Mini-InternVL2-4B-DA-DriveLM) without additional training, reducing NLI contradiction by up to 42.6% and establishing a strong zero-training baseline. The implicit variant introduces gated context projectors, which extract a hidden-state vector from one stage and inject a normalized, gated projection into the next stage&apos;s input embeddings. These projectors are jointly trained with stage-specific QLoRA adapters on a general-purpose 8B VLM (InternVL3-8B-Instruct) while updating only approximately 0.5% of parameters. The implicit variant achieves a statistically significant 34% reduction in planning-stage NLI contradiction (bootstrap 95% CIs, p &lt; 0.05) and increases cross-stage entailment by 50%, evaluated with a multilingual NLI classifier to account for mixed-language outputs. Planning language quality also improves (CIDEr +30.3%), but lexical overlap and structural consistency degrade due to the absence of driving-domain pretraining. Since the two variants use different base models, we present them as complementary case studies: explicit context passing provides a strong training-free baseline for surface consistency, while implicit gated projection delivers significant planning-stage semantic gains, suggesting domain adaptation as a plausible next ingredient for full-spectrum improvement.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22560v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22560v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?</title>
      <link>https://arxiv.org/abs/2604.22557v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22557v1</guid>
      <pubDate>Fri, 24 Apr 2026 13:52:25 GMT</pubDate>
      <dc:creator>Anam Hashmi, Mayug Maniparambil, Julia Dietlmeier et al.</dc:creator>
      <category>模型架构</category>
      <description>The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inver...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Anam Hashmi, Mayug Maniparambil, Julia Dietlmeier et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The emergence of large-scale pretrained foundation models has transformed computer vision, enabling strong performance across diverse downstream tasks. However, their potential for physics-based inverse problems, such as accelerated cardiac MRI reconstruction, remains largely underexplored. In this work, we investigate whether natural-domain foundation models can serve as effective image priors for accelerated cardiac MRI reconstruction, and compare the performance obtained against domain-specific counterparts such as BiomedCLIP. We propose an unrolled reconstruction framework that incorporates pretrained, frozen visual encoders, such as CLIP, DINOv2, and BiomedCLIP, within each cascade to guide the reconstruction process. Through extensive experiments, we show that while task-specific state-of-the-art reconstruction models such as E2E-VarNet achieve superior performance in standard in-distribution settings, foundation-model-based approaches remain competitive. More importantly, in challenging cross-domain scenarios, where models are trained on cardiac MRI and evaluated on anatomically distinct knee and brain datasets--foundation models exhibit improved robustness, particularly under high acceleration factors and limited low-frequency sampling. We further observe that natural-image-pretrained models, such as CLIP, learn highly transferable structural representations, while domain-specific pretraining (BiomedCLIP) provides modest additional gains in more ill-posed regimes. Overall, our results suggest that pretrained foundation models offer a promising source of transferable priors, enabling improved robustness and generalization in accelerated MRI reconstruction.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22557v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22557v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] Video Analysis and Generation via a Semantic Progress Function</title>
      <link>https://arxiv.org/abs/2604.22554v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22554v1</guid>
      <pubDate>Fri, 24 Apr 2026 13:48:51 GMT</pubDate>
      <dc:creator>Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri et al.</dc:creator>
      <category>视频生成</category>
      <description>Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps....</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Video Analysis and Generation via a Semantic Progress Function</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Gal Metzer, Sagi Polaczek, Ali Mahdavi-Amiri et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> video generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Transformations produced by image and video generation models often evolve in a highly non-linear manner: long stretches where the content barely changes are followed by sudden, abrupt semantic jumps. To analyze and correct this behavior, we introduce a Semantic Progress Function, a one-dimensional representation that captures how the meaning of a given sequence evolves over time. For each frame, we compute distances between semantic embeddings and fit a smooth curve that reflects the cumulative semantic shift across the sequence. Departures of this curve from a straight line reveal uneven semantic pacing. Building on this insight, we propose a semantic linearization procedure that reparameterizes (or retimes) the sequence so that semantic change unfolds at a constant rate, yielding smoother and more coherent transitions. Beyond linearization, our framework provides a model-agnostic foundation for identifying temporal irregularities, comparing semantic pacing across different generators, and steering both generated and real-world video sequences toward arbitrary target pacing.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22554v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22554v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models</title>
      <link>https://arxiv.org/abs/2604.22552v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22552v1</guid>
      <pubDate>Fri, 24 Apr 2026 13:45:11 GMT</pubDate>
      <dc:creator>Shihui Yan, Ziqi Zhou, Yufei Song et al.</dc:creator>
      <category>模型架构</category>
      <description>Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their eff...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Transferable Physical-World Adversarial Patches Against Pedestrian Detection Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Shihui Yan, Ziqi Zhou, Yufei Song et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Physical adversarial patch attacks critically threaten pedestrian detection, causing surveillance and autonomous driving systems to miss pedestrians and creating severe safety risks. Despite their effectiveness in controlled settings, existing physical attacks face two major limitations in practice: they lack systematic disruption of the multi-stage decision pipeline, enabling residual modules to offset perturbations, and they fail to model complex physical variations, leading to poor robustness. To overcome these limitations, we propose a novel pedestrian adversarial patch generation method that combines multi-stage collaborative attacks with robustness enhancement under physical diversity, called TriPatch. Specifically, we design a triplet loss consisting of detection confidence suppression, bounding-box offset amplification, and non-maximum suppression (NMS) disruption, which jointly act across different stages of the detection pipeline. In addition, we introduce an appearance consistency loss to constrain the color distribution of the patch, thereby improving its adaptability under diverse imaging conditions, and incorporate data augmentation to further enhance robustness against complex physical perturbations. Extensive experiments demonstrate that TriPatch achieves a higher attack success rate across multiple detector models compared to existing approaches.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22552v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22552v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] Distilling Vision Transformers for Distortion-Robust Representation Learning</title>
      <link>https://arxiv.org/abs/2604.22529v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22529v1</guid>
      <pubDate>Fri, 24 Apr 2026 13:15:26 GMT</pubDate>
      <dc:creator>Konstantinos Alexis, Giorgos Giannopoulos, Dimitrios Gunopulos</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Distilling Vision Transformers for Distortion-Robust Representation Learning</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Konstantinos Alexis, Giorgos Giannopoulos, Dimitrios Gunopulos</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> distillation, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Self-supervised learning has achieved remarkable success in learning visual representations from clean data, yet remains challenging when clean observations are sparse or not available at all. In this paper, we demonstrate that pretrained vision models can be leveraged to learn distortion-robust representations, which can then be effectively applied to downstream tasks operating on distorted observations. In particular, we propose an asymmetric knowledge distillation framework in which both teacher and student are initialized from the same pretrained Vision Transformer but receive different views of each image: the teacher processes clean images, while the student sees their distorted versions. We introduce multi-level distillation that aligns global embeddings, patch-level features, and attention maps and show that the student is able to approximate clean-image representations despite never directly accessing clean data. We evaluate our approach on image classification tasks across several datasets and under various distortions, consistently outperforming existing alternatives for the same amount of human supervision.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22529v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22529v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] ICPR 2026 Competition on Low-Resolution License Plate Recognition</title>
      <link>https://arxiv.org/abs/2604.22506v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22506v1</guid>
      <pubDate>Fri, 24 Apr 2026 12:36:09 GMT</pubDate>
      <dc:creator>Rayson Laroca, Valfride Nascimento, Donggun Kim et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">ICPR 2026 Competition on Low-Resolution License Plate Recognition</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Rayson Laroca, Valfride Nascimento, Donggun Kim et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Low-Resolution License Plate Recognition (LRLPR) remains a challenging problem in real-world surveillance scenarios, where long capture distances, compression artifacts, and adverse imaging conditions can severely degrade license plate legibility. To promote progress in this area, we organized the ICPR 2026 Competition on Low-Resolution License Plate Recognition, the first competition specifically dedicated to LRLPR using real low-quality data collected under operationally relevant conditions. The competition was based on the LRLPR-26 dataset, which comprises 20,000 training tracks and 3,000 test tracks; each training track contains five low-resolution and five high-resolution images of the same license plate. Notably, a total of 269 teams from 41 countries registered for the competition, and 99 teams submitted valid entries in the Blind Test Phase. The winning team achieved a Recognition Rate of 82.13%, and four teams surpassed the 80% mark, highlighting both the high level of competition at the top of the leaderboard and the continued difficulty of the task. In addition to presenting the competition design, evaluation protocol, and main results, this paper summarizes the methods adopted by the top-5 teams and discusses current trends and promising directions for future research on LRLPR. The competition webpage is available at https://icpr26lrlpr.github.io/</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22506v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22506v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding</title>
      <link>https://arxiv.org/abs/2604.22498v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22498v1</guid>
      <pubDate>Fri, 24 Apr 2026 12:26:49 GMT</pubDate>
      <dc:creator>Lihao Zheng, Zhenwei Shao, Yu Zhou et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Lihao Zheng, Zhenwei Shao, Yu Zhou et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22498v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22498v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification</title>
      <link>https://arxiv.org/abs/2604.22479v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22479v1</guid>
      <pubDate>Fri, 24 Apr 2026 12:00:08 GMT</pubDate>
      <dc:creator>Gökdeniz Ersoy, Mehmet Alper Tatar, Eray Tonbul et al.</dc:creator>
      <category>模型架构</category>
      <description>Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Improving Driver Drowsiness Detection via Personalized EAR/MAR Thresholds and CNN-Based Classification</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Gökdeniz Ersoy, Mehmet Alper Tatar, Eray Tonbul et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Driver drowsiness is a major cause of traffic accidents worldwide, posing a serious threat to public safety. Vision-based driver monitoring systems often rely on fixed Eye Aspect Ratio (EAR) and Mouth Aspect Ratio (MAR) thresholds; however, such fixed values frequently fail to generalize across individuals due to variations in facial structure, illumination, and driving conditions. This paper proposes a personalized driver drowsiness detection system that monitors eyelid movements, head position, and yawning behavior in real time and provides warnings when signs of fatigue are detected. The system employs driver-specific EAR and MAR thresholds, calibrated before driving, to improve classical metric-based detection. In addition, deep learning-based Convolutional Neural Network (CNN) models are integrated to enhance accuracy in challenging scenarios. The system is evaluated using publicly available datasets as well as a custom dataset collected under diverse lighting conditions, head poses, and user characteristics. Experimental results show that personalized thresholding improves detection accuracy by 2-3% compared to fixed thresholds, while CNN-based classification achieves 99.1% accuracy for eye state detection and 98.8% for yawning detection, demonstrating the effectiveness of combining classical metrics with deep learning for robust real-time driver monitoring.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22479v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22479v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples</title>
      <link>https://arxiv.org/abs/2604.22477v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22477v1</guid>
      <pubDate>Fri, 24 Apr 2026 11:55:50 GMT</pubDate>
      <dc:creator>Oussama Bouanani, Jim Berend, Wojciech Samek et al.</dc:creator>
      <category>多模态生成</category>
      <description>Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Oussama Bouanani, Jim Berend, Wojciech Samek et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22477v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22477v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams</title>
      <link>https://arxiv.org/abs/2604.22476v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22476v1</guid>
      <pubDate>Fri, 24 Apr 2026 11:54:56 GMT</pubDate>
      <dc:creator>Marco Pegoraro, Jonas Seng, Dustin Heller et al.</dc:creator>
      <category>图像生成</category>
      <description>Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analys...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">All Eyes on the Workflow: Automated and Efficient Event Discovery from Video Streams</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Marco Pegoraro, Jonas Seng, Dustin Heller et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Disciplines such as business process management and process mining aid organizations by discovering insights about processes on the basis of recorded event data. However, an obstacle to process analysis is data multi-modality: for instance, data in video form are not directly interpretable as events. In this work, we present SnapLog, an approach to extract event data from videos by converting frames to feature vectors using image embeddings and performing temporal segmentation through frame-wise similarity matrices. A generalized few-shot classification is then used to assign labels to the video segments, yielding labeled, timestamped sub-sequences of frames that are interpretable as events. Conventional process mining techniques can be used to analyze the resulting data. We show that our approach produces logs that accurately reflect the process in the videos.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22476v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22476v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting</title>
      <link>https://arxiv.org/abs/2604.22439v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22439v1</guid>
      <pubDate>Fri, 24 Apr 2026 10:57:05 GMT</pubDate>
      <dc:creator>Zaiyan Yang, Xinpeng Liu, Heng Guo et al.</dc:creator>
      <category>模型架构</category>
      <description>We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussi...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">NRGS: Neural Regularization for Robust 3D Semantic Gaussian Splatting</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zaiyan Yang, Xinpeng Liu, Heng Guo et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We propose a neural regularization method that refines the noisy 3D semantic field produced by lifting multi-view inconsistent 2D features, in order to obtain an accurate and robust 3D semantic Gaussian Splatting. The 2D features extracted from vision foundation models suffer from multi-view inconsistency due to a lack of cross-view constraints. Lifting these inconsistent features directly into 3D Gaussians results in a noisy semantic field, which degrades the performance of downstream tasks. Previous methods either focus on obtaining consistent multi-view features in the preprocessing stage or aim to mitigate noise through improved optimization strategies, often at the cost of increased preprocessing time or expensive computational overhead. In contrast, we introduce a variance-aware conditional MLP that operates directly on the 3D Gaussians, leveraging their geometric and appearance attributes to correct semantic errors in 3D space. Experiments on different datasets show that our method enhances the accuracy of lifted semantics, providing an efficient and effective approach to robust 3D semantic Gaussian Splatting.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22439v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22439v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments</title>
      <link>https://arxiv.org/abs/2604.22409v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22409v1</guid>
      <pubDate>Fri, 24 Apr 2026 10:06:41 GMT</pubDate>
      <dc:creator>Chih-Ting Liao, Xi Xiao, Chunlei Meng et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be conti...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Chih-Ting Liao, Xi Xiao, Chunlei Meng et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22409v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22409v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition</title>
      <link>https://arxiv.org/abs/2604.22390v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22390v1</guid>
      <pubDate>Fri, 24 Apr 2026 09:28:35 GMT</pubDate>
      <dc:creator>Shunpeng Chen, Yukun Song, Changwei Wang et al.</dc:creator>
      <category>模型架构</category>
      <description>Visual Place Recognition (VPR) determines a query image&apos;s geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irreleva...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Region Matters: Efficient and Reliable Region-Aware Visual Place Recognition</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Shunpeng Chen, Yukun Song, Changwei Wang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Visual Place Recognition (VPR) determines a query image&apos;s geographic location by matching it against geotagged databases. However, existing methods struggle with perceptual aliasing caused by irrelevant regions and inefficient re-ranking due to rigid candidate scheduling. To address these issues, we introduce FoL++, a method combining robust discriminative region modeling with adaptive re-ranking. Specifically, we propose a Reliability Estimation Branch to generate spatial reliability maps that explicitly model occlusion resistance. This representation is further optimized by two spatial alignment losses (SAL and SCEL) to effectively align features and highlight salient regions. For weakly supervised learning without manual annotations, a pseudo-correspondence strategy generates dense local feature supervision directly from aggregation clusters. Our Adaptive Candidate Scheduler dynamically resizes candidate pools based on global similarity. By weighting local matches by reliability and adaptively fusing global and local evidence, FoL++ surpasses traditional independent matching systems. Extensive experiments across seven benchmarks demonstrate that FoL++ achieves state-of-the-art performance with a lightweight memory footprint, improving inference speed by 40% over FoL. Code and models will be released (and merged with FoL) at https://github.com/chenshunpeng/FoL.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22390v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22390v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos</title>
      <link>https://arxiv.org/abs/2604.22388v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22388v1</guid>
      <pubDate>Fri, 24 Apr 2026 09:24:19 GMT</pubDate>
      <dc:creator>Xu Lu, Qianhong Peng, Qihao Zhou et al.</dc:creator>
      <category>视频生成</category>
      <description>Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">HFS-TriNet: A Three-Branch Collaborative Feature Learning Network for Prostate Cancer Classification from TRUS Videos</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Xu Lu, Qianhong Peng, Qihao Zhou et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> temporal consistency</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Transrectal ultrasound (TRUS) imaging is a cost-effective and non-invasive modality widely used in the diagnosis of prostate cancer. The computer-aided diagnosis (CAD) relying on TRUS images has been extensively investigated recently. Compared to static images, TRUS video provides richer spatial-temporal information, which make it a promising alternative for improving the accuracy and robustness of CAD systems. However, TRUS video analysis also introduces new challenges. These include information redundancy, which increases computational costs; high intra- and inter-class similarity, which complicates feature extraction; and a low signal-to-noise ratio, which hinders the identification of clinically relevant information. To address these problems, we propose a heuristic frame selection (HFS) and a three-branch collaborative feature learning network (HFS-TriNet) for prostate cancer classification from TRUS videos. Specifically, selecting a clip of video frames at intervals for training can mitigate redundancy. The HFS strategy dynamically initializes the starting point of each training clip, which ensures that the sampled clips span the entire video sequence. For better feature extraction, besides a regular ResNet50 branch, we also utilize 1) a large model branch based a pre-trained medical segment anything model (SAM) to extract deep features of each frame and a normalization-based attention module to explore the temporal consistency; and 2) a wavelet transform convolutional residual (WTCR) branch that extracts lesion edge information in the high-frequency domain and performs denoising in the low-frequency domain.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22388v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22388v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[控制与编辑] [扩散模型] [图像生成] [模型架构] Efficient Diffusion Distillation via Embedding Loss</title>
      <link>https://arxiv.org/abs/2604.22379v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22379v1</guid>
      <pubDate>Fri, 24 Apr 2026 09:16:47 GMT</pubDate>
      <dc:creator>Jincheng Ying, Yitao Chen, Li Wenlin et al.</dc:creator>
      <category>控制与编辑</category>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and ex...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Efficient Diffusion Distillation via Embedding Loss</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jincheng Ying, Yitao Chen, Li Wenlin et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 控制与编辑, 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan, conditional generation, diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent advances in distilling expensive diffusion models into efficient few-step generators show significant promise. However, these methods typically demand substantial computational resources and extended training periods, limiting accessibility for resource-constrained researchers, and existing supplementary loss functions have notable limitations. Regression loss requires pre-generating large datasets before training and limits the student model to the teacher&apos;s performance, while GAN-based losses suffer from training instability and require careful tuning. In this paper, we propose Embedding Loss (EL), a novel supplementary loss function that complements existing diffusion distillation methods to enhance generation quality and accelerate training with smaller batch sizes. Leveraging feature embeddings from a diverse set of randomly initialized networks, EL effectively aligns the feature distributions between the distilled few-step generator and the original data. By computing Maximum Mean Discrepancy (MMD) in the embedded feature space, EL ensures robust distribution matching, thereby preserving sample fidelity and diversity during distillation. Within distribution matching distillation frameworks, EL demonstrates strong empirical performance for one-step generators. On the CIFAR-10 dataset, our approach achieves state-of-the-art FID values of 1.475 for unconditional generation and 1.380 for conditional generation. Beyond CIFAR-10, we further validate EL across multiple benchmarks and distillation methods, including ImageNet, AFHQ-v2, and FFHQ datasets, using DMD, DI, and CM distillation frameworks, demonstrating consistent improvements over existing one-step distillation methods. Our method also reduces training iterations by up to 80%, offering a more practical and scalable solution for deploying diffusion-based generative models in resource-constrained environments.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22379v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22379v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 控制与编辑, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] One Shot Learning for Edge Detection on Point Clouds</title>
      <link>https://arxiv.org/abs/2604.22354v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22354v1</guid>
      <pubDate>Fri, 24 Apr 2026 08:42:47 GMT</pubDate>
      <dc:creator>Zhikun Tu, Yuhe Zhang, Yiou Jia et al.</dc:creator>
      <category>模型架构</category>
      <description>Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less ef...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">One Shot Learning for Edge Detection on Point Clouds</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zhikun Tu, Yuhe Zhang, Yiou Jia et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Each scanner possesses its unique characteristics and exhibits its distinct sampling error distribution. Training a network on a dataset that includes data collected from different scanners is less effective than training it on data specific to a single scanner. Therefore, we present a novel one-shot learning method allowing for edge extraction on point clouds, by learning the specific data distribution of the target point cloud, and thus achieve superior results compared to networks that were trained on general data distributions. More specifically, we present how to train a lightweight network named OSFENet (One-Shot edge Feature Extraction Network), by designing a filtered-KNN-based surface patch representation that supports a one-shot learning framework. Additionally, we introduce an RBF_DoS module, which integrates Radial Basis Function-based Descriptor of the Surface patch, highly beneficial for the edge extraction on point clouds. The advantage of the proposed OSFENet is demonstrated through comparative analyses against 7 baselines on the ABC dataset, and its practical utility is validated by results across diverse real-scanned datasets, including indoor scenes like S3DIS dataset, and outdoor scenes such as the Semantic3D dataset and UrbanBIS dataset.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22354v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22354v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] PoseFM: Relative Camera Pose Estimation Through Flow Matching</title>
      <link>https://arxiv.org/abs/2604.22350v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22350v1</guid>
      <pubDate>Fri, 24 Apr 2026 08:36:53 GMT</pubDate>
      <dc:creator>Dominik Kuczkowski, Laura Ruotsalainen</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown su...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">PoseFM: Relative Camera Pose Estimation Through Flow Matching</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Dominik Kuczkowski, Laura Ruotsalainen</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> flow matching, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at https://github.com/helsinki-sda-group/posefm.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22350v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22350v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Thermal background reduction for mid-infrared imaging by low-rank background and sparse point-source modelling</title>
      <link>https://arxiv.org/abs/2604.22351v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22351v1</guid>
      <pubDate>Fri, 24 Apr 2026 08:36:53 GMT</pubDate>
      <dc:creator>R. A. R. Moens, A. G. M. Pietrow, B. Brandl et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Mid-infrared astronomy from the ground faces critical challenges in accurately detecting and quantifying sources due to the dominant spatially and time-variable background noise. Moreover, chopping an...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Thermal background reduction for mid-infrared imaging by low-rank background and sparse point-source modelling</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> R. A. R. Moens, A. G. M. Pietrow, B. Brandl et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit, lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Mid-infrared astronomy from the ground faces critical challenges in accurately detecting and quantifying sources due to the dominant spatially and time-variable background noise. Moreover, chopping and nodding, the traditional methods for dealing with these background issues, will not be technically feasible on the next generation of extremely large telescopes. This limitation requires the development of novel computational methods for a robust background reduction. We present and evaluate a novel method named LOw-RAnk Background ELimination (LORABEL) to improve the sensitivity of mid-infrared astronomical observations, without the need for classical telescope nodding, source masking, or other overheads in observing time. We applied a low-rank background-reduction strategy to (1) data taken on the ground with the VISIR with synthetically injected sources, and (2) airborne data from SOFIA. We compared the performance of our new method to classical chopping and nodding techniques, and analysed the effect on source photometry and detection precision for different observational scenarios. In regimes with a low signal-to-noise ratio (S/N $&lt;5$) in the ground-based VISIR data, LORABEL reduces variation in the photometric error with respect to chopping differences alone and even the classical chop-nod sequence, at the cost of introducing a bias. Secondly, we demonstrate that LORABEL increases detection precision in comparison to traditional background-reduction methods. For the SOFIA dataset, we achieve a $20-100$ fold decrease in mean background flux with respect to the traditional chop-nod method while preserving most of the source flux. Our findings suggest that LORABEL is applicable to a wider range of instrumental observation, that is, both ground-based and airborne, and it is a suitable tool in the context of faint-source detection.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22351v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22351v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] FILTR: Extracting Topological Features from Pretrained 3D Models</title>
      <link>https://arxiv.org/abs/2604.22334v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22334v1</guid>
      <pubDate>Fri, 24 Apr 2026 08:07:03 GMT</pubDate>
      <dc:creator>Louis Martinez, Maks Ovsjanikov</dc:creator>
      <category>模型架构</category>
      <description>Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">FILTR: Extracting Topological Features from Pretrained 3D Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Louis Martinez, Maks Ovsjanikov</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer, mae</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent advances in pretraining 3D point cloud encoders (e.g., Point-BERT, Point-MAE) have produced powerful models, whose abilities are typically evaluated on geometric or semantic tasks. At the same time, topological descriptors have been shown to provide informative summaries of a shape&apos;s multiscale structure. In this paper we pose the question whether topological information can be derived from features produced by 3D encoders. To address this question, we first introduce DONUT, a synthetic benchmark with controlled topological complexity, and propose FILTR (Filtration Transformer), a learnable framework to predict persistence diagrams directly from frozen encoders. FILTR adapts a transformer decoder to treat diagram generation as a set prediction task. Our analysis on DONUT reveals that existing encoders retain only limited global topological signals, yet FILTR successfully leverages information produced by these encoders to approximate persistence diagrams. Our approach enables, for the first time, data-driven extraction of persistence diagrams from raw point clouds through an efficient learnable feed-forward mechanism.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22334v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22334v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding</title>
      <link>https://arxiv.org/abs/2604.22333v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22333v1</guid>
      <pubDate>Fri, 24 Apr 2026 08:05:39 GMT</pubDate>
      <dc:creator>Dongwei Sun, Jing Yao, Kan Wei et al.</dc:creator>
      <category>模型架构</category>
      <description>Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">ChangeQuery: Advancing Remote Sensing Change Analysis for Natural and Human-Induced Disasters from Visual Detection to Semantic Understanding</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Dongwei Sun, Jing Yao, Kan Wei et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Rapid situational awareness is critical in post-disaster response. While remote sensing damage assessment is evolving from pixel-level change detection to high-level semantic analysis, existing vision-language methodologies still struggle to provide actionable intelligence for complex strategic queries. They remain severely constrained by unimodal optical dependence, a prevailing bias towards natural disasters, and a fundamental lack of grounded interactivity. To address these limitations, we present ChangeQuery, a unified multimodal framework designed for comprehensive, all-weather disaster situation awareness. To overcome modality constraints and scenario biases, we construct the Disaster-Induced Change Query (DICQ) dataset, a large-scale benchmark coupling pre-event optical semantics with post-event SAR structural features across a balanced distribution of natural catastrophes and armed conflicts. Furthermore, to provide the high-quality supervision required for interactive reasoning, we propose a novel Automated Semantic Annotation Pipeline. Adhering to a ``statistics-first, generation-later&apos;&apos; paradigm, this engine automatically transforms raw segmentation masks into grounded, hierarchical instruction sets, effectively equipping the model with fine-grained spatial and quantitative awareness. Trained on this structured data, the ChangeQuery architecture operates as an interactive disaster analyst. It supports multi-task reasoning driven by diverse user queries, delivering precise damage quantification, region-specific descriptions, and holistic post-disaster summaries. Extensive experiments demonstrate that ChangeQuery establishes a new state-of-the-art, providing a robust and interpretable solution for complex disaster monitoring. The code is available at \href{https://sundongwei.github.io/changequery/}{https://sundongwei.github.io/changequery/}.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22333v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22333v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization</title>
      <link>https://arxiv.org/abs/2604.22310v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22310v1</guid>
      <pubDate>Fri, 24 Apr 2026 07:44:57 GMT</pubDate>
      <dc:creator>Jeonggon Kim, Heejoon Moon, Je Hyeong Hong</dc:creator>
      <category>模型架构</category>
      <description>Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. How...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Revisiting Geometric Obfuscation with Dual Convergent Lines for Privacy-Preserving Image Queries in Visual Localization</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jeonggon Kim, Heejoon Moon, Je Hyeong Hong</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Privacy-Preserving Image Queries (PPIQ) are an emerging mechanism for cloud-based visual localization, enabling pose estimation from obfuscated features instead of private images or raw keypoints. However, the main approaches for PPIQ, primarily geometry-based and segmentation-based obfuscation, both suffer from vulnerabilities to recent privacy attacks. In particular, a fundamental limitation of geometry-based obfuscation is that the spatial distribution of obfuscated neighboring lines still effectively surrounds the original keypoint location, providing exploitable cues for recovering the original points. We revisit this geometric paradigm and introduce Dual Convergent Lines (DCL), a novel keypoint obfuscation method demonstrating strong resilience against such attack. DCL places two fixed anchors on a central partition line and lifts each keypoint to a line originating from one of them, with the active anchor determined by the keypoint&apos;s location. This arrangement invalidates the geometry-recovery attack by making its optimization ill-posed: Neighboring lines either misleadingly converge to one anchor, yielding a trivial solution, or become near-parallel at the partition boundary, yielding an unstable high-variance solution. Both outcomes thwart point recovery. DCL is also compatible with an existing line-based solver, enabling deployment in traditional localization pipelines. Experiments on both indoor and large-scale outdoor datasets demonstrate DCL&apos;s robustness against privacy attacks, efficiency, and scalability, while achieving practical localization performance.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22310v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22310v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation</title>
      <link>https://arxiv.org/abs/2604.22302v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22302v1</guid>
      <pubDate>Fri, 24 Apr 2026 07:33:52 GMT</pubDate>
      <dc:creator>Ran Zhao, Sheng Jin, Size Wu et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains la...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Ran Zhao, Sheng Jin, Size Wu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> text-to-image, dit, t2i, image generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent text-to-image (T2I) models have demonstrated impressive capabilities in photorealistic synthesis and instruction following. However, their reliability in knowledge-intensive settings remains largely unexplored. Unlike natural image generation, knowledge visualization requires not only semantic alignment but also strict adherence to domain knowledge, structural constraints, and symbolic conventions, exposing a critical gap between visual plausibility and scientific correctness. To systematically study this problem, we introduce KVBench, a curriculum-grounded benchmark for evaluating knowledge-intensive T2I generation. KVBench covers six senior high-school subjects: Biology, Chemistry, Geography, History, Mathematics, and Physics. The benchmark consists of 1,800 expert-curated prompts derived from over 30 authoritative textbooks. Using this benchmark, we evaluate 14 state-of-the-art open- and closed-source models, revealing substantial deficiencies in logical reasoning, symbolic precision, and multilingual robustness, with open-source models consistently underperforming proprietary systems. To address these limitations, we further propose KE-Check, a two-stage framework that improves scientific fidelity via (1) Knowledge Elaboration for structured prompt enrichment, and (2) Checklist-Guided Refinement for explicit constraint enforcement through violation identification and constraint-guided editing. KE-Check effectively mitigates scientific hallucinations, narrowing the performance gap between open-source and leading closed-source models. Data and codes are publicly available at https://github.com/zhaoran66/KVBench.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22302v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22302v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment</title>
      <link>https://arxiv.org/abs/2604.22296v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22296v1</guid>
      <pubDate>Fri, 24 Apr 2026 07:26:48 GMT</pubDate>
      <dc:creator>Jai G Singla, Hinal B Patel, Nitant Dube</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration a...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Evaluation of image simulation open source solutions for simulation of synthetic images in lunar environment</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jai G Singla, Hinal B Patel, Nitant Dube</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit, lora, image generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Synthetic image generation is one of the crucial input for planetary missions. It enables researchers and engineers to visualize planned planetary missions, test imaging systems and plan exploration activities in a virtual environment before actual deployment. Image simulation is essential for assessing landing sites, detecting hazards, and validating navigation systems in a missions. This study offers a detailed evaluation of various image simulation approaches for the lunar environment, with particular emphasis on the effects of different camera models and light illumination conditions on the quality of synthetic lunar images. These images are produced using real Digital Elevation Models (DEM) and terrain data derived from instruments such as Chandrayaan-2 Orbiter High Resolution Camera (OHRC) and NASA&apos;s Wide Angle Camera (WAC), and Narrow Angle Camera (NAC) instruments. This research aims to improve the reliability of synthetic imagery in supporting autonomous navigation and decision-making systems in lunar exploration. This work contributes to the development of more effective tools for generating important information for future lunar missions and enhances the understanding of the moon&apos;s surface environment.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22296v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22296v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning</title>
      <link>https://arxiv.org/abs/2604.22281v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22281v1</guid>
      <pubDate>Fri, 24 Apr 2026 06:51:58 GMT</pubDate>
      <dc:creator>Joonmyung Choi, Sanghyeok Lee, Jongha Kim et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from te...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Joonmyung Choi, Sanghyeok Lee, Jongha Kim et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model&apos;s level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22281v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22281v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation</title>
      <link>https://arxiv.org/abs/2604.22274v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22274v1</guid>
      <pubDate>Fri, 24 Apr 2026 06:34:45 GMT</pubDate>
      <dc:creator>Suiyang Guang, Chenyu Liu, Ruohan Zhang et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models gre...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Suiyang Guang, Chenyu Liu, Ruohan Zhang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support, contact, containment, depth, motion, and state. A relation-conditioned evidence encoder extracts predicate-relevant cues, while a counterfactual verifier tests whether the relation score decreases when necessary vidence is removed and remains stable under irrelevant perturbations. We further introduce contradiction-aware predicate learning and graph-level preference optimization to improve fine-grained discrimination and global graph consistency. Experiments on conventional, open-vocabulary, and panoptic SGG benchmarks show that our method consistently improves standard recall-based metrics, unseen predicate generalization, and counterfactual grounding quality. These results demonstrate that moving from relation generation to relation verification leads to more reliable, interpretable, and evidence-grounded scene graphs.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22274v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22274v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset</title>
      <link>https://arxiv.org/abs/2604.22260v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22260v1</guid>
      <pubDate>Fri, 24 Apr 2026 06:09:41 GMT</pubDate>
      <dc:creator>Wenhui Huang, Songyan Zhang, Collister Chua et al.</dc:creator>
      <category>模型架构</category>
      <description>Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale m...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wenhui Huang, Songyan Zhang, Collister Chua et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22260v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22260v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space</title>
      <link>https://arxiv.org/abs/2604.22240v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22240v1</guid>
      <pubDate>Fri, 24 Apr 2026 05:30:35 GMT</pubDate>
      <dc:creator>Zhuding Liang, Tianyi Yan, Dubing Chen et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit traj...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">OccDirector: Language-Guided Behavior and Interaction Generation in 4D Occupancy Space</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zhuding Liang, Tianyi Yan, Dubing Chen et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Generative world models increasingly rely on 4D occupancy for realistic autonomous driving simulation. However, existing generation frameworks depend on rigid geometric conditions (e.g., explicit trajectories) or simplistic attribute-level text, failing to orchestrate complex, sequential multi-agent interactions. To address this semantic-spatiotemporal gap, we propose OccDirector, a pioneering framework that generates 4D occupancy dynamics conditioned solely on natural language. Operating as a ``scenario director&apos;&apos;, OccDirector maps natural language scripts into physically plausible voxel dynamics without requiring geometric priors. Technically, it employs a VLM-driven Spatio-Temporal MMDiT equipped with a history-prefix anchoring strategy to ensure long-horizon interaction consistency. Furthermore, we introduce OccInteract-85k, a novel dataset uniquely annotated with multi-level language instructions: ranging from static layouts to intricate multi-agent behaviors, alongside a novel VLM-based evaluation benchmark. Extensive experiments demonstrate that OccDirector achieves state-of-the-art generation quality and unprecedented instruction-following capabilities, successfully shifting the paradigm from appearance synthesis to language-driven behavior orchestration.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22240v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22240v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Towards Temporal Compositional Reasoning in Long-Form Sports Videos</title>
      <link>https://arxiv.org/abs/2604.22226v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22226v1</guid>
      <pubDate>Fri, 24 Apr 2026 05:02:03 GMT</pubDate>
      <dc:creator>Siyu Cao, Lu Zhang, Ruizhe Zeng et al.</dc:creator>
      <category>模型架构</category>
      <description>Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Towards Temporal Compositional Reasoning in Long-Form Sports Videos</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Siyu Cao, Lu Zhang, Ruizhe Zeng et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22226v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22226v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework</title>
      <link>https://arxiv.org/abs/2604.22220v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22220v1</guid>
      <pubDate>Fri, 24 Apr 2026 04:57:13 GMT</pubDate>
      <dc:creator>Chunpeng Wang, Binyan Qu, Xiaoyu Wang et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <description>Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance a...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Breaking Watermarks in the Frequency Domain: A Modulated Diffusion Attack Framework</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Chunpeng Wang, Binyan Qu, Xiaoyu Wang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Digital image watermarking has advanced rapidly for copyright protection of generative AI, yet the comparatively limited progress in watermark attack techniques has broken the attack-defense balance and hindered further advances in the field. In this paper, we propose FMDiffWA, a frequency-domain modulated diffusion framework for watermark attacks. Specifically, we introduce a frequency-domain watermark modulation (FWM) module and incorporate it into the sampling stages both the forward and reverse diffusion processes. This mechanism enables selective modulation of watermark-related frequency components, thereby allowing FMDiffWA to effectively neutralize the invisible watermark signals while preserving the perceptual quality of the attacked watermarked images. To achieve a better trade-off between attack efficacy and visual fidelity, we reformulate the training strategy of conventional diffusion models by augmenting the canonical noise estimation objective with an auxiliary refinement constraint. Comprehensive experiments demonstrate that FMDiffWA achieves superior visual fidelity compared to existing watermark attacks, while exhibiting strong generalization across diverse watermarking schemes.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22220v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22220v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data</title>
      <link>https://arxiv.org/abs/2604.22212v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22212v1</guid>
      <pubDate>Fri, 24 Apr 2026 04:30:02 GMT</pubDate>
      <dc:creator>Harry Dong, Timofey Efimov, Megna Shah et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>In spite of the utility of 3-D electron back-scattered diffraction (EBSD) microscopy, the data collection process can be time-consuming with serial-sectioning. Hence, it is natural to look at other mo...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Multimodal Diffusion to Mutually Enhance Polarized Light and Low Resolution EBSD Data</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Harry Dong, Timofey Efimov, Megna Shah et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">In spite of the utility of 3-D electron back-scattered diffraction (EBSD) microscopy, the data collection process can be time-consuming with serial-sectioning. Hence, it is natural to look at other modalities, such as polarized light (PL) data, to accelerate EBSD data collection, supplemented with shared information. Complementarily, features in chaotic PL data could even be enriched with a handful of EBSD measurements. To inherently learn the complex dynamics between EBSD and PL to solve these inverse problems, we use an unconditional multimodal diffusion model, motivated by progress in diffusion models for inverse problems. Although trained solely on synthetic data once, our model has strong generalizable capabilities on real data which can be low-resolution, noisy, corrupted, and misregistered. With inference-time scaling, we show gains in performance on a variety of objectives including grain boundary prediction, super-resolution, and denoising. With our model, we demonstrate that there is little difference from full resolution performance with only 25% (1/4 the resolution) of EBSD data and corrupted PL data.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22212v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22212v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution</title>
      <link>https://arxiv.org/abs/2604.22192v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22192v1</guid>
      <pubDate>Fri, 24 Apr 2026 03:39:51 GMT</pubDate>
      <dc:creator>Xiangxi Zheng, Kuang He, Jiayi Hu et al.</dc:creator>
      <category>多模态生成</category>
      <description>Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limita...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Xiangxi Zheng, Kuang He, Jiayi Hu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Chart-to-code generation demands strict visual precision and syntactic correctness from Vision-Language Models (VLMs). However, existing approaches are fundamentally constrained by data-centric limitations: despite the availability of growing chart-to-code datasets, simply scaling homogeneous chart-code pairs conflates visual perception with program logic, preventing models from fully leveraging the richness of multimodal supervision. We present CharTide, a novel data-centric framework that systematically redesigns both training and alignment data for chart-to-code generation. First, we construct a 2M-sample dataset via a Tri-Perspective Tuning strategy, explicitly decoupling training into visual perception, pure-text code logic, and modality fusion streams, enabling a 7B model to surpass specialized baselines using only supervised data. Second, we reformulate alignment as a data verification problem rather than a heuristic scoring task. To this end, we introduce an Inquiry-Driven RL framework grounded in the principle of information invariance: a downstream model should yield consistent answers to identical visual queries across both original and generated charts. Moving beyond rigid rule matching or VLM scoring, we employ a frozen Inspector to objectively verify generated charts through atomic QA tasks, providing verifiable reward signals based on answer accuracy. Experiments on ChartMimic, Plot2Code, and ChartX show that CharTide-7B/8B significantly outperforms open-source baselines, surpasses GPT-4o, and is competitive with GPT-5.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22192v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22192v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification</title>
      <link>https://arxiv.org/abs/2604.22190v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22190v1</guid>
      <pubDate>Fri, 24 Apr 2026 03:37:21 GMT</pubDate>
      <dc:creator>Aotian Zheng, Winston Sun, Bahaa Alattar et al.</dc:creator>
      <category>模型架构</category>
      <description>CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making repre...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">From Global to Local: Rethinking CLIP Feature Aggregation for Person Re-Identification</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Aotian Zheng, Winston Sun, Bahaa Alattar et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">CLIP-based person re-identification (ReID) methods aggregate spatial features into a single global \texttt{[CLS]} token optimized for image-text alignment rather than spatial selectivity, making representations fragile under occlusion and cross-camera variation. We propose SAGA-ReID, which reconstructs identity representations by aligning intermediate patch tokens with anchor vectors parameterized in CLIP&apos;s text embedding space -- emphasizing spatially stable evidence while suppressing corrupted or absent regions, without requiring textual descriptions of individual images. Controlled experiments isolate the aggregation mechanism under two qualitatively distinct conditions -- synthetic masking, where identity signal is absent, and realistic human distractors, where an overlapping person introduces semantically confusing signal -- with SAGA&apos;s advantage over global pooling growing substantially as occlusion increases across both conditions. Benchmark evaluations confirm consistent gains over CLIP-ReID across standard and occluded settings, with the largest improvements where global pooling is most unreliable: up to +10.6 Rank-1 on occluded benchmarks. SAGA&apos;s aggregation outperforms dedicated sequential patch aggregation on a stronger backbone, confirming that structured reconstruction addresses a bottleneck that backbone quality and architectural complexity alone cannot resolve. Code available at https://github.com/ipl-uw/Structured-Anchor-Guided-Aggregation-for-ReID.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22190v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22190v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting</title>
      <link>https://arxiv.org/abs/2604.22183v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22183v1</guid>
      <pubDate>Fri, 24 Apr 2026 03:17:52 GMT</pubDate>
      <dc:creator>Feiyu An, Yufei Deng, Zihui Zhang et al.</dc:creator>
      <category>模型架构</category>
      <description>Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">EvFlow-GS: Event Enhanced Motion Deblurring with Optical Flow for 3D Gaussian Splatting</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Feiyu An, Yufei Deng, Zihui Zhang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Achieving sharp 3D reconstruction from motion-blurred images alone becomes challenging, motivating recent methods to incorporate event cameras, benefiting from microsecond temporal resolution. However, they suffer from residual artifacts and blurry texture details due to misleading supervision from inaccurate event double integral priors and noisy, blurry events. In this study, we propose EvFlow-GS, a unified framework that leverages event streams and optical flow to optimize an end-to-end learnable double integral (LDI), camera poses, and 3D Gaussian Splatting (3DGS) jointly on-the-fly. Specifically, we first extract edge information from the events using optical flow and then formulate a novel event-based loss applied separately to different modules. Additionally, we exploit a novel event-residual prior to strengthen the supervision of intensity changes between images rendered from 3DGS. Finally, we integrate the outputs of both 3DGS and LDI into a joint loss, enabling their optimization to mutually facilitate each other. Experiments demonstrate the leading performance of our EvFlow-GS.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22183v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22183v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities</title>
      <link>https://arxiv.org/abs/2604.22177v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22177v1</guid>
      <pubDate>Fri, 24 Apr 2026 03:02:26 GMT</pubDate>
      <dc:creator>Peibo Song, Xiaotian Xue, Jinshuo Zhang et al.</dc:creator>
      <category>模型架构</category>
      <description>Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose U...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Peibo Song, Xiaotian Xue, Jinshuo Zhang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Multimodal MRI offers complementary information for brain tumor segmentation, but clinical scans often lack one or more modalities, which degrades segmentation performance. In this paper, we propose UniME (Uni-Encoder Meets Multi-Encoders), a two-stage heterogeneous method for brain tumor segmentation with missing modalities that reconciles the trade-offs among fine-grained structure capture, cross-modal complementarity modeling, and exploitation of available modalities. The idea is to decouple representation learning from segmentation via a two-stage heterogeneous architecture. Stage 1 pretrains a single ViT Uni-Encoder with masked image modeling to establish a unified representation robust to missing modalities. Stage 2 adds modality-specific CNN Multi-Encoders to extract high-resolution, multi-scale, fine-grained features. We fuse these features with the global representation to produce precise segmentations. Experiments on BraTS 2023 and BraTS 2024 show that UniME outperforms previous methods under incomplete multi-modal scenarios. The code is available at https://github.com/Hooorace-S/UniME</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22177v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22177v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [模型架构] Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models</title>
      <link>https://arxiv.org/abs/2604.22164v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22164v1</guid>
      <pubDate>Fri, 24 Apr 2026 02:27:13 GMT</pubDate>
      <dc:creator>Masato Soga, Ryuki Takebayashi</dc:creator>
      <category>视频生成</category>
      <category>模型架构</category>
      <description>Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, m...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Learning Reactive Human Motion Generation from Paired Interaction Data Using Transformer-Based Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Masato Soga, Ryuki Takebayashi</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> motion generation, dit, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent advances in deep learning have enabled the generation of videos from textual descriptions as well as the prediction of future sequences from input videos. Similarly, in human motion modeling, motions can be generated from text or predicted from a single person&apos;s motion sequence. However, these approaches primarily focus on single-agent motion generation. In contrast, this study addresses the problem of generating the motion of one person based on the motion of another in interaction scenarios, where the two motions are mutually dependent. We construct a dataset of paired action-reaction motion sequences extracted from boxing match videos and investigate the effectiveness of Transformer-based models for this task. Specifically, we implement and compare three models: a simple Transformer, iTransformer, and Crossformer. In addition, we introduce a person ID embedding to explicitly distinguish between individuals, enabling the model to maintain structural consistency and better capture interaction dynamics. Experimental results show that the simple Transformer can generate plausible interaction-aware motions without suffering from posture collapse, while iTransformer and Crossformer accumulate errors over time, leading to unstable motion generation. Furthermore, the proposed person ID embedding contributes to preventing structural collapse and improving motion consistency. These results highlight the importance of explicitly modeling individual identity in interaction-aware motion generation.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22164v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22164v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] GenMatter: Perceiving Physical Objects with Generative Matter Models</title>
      <link>https://arxiv.org/abs/2604.22160v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22160v1</guid>
      <pubDate>Fri, 24 Apr 2026 02:18:18 GMT</pubDate>
      <dc:creator>Eric Li, Arijit Dasgupta, Yoni Friedman et al.</dc:creator>
      <category>模型架构</category>
      <description>Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute ind...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">GenMatter: Perceiving Physical Objects with Generative Matter Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Eric Li, Arijit Dasgupta, Yoni Friedman et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22160v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22160v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models</title>
      <link>https://arxiv.org/abs/2604.22156v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22156v1</guid>
      <pubDate>Fri, 24 Apr 2026 02:07:23 GMT</pubDate>
      <dc:creator>Weiqiu You, Cassandra Goldberg, Amin Madani et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Sum-of-Checks: Structured Reasoning for Surgical Safety with Large Vision-Language Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Weiqiu You, Cassandra Goldberg, Amin Madani et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Purpose: Accurate assessment of the Critical View of Safety (CVS) during laparoscopic cholecystectomy is essential to prevent bile duct injury, a complication associated with significant morbidity and mortality. While large vision-language models (LVLMs) offer flexible reasoning, their predictions remain difficult to audit and unreliable on safety-critical surgical tasks.   Methods: We introduce Sum-of-Checks, a framework that decomposes each CVS criterion into expert-defined reasoning checks reflecting clinically relevant visual evidence. Given a laparoscopic frame, an LVLM evaluates each check, producing a binary judgment and justification. Criterion-level scores are computed via fixed, weighted aggregation of check outcomes. We evaluate on the Endoscapes2023 benchmark using three frontier LVLMs, comparing against direct prompting, chain-of-thought, and sub-question decomposition, each with and without few-shot examples.   Results: Sum-of-Checks improves average frame-level mean average precision by 12--14% relative to the best baseline across all three models and criteria. Analysis of individual checks reveals that LVLMs are reliable on observational checks (e.g., visibility, tool obstruction) but show substantial variability on decision-critical anatomical evidence.   Conclusion: Structuring surgical reasoning into expert-aligned verification checks improves both accuracy and transparency of LVLM-based CVS assessment, demonstrating that explicitly separating evidence elicitation from decision-making is critical for reliable and auditable surgical AI systems.   Code is available at https://github.com/BrachioLab/SumOfChecks.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22156v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22156v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography</title>
      <link>https://arxiv.org/abs/2604.22139v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22139v1</guid>
      <pubDate>Fri, 24 Apr 2026 01:09:05 GMT</pubDate>
      <dc:creator>Tania Haghighi, Sina Gholami, Hamed Tabkhi et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annot...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Anatomy-Aware Unsupervised Detection and Localization of Retinal Abnormalities in Optical Coherence Tomography</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Tania Haghighi, Sina Gholami, Hamed Tabkhi et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan, vae, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-24</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Reliable automated analysis of Optical Coherence Tomography (OCT) imaging is crucial for diagnosing retinal disorders but faces a critical barrier: the need for expensive, labor-intensive expert annotations. Supervised deep learning models struggle to generalize across diverse pathologies, imaging devices, and patient populations due to their restricted vocabulary of annotated abnormalities. We propose an unsupervised anomaly detection framework that learns the normative distribution of healthy retinal anatomy without lesion annotations, directly addressing annotation efficiency challenges in clinical deployment. Our approach leverages a discrete latent model trained on normal B-scans to capture OCT-specific structural patterns. To enhance clinical robustness, we incorporate retinal layer-aware supervision and structured triplet learning to separate healthy from pathological representations, improving model reliability across varied imaging conditions. During inference, anomalies are detected and localized via reconstruction discrepancies, enabling both image and pixel-level identification without requiring disease-specific labels. On the Kermany dataset (AUROC: 0.799), our method substantially outperforms VAE, VQVAE, VQGAN, and f-AnoGAN baselines. Critically, cross-dataset evaluation on Srinivasan achieves AUROC 0.884 with superior generalization, demonstrating robust domain adaptation. On the external RETOUCH benchmark, unsupervised anomaly segmentation achieves competitive Dice (0.200) and mIoU (0.117) scores, validating reproducibility across institutions.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22139v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22139v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits</title>
      <link>https://arxiv.org/abs/2604.22103v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22103v1</guid>
      <pubDate>Thu, 23 Apr 2026 22:20:59 GMT</pubDate>
      <dc:creator>Jason Tang, Stephen Law</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jason Tang, Stephen Law</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, image editing</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22103v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22103v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision</title>
      <link>https://arxiv.org/abs/2604.22093v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22093v1</guid>
      <pubDate>Thu, 23 Apr 2026 21:51:22 GMT</pubDate>
      <dc:creator>Nathan Shankar, Pawel Ladosz, Hujun Yin</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operati...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">FLARE-BO: Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation for Low-Light Robotic Vision</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Nathan Shankar, Pawel Ladosz, Hujun Yin</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Reliable visual perception under low illumination remains a core challenge for autonomous robotic systems, where degraded image quality directly compromises navigation, inspection, and various operations. A recent training free approach showed that Bayesian optimisation with Gaussian Processes can adaptively select brightness, contrast, and denoising parameters on a per-image basis, achieving competitive enhancement without any learned model. However, that framework is limited to three parameters, applies no illumination decomposition or white balance correction, and relies on Non-Local Means denoising, which tends to over smooth edges under noisy conditions. This paper proposes FLARE-BO (Fused Luminance and Adaptive Retinex Enhancement via Bayesian Optimisation), an extended framework that jointly optimises eight parameters spanning across gamma correction, LIME-style illumination normalisation, chrominance denoising, bilateral filtering, NLM denoising, Grey-World automatic white balance, and adaptive post smoothing. The search engine employs a unit hypercube parameter normalisation, objective standardisation, Sobol quasi-random initialisation, and Log Expected Improvement acquisition for principled exploration of the expanded space. Performance of the proposed method is benchmarked using the Low Light paired dataset (LOL) and results show marked improvements of the proposed method over existing methods that were not specifically trained using this dataset.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22093v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22093v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers</title>
      <link>https://arxiv.org/abs/2604.22045v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22045v1</guid>
      <pubDate>Thu, 23 Apr 2026 20:03:17 GMT</pubDate>
      <dc:creator>Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, imagen</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22045v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22045v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms</title>
      <link>https://arxiv.org/abs/2604.22036v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22036v1</guid>
      <pubDate>Thu, 23 Apr 2026 19:49:16 GMT</pubDate>
      <dc:creator>Brian VanVoorst, Nicholas Walczak, Christopher Gilleo et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA&apos;s Perceptually-enabled Task Guidance (PTG)...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Brian VanVoorst, Nicholas Walczak, Christopher Gilleo et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit, lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA&apos;s Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks.   To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications.   In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks.   The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22036v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22036v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] LTBs-KAN: Linear-Time B-splines Kolmogorov-Arnold Networks</title>
      <link>https://arxiv.org/abs/2604.22034v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.22034v1</guid>
      <pubDate>Thu, 23 Apr 2026 19:47:13 GMT</pubDate>
      <dc:creator>Eduardo Said Merin-Martinez, Andres Mendez-Vazquez, Eduardo Rodriguez-Tello</dc:creator>
      <category>模型架构</category>
      <description>Kolmogorov-Arnold Networks (KANs) are a recent neural network architecture offering an alternative to Multilayer Perceptrons (MLPs) with improved explainability and expressibility. However, KANs are s...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">LTBs-KAN: Linear-Time B-splines Kolmogorov-Arnold Networks</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Eduardo Said Merin-Martinez, Andres Mendez-Vazquez, Eduardo Rodriguez-Tello</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Kolmogorov-Arnold Networks (KANs) are a recent neural network architecture offering an alternative to Multilayer Perceptrons (MLPs) with improved explainability and expressibility. However, KANs are significantly slower than MLPs due to the recursive nature of B-spline function computations, limiting their application. This work addresses these issues by proposing a novel base-spline Linear-Time B-splines Kolmogorov-Arnold Network (LTBs-KAN) with linear complexity. Unlike previous methods that rely on the Boor-Mansfield-Cox spline algorithm or other computationally intensive mathematical functions, our approach significantly reduces the computational burden. Additionally, we further reduce model&apos;s parameter through product-of-sums matrix factorization in the forward pass without sacrificing performance. Experiments on MNIST, Fashion-MNIST and CIFAR-10 demonstrate that LTBs-KAN achieves good time complexity and parameter reduction, when used as building architectural blocks, compared to other KAN implementations.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.22034v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.22034v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Soft Anisotropic Diagrams for Differentiable Image Representation</title>
      <link>https://arxiv.org/abs/2604.21984v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21984v1</guid>
      <pubDate>Thu, 23 Apr 2026 18:07:27 GMT</pubDate>
      <dc:creator>Laki Iinbor, Zhiyang Dou, Wojciech Matusik</dc:creator>
      <category>模型架构</category>
      <description>We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropi...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Soft Anisotropic Diagrams for Differentiable Image Representation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Laki Iinbor, Zhiyang Dou, Wojciech Matusik</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We introduce Soft Anisotropic Diagrams (SAD), an explicit and differentiable image representation parameterized by a set of adaptive sites in the image plane. In SAD, each site specifies an anisotropic metric and an additively weighted distance score, and we compute pixel colors as a softmax blend over a small per-pixel top-K subset of sites. We induce a soft anisotropic additively weighted Voronoi partition (i.e., an Apollonius diagram) with learnable per-site temperatures, preserving informative gradients while allowing clear, content-aligned boundaries and explicit ownership. Such a formulation enables efficient rendering by maintaining a per-query top-K map that approximates nearest neighbors under the same shading score, allowing GPU-friendly, fixed-size local computation. We update this list using our top-K propagation scheme inspired by jump flooding, augmented with stochastic injection to provide probabilistic global coverage. Training follows a GPU-first pipeline with gradient-weighted initialization, Adam optimization, and adaptive budget control through densification and pruning. Across standard benchmarks, SAD consistently outperforms Image-GS and Instant-NGP at matched bitrate. On Kodak, SAD reaches 46.0 dB PSNR with 2.2 s encoding time (vs. 28 s for Image-GS), and delivers 4-19 times end-to-end training speedups over state-of-the-art baselines. We demonstrate the effectiveness of SAD by showcasing the seamless integration with differentiable pipelines for forward and inverse problems, efficiency of fast random access, and compact storage.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21984v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21984v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Forecasting Solar Energy Using a Single Image</title>
      <link>https://arxiv.org/abs/2604.21982v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21982v1</guid>
      <pubDate>Thu, 23 Apr 2026 18:01:38 GMT</pubDate>
      <dc:creator>Jeremy Klotz, Shree K. Nayar</dc:creator>
      <category>模型架构</category>
      <description>Solar panels are increasingly deployed in cities on rooftops, walls, and urban infrastructure. Although the panel costs have fallen in recent years, the soft costs of installing them have not. These s...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Forecasting Solar Energy Using a Single Image</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jeremy Klotz, Shree K. Nayar</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Solar panels are increasingly deployed in cities on rooftops, walls, and urban infrastructure. Although the panel costs have fallen in recent years, the soft costs of installing them have not. These soft costs include assessing the illumination (irradiance) of a panel, which is typically performed using a 3D model that fails to capture small nearby structures that impact the irradiance. Our approach uses a single image taken at the panel&apos;s location to forecast its irradiance at any time in the future. We use visual cues in the image to find the camera&apos;s orientation and the portion of the sky visible to the panel in order to forecast the irradiance due to the sun and the sky. In addition, we show that the irradiance due to reflections from nearby buildings varies smoothly over time and can be forecasted from the image. This approach enables assessing the solar energy potential of any surface and forecasting the temporal variation of a panel&apos;s irradiance. We validate our approach using real irradiance measurements in urban canyons. We show that our approach often yields more accurate irradiance forecasts compared to conventional irradiance-based transposition methods and 3D model-based simulations. We also show that a single spherical image can be used to find the best fixed orientation of a panel. Finally, we present Solaris, a device to capture the image seen by a panel in a variety of urban settings.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21982v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21982v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [模型架构] Seeing Fast and Slow: Learning the Flow of Time in Videos</title>
      <link>https://arxiv.org/abs/2604.21931v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21931v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:59:57 GMT</pubDate>
      <dc:creator>Yen-Siang Wu, Rundong Luo, Jingsen Zhu et al.</dc:creator>
      <category>视频生成</category>
      <category>模型架构</category>
      <description>How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Seeing Fast and Slow: Learning the Flow of Time in Videos</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yen-Siang Wu, Rundong Luo, Jingsen Zhu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> video generation, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and controlling the passage of time. In this paper, we study time as a learnable visual concept and develop models for reasoning about and manipulating the flow of time in videos. We first exploit the multimodal cues and temporal structure naturally present in videos to learn, in a self-supervised manner, to detect speed changes and estimate playback speed. We then show that these learned temporal reasoning models enable us to curate the largest slow-motion video dataset to date from noisy in-the-wild sources. Such slow-motion footage, typically filmed by high-speed cameras, contains substantially richer temporal detail than standard videos. Using this data, we further develop models capable of temporal control, including speed-conditioned video generation, which produces motion at specified playback speed, and temporal super-resolution, which tranforms low-FPS, blurry videos into high-FPS sequences with fine-grained temporal details. Our findings highlight time as a manipulable, perceptual dimension in video learning, opening doors to temporally controllable video generation, temporal forensics detection, and potentially richer world-models that understand how events unfold over time.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21931v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21931v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs</title>
      <link>https://arxiv.org/abs/2604.21926v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21926v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:59:16 GMT</pubDate>
      <dc:creator>Hao-Yu Hsu, Tianhang Cheng, Jing Wen et al.</dc:creator>
      <category>模型架构</category>
      <description>Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Hao-Yu Hsu, Tianhang Cheng, Jing Wen et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21926v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21926v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] Context Unrolling in Omni Models</title>
      <link>https://arxiv.org/abs/2604.21921v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21921v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:58:38 GMT</pubDate>
      <dc:creator>Ceyuan Yang, Zhijie Lin, Yang Zhao et al.</dc:creator>
      <category>多模态生成</category>
      <description>We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Context Unrolling in Omni Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Ceyuan Yang, Zhijie Lin, Yang Zhao et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> multimodal generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21921v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21921v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs</title>
      <link>https://arxiv.org/abs/2604.21911v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21911v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:54:36 GMT</pubDate>
      <dc:creator>Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny et al.</dc:creator>
      <category>多模态生成</category>
      <description>Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior w...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Pegah Khayatan, Jayneel Parekh, Arnaud Dapogny et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Despite impressive progress in capabilities of large vision-language models (LVLMs), these systems remain vulnerable to hallucinations, i.e., outputs that are not grounded in the visual input. Prior work has attributed hallucinations in LVLMs to factors such as limitations of the vision backbone or the dominance of the language component, yet the relative importance of these factors remains unclear. To resolve this ambiguity, We propose HalluScope, a benchmark to better understand the extent to which different factors induce hallucinations. Our analysis indicates that hallucinations largely stem from excessive reliance on textual priors and background knowledge, especially information introduced through textual instructions. To mitigate hallucinations induced by textual instruction priors, we propose HalluVL-DPO, a framework for fine-tuning off-the-shelf LVLMs towards more visually grounded responses. HalluVL-DPO leverages preference optimization using a curated training dataset that we construct, guiding the model to prefer grounded responses over hallucinated ones. We demonstrate that our optimized model effectively mitigates the targeted hallucination failure mode, while preserving or improving performance on other hallucination benchmarks and visual capability evaluations. To support reproducibility and further research, we will publicly release our evaluation benchmark, preference training dataset, and code at https://pegah-kh.github.io/projects/prompts-override-vision/ .</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21911v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21911v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision</title>
      <link>https://arxiv.org/abs/2604.21909v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21909v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:52:16 GMT</pubDate>
      <dc:creator>Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin</dc:creator>
      <category>图像生成</category>
      <description>Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for wh...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Directional Confusions Reveal Divergent Inductive Biases Through Rate-Distortion Geometry in Human and Machine Vision</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Leyla Roksan Caglar, Pedro A. M. Mediano, Baihan Lin</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Humans and modern vision models can reach similar classification accuracy while making systematically different kinds of mistakes - differing not in how often they err, but in who gets mistaken for whom, and in which direction. We show that these directional confusions reveal distinct inductive biases that are invisible to accuracy alone. Using matched human and deep vision model responses on a natural-image categorization task under 12 perturbation types, we quantify asymmetry in confusion matrices and link it to generalization geometry through a Rate-Distortion (RD) framework, summarized by three geometric signatures (slope (beta), curvature (kappa)) and efficiency (AUC). We find that humans exhibit broad but weak asymmetries, whereas deep vision models show sparser, stronger directional collapses. Robustness training reduces global asymmetry but fails to recover the human-like breadth-strength profile of graded similarity. Mechanistic simulations further show that different asymmetry organizations shift the RD frontier in opposite directions, even when matched for performance. Together, these results position directional confusions and RD geometry as compact, interpretable signatures of inductive bias under distribution shift.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21909v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21909v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection</title>
      <link>https://arxiv.org/abs/2604.21904v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21904v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:49:25 GMT</pubDate>
      <dc:creator>Yanran Zhang, Wenzhao Zheng, Yifei Li et al.</dc:creator>
      <category>图像生成</category>
      <description>In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved dis...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yanran Zhang, Wenzhao Zheng, Yifei Li et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> image generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance. Code: \href{https://github.com/Zhangyr2022/UniGenDet}{https://github.com/Zhangyr2022/UniGenDet}.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21904v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21904v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Grounding Video Reasoning in Physical Signals</title>
      <link>https://arxiv.org/abs/2604.21873v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21873v1</guid>
      <pubDate>Thu, 23 Apr 2026 17:17:18 GMT</pubDate>
      <dc:creator>Alibay Osmanli, Zixu Cheng, Shaogang Gong</dc:creator>
      <category>模型架构</category>
      <description>Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Grounding Video Reasoning in Physical Signals</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Alibay Osmanli, Zixu Cheng, Shaogang Gong</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Physical video understanding requires more than naming an event correctly. A model can answer a question about pouring, sliding, or collision from textual regularities while still failing to localize the event in time or space. We introduce a grounded benchmark for physical video understanding that extends the what--when--where evaluation structure of V-STaR to four video sources, six physics domains, three prompt families (physics, vstar_like, and neutral_rstr), and four input conditions (original, shuffled, ablated, and frame-masked). The benchmark contains 1,560 base video clips from SSV2, YouCook2, HoloAssist, and Roundabout-TAU. Each clip is first converted into a shared grounded event record, and the three query families are derived from that record. Temporal and spatial targets are shared across prompt families, while the non-physics families use deterministic family-appropriate semantic a_what targets derived from the same record. Across models and prompt families, physics remains the strongest regime overall, vstar_like is the clearest non-physics semantic comparison, and neutral_rstr behaves as a harder templated control. Prompt-family robustness is selective rather than universal, perturbation gains cluster in weak original cases, and spatial grounding is the weakest across settings. These results suggest that video Q&amp;A reasoning benchmarks shall report physically grounded, prompt-aware, and perturbation-aware diagnostics alongside aggregate accuracy.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21873v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21873v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos</title>
      <link>https://arxiv.org/abs/2604.21814v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21814v1</guid>
      <pubDate>Thu, 23 Apr 2026 16:07:51 GMT</pubDate>
      <dc:creator>Bowen Liu, Li Yang, Shanshan Song et al.</dc:creator>
      <category>图像生成</category>
      <description>Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis undere...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Bowen Liu, Li Yang, Shanshan Song et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical reports. VideoCAP comprises 240 full-length videos and provides realistic supervision for both key evidence frame extraction and diagnosis. To address this task, we further propose DiCE, a clinician-inspired framework that mirrors the standard CE reading workflow. DiCE first performs efficient candidate screening over the raw video, then uses a Context Weaver to organize candidates into coherent diagnostic contexts that preserve distinct lesion events, and an Evidence Converger to aggregate multi-frame evidence within each context into robust clip-level judgments. Experiments show that DiCE consistently outperforms state-of-the-art methods, producing concise and clinically reliable diagnostic summaries. These results highlight diagnosis-driven contextual reasoning as a promising paradigm for ultra-long CE video summarization.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21814v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21814v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval</title>
      <link>https://arxiv.org/abs/2604.21806v2</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21806v2</guid>
      <pubDate>Thu, 23 Apr 2026 16:03:12 GMT</pubDate>
      <dc:creator>Zixu Li, Yupeng Hu, Zhiheng Fu et al.</dc:creator>
      <category>模型架构</category>
      <description>Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text....</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zixu Li, Yupeng Hu, Zhiheng Fu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA&apos;s superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21806v2" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21806v2" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[控制与编辑] [多模态生成] From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media</title>
      <link>https://arxiv.org/abs/2604.21786v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21786v1</guid>
      <pubDate>Thu, 23 Apr 2026 15:44:14 GMT</pubDate>
      <dc:creator>Katharina Prasse, Steffen Jung, Isaac Bravo et al.</dc:creator>
      <category>控制与编辑</category>
      <category>多模态生成</category>
      <description>Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobil...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Katharina Prasse, Steffen Jung, Isaac Bravo et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 控制与编辑, 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> prompt engineering, vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Social media platforms have become primary arenas for climate communication, generating millions of images and posts that - if systematically analysed - can reveal which communication strategies mobilise public concern and which fall flat. We aim to facilitate such research by analysing how computer vision methods can be used for social media discourse analysis. This analysis includes application-based taxonomy design, model selection, prompt engineering, and validation. We benchmark six promptable vision-language models and 15 zero-shot CLIP-like models on two datasets from X (formerly Twitter) - a 1,038-image expert-annotated set and a larger corpus of over 1.2 million images, with 50,000 labels manually validated - spanning five annotation dimensions: animal content, climate change consequences, climate action, image setting, and image type. Among the models benchmarked, Gemini-3.1-flash-lite outperforms all others across all super-categories and both datasets, while the gap to open-weight models of moderate size remains relatively small. Beyond instance-level metrics, we advocate for distributional evaluation: VLM predictions can reliably recover population level trends even when per-image accuracy is moderate, making them a viable starting point for discourse analysis at scale. We find that chain-of-thought reasoning reduces rather than improves performance, and that annotation dimension specific prompt design improves performance. We release tweet IDs and labels along with our code at https://github.com/KathPra/Codebooks2VLMs.git.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21786v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21786v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 控制与编辑, 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [扩散模型] [模型架构] Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting</title>
      <link>https://arxiv.org/abs/2604.21776v2</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21776v2</guid>
      <pubDate>Thu, 23 Apr 2026 15:32:56 GMT</pubDate>
      <dc:creator>Avinash Paliwal, Adithya Iyer, Shivin Yadav et al.</dc:creator>
      <category>视频生成</category>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supe...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Reshoot-Anything: A Self-Supervised Model for In-the-Wild Video Reshooting</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Avinash Paliwal, Adithya Iyer, Shivin Yadav et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> temporal consistency, transformer, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Precise camera control for reshooting dynamic videos is bottlenecked by the severe scarcity of paired multi-view data for non-rigid scenes. We overcome this limitation with a highly scalable self-supervised framework capable of leveraging internet-scale monocular videos. Our core contribution is the generation of pseudo multi-view training triplets, consisting of a source video, a geometric anchor, and a target video. We achieve this by extracting distinct smooth random-walk crop trajectories from a single input video to serve as the source and target views. The anchor is synthetically generated by forward-warping the first frame of the source with a dense tracking field, which effectively simulates the distorted point-cloud inputs expected at inference. Because our independent cropping strategy introduces spatial misalignment and artificial occlusions, the model cannot simply copy information from the current source frame. Instead, it is forced to implicitly learn 4D spatiotemporal structures by actively routing and re-projecting missing high-fidelity textures across distinct times and viewpoints from the source video to reconstruct the target. At inference, our minimally adapted diffusion transformer utilizes a 4D point-cloud derived anchor to achieve state-of-the-art temporal consistency, robust camera control, and high-fidelity novel view synthesis on complex dynamic scenes.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21776v2" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21776v2" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation</title>
      <link>https://arxiv.org/abs/2604.21772v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21772v1</guid>
      <pubDate>Thu, 23 Apr 2026 15:29:29 GMT</pubDate>
      <dc:creator>Yingkai Yang, Chaoqi Chen, Hui Huang</dc:creator>
      <category>模型架构</category>
      <description>Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where mode...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Back to Source: Open-Set Continual Test-Time Adaptation via Domain Compensation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yingkai Yang, Chaoqi Chen, Hui Huang</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Test-Time Adaptation (TTA) aims to mitigate distributional shifts between training and test domains during inference time. However, existing TTA methods fall short in the realistic scenario where models face both continually changing domains and the simultaneous emergence of unknown semantic classes, a challenging setting we term Open-set Continual Test-Time Adaptation (OCTTA). The coupling of domain and semantic shifts often collapses the feature space, severely degrading both classification and out-of-distribution detection. To tackle this, we propose DOmain COmpensation (DOCO), a lightweight and effective framework that robustly performs domain adaptation and OOD detection in a synergistic, closed loop. DOCO first performs dynamic, adaptation-conditioned sample splitting to separate likely ID from OOD samples. Then, using only the ID samples, it learns a domain compensation prompt by aligning feature statistics with the source domain, guided by a structural preservation regularizer that prevents semantic distortion. This learned prompt is then propagated to the OOD samples within the same batch, effectively isolating their semantic novelty for more reliable detection. Extensive experiments on multiple challenging benchmarks demonstrate that DOCO outperforms prior CTTA and OSTTA methods, establishing a new state-of-the-art for the demanding OCTTA setting.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21772v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21772v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Interpretable facial dynamics as behavioral and perceptual traces of deepfakes</title>
      <link>https://arxiv.org/abs/2604.21760v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21760v1</guid>
      <pubDate>Thu, 23 Apr 2026 15:07:30 GMT</pubDate>
      <dc:creator>Timothy Joseph Murphy, Jennifer Cook, Hélio Clemente José Cuve</dc:creator>
      <category>模型架构</category>
      <description>Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial beh...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Interpretable facial dynamics as behavioral and perceptual traces of deepfakes</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Timothy Joseph Murphy, Jennifer Cook, Hélio Clemente José Cuve</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Deepfake detection research has largely converged on deep learning approaches that, despite strong benchmark performance, offer limited insight into what distinguishes real from manipulated facial behavior. This study presents an interpretable alternative grounded in bio-behavioral features of facial dynamics and evaluates how computational detection strategies relate to human perceptual judgments. We identify core low-dimensional patterns of facial movement, from which temporal features characterizing spatiotemporal structure were derived. Traditional machine learning classifiers trained on these features achieved modest but significant above-chance deepfake classification, driven by higher-order temporal irregularities that were more pronounced in manipulated than real facial dynamics. Notably, detection was substantially more accurate for videos containing emotive expressions than those without. An emotional valence classification analysis further indicated that emotive signals are systematically degraded in deepfakes, explaining the differential impact of emotive dynamics on detection. Furthermore, we provide an additional and often overlooked dimension of explainability by assessing the relationship between model decisions and human perceptual detection. Model and human judgments converged for emotive but diverged for non-emotive videos, and even where outputs aligned, underlying detection strategies differed. These findings demonstrate that face-swapped deepfakes carry a measurable behavioral fingerprint, most salient during emotional expression. Additionally, model-human comparisons suggest that interpretable computational features and human perception may offer complementary rather than redundant routes to detection.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21760v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21760v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection</title>
      <link>https://arxiv.org/abs/2604.21728v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21728v1</guid>
      <pubDate>Thu, 23 Apr 2026 14:33:27 GMT</pubDate>
      <dc:creator>Wenxuan Bao, Yanjun Zhao, Xiyuan Yang et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access t...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wenxuan Bao, Yanjun Zhao, Xiyuan Yang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and prediction balance, which mitigates adaptation bias caused by skewed predictions. To improve efficiency, Ramen employs an embedding-gradient cache that stores the embeddings and sample-level gradients of past test images. The stored embeddings are used to retrieve relevant samples, and the corresponding gradients are aggregated for model updates, eliminating the need for any additional forward or backward passes. Our theoretical analysis provides insight into why the proposed adaptation mechanism is effective under mixed-domain shifts. Experiments on multiple image corruption and domain-shift benchmarks demonstrate that Ramen achieves strong and consistent performance, offering robust and efficient adaptation in complex mixed-domain scenarios. Our code is available at https://github.com/baowenxuan/Ramen .</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21728v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21728v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Conditional Diffusion Posterior Alignment for Sparse-View CT Reconstruction</title>
      <link>https://arxiv.org/abs/2604.21960v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21960v1</guid>
      <pubDate>Thu, 23 Apr 2026 14:09:13 GMT</pubDate>
      <dc:creator>Luis Barba, Johannes Kirschner, Benjamin Bejar</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Computed Tomography (CT) is a widely used imaging modality in medical and industrial applications. To limit radiation exposure and measurement time, there is a growing interest in sparse-view CT, wher...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Conditional Diffusion Posterior Alignment for Sparse-View CT Reconstruction</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Luis Barba, Johannes Kirschner, Benjamin Bejar</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, u-net, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Computed Tomography (CT) is a widely used imaging modality in medical and industrial applications. To limit radiation exposure and measurement time, there is a growing interest in sparse-view CT, where the number of projection views is significantly reduced. Deep neural networks have shown great promise in improving reconstruction quality in sparse-view CT, especially generative diffusion models. However, these methods struggle to scale to large 3D volumes due to several reasons: (i) the high memory and computational requirements of 3D models, (ii) the lack of large 3D training datasets, and (iii) the inconsistencies across slices when using 2D models independently on each slice. We overcome these limitations and scale diffusion-based sparse-view CT reconstruction to large 3D volumes by combining conditional diffusion with explicit data consistency. We propose Conditional Diffusion Posterior Alignment (CDPA) to enable scalable 3D sparse-view CT reconstruction. A 2D U-Net diffusion model is conditioned on an initial 3D reconstruction to improve inter-slice consistency, combined with data-consistency alignment to match measured projections. Experiments on synthetic and real Cone Beam CT (CBCT) data show state-of-the-art performance, with ablations that confirm the synergistic effects of the proposed pipeline. Finally, we show that the same principles also strengthen fast denoising U-Nets, yielding near-diffusion quality at a fraction of the computational cost.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21960v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21960v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition</title>
      <link>https://arxiv.org/abs/2604.21689v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21689v1</guid>
      <pubDate>Thu, 23 Apr 2026 13:55:22 GMT</pubDate>
      <dc:creator>Kwan Yun, Changmin Lee, Ayeong Jeong et al.</dc:creator>
      <category>扩散模型</category>
      <description>Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which a...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">StyleID: A Perception-Aware Dataset and Metric for Stylization-Agnostic Facial Identity Recognition</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Kwan Yun, Changmin Lee, Ayeong Jeong et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Creative face stylization aims to render portraits in diverse visual idioms such as cartoons, sketches, and paintings while retaining recognizable identity. However, current identity encoders, which are typically trained and calibrated on natural photographs, exhibit severe brittleness under stylization. They often mistake changes in texture or color palette for identity drift or fail to detect geometric exaggerations. This reveals the lack of a style-agnostic framework to evaluate and supervise identity consistency across varying styles and strengths. To address this gap, we introduce StyleID, a human perception-aware dataset and evaluation framework for facial identity under stylization. StyleID comprises two datasets: (i) StyleBench-H, a benchmark that captures human same-different verification judgments across diffusion- and flow-matching-based stylization at multiple style strengths, and (ii) StyleBench-S, a supervision set derived from psychometric recognition-strength curves obtained through controlled two-alternative forced-choice (2AFC) experiments. Leveraging StyleBench-S, we fine-tune existing semantic encoders to align their similarity orderings with human perception across styles and strengths. Experiments demonstrate that our calibrated models yield significantly higher correlation with human judgments and enhanced robustness for out-of-domain, artist drawn portraits. All of our datasets, code, and pretrained models are publicly available at https://kwanyun.github.io/StyleID_page/</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21689v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21689v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [多模态生成] [模型架构] WorldMark: A Unified Benchmark Suite for Interactive Video World Models</title>
      <link>https://arxiv.org/abs/2604.21686v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21686v1</guid>
      <pubDate>Thu, 23 Apr 2026 13:50:47 GMT</pubDate>
      <dc:creator>Xiaojie Xu, Zhengyuan Lin, Kang He et al.</dc:creator>
      <category>视频生成</category>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">WorldMark: A Unified Benchmark Suite for Interactive Video World Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Xiaojie Xu, Zhengyuan Lin, Kang He et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, video generation, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Interactive video generation models such as Genie, YUME, HY-World, and Matrix-Game are advancing rapidly, yet every model is evaluated on its own benchmark with private scenes and trajectories, making fair cross-model comparison impossible. Existing public benchmarks offer useful metrics such as trajectory error, aesthetic scores, and VLM-based judgments, but none supplies the standardized test conditions -- identical scenes, identical action sequences, and a unified control interface -- needed to make those metrics comparable across models with heterogeneous inputs. We introduce WorldMark, the first benchmark that provides such a common playing field for interactive Image-to-Video world models. WorldMark contributes: (1) a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model&apos;s native control format, enabling apples-to-apples comparison across six major models on identical scenes and trajectories; (2) a hierarchical test suite of 500 evaluation cases covering first- and third-person viewpoints, photorealistic and stylized scenes, and three difficulty tiers from Easy to Hard spanning 20-60s; and (3) a modular evaluation toolkit for Visual Quality, Control Alignment, and World Consistency, designed so that researchers can reuse our standardized inputs while plugging in their own metrics as the field evolves. We will release all data, evaluation code, and model outputs to facilitate future research. Beyond offline metrics, we launch World Model Arena (warena.ai), an online platform where anyone can pit leading world models against each other in side-by-side battles and watch the live leaderboard.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21686v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21686v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Sapiens2</title>
      <link>https://arxiv.org/abs/2604.21681v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21681v1</guid>
      <pubDate>Thu, 23 Apr 2026 13:45:32 GMT</pubDate>
      <dc:creator>Rawal Khirodkar, He Wen, Julieta Martinez et al.</dc:creator>
      <category>模型架构</category>
      <description>We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 bill...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Sapiens2</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Rawal Khirodkar, He Wen, Julieta Martinez et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+24.3 mIoU), normal estimation (45.6% lower angular error) and extends to new tasks such as pointmap and albedo estimation. Code: https://github.com/facebookresearch/sapiens2</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21681v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21681v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Encoder-Free Human Motion Understanding via Structured Motion Descriptions</title>
      <link>https://arxiv.org/abs/2604.21668v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21668v1</guid>
      <pubDate>Thu, 23 Apr 2026 13:33:28 GMT</pubDate>
      <dc:creator>Yao Zhang, Zhuchenyang Liu, Thomas Ploetz et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answeri...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Encoder-Free Human Motion Understanding via Structured Motion Descriptions</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yao Zhang, Zhuchenyang Liu, Thomas Ploetz et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The world knowledge and reasoning capabilities of text-based large language models (LLMs) are advancing rapidly, yet current approaches to human motion understanding, including motion question answering and captioning, have not fully exploited these capabilities. Existing LLM-based methods typically learn motion-language alignment through dedicated encoders that project motion features into the LLM&apos;s embedding space, remaining constrained by cross-modal representation and alignment. Inspired by biomechanical analysis, where joint angles and body-part kinematics have long served as a precise descriptive language for human movement, we propose \textbf{Structured Motion Description (SMD)}, a rule-based, deterministic approach that converts joint position sequences into structured natural language descriptions of joint angles, body part movements, and global trajectory. By representing motion as text, SMD enables LLMs to apply their pretrained knowledge of body parts, spatial directions, and movement semantics directly to motion reasoning, without requiring learned encoders or alignment modules. We show that this approach goes beyond state-of-the-art results on both motion question answering (66.7\% on BABEL-QA, 90.1\% on HuMMan-QA) and motion captioning (R@1 of 0.584, CIDEr of 53.16 on HumanML3D), surpassing all prior methods. SMD additionally offers practical benefits: the same text input works across different LLMs with only lightweight LoRA adaptation (validated on 8 LLMs from 6 model families), and its human-readable representation enables interpretable attention analysis over motion descriptions. Code, data, and pretrained LoRA adapters are available at https://yaozhang182.github.io/motion-smd/.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21668v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21668v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures</title>
      <link>https://arxiv.org/abs/2604.21631v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21631v1</guid>
      <pubDate>Thu, 23 Apr 2026 12:50:13 GMT</pubDate>
      <dc:creator>Xu Wang, Zhiru Wang, Shiyun Xie et al.</dc:creator>
      <category>多模态生成</category>
      <description>While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistenc...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">DualSplat: Robust 3D Gaussian Splatting via Pseudo-Mask Bootstrapping from Reconstruction Failures</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Xu Wang, Zhiru Wang, Shiyun Xie et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> nerf</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">While 3D Gaussian Splatting (3DGS) achieves real-time photorealistic rendering, its performance degrades significantly when training images contain transient objects that violate multi-view consistency. Existing methods face a circular dependency: accurate transient detection requires a well-reconstructed static scene, while clean reconstruction itself depends on reliable transient masks. We address this challenge with DualSplat, a Failure-to-Prior framework that converts first-pass reconstruction failures into explicit priors for a second reconstruction stage. We observe that transients, which appear in only a subset of views, often manifest as incomplete fragments during conservative initial training. We exploit these failures to construct object-level pseudo-masks by combining photometric residuals, feature mismatches, and SAM2 instance boundaries. These pseudo-masks then guide a clean second-pass 3DGS optimization, while a lightweight MLP refines them online by gradually shifting from prior supervision to self-consistency. Experiments on RobustNeRF and NeRF On-the-go show that DualSplat outperforms existing baselines, demonstrating particularly clear advantages in transient-heavy scenes and transient regions.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21631v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21631v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion</title>
      <link>https://arxiv.org/abs/2604.21627v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21627v1</guid>
      <pubDate>Thu, 23 Apr 2026 12:46:07 GMT</pubDate>
      <dc:creator>Tahar Chettaoui, Eduarda Caldeira, Guray Ozgur et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stre...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">DCMorph: Face Morphing via Dual-Stream Cross-Attention Diffusion</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Tahar Chettaoui, Eduarda Caldeira, Guray Ozgur et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan, diffusion model, latent diffusion, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Advancing face morphing attack techniques is crucial to anticipate evolving threats and develop robust defensive mechanisms for identity verification systems. This work introduces DCMorph, a dual-stream diffusion-based morphing framework that simultaneously operates at both identity conditioning and latent space levels. Unlike image-level methods suffering from blending artifacts or GAN-based approaches with limited reconstruction fidelity, DCMorph leverages identity-conditioned latent diffusion models through two mechanisms: (1) decoupled cross-attention interpolation that injects identity-specific features from both source faces into the denoising process, enabling explicit dual-identity conditioning absent in existing diffusion-based methods, and (2) DDIM inversion with spherical interpolation between inverted latent representations from both source faces, providing geometrically consistent initial latent representation that preserves structural attributes. Vulnerability analyses across four state-of-the-art face recognition systems demonstrate that DCMorph achieves the highest attack success rates compared to existing methods at both operational thresholds, while remaining challenging to detect by current morphing attack detection solutions.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21627v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21627v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [多模态生成] [模型架构] Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers</title>
      <link>https://arxiv.org/abs/2604.21592v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21592v1</guid>
      <pubDate>Thu, 23 Apr 2026 12:18:55 GMT</pubDate>
      <dc:creator>Minghao Yin, Wenbo Hu, Jiale Xu et al.</dc:creator>
      <category>扩散模型</category>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and p...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Sculpt4D: Generating 4D Shapes via Sparse-Attention Diffusion Transformers</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Minghao Yin, Wenbo Hu, Jiale Xu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer, 3d diffusion, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent breakthroughs in 3D generative modeling have yielded remarkable progress in static shape synthesis, yet high-fidelity dynamic 4D generation remains elusive, hindered by temporal artifacts and prohibitive computational demand. We present Sculpt4D, a native 4D generative framework that seamlessly integrates efficient temporal modeling into a pretrained 3D Diffusion Transformer (Hunyuan3D 2.1), thereby mitigating the scarcity of 4D training data. At its core lies a Block Sparse Attention mechanism that preserves object identity by anchoring to the initial frame while capturing rich motion dynamics via a time-decaying sparse mask. This design faithfully models complex spatiotemporal dependencies with high fidelity, while sidestepping the quadratic overhead of full attention and reducing network total computation by 56%. Consequently, Sculpt4D establishes a new state-of-the-art in temporally coherent 4D synthesis and charts a path toward efficient and scalable 4D generation.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21592v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21592v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction</title>
      <link>https://arxiv.org/abs/2604.21575v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21575v1</guid>
      <pubDate>Thu, 23 Apr 2026 11:55:19 GMT</pubDate>
      <dc:creator>Zeyu Cai, Yuliang Xiu, Renke Wang et al.</dc:creator>
      <category>模型架构</category>
      <description>Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zeyu Cai, Yuliang Xiu, Renke Wang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Fitting an underlying body model to 3D clothed human assets has been extensively studied, yet most approaches focus on either single-modal inputs such as point clouds or multi-view images alone, often requiring a known metric scale. This constraint is frequently impractical, especially for AI-generated assets where scale distortion is common. We propose OmniFit, a method that can seamlessly handle diverse multi-modal inputs, including full scans, partial depth observations, and image captures, while remaining scale-agnostic for both real and synthetic assets. Our key innovation is a simple yet effective conditional transformer decoder that directly maps surface points to dense body landmarks, which are then used for SMPL-X parameter fitting. In addition, an optional plug-and-play image adapter incorporates visual cues to compensate for missing geometric information. We further introduce a dedicated scale predictor that rescales subjects to canonical body proportions. OmniFit substantially outperforms state-of-the-art methods by 57.1 to 80.9 percent across daily and loose clothing scenarios. To the best of our knowledge, it is the first body fitting method to surpass multi-view optimization baselines and the first to achieve millimeter-level accuracy on the CAPE and 4D-DRESS benchmarks.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21575v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21575v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction</title>
      <link>https://arxiv.org/abs/2604.21573v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21573v1</guid>
      <pubDate>Thu, 23 Apr 2026 11:53:07 GMT</pubDate>
      <dc:creator>Changfan Wang, Xinran Wang, Donghai Liu et al.</dc:creator>
      <category>模型架构</category>
      <description>Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene express...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">CHRep: Cross-modal Histology Representation and Post-hoc Calibration for Spatial Gene Expression Prediction</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Changfan Wang, Xinran Wang, Donghai Liu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> mae</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Spatial transcriptomics (ST) enables spatially resolved gene profiling but remains expensive and low-throughput, limiting large-cohort studies and routine clinical use. Predicting spatial gene expression from routine hematoxylin and eosin (H&amp;E) slides is a promising alternative, yet under realistic leave-one-slide-out evaluation, existing models often suffer from slide-level appearance shifts and regression-driven over-smoothing that suppress biologically meaningful variation. CHRep is a two-phase framework for robust histology-to-expression prediction. In the training phase, CHRep learns a structure-aware representation by jointly optimizing correlation-aware regression, symmetric image-expression alignment, and coordinate-induced spatial topology regularization. In the inference phase, cross-slide robustness is improved without backbone fine-tuning through a lightweight calibration module trained on the training slides, which combines a non-parametric estimate from a training gallery with a magnitude-regularized correction module. Unlike prior embedding-alignment or retrieval-based transfer methods that rely on a single prediction route, CHRep couples topology-preserving representation learning with post-hoc calibration, enabling stable neighborhood retrieval and controlled bias correction under slide-level shifts. Across the three cohorts, CHRep consistently improves gene-wise correlation under leave-one-slide-out evaluation, with the largest gains observed on Alex+10x. Relative to HAGE, the Pearson correlation coefficient on all considered genes [PCC(ACG)] increases by 4.0% on cSCC and 9.8% on HER2+. Relative to mclSTExp, PCC(ACG) further improves by 39.5% on Alex+10x, together with 9.7% and 9.0% reductions in mean squared error (MSE) and mean absolute error (MAE), respectively.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21573v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21573v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Deep kernel video approximation for unsupervised action segmentation</title>
      <link>https://arxiv.org/abs/2604.21572v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21572v1</guid>
      <pubDate>Thu, 23 Apr 2026 11:52:56 GMT</pubDate>
      <dc:creator>Silvia L. Pintea, Jouke Dijkstra</dc:creator>
      <category>模型架构</category>
      <description>This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment video...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Deep kernel video approximation for unsupervised action segmentation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Silvia L. Pintea, Jouke Dijkstra</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">This work focuses on per-video unsupervised action segmentation, which is of interest to applications where storing large datasets is either not possible, or nor permitted. We propose to segment videos by learning in deep kernel space, to approximate the underlying frame distribution, as closely as possible. To define this closeness metric between the original video distribution and its approximation, we rely on maximum mean discrepancy (MMD) which is a geometry-preserving metric in distribution space, and thus gives more reliable estimates. Moreover, unlike the commonly used optimal transport metric, MMD is both easier to optimize, and faster. We choose to use neural tangent kernels (NTKs) to define the kernel space where MMD operates, because of their improved descriptive power as opposed to fixed kernels. And, also, because NTKs sidestep the trivial solution, when jointly learning the inputs (video approximation) and the kernel function. Finally, we show competitive results when compared to state-of-the-art per-video methods, on six standard benchmarks. Additionally, our method has higher F1 scores than prior agglomerative work, when the number of segments is unknown.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21572v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21572v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Component-Based Out-of-Distribution Detection</title>
      <link>https://arxiv.org/abs/2604.21546v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21546v1</guid>
      <pubDate>Thu, 23 Apr 2026 11:19:39 GMT</pubDate>
      <dc:creator>Wenrui Liu, Hong Chang, Ruibing Hou et al.</dc:creator>
      <category>模型架构</category>
      <description>Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global r...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Component-Based Out-of-Distribution Detection</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wenrui Liu, Hong Chang, Ruibing Hou et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Out-of-Distribution (OOD) detection requires sensitivity to subtle shifts without overreacting to natural In-Distribution (ID) diversity. However, from the viewpoint of detection granularity, global representation inevitably suppress local OOD cues, while patch-based methods are unstable due to entangled spurious-correlation and noise. And neither them is effective in detecting compositional OODs composed of valid ID components. Inspired by recognition-by-components theory, we present a training-free Component-Based OOD Detection (CoOD) framework that addresses the existing limitations by decomposing inputs into functional components. To instantiate CoOD, we derive Component Shift Score (CSS) to detect local appearance shifts, and Compositional Consistency Score (CCS) to identify cross-component compositional inconsistencies. Empirically, CoOD achieves consistent improvements on both coarse- and fine-grained OOD detection.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21546v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21546v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [多模态生成] Seeing Isn&apos;t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models</title>
      <link>https://arxiv.org/abs/2604.21523v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21523v1</guid>
      <pubDate>Thu, 23 Apr 2026 10:36:50 GMT</pubDate>
      <dc:creator>Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand et al.</dc:creator>
      <category>图像生成</category>
      <category>多模态生成</category>
      <description>Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Seeing Isn&apos;t Believing: Uncovering Blind Spots in Evaluator Vision-Language Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Mohammed Safi Ur Rahman Khan, Sanjay Suryanarayanan, Tushar Anand et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> text-to-image, vlm, t2i, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Large Vision-Language Models (VLMs) are increasingly used to evaluate outputs of other models, for image-to-text (I2T) tasks such as visual question answering, and text-to-image (T2I) generation tasks. Despite this growing reliance, the reliability of these Evaluator VLMs remains under explored. In this work, we systematically evaluate the reliability of Evaluator VLMs across both I2T and T2I tasks. We introduce targeted perturbations that degrade output quality along key error dimensions, including object hallucinations, spatial reasoning, factual grounding, and visual fidelity. These perturbations test whether Evaluator VLMs can reliably account for these quality degrading errors in their evaluations. Using a comprehensive benchmark of over 4000 perturbed instances spanning 40 perturbation dimensions, we evaluate 4 prominent VLMs using single-answer scoring, pairwise comparison, and reference-guided paradigms. Our findings reveal that current VLM evaluators exhibit substantial blind spots: they often fail to detect perturbed outputs - in some cases exceeding 50%, struggle particularly with fine-grained compositional and spatial errors, and are often insensitive to hallucinated content that contradicts the input image. Pairwise comparison proves more reliable, though failure rates persist. These results highlight the unreliable nature of current Evaluator VLMs and urge caution in their deployment for benchmarking and development decisions. Code and data have been made publicly available.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21523v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21523v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction</title>
      <link>https://arxiv.org/abs/2604.21518v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21518v1</guid>
      <pubDate>Thu, 23 Apr 2026 10:27:54 GMT</pubDate>
      <dc:creator>Shiyan Su, Ruyi Zha, Danli Shi et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To addr...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Shiyan Su, Ruyi Zha, Danli Shi et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Neural representations (NRs), such as neural fields and 3D Gaussians, effectively model volumetric data in computed tomography (CT) but suffer from severe artifacts under sparse-view settings. To address this, we propose DiffNR, a novel framework that enhances NR optimization with diffusion priors. At its core is SliceFixer, a single-step diffusion model designed to correct artifacts in degraded slices. We integrate specialized conditioning layers into the network and develop tailored data curation strategies to support model finetuning. During reconstruction, SliceFixer periodically generates pseudo-reference volumes, providing auxiliary 3D perceptual supervision to fix underconstrained regions. Compared to prior methods that embed CT solvers into time-consuming iterative denoising, our repair-and-augment strategy avoids frequent diffusion model queries, leading to better runtime performance. Extensive experiments show that DiffNR improves PSNR by 3.99 dB on average, generalizes well across domains, and maintains efficient optimization.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21518v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21518v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection</title>
      <link>https://arxiv.org/abs/2604.21502v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21502v1</guid>
      <pubDate>Thu, 23 Apr 2026 10:04:36 GMT</pubDate>
      <dc:creator>Yupeng Zhang, Ruize Han, Ningnan Guo et al.</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">VFM$^{4}$SDG: Unveiling the Power of VFMs for Single-Domain Generalized Object Detection</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yupeng Zhang, Ruize Han, Ningnan Guo et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, distillation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">In real-world scenarios, continual changes in weather, illumination, and imaging conditions cause significant domain shifts, leading detectors trained on a single source domain to degrade severely in unseen environments. Existing single-domain generalized object detection (SDGOD) methods mainly rely on data augmentation or domain-invariant representation learning, but pay limited attention to detector mechanisms, leaving clear limitations under complex domain shifts. Through analytical experiments, we find that performance degradation is dominated by increasing missed detections, which fundamentally arises from reduced cross-domain stability of the detector: object-background and inter-instance relations become less stable in the encoding stage, while semantic-spatial alignment of query representations also becomes harder to maintain in the decoding stage. To this end, we propose VFM$^{4}$SDG, a dual-prior learning framework for SDGOD, which introduces a frozen vision foundation model (VFM) as a transferable cross-domain stability prior into detector representation learning and query modeling. In the encoding stage, we propose Cross-domain Stable Relational Prior Distillation to enhance the robustness of object-background and inter-instance relational modeling. In the decoding stage, we propose Semantic-Contextual Prior-based Query Enhancement, which injects category-level semantic prototypes and global visual context into queries to improve their semantic recognition and spatial localization stability in unseen domains. Extensive experiments show that the proposed method consistently outperforms existing SOTA methods on standard SDGOD benchmarks and two mainstream DETR-based detectors, demonstrating its effectiveness, robustness, and generality.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21502v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21502v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts</title>
      <link>https://arxiv.org/abs/2604.21478v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21478v1</guid>
      <pubDate>Thu, 23 Apr 2026 09:39:33 GMT</pubDate>
      <dc:creator>Yuhan Luo, Tao Chen, Decheng Liu</dc:creator>
      <category>模型架构</category>
      <description>Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can&apos;...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Rethinking Cross-Domain Evaluation for Face Forgery Detection with Semantic Fine-grained Alignment and Mixture-of-Experts</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yuhan Luo, Tao Chen, Decheng Liu</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Nowadays, visual data forgery detection plays an increasingly important role in social and economic security with the rapid development of generative models. Existing face forgery detectors still can&apos;t achieve satisfactory performance because of poor generalization ability across datasets. The key factor that led to this phenomenon is the lack of suitable metrics: the commonly used cross-dataset AUC metric fails to reveal an important issue where detection scores may shift significantly across data domains. To explicitly evaluate cross-domain score comparability, we propose \textbf{Cross-AUC}, an evaluation metric that can compute AUC across dataset pairs by contrasting real samples from one dataset with fake samples from another (and vice versa). It is interesting to find that evaluating representative detectors under the Cross-AUC metric reveals substantial performance drops, exposing an overlooked robustness problem. Besides, we also propose the novel framework \textbf{S}emantic \textbf{F}ine-grained \textbf{A}lignment and \textbf{M}ixture-of-Experts (\textbf{SFAM}), consisting of a patch-level image-text alignment module that enhances CLIP&apos;s sensitivity to manipulation artifacts, and the facial region mixture-of-experts module, which routes features from different facial regions to specialized experts for region-aware forgery analysis. Extensive qualitative and quantitative experiments on the public datasets prove that the proposed method achieves superior performance compared with the state-of-the-art methods with various suitable metrics.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21478v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21478v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] Instance-level Visual Active Tracking with Occlusion-Aware Planning</title>
      <link>https://arxiv.org/abs/2604.21453v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21453v1</guid>
      <pubDate>Thu, 23 Apr 2026 09:11:50 GMT</pubDate>
      <dc:creator>Haowei Sun, Kai Zhou, Hao Gao et al.</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottl...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Instance-level Visual Active Tracking with Occlusion-Aware Planning</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Haowei Sun, Kai Zhou, Hao Gao et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Visual Active Tracking (VAT) aims to control cameras to follow a target in 3D space, which is critical for applications like drone navigation and security surveillance. However, it faces two key bottlenecks in real-world deployment: confusion from visually similar distractors caused by insufficient instance-level discrimination and severe failure under occlusions due to the absence of active planning. To address these, we propose OA-VAT, a unified pipeline with three complementary modules. First, a training-free Instance-Aware Offline Prototype Initialization aggregates multi-view augmented features via DINOv3 to construct discriminative instance prototypes, mitigating distractor confusion. Second, an Online Prototype Enhancement Tracker enhances prototypes online and integrates a confidence-aware Kalman filter for stable tracking under appearance and motion changes. Third, an Occlusion-Aware Trajectory Planner, trained on our new Planning-20k dataset, uses conditional diffusion to generate obstacle-avoiding paths for occlusion recovery. Experiments demonstrate OA-VAT achieves 0.93 average SR on UnrealCV (+2.2% vs. SOTA TrackVLA), 90.8% average CAR on real-world datasets (+12.1% vs. SOTA GC-VAT), and 81.6% TSR on a DJI Tello drone. Running at 35 FPS on an RTX 3090, it delivers robust, real-time performance for practical deployment.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21453v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21453v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution</title>
      <link>https://arxiv.org/abs/2604.21450v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21450v1</guid>
      <pubDate>Thu, 23 Apr 2026 09:09:36 GMT</pubDate>
      <dc:creator>Yixuan Zhu, Shilin Ma, Haolin Wang et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). Howe...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yixuan Zhu, Shilin Ma, Haolin Wang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> autoregressive model, image generation, text-to-image, dit, distillation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21450v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21450v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery</title>
      <link>https://arxiv.org/abs/2604.21435v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21435v1</guid>
      <pubDate>Thu, 23 Apr 2026 08:51:24 GMT</pubDate>
      <dc:creator>Jingfang Li, Haoran Zhu, Wen Yang et al.</dc:creator>
      <category>模型架构</category>
      <description>Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical di...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">UHR-DETR: Efficient End-to-End Small Object Detection for Ultra-High-Resolution Remote Sensing Imagery</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jingfang Li, Haoran Zhu, Wen Yang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Ultra-High-Resolution (UHR) imagery has become essential for modern remote sensing, offering unprecedented spatial coverage. However, detecting small objects in such vast scenes presents a critical dilemma: retaining the original resolution for small objects causes prohibitive memory bottlenecks. Conversely, conventional compromises like image downsampling or patch cropping either erase small objects or destroy context. To break this dilemma, we propose UHR-DETR, an efficient end-to-end transformer-based detector designed for UHR imagery. First, we introduce a Coverage-Maximizing Sparse Encoder that dynamically allocates finite computational resources to informative high-resolution regions, ensuring maximum object coverage with minimal spatial redundancy. Second, we design a Global-Local Decoupled Decoder. By integrating macroscopic scene awareness with microscopic object details, this module resolves semantic ambiguities and prevents scene fragmentation. Extensive experiments on the UHR imagery datasets (e.g., STAR and SODA-A) demonstrate the superiority of UHR-DETR under strict hardware constraints (e.g., a single 24GB RTX 3090). It achieves a 2.8\% mAP improvement while delivering a 10$\times$ inference speedup compared to standard sliding-window baselines on the STAR dataset. Our codes and models will be available at https://github.com/Li-JingFang/UHR-DETR.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21435v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21435v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] Pre-process for segmentation task with nonlinear diffusion filters</title>
      <link>https://arxiv.org/abs/2604.21422v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21422v1</guid>
      <pubDate>Thu, 23 Apr 2026 08:38:45 GMT</pubDate>
      <dc:creator>Javier Sanguino, Carlos Platero, Olga Velasco</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques.   We first show an intrinsic formulation for ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Pre-process for segmentation task with nonlinear diffusion filters</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Javier Sanguino, Carlos Platero, Olga Velasco</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">This paper deals with the case of using nonlinear diffusion filters to obtain piecewise constant images as a previous process for segmentation techniques.   We first show an intrinsic formulation for the nonlinear diffusion equation to provide some design conditions on the diffusion filters. According to this theoretical framework, we propose a new family of diffusivities; they are obtained from nonlinear diffusion techniques and are related with backward diffusion. Their goal is to split the image in closed contours with a homogenized grey intensity inside and with no blurred edges.   We also prove that our filters satisfy the well-posedness semi-discrete and full discrete scale-space requirements. This shows that by using semi-implicit schemes, a forward nonlinear diffusion equation is solved, instead of a backward nonlinear diffusion equation, connecting with an edge-preserving process. Under the conditions established for the diffusivity and using a stopping criterion for the diffusion time, we get piecewise constant images with a low computational effort.   Finally, we test our filter with real images and we illustrate the effects of our diffusivity function as a method to get piecewise constant images.   The code is available at https://github.com/cplatero/NonlinearDiffusion.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21422v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21422v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought</title>
      <link>https://arxiv.org/abs/2604.21396v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21396v1</guid>
      <pubDate>Thu, 23 Apr 2026 08:04:07 GMT</pubDate>
      <dc:creator>Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model&apos;s logic in actual visual evidence. However, existing datasets fa...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Byeonggeuk Lim, Kyeonghyun Kim, JungMin Yun et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The advancement of Large Vision-Language Models (LVLMs) requires precise local region-based reasoning that faithfully grounds the model&apos;s logic in actual visual evidence. However, existing datasets face limitations in scalability due to extensive manual annotation and lack of explicit alignment between multi-step reasoning and corresponding image regions, which constrains the evaluation of model trustworthiness. To address these challenges, we propose the Visual Grounding Chain-of-Thought (VG-CoT) dataset, which explicitly links each reasoning step to real visual evidence within the image through a fully automated three-stage pipeline. The pipeline first extracts object- and text-level visual evidence using state-of-the-art detection and OCR models, then generates step-by-step grounded reasoning with GPT-4o, and finally refines the grounding through a rationale-driven open-set detection process. In addition, we introduce a new benchmark that comprehensively evaluates LVLMs reasoning across three complementary dimensions: Rationale Quality, Answer Accuracy, and Reasoning-Answer Alignment. Experiments with representative LVLMs, including LLaVA-1.5 and Qwen2-VL, demonstrate consistent improvements on most evaluation metrics, confirming that VG-CoT effectively enhances trustworthy, evidence-based reasoning while maintaining scalable and cost-efficient dataset construction. The dataset and code will be released publicly upon acceptance to facilitate further research.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21396v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21396v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair</title>
      <link>https://arxiv.org/abs/2604.21395v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21395v1</guid>
      <pubDate>Thu, 23 Apr 2026 08:03:33 GMT</pubDate>
      <dc:creator>Vishal Rajput</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises supervised loss must retain non-zero Jacobian sensitivit...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Vishal Rajput</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit, imagen</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We prove that empirical risk minimisation (ERM) imposes a necessary geometric constraint on learned representations: any encoder that minimises supervised loss must retain non-zero Jacobian sensitivity in directions that are label-correlated in training data but nuisance at test time. This is not a contingent failure of current methods; it is a mathematical consequence of the supervised objective itself. We call this the geometric blind spot of supervised learning (Theorem 1), and show it holds across proper scoring rules, architectures, and dataset sizes.   This single theorem unifies four lines of prior empirical work that were previously treated separately: non-robust predictive features, texture bias, corruption fragility, and the robustness-accuracy tradeoff. In this framing, adversarial vulnerability is one consequence of a broader structural fact about supervised learning geometry.   We introduce Trajectory Deviation Index (TDI), a diagnostic that measures the theorem&apos;s bounded quantity directly, and show why common alternatives miss the key failure mode. PGD adversarial training reaches Jacobian Frobenius 2.91 yet has the worst clean-input geometry (TDI 1.336), while PMH achieves TDI 0.904. TDI is the only metric that detects this dissociation because it measures isotropic path-length distortion -- the exact quantity Theorem 1 bounds.   Across seven vision tasks, BERT/SST-2, and ImageNet ViT-B/16 backbones used by CLIP, DINO, and SAM, the blind spot is measurable and repairable. It is present at foundation-model scale, worsens monotonically across language-model sizes (blind-spot ratio 0.860 to 0.765 to 0.742 from 66M to 340M), and is amplified by task-specific ERM fine-tuning (+54%), while PMH repairs it by 11x with one additional training term whose Gaussian form Proposition 5 proves is the unique perturbation law that uniformly penalises the encoder Jacobian.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21395v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21395v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] EdgeFormer: local patch-based edge detection transformer on point clouds</title>
      <link>https://arxiv.org/abs/2604.21387v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21387v1</guid>
      <pubDate>Thu, 23 Apr 2026 07:57:04 GMT</pubDate>
      <dc:creator>Yifei Xie, Zhikun Tu, Tong Yang et al.</dc:creator>
      <category>模型架构</category>
      <description>Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial dema...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">EdgeFormer: local patch-based edge detection transformer on point clouds</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yifei Xie, Zhikun Tu, Tong Yang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Edge points on 3D point clouds can clearly convey 3D geometry and surface characteristics, therefore, edge detection is widely used in many vision applications with high industrial and commercial demands. However, the fine-grained edge features are difficult to detect effectively as they are generally densely distributed or exhibit small-scale surface gradients. To address this issue, we present a learning-based edge detection network, named EdgeFormer, which mainly consists of two stages. Based on the observation that spatially neighboring points tend to exhibit high correlation, forming the local underlying surface, we convert the edge detection of the entire point cloud into a point classification based on local patches. Therefore, in the first stage, we construct local patch feature descriptors that describe the local neighborhood around each point. In the second stage, we classify each point by analyzing the local patch feature descriptors generated in the first stage. Due to the conversion of the point cloud into local patches, the proposed method can effectively extract the finer details. The experimental results show that our model demonstrates competitive performance compared to six baselines.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21387v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21387v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] KD-CVG: A Knowledge-Driven Approach for Creative Video Generation</title>
      <link>https://arxiv.org/abs/2604.21362v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21362v1</guid>
      <pubDate>Thu, 23 Apr 2026 07:24:15 GMT</pubDate>
      <dc:creator>Linkai Liu, Wei Feng, Xi Zhao et al.</dc:creator>
      <category>视频生成</category>
      <description>Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, whi...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">KD-CVG: A Knowledge-Driven Approach for Creative Video Generation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Linkai Liu, Wei Feng, Xi Zhao et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> text-to-video, t2v, video generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Creative Generation (CG) leverages generative models to automatically produce advertising content that highlights product features, and it has been a significant focus of recent research. However, while CG has advanced considerably, most efforts have concentrated on generating advertising text and images, leaving Creative Video Generation (CVG) relatively underexplored. This gap is largely due to two major challenges faced by Text-to-Video (T2V) models: (a) \textbf{ambiguous semantic alignment}, where models struggle to accurately correlate product selling points with creative video content, and (b) \textbf{inadequate motion adaptability}, resulting in unrealistic movements and distortions. To address these challenges, we develop a comprehensive Advertising Creative Knowledge Base (ACKB) as a foundational resource and propose a knowledge-driven approach (KD-CVG) to overcome the knowledge limitations of existing models. KD-CVG consists of two primary modules: Semantic-Aware Retrieval (SAR) and Multimodal Knowledge Reference (MKR). SAR utilizes the semantic awareness of graph attention networks and reinforcement learning feedback to enhance the model&apos;s comprehension of the connections between selling points and creative videos. Building on this, MKR incorporates semantic and motion priors into the T2V model to address existing knowledge gaps. Extensive experiments have demonstrated KD-CVG&apos;s superior performance in achieving semantic alignment and motion adaptability, validating its effectiveness over other state-of-the-art methods. The code and dataset will be open source at https://kdcvg.github.io/KDCVG/.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21362v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21362v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [多模态生成] Prototype-Based Test-Time Adaptation of Vision-Language Models</title>
      <link>https://arxiv.org/abs/2604.21360v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21360v1</guid>
      <pubDate>Thu, 23 Apr 2026 07:20:56 GMT</pubDate>
      <dc:creator>Zhaohong Huang, Yuxin Zhang, Wenjing Liu et al.</dc:creator>
      <category>图像生成</category>
      <category>多模态生成</category>
      <description>Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpro...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Prototype-Based Test-Time Adaptation of Vision-Language Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zhaohong Huang, Yuxin Zhang, Wenjing Liu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, vision-language model, imagen</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample&apos;s visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP&apos;s accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP&apos;s inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP&apos;s inference speed.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21360v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21360v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes</title>
      <link>https://arxiv.org/abs/2604.21356v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21356v1</guid>
      <pubDate>Thu, 23 Apr 2026 07:15:12 GMT</pubDate>
      <dc:creator>Nannan Qin, Pengjie Tao, Haiyan Guan et al.</dc:creator>
      <category>图像生成</category>
      <description>High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filter...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">SparseGF: A Height-Aware Sparse Segmentation Framework with Context Compression for Robust Ground Filtering Across Urban to Natural Scenes</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Nannan Qin, Pengjie Tao, Haiyan Guan et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">High-quality digital terrain models derived from airborne laser scanning (ALS) data are essential for a wide range of geospatial analyses, and their generation typically relies on robust ground filtering (GF) to separate point clouds across diverse landscapes into ground and non-ground parts. Although current deep-learning-based GF methods have demonstrated impressive performance, especially in specific challenging terrains, their cross-scene generalization remains limited by two persistent issues: the context-detail dilemma in large-scale processing due to limited computational resources, and the random misclassification of tall objects arising from classification-only optimization. To overcome these limitations, we propose SparseGF, a height-aware sparse segmentation framework enhanced with context compression. It is built upon three key innovations: (1) a convex-mirror-inspired context compression module that condenses expansive contexts into compact representations while preserving central details; (2) a hybrid sparse voxel-point network architecture that effectively interprets compressed representations while mitigating compression-induced geometric distortion; and (3) a height-aware loss function that explicitly enforces topographic elevation priors during training to suppress random misclassification of tall objects. Extensive evaluations on two large-scale ALS benchmark datasets demonstrate that SparseGF delivers robust GF across urban to natural terrains, achieving leading performance in complex urban scenes, competitive results on mixed terrains, and moderate yet non-catastrophic accuracy in densely forested steep areas. This work offers new insights into deep-learning-based GF research and encourages further exploration toward truly cross-scene generalization for large-scale environmental monitoring.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21356v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21356v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning</title>
      <link>https://arxiv.org/abs/2604.21349v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21349v1</guid>
      <pubDate>Thu, 23 Apr 2026 07:07:59 GMT</pubDate>
      <dc:creator>Wadii Boulila, Adel Ammar, Bilel Benjdira et al.</dc:creator>
      <category>模型架构</category>
      <description>Self-supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations pr...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Trust-SSL: Additive-Residual Selective Invariance for Robust Aerial Self-Supervised Learning</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wadii Boulila, Adel Ammar, Bilel Benjdira et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Self-supervised learning (SSL) is a standard approach for representation learning in aerial imagery. Existing methods enforce invariance between augmented views, which works well when augmentations preserve semantic content. However, aerial images are frequently degraded by haze, motion blur, rain, and occlusion that remove critical evidence. Enforcing alignment between a clean and a severely degraded view can introduce spurious structure into the latent space. This study proposes a training strategy and architectural modification to enhance SSL robustness to such corruptions. It introduces a per-sample, per-factor trust weight into the alignment objective, combined with the base contrastive loss as an additive residual. A stop-gradient is applied to the trust weight instead of a multiplicative gate. While a multiplicative gate is a natural choice, experiments show it impairs the backbone, whereas our additive-residual approach improves it. Using a 200-epoch protocol on a 210,000-image corpus, the method achieves the highest mean linear-probe accuracy among six backbones on EuroSAT, AID, and NWPU-RESISC45 (90.20% compared to 88.46% for SimCLR and 89.82% for VICReg). It yields the largest improvements under severe information-erasing corruptions on EuroSAT (+19.9 points on haze at s=5 over SimCLR). The method also demonstrates consistent gains of +1 to +3 points in Mahalanobis AUROC on a zero-shot cross-domain stress test using BDD100K weather splits. Two ablations (scalar uncertainty and cosine gate) indicate the additive-residual formulation is the primary source of these improvements. An evidential variant using Dempster-Shafer fusion introduces interpretable signals of conflict and ignorance. These findings offer a concrete design principle for uncertainty-aware SSL. Code is publicly available at https://github.com/WadiiBoulila/trust-ssl.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21349v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21349v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning</title>
      <link>https://arxiv.org/abs/2604.21346v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21346v1</guid>
      <pubDate>Thu, 23 Apr 2026 07:03:48 GMT</pubDate>
      <dc:creator>Mohit Vaishnav, Tanel Tammet</dc:creator>
      <category>多模态生成</category>
      <description>Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Mohit Vaishnav, Tanel Tammet</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Vision--language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our \emph{Componential--Grammatical (C--G)} paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid--90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21346v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21346v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Latent Denoising Improves Visual Alignment in Large Multimodal Models</title>
      <link>https://arxiv.org/abs/2604.21343v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21343v1</guid>
      <pubDate>Thu, 23 Apr 2026 06:58:08 GMT</pubDate>
      <dc:creator>Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak inte...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Latent Denoising Improves Visual Alignment in Large Multimodal Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, distillation, imagen</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework also preserves the teacher&apos;s intra-image similarity structure and applies intra-image contrastive patch distillation. During inference, corruption and auxiliary heads are disabled, introducing no additional inference-time overhead. Across a broad suite of standard multimodal benchmarks, our method consistently improves visual understanding and reasoning over strong baselines, and yields clear gains on compositional robustness benchmarks (e.g., NaturalBench). Moreover, under ImageNet-C-style non-adversarial common corruptions applied to benchmark images, our method maintains higher accuracy and exhibits reduced degradation at both moderate and severe corruption levels. Our code is available at https://github.com/dhruvashp/latent-denoising-for-lmms.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21343v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21343v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] Teacher-Guided Routing for Sparse Vision Mixture-of-Experts</title>
      <link>https://arxiv.org/abs/2604.21330v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21330v1</guid>
      <pubDate>Thu, 23 Apr 2026 06:34:10 GMT</pubDate>
      <dc:creator>Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata et al.</dc:creator>
      <category>图像生成</category>
      <description>Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an ef...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Teacher-Guided Routing for Sparse Vision Mixture-of-Experts</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Masahiro Kada, Ryota Yoshihashi, Satoshi Ikehata et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> imagen</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent progress in deep learning has been driven by increasingly large-scale models, but the resulting computational cost has become a critical bottleneck. Sparse Mixture of Experts (MoE) offers an effective solution by activating only a small subset of experts for each input, achieving high scalability without sacrificing inference speed. Although effective, sparse MoE training exhibits characteristic optimization difficulties. Because the router receives informative gradients only through the experts selected in the forward pass, it suffers from gradient blocking and obtains little information from unselected routes. This limited, highly localized feedback makes it difficult for the router to learn appropriate expert-selection scores and often leads to unstable routing dynamics, such as fluctuating expert assignments during training. To address this issue, we propose TGR-MoE: Teacher-Guided Routing for Sparse Vision Mixture-of-Experts, a simple yet effective method that stabilizes router learning using supervision derived from a pretrained dense teacher model. TGR-MoE constructs a teacher router from the teacher&apos;s intermediate representations and uses its routing outputs as pseudo-supervision for the student router, suppressing frequent routing fluctuations during training and enabling knowledge-guided expert selection from the early stages of training. Extensive experiments on ImageNet-1K and CIFAR-100 demonstrate that TGR consistently improves both accuracy and routing consistency, while maintaining stable training even under highly sparse configurations.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21330v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21330v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment</title>
      <link>https://arxiv.org/abs/2604.21326v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21326v1</guid>
      <pubDate>Thu, 23 Apr 2026 06:29:42 GMT</pubDate>
      <dc:creator>Juan Li, Chuanghao Ding, Xujie Zhang et al.</dc:creator>
      <category>模型架构</category>
      <description>Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided in...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Juan Li, Chuanghao Ding, Xujie Zhang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Universal Multimodal Retrieval (UMR) aims to map different modalities (e.g., visual and textual) into a shared embedding space for multi-modal retrieval. Existing UMR methods can be broadly divided into two categories: early-fusion approaches, such as Marvel, which projects visual features into the language model (LM) space for integrating with text modality, and late-fusion approaches, such as UniVL-DR, which encode visual and textual inputs using separate encoders and obtain fused embeddings through addition. Our pilot study reveals that Marvel exhibits visual modality collapse, which is characterized by the model&apos;s tendency to disregard visual features while depending excessively on textual cues. In contrast, although UniVL-DR is less affected by this issue, it is more susceptible to semantic misalignment, where semantically related content is positioned far apart in the embedding space. To address these challenges, we propose MiMIC, which introduces two key innovations: (1) a fusion-in-decoder architecture for effective multimodal integration, and (2) robust training through single modality mixin and random caption dropout. Experiments on the WebQA+ and EVQA+ datasets, where image in documents or queries might lack captions, indicate that MiMIC consistently outperforms both early- and late-fusion baselines.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21326v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21326v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment</title>
      <link>https://arxiv.org/abs/2604.21321v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21321v1</guid>
      <pubDate>Thu, 23 Apr 2026 06:24:10 GMT</pubDate>
      <dc:creator>Khaled R Ahmed, Toqi Tahamid Sarker, Taminul Islam et al.</dc:creator>
      <category>模型架构</category>
      <description>Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use....</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">FryNet: Dual-Stream Adversarial Fusion for Non-Destructive Frying Oil Oxidation Assessment</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Khaled R Ahmed, Toqi Tahamid Sarker, Taminul Islam et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> mae</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Monitoring frying oil degradation is critical for food safety, yet current practice relies on destructive wet-chemistry assays that provide no spatial information and are unsuitable for real-time use. We identify a fundamental obstacle in thermal-image-based inspection, the camera-fingerprint shortcut, whereby models memorize sensor-specific noise and thermal bias instead of learning oxidation chemistry, collapsing under video-disjoint evaluation. We propose FryNet, a dual-stream RGB-thermal framework that jointly performs oil-region segmentation, serviceability classification, and regression of four chemical oxidation indices (PV, p-AV, Totox, temperature) in a single forward pass. A ThermalMiT-B2 backbone with channel and spatial attention extracts thermal features, while an RGB-MAE Encoder learns chemically grounded representations via masked autoencoding and chemical alignment. Dual-Encoder DANN adversarially regularizes both streams against video identity via Gradient Reversal Layers, and FiLM fusion bridges thermal structure with RGB chemical context. On 7,226 paired frames across 28 frying videos, FryNet achieves 98.97% mIoU, 100% classification accuracy, and 2.32 mean regression MAE, outperforming all seven baselines.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21321v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21321v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring</title>
      <link>https://arxiv.org/abs/2604.21313v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21313v1</guid>
      <pubDate>Thu, 23 Apr 2026 06:08:56 GMT</pubDate>
      <dc:creator>Yongying Liu, Jiaqi Wang, Jian Song et al.</dc:creator>
      <category>模型架构</category>
      <description>Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV-bas...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">PLAS-Net: Pixel-Level Area Segmentation for UAV-Based Beach Litter Monitoring</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yongying Liu, Jiaqi Wang, Jian Song et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Accurate quantification of the physical exposure area of beach litter, rather than simple item counts, is essential for credible ecological risk assessment of marine debris. However, automated UAV-based monitoring predominantly relies on bounding-box detection, which systematically overestimates the planar area of irregular litter objects. To address this geometric limitation, we develop PLAS-Net (Pixel-level Litter Area Segmentor), an instance segmentation framework that extracts pixel-accurate physical footprints of coastal debris. Evaluated on UAV imagery from a monsoon-driven pocket beach in Koh Tao, Thailand, PLAS-Net achieves a mAP_50 of 58.7% with higher precision than eleven baseline models, demonstrating improved mask fidelity under complex coastal conditions. To illustrate how the accuracy of the masking affects the conclusions of environmental analysis, we conducted three downstream demonstrations: (i) power-law fitting of normalized plastic density (NPD) to characterize fragmentation dynamics; (ii) area-weighted ecological risk index (ERI) to map spatial pollution hotspots; and (iii) source composition analysis revealing the abundance-area paradox: fishing gear constitutes a small proportion of the total number of items, but has the largest physical area per unit item. Pixel-level area extraction can provide more valuable information for coastal monitoring compared to methods based solely on counting.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21313v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21313v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] an interpretable vision transformer framework for automated brain tumor classification</title>
      <link>https://arxiv.org/abs/2604.21311v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21311v1</guid>
      <pubDate>Thu, 23 Apr 2026 06:07:10 GMT</pubDate>
      <dc:creator>Chinedu Emmanuel Mbonu, Tochukwu Sunday Belonwu, Okwuchukwu Ejike Chukwuogo et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resona...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">an interpretable vision transformer framework for automated brain tumor classification</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Chinedu Emmanuel Mbonu, Tochukwu Sunday Belonwu, Okwuchukwu Ejike Chukwuogo et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer, vit, dit, imagen</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Brain tumors represent one of the most critical neurological conditions, where early and accurate diagnosis is directly correlated with patient survival rates. Manual interpretation of Magnetic Resonance Imaging (MRI) scans is time-intensive, subject to inter-observer variability, and demands significant specialist expertise. This paper proposes a deep learning framework for automated four-class brain tumor classification distinguishing glioma, meningioma, pituitary tumor, and healthy brain tissue from a dataset of 7,023 MRI scans. The proposed system employs a Vision Transformer (ViT-B/16) pretrained on ImageNet-21k as the backbone, augmented with a clinically motivated preprocessing and training pipeline. Contrast Limited Adaptive Histogram Equalization (CLAHE) is applied to enhance local contrast and accentuate tumor boundaries invisible to standard normalization. A two-stage fine-tuning strategy is adopted: the classification head is warmed up with the backbone frozen, followed by full fine-tuning with discriminative learning rates. MixUp and CutMix augmentation is applied per batch to improve generalization. Exponential Moving Average (EMA) of weights and Test-Time Augmentation (TTA) further stabilize and boost performance. Attention Rollout visualization provides clinically interpretable heatmaps of the brain regions driving each prediction. The proposed model achieves a test accuracy of 99.29%, macro F1-score of 99.25%, and perfect recall on both healthy and meningioma classes, outperforming all CNN-based baselines</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21311v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21311v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [扩散模型] [图像生成] Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation</title>
      <link>https://arxiv.org/abs/2604.21291v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21291v1</guid>
      <pubDate>Thu, 23 Apr 2026 05:10:15 GMT</pubDate>
      <dc:creator>Yuanchen Fei, Yude Zou, Zejian Kang et al.</dc:creator>
      <category>视频生成</category>
      <category>扩散模型</category>
      <category>图像生成</category>
      <description>Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied AI.How...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Exploring the Role of Synthetic Data Augmentation in Controllable Human-Centric Video Generation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yuanchen Fei, Yude Zou, Zejian Kang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 扩散模型, 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> lora, temporal consistency, diffusion, video synthesis, video generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Controllable human video generation aims to produce realistic videos of humans with explicitly guided motions and appearances,serving as a foundation for digital humans, animation, and embodied AI.However, the scarcity of largescale, diverse, and privacy safe human video datasets poses a major bottleneck, especially for rare identities and complex actions.Synthetic data provides a scalable and controllable alternative,yet its actual contribution to generative modeling remains underexplored due to the persistent Sim2Real gap.In this work,we systematically investigate the impact of synthetic data on controllable human video generation. We propose a diffusion-based framework that enables fine-grained control over appearance and motion while providing a unfied testbed to analyze how synthetic data interacts with real world data during training. Through extensive experiments, we reveal the complementary roles of synthetic and real data and demonstrate possible methods for efficiently selecting synthetic samples to enhance motion realism,temporal consistency,and identity preservation.Our study offers the first comprehensive exploration of synthetic data&apos;s role in human-centric video synthesis and provides practical insights for building data-efficient and generalizable generative models.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21291v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21291v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA</title>
      <link>https://arxiv.org/abs/2604.21290v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21290v1</guid>
      <pubDate>Thu, 23 Apr 2026 05:09:54 GMT</pubDate>
      <dc:creator>Anvitha Ramachandran, Dhruv Parikh, Viktor Prasanna</dc:creator>
      <category>模型架构</category>
      <description>Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with globa...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">GraphLeap: Decoupling Graph Construction and Convolution for Vision GNN Acceleration on FPGA</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Anvitha Ramachandran, Dhruv Parikh, Viktor Prasanna</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Vision Graph Neural Networks (ViGs) represent an image as a graph of patch tokens, enabling adaptive, feature-driven neighborhoods. Unlike CNNs with fixed grid biases or Vision Transformers with global token interactions, ViGs rely on dynamic graph convolution: at each layer, a feature-dependent graph is built via k-nearest-neighbor (kNN) search on current patch features, followed by message passing. This per-layer graph construction is the main bottleneck, consuming 50--95\% of graph convolution time on CPUs and GPUs, scaling as $O(N^2)$ with the number of patches $N$, and creating a sequential dependency between graph construction and feature updates.   We introduce GraphLeap, a simple reformulation that removes this dependency by decoupling graph construction from feature update across layers. GraphLeap performs the feature update at layer $\ell$ using a graph built from the previous layer&apos;s features, while simultaneously using the current layer&apos;s features to construct the graph for layer $\ell+1$. This one-layer-lookahead graph construction enables concurrent graph construction and message passing. Although using prior-layer features can introduce minor accuracy degradation, lightweight fine-tuning for a few epochs is sufficient to recover the original accuracy. Building on GraphLeap, we present the first end-to-end FPGA accelerator for Vision GNNs. Our streaming, layer-pipelined design overlaps a kNN graph construction engine with a feature update engine, exploits node- and channel-level parallelism, and enables efficient on-chip dataflow without explicit edge-feature materialization. Evaluated on isotropic and pyramidal ViG models on an Alveo U280 FPGA, GraphLeap achieves up to $95.7\times$ speedup over CPU and $8.5\times$ speedup over GPU baselines, demonstrating the feasibility of real-time Vision GNN inference.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21290v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21290v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing</title>
      <link>https://arxiv.org/abs/2604.21289v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21289v1</guid>
      <pubDate>Thu, 23 Apr 2026 05:05:46 GMT</pubDate>
      <dc:creator>Wenmin Huang, Weiqi Luo, Xiaochun Cao et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but of...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">AttDiff-GAN: A Hybrid Diffusion-GAN Framework for Facial Attribute Editing</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wenmin Huang, Weiqi Luo, Xiaochun Cao et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> gan, image generation, image synthesis, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Facial attribute editing aims to modify target attributes while preserving attribute-irrelevant content and overall image fidelity. Existing GAN-based methods provide favorable controllability, but often suffer from weak alignment between style codes and attribute semantics. Diffusion-based methods can synthesize highly realistic images; however, their editing precision is limited by the entanglement of semantic directions among different attributes. In this paper, we propose AttDiff-GAN, a hybrid framework that combines GAN-based attribute manipulation with diffusion-based image generation. A key challenge in such integration lies in the inconsistency between one-step adversarial learning and multi-step diffusion denoising, which makes effective optimization difficult. To address this issue, we decouple attribute editing from image synthesis by introducing a feature-level adversarial learning scheme to learn explicit attribute manipulation, and then using the manipulated features to guide the diffusion process for image generation, while also removing the reliance on semantic direction-based editing. Moreover, we enhance style-attribute alignment by introducing PriorMapper, which incorporates facial priors into style generation, and RefineExtractor, which captures global semantic relationships through a Transformer for more precise style extraction. Experimental results on CelebA-HQ show that the proposed method achieves more accurate facial attribute editing and better preservation of non-target attributes than state-of-the-art methods in both qualitative and quantitative evaluations.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21289v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21289v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation</title>
      <link>https://arxiv.org/abs/2604.21279v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21279v1</guid>
      <pubDate>Thu, 23 Apr 2026 04:47:40 GMT</pubDate>
      <dc:creator>Wenmin Huang, Weiqi Luo, Xiaochun Cao et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelate...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">LatRef-Diff: Latent and Reference-Guided Diffusion for Facial Attribute Editing and Style Manipulation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wenmin Huang, Weiqi Luo, Xiaochun Cao et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, gan, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Facial attribute editing and style manipulation are crucial for applications like virtual avatars and photo editing. However, achieving precise control over facial attributes without altering unrelated features is challenging due to the complexity of facial structures and the strong correlations between attributes. While conditional GANs have shown progress, they are limited by accuracy issues and training instability. Diffusion models, though promising, face challenges in style manipulation due to the limited expressiveness of semantic directions. In this paper, we propose LatRef-Diff, a novel diffusion-based framework that addresses these limitations. We replace the traditional semantic directions in diffusion models with style codes and propose two methods for generating them: latent and reference guidance. Based on these style codes, we design a style modulation module that integrates them into the target image, enabling both random and customized style manipulation. This module incorporates learnable vectors, cross-attention mechanisms, and a hierarchical design to improve accuracy and image quality. Additionally, to enhance training stability while eliminating the need for paired images (e.g., before and after editing), we propose a forward-backward consistency training strategy. This strategy first removes the target attribute approximately using image-specific semantic directions and then restores it via style modulation, guided by perceptual and classification losses. Extensive experiments on CelebA-HQ demonstrate that LatRef-Diff achieves state-of-the-art performance in both qualitative and quantitative evaluations. Ablation studies validate the effectiveness of our model&apos;s design choices.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21279v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21279v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding</title>
      <link>https://arxiv.org/abs/2604.21268v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21268v1</guid>
      <pubDate>Thu, 23 Apr 2026 04:23:31 GMT</pubDate>
      <dc:creator>Wenkai Wang, Xiyun Li, Hongcan Guo et al.</dc:creator>
      <category>图像生成</category>
      <description>Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Wenkai Wang, Xiyun Li, Hongcan Guo et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> lora</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with achieving precise localization. While scaling sampling attempts (Pass@k) reveals potential gains, static self-consistency strategies derived from geometric clustering often yield limited improvements, as the model&apos;s predictions tend to be spatially dispersed. In this paper, we propose replacing static consistency strategies with a learnable selection mechanism that selects the optimal target by critiquing its own proposals rendered on the screenshot. Given the significant disparity between the model&apos;s grounding and critiquing capabilities, we propose a co-evolving Propose-then-Critic framework. To jointly optimize these, we introduce a maturity-aware adaptive co-evolutionary reinforcement learning paradigm. This approach dynamically balances the training objectives of proposer and critic, where the diversity of the proposer&apos;s outputs enhances critic robustness, while the critic&apos;s maturing discrimination capability conversely unlocks the proposer&apos;s potential for extensive spatial exploration, fostering the mutual reinforcement and co-evolution of both capabilities, thereby ensuring generalizability to adapt to diverse and complex interface layouts. Extensive experiments over 6 benchmarks show that our method significantly enhances both grounding accuracy and critic reliability.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21268v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21268v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection</title>
      <link>https://arxiv.org/abs/2604.21227v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21227v1</guid>
      <pubDate>Thu, 23 Apr 2026 02:47:24 GMT</pubDate>
      <dc:creator>Yuze Li, Zhilei Liu</dc:creator>
      <category>模型架构</category>
      <description>Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">UAU-Net: Uncertainty-aware Representation Learning and Evidential Classification for Facial Action Unit Detection</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yuze Li, Zhilei Liu</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> u-net, vae, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Facial action unit (AU) detection remains challenging because it involves heterogeneous, AU-specific uncertainties arising at both the representation and decision stages. Recent methods have improved discriminative feature learning, but they often treat the AU representations as deterministic, overlooking uncertainty caused by visual noise, subject-dependent appearance variations, and ambiguous inter-AU relationships, all of which can substantially degrade robustness. Meanwhile, conventional point-estimation classifiers often provide poorly calibrated confidence, producing overconfident predictions, especially under the severe label imbalance typical of AU datasets. We propose UAU-Net, an Uncertainty-aware AU detection framework that explicitly models uncertainty at both stages. At the representation stage, we introduce CV-AFE, a conditional VAE (CVAE)-based AU feature extraction module that learns probabilistic AU representations by jointly estimating feature means and variances across multiple spatio-temporal scales; conditioning on AU labels further enables CV-AFE to capture uncertainty associated with inter-AU dependencies. At the decision stage, we design AB-ENN, an Asymmetric Beta Evidential Neural Network for multi-label AU detection, which parameterizes predictive uncertainty with Beta distributions and mitigates overconfidence via an asymmetric loss tailored to highly imbalanced binary labels. Extensive experiments on BP4D and DISFA show that UAU-Net achieves strong AU detection performance, and further analyses indicate that modeling uncertainty in both representation learning and evidential prediction improves robustness and reliability.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21227v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21227v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [扩散模型] [图像生成] Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation</title>
      <link>https://arxiv.org/abs/2604.21221v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21221v1</guid>
      <pubDate>Thu, 23 Apr 2026 02:22:25 GMT</pubDate>
      <dc:creator>Boxun Xu, Yuming Du, Zichang Liu et al.</dc:creator>
      <category>视频生成</category>
      <category>扩散模型</category>
      <category>图像生成</category>
      <description>We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing i...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Boxun Xu, Yuming Du, Zichang Liu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 扩散模型, 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> text-to-video, diffusion model, diffusion, video generation, video diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We introduce Sparse Forcing, a training-and-inference paradigm for autoregressive video diffusion models that improves long-horizon generation quality while reducing decoding latency. Sparse Forcing is motivated by an empirical observation in autoregressive diffusion rollouts: attention concentrates on a persistent subset of salient visual blocks, forming an implicit spatiotemporal memory in the KV cache, and exhibits a locally structured block-sparse pattern within sliding windows. Building on this observation, we propose a trainable native sparsity mechanism that learns to compress, preserve, and update these persistent blocks while restricting computation within each local window to a dynamically selected local neighborhood. To make the approach practical at scale for both training and inference, we further propose Persistent Block-Sparse Attention (PBSA), an efficient GPU kernel that accelerates sparse attention and memory updates for low-latency, memory-efficient decoding. Experiments show that Sparse Forcing improves the VBench score by +0.26 over Self-Forcing on 5-second text-to-video generation while delivering a 1.11-1.17x decoding speedup and 42% lower peak KV-cache footprint. The gains are more pronounced on longer-horizon rollouts, delivering improved visual quality with +0.68 and +2.74 VBench improvements, and 1.22x and 1.27x speedups on 20-second and 1-minute generations, respectively.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21221v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21221v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response</title>
      <link>https://arxiv.org/abs/2604.21199v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21199v1</guid>
      <pubDate>Thu, 23 Apr 2026 01:45:10 GMT</pubDate>
      <dc:creator>Stephan Xie, Ben Cohen, Mononito Goswami et al.</dc:creator>
      <category>多模态生成</category>
      <description>Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models....</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">ARFBench: Benchmarking Time Series Question Answering Ability for Software Incident Response</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Stephan Xie, Ben Cohen, Mononito Goswami et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Time series question-answering (TSQA), in which we ask natural language questions to infer and reason about properties of time series, is a promising yet underexplored capability of foundation models. In this work, we present ARFBench, a TSQA benchmark that evaluates the understanding of multimodal foundation models (FMs) on time series anomalies prevalent in software incident data. ARFBench consists of 750 questions across 142 time series and 5.38M data points from 63 production incidents sourced exclusively from internal telemetry at Datadog. We evaluate leading proprietary and open-source LLMs, VLMs, and time series FMs and observe that frontier VLMs perform markedly better than existing baselines; the leading model (GPT-5) achieves a 62.7% accuracy and 51.9% F1. We next demonstrate the promise of specialized multimodal approaches. We develop a novel TSFM + VLM hybrid prototype which we post-train on a small set of synthetic and real data that yields comparable overall F1 and accuracy with frontier models. Lastly, we find models and human domain experts exhibit complementary strengths. We define a model-expert oracle, a best-of-2 oracle selector over model and expert answers, yielding 82.8% F1 and 87.2% accuracy and establishing a new superhuman frontier for future TSQA models. The benchmark is available at https://huggingface.co/datasets/Datadog/ARFBench.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21199v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21199v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images</title>
      <link>https://arxiv.org/abs/2604.21182v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21182v1</guid>
      <pubDate>Thu, 23 Apr 2026 00:58:43 GMT</pubDate>
      <dc:creator>Yuki Fujimura, Takahiro Kushida, Kazuya Kitano et al.</dc:creator>
      <category>模型架构</category>
      <description>We propose WildSplatter, a feed-forward 3D Gaussian Splatting (3DGS) model for unconstrained images with unknown camera parameters and varying lighting conditions. 3DGS is an effective scene represent...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yuki Fujimura, Takahiro Kushida, Kazuya Kitano et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We propose WildSplatter, a feed-forward 3D Gaussian Splatting (3DGS) model for unconstrained images with unknown camera parameters and varying lighting conditions. 3DGS is an effective scene representation that enables high-quality, real-time rendering; however, it typically requires iterative optimization and multi-view images captured under consistent lighting with known camera parameters. WildSplatter is trained on unconstrained photo collections and jointly learns 3D Gaussians and appearance embeddings conditioned on input images. This design enables flexible modulation of Gaussian colors to represent significant variations in lighting and appearance. Our method reconstructs 3D Gaussians from sparse input views in under one second, while also enabling appearance control under diverse lighting conditions. Experimental results demonstrate that our approach outperforms existing pose-free 3DGS methods on challenging real-world datasets with varying illumination.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21182v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21182v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment</title>
      <link>https://arxiv.org/abs/2604.21160v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21160v1</guid>
      <pubDate>Thu, 23 Apr 2026 00:01:40 GMT</pubDate>
      <dc:creator>Jingkun Chen, Ruoshi Xu, Mingqi Gao et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the o...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jingkun Chen, Ruoshi Xu, Mingqi Gao et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-23</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Point-Vision-Language Models promise to empower embodied agents with executable spatial reasoning, yet they frequently succumb to geometric hallucination where predicted 3D structures contradict the observed 2D reality. We identify a key cause of this failure not as a representation bottleneck but as a structural misalignment in reinforcement learning, where sparse geometric tokens are drowned out by noisy and broadcasted sequence-level rewards. To resolve this causal dilution, we propose Geometric Reward Credit Assignment, a framework that disentangles holistic supervision into field-specific signals and routes them exclusively to their responsible token spans. This mechanism transforms vague feedback into precise gradient updates and effectively turns generic policy optimization into targeted structural alignment. Furthermore, we internalize physical constraints via a Reprojection-Consistency term which serves as a cross-modal verifier to penalize physically impossible geometries. Validated on a calibrated benchmark derived from ShapeNetCore, our approach bridges the reliability gap by boosting 3D KPA from 0.64 to 0.93, increasing 3D bounding box intersection over union to 0.686, and raising reprojection consistency scores to 0.852. Crucially, these gains are achieved while maintaining robust 2D localization performance, marking a meaningful step from plausible textual outputs toward physically verifiable spatial predictions.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21160v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21160v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis</title>
      <link>https://arxiv.org/abs/2604.21146v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21146v1</guid>
      <pubDate>Wed, 22 Apr 2026 23:19:50 GMT</pubDate>
      <dc:creator>Yalcin Tur, Mihajlo Stojkovic, Ulas Bagci</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Diffusion models have achieved remarkable quality in multi-modal MRI synthesis, but their computational cost (hundreds of sampling steps and separate models per modality) limits clinical deployment. W...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">WFM: 3D Wavelet Flow Matching for Ultrafast Multi-Modal MRI Synthesis</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yalcin Tur, Mihajlo Stojkovic, Ulas Bagci</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> flow matching, diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Diffusion models have achieved remarkable quality in multi-modal MRI synthesis, but their computational cost (hundreds of sampling steps and separate models per modality) limits clinical deployment. We observe that this inefficiency stems from an unnecessary starting point: diffusion begins from pure noise, discarding the structural information already present in available MRI sequences. We propose WFM (Wavelet Flow Matching), which instead learns a direct flow from an informed prior, the mean of conditioning modalities in wavelet space, to the target distribution. Because the source and target share underlying anatomy and differ primarily in contrast, this formulation enables accurate synthesis in just 1-2 integration steps. A single 82M-parameter model with class conditioning synthesizes all four BraTS modalities (T1, T1c, T2, FLAIR), replacing four separate diffusion models totaling 326M parameters. On BraTS 2024, WFM achieves 26.8 dB PSNR and 0.94 SSIM, within 1-2 dB of diffusion baselines, while running 250-1000x faster (0.16-0.64s vs. 160s per volume). This speed-quality trade-off makes real-time MRI synthesis practical for clinical workflows. Code is available at https://github.com/yalcintur/WFM.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21146v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21146v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping</title>
      <link>https://arxiv.org/abs/2604.21127v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21127v1</guid>
      <pubDate>Wed, 22 Apr 2026 22:33:15 GMT</pubDate>
      <dc:creator>Zahid Hassan Tushar, Sanjay Purushotham</dc:creator>
      <category>模型架构</category>
      <description>The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth&apos;s climate and...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">HyperFM: An Efficient Hyperspectral Foundation Model with Spectral Grouping</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zahid Hassan Tushar, Sanjay Purushotham</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The NASA PACE mission provides unprecedented hyperspectral observations of ocean color, aerosols, and clouds, offering new insights into how these components interact and influence Earth&apos;s climate and air quality. Its Ocean Color Instrument measures light across hundreds of finely spaced wavelength bands, enabling detailed characterization of features such as phytoplankton composition, aerosol properties, and cloud microphysics. However, hyperspectral data of this scale is large, complex, and difficult to label, requiring specialized processing and analysis techniques. Existing foundation models, which have transformed computer vision and natural language processing, are generally trained on standard RGB imagery and therefore struggle to interpret the continuous spectral signatures captured by PACE. While recent advances have introduced hyperspectral foundation models, they are typically trained on cloud-free observations and often remain limited to single-sensor datasets due to spectral inconsistencies across instruments. Moreover, existing models tend to be parameter-heavy and computationally expensive, limiting scalability and adoption in operational settings. To address these challenges, we introduce HyperFM, a parameter-efficient hyperspectral foundation model that leverages intra-group and inter-group spectral attention along with hybrid parameter decomposition to better capture spectral spatial relationships while reducing computational cost. HyperFM demonstrates consistent performance improvements over existing hyperspectral foundation models and task-specific state-of-the-art methods across four benchmark downstream atmospheric cloud property retrieval tasks. To support further research, we additionally release HyperFM250K, a large-scale hyperspectral dataset from the PACE mission that includes both clear and cloudy scenes.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21127v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21127v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Materialistic RIR: Material Conditioned Realistic RIR Generation</title>
      <link>https://arxiv.org/abs/2604.21119v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21119v1</guid>
      <pubDate>Wed, 22 Apr 2026 22:04:35 GMT</pubDate>
      <dc:creator>Mahnoor Fatima Saad, Sagnik Majumder, Kristen Grauman et al.</dc:creator>
      <category>模型架构</category>
      <description>Rings like gold, thuds like wood! The sound we hear in a scene is shaped not only by the spatial layout of the environment but also by the materials of the objects and surfaces within it. For instance...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Materialistic RIR: Material Conditioned Realistic RIR Generation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Mahnoor Fatima Saad, Sagnik Majumder, Kristen Grauman et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Rings like gold, thuds like wood! The sound we hear in a scene is shaped not only by the spatial layout of the environment but also by the materials of the objects and surfaces within it. For instance, a room with wooden walls will produce a different acoustic experience from a room with the same spatial layout but concrete walls. Accurately modeling these effects is essential for applications such as virtual reality, robotics, architectural design, and audio engineering. Yet, existing methods for acoustic modeling often entangle spatial and material influences in correlated representations, which limits user control and reduces the realism of the generated acoustics. In this work, we present a novel approach for material-controlled Room Impulse Response (RIR) generation that explicitly disentangles the effects of spatial and material cues in a scene. Our approach models the RIR using two modules: a spatial module that captures the influence of the spatial layout of the scene, and a material module that modulates this spatial RIR according to a user-specified material configuration. This explicitly disentangled design allows users to easily modify the material configuration of a scene and observe its impact on acoustics without altering the spatial structure or scene content. Our model provides significant improvements over prior approaches on both acoustic-based metrics (up to +16% on RTE) and material-based metrics (up to +70%). Furthermore, through a human perceptual study, we demonstrate the improved realism and material sensitivity of our model compared to the strongest baselines.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21119v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21119v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery</title>
      <link>https://arxiv.org/abs/2604.21102v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21102v1</guid>
      <pubDate>Wed, 22 Apr 2026 21:42:09 GMT</pubDate>
      <dc:creator>Siyuan Yao, Siavash Ghorbany, Kuangshi Ai et al.</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-t...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Leveraging Multimodal LLMs for Built Environment and Housing Attribute Assessment from Street-View Imagery</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Siyuan Yao, Siavash Ghorbany, Kuangshi Ai et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, distillation, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We present a novel framework for automatically evaluating building conditions nationwide in the United States by leveraging large language models (LLMs) and Google Street View (GSV) imagery. By fine-tuning Gemma 3 27B on a modest human-labeled dataset, our approach achieves strong alignment with human mean opinion scores (MOS), outperforming even individual raters on SRCC and PLCC relative to the MOS benchmark. To enhance efficiency, we apply knowledge distillation, transferring the capabilities of Gemma 3 27B to a smaller Gemma 3 4B model that achieves comparable performance with a 3x speedup. Further, we distill the knowledge into a CNN-based model (EfficientNetV2-M) and a transformer (SwinV2-B), delivering close performance while achieving a 30x speed gain. Furthermore, we investigate LLMs&apos; capabilities for assessing an extensive list of built environment and housing attributes through a human-AI alignment study and develop a visualization dashboard that integrates LLM assessment outcomes for downstream analysis by homeowners. Our framework offers a flexible and efficient solution for large-scale building condition assessment, enabling high accuracy with minimal human labeling effort.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21102v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21102v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models</title>
      <link>https://arxiv.org/abs/2604.21079v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21079v1</guid>
      <pubDate>Wed, 22 Apr 2026 20:44:24 GMT</pubDate>
      <dc:creator>Juhong Min, Lazar Valkov, Vitali Petsiuk et al.</dc:creator>
      <category>多模态生成</category>
      <description>Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides &quot;where ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Juhong Min, Lazar Valkov, Vitali Petsiuk et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Vision-language models benefit from high-resolution images, but the increase in visual-token count incurs high compute overhead. Humans resolve this tension via foveation: a coarse view guides &quot;where to look&quot;, while selectively acquired high-acuity evidence refines &quot;what to think&quot;. We introduce Foveated Reasoner, an autoregressive vision-language framework that unifies foveation and reasoning within a single decoding trajectory. Starting from a low-resolution view, the model triggers foveation only when needed, retrieves high-resolution evidence from selected regions, and injects it back into the same decoding trajectory. We train the method with a two-stage pipeline: coldstart supervision to bootstrap foveation behavior, followed by reinforcement learning to jointly improve evidence acquisition and task accuracy while discouraging trivial &quot;see-everything&quot; solutions. Experiments show that the method learns effective foveation policies and achieves stronger accuracy under tight visual-token budgets across multiple vision-language benchmarks.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21079v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21079v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Optimizing Diffusion Priors with a Single Observation</title>
      <link>https://arxiv.org/abs/2604.21066v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21066v1</guid>
      <pubDate>Wed, 22 Apr 2026 20:18:57 GMT</pubDate>
      <dc:creator>Frederic Wang, Katherine L. Bouman</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and bi...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Optimizing Diffusion Priors with a Single Observation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Frederic Wang, Katherine L. Bouman</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">While diffusion priors generate high-quality posterior samples across many inverse problems, they are often trained on limited training sets or purely simulated data, thus inheriting the errors and biases of these underlying sources. Current approaches to finetuning diffusion models rely on a large number of observations with varying forward operators, which can be difficult to collect for many applications, and thus lead to overfitting when the measurement set is small. We propose a method for tuning a prior from only a single observation by combining existing diffusion priors into a single product-of-experts prior and identifying the exponents that maximize the Bayesian evidence. We validate our method on real-world inverse problems, including black hole imaging, where the true prior is unknown a priori, and image deblurring with text-conditioned priors. We find that the evidence is often maximized by priors that extend beyond those trained on a single dataset. By generalizing the prior through exponent weighting, our approach enables posterior sampling from both tempered and combined diffusion models, yielding more flexible priors that improve the trustworthiness of the resulting posterior image distribution.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21066v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21066v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images</title>
      <link>https://arxiv.org/abs/2604.21060v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21060v1</guid>
      <pubDate>Wed, 22 Apr 2026 20:04:09 GMT</pubDate>
      <dc:creator>Joakim Nguyen, Jian Yu, Jinrui Fang et al.</dc:creator>
      <category>模型架构</category>
      <description>Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Joakim Nguyen, Jian Yu, Jinrui Fang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21060v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21060v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains</title>
      <link>https://arxiv.org/abs/2604.21053v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21053v1</guid>
      <pubDate>Wed, 22 Apr 2026 19:53:48 GMT</pubDate>
      <dc:creator>Fatemeh Ziaeetabar</dc:creator>
      <category>模型架构</category>
      <description>Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow....</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Neuro-Symbolic Manipulation Understanding with Enriched Semantic Event Chains</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Fatemeh Ziaeetabar</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Robotic systems operating in human environments must reason about how object interactions evolve over time, which actions are currently being performed, and what manipulation step is likely to follow. Classical enriched Semantic Event Chains (eSECs) provide an interpretable relational description of manipulation, but remain primarily descriptive and do not directly support uncertainty-aware decision making. In this paper, we propose eSEC-LAM, a neuro-symbolic framework that transforms eSECs into an explicit event-level symbolic state for manipulation understanding. The proposed formulation augments classical eSECs with confidence-aware predicates, functional object roles, affordance priors, primitive-level abstraction, and saliency-guided explanation cues. These enriched symbolic states are derived from a foundation-model-based perception front-end through deterministic predicate extraction, while current-action inference and next-primitive prediction are performed using lightweight symbolic reasoning over primitive pre- and post-conditions. We evaluate the proposed framework on EPIC-KITCHENS-100, EPIC-KITCHENS VISOR, and Assembly101 across action recognition, next-primitive prediction, robustness to perception noise, and explanation consistency. Experimental results show that eSEC-LAM achieves competitive action recognition, substantially improves next-primitive prediction, remains more robust under degraded perceptual conditions than both classical symbolic and end-to-end video baselines, and provides temporally consistent explanation traces grounded in explicit relational evidence. These findings demonstrate that enriched Semantic Event Chains can serve not only as interpretable descriptors of manipulation, but also as effective internal states for neuro-symbolic action reasoning.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21053v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21053v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[控制与编辑] [模型架构] StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling</title>
      <link>https://arxiv.org/abs/2604.21052v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21052v1</guid>
      <pubDate>Wed, 22 Apr 2026 19:52:35 GMT</pubDate>
      <dc:creator>Liqi Jing, Dingming Zhang, Peinian Li et al.</dc:creator>
      <category>控制与编辑</category>
      <category>模型架构</category>
      <description>We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">StyleVAR: Controllable Image Style Transfer via Visual Autoregressive Modeling</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Liqi Jing, Dingming Zhang, Peinian Li et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 控制与编辑, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> autoregressive model, style transfer, dit, vae, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We build on the Visual Autoregressive Modeling (VAR) framework and formulate style transfer as conditional discrete sequence modeling in a learned latent space. Images are decomposed into multi-scale representations and tokenized into discrete codes by a VQ-VAE; a transformer then autoregressively models the distribution of target tokens conditioned on style and content tokens. To inject style and content information, we introduce a blended cross-attention mechanism in which the evolving target representation attends to its own history, while style and content features act as queries that decide which aspects of this history to emphasize. A scale-dependent blending coefficient controls the relative influence of style and content at each stage, encouraging the synthesized representation to align with both the content structure and the style texture without breaking the autoregressive continuity of VAR. We train StyleVAR in two stages from a pretrained VAR checkpoint: supervised fine-tuning on a large triplet dataset of content--style--target images, followed by reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) against a DreamSim-based perceptual reward, with per-action normalization weighting to rebalance credit across VAR&apos;s multi-scale hierarchy. Across three benchmarks spanning in-, near-, and out-of-distribution regimes, StyleVAR consistently outperforms an AdaIN baseline on Style Loss, Content Loss, LPIPS, SSIM, DreamSim, and CLIP similarity, and the GRPO stage yields further gains over the SFT checkpoint, most notably on the reward-aligned perceptual metrics. Qualitatively, the method transfers texture while maintaining semantic structure, especially for landscapes and architectural scenes, while a generalization gap on internet images and difficulty with human faces highlight the need for better content diversity and stronger structural priors.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21052v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21052v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 控制与编辑, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks</title>
      <link>https://arxiv.org/abs/2604.21041v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21041v1</guid>
      <pubDate>Wed, 22 Apr 2026 19:39:56 GMT</pubDate>
      <dc:creator>Aljalila Aladawi, Mohammed Talha Alam, Fakhri Karray</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <description>Machine unlearning for text-to-image diffusion models aims to selectively remove undesirable concepts from pre-trained models without costly retraining. Current unlearning methods share a common weakn...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Projected Gradient Unlearning for Text-to-Image Diffusion Models: Defending Against Concept Revival Attacks</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Aljalila Aladawi, Mohammed Talha Alam, Fakhri Karray</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> text-to-image, diffusion model, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Machine unlearning for text-to-image diffusion models aims to selectively remove undesirable concepts from pre-trained models without costly retraining. Current unlearning methods share a common weakness: erased concepts return when the model is fine-tuned on downstream data, even when that data is entirely unrelated. We adapt Projected Gradient Unlearning (PGU) from classification to the diffusion domain as a post-hoc hardening step. By constructing a Core Gradient Space (CGS) from the retain concept activations and projecting gradient updates into its orthogonal complement, PGU ensures that subsequent fine-tuning cannot undo the achieved erasure. Applied on top of existing methods (ESD, UCE, Receler), the approach eliminates revival for style concepts and substantially delays it for object concepts, running in roughly 6 minutes versus the ~2 hours required by Meta-Unlearning. PGU and Meta-Unlearning turn out to be complementary: which performs better depends on how the concept is encoded, and retain concept selection should follow visual feature similarity rather than semantic grouping.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21041v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21041v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment</title>
      <link>https://arxiv.org/abs/2604.21028v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21028v1</guid>
      <pubDate>Wed, 22 Apr 2026 19:19:59 GMT</pubDate>
      <dc:creator>Christian Lammers, Fernando Arévalo, Leonie Märker-Neuhaus et al.</dc:creator>
      <category>模型架构</category>
      <description>The increasing frequency and severity of global flood events highlights the need for the development of rapid and reliable flood prediction tools. This process traditionally relies on computationally ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Christian Lammers, Fernando Arévalo, Leonie Märker-Neuhaus et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> u-net, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The increasing frequency and severity of global flood events highlights the need for the development of rapid and reliable flood prediction tools. This process traditionally relies on computationally expensive hydraulic simulations. This research presents a prediction tool by developing a deep-learning based surrogate model to accurately and efficiently predict the maximum water level across a grid. This was achieved by conducting a series of experiments to optimize a U-Net architecture, patch generation, and data handling for approximating a hydraulic model. This research demonstrates that a deep learning surrogate model can serve as a computationally efficient alternative to traditional hydraulic simulations. The framework was tested using hydraulic simulations of the Wupper catchment in the North-Rhein Westphalia region (Germany), obtaining comparable results.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21028v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21028v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Linear Image Generation by Synthesizing Exposure Brackets</title>
      <link>https://arxiv.org/abs/2604.21008v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.21008v1</guid>
      <pubDate>Wed, 22 Apr 2026 18:55:35 GMT</pubDate>
      <dc:creator>Yuekun Dai, Zhoutong Zhang, Shangchen Zhou et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Linear Image Generation by Synthesizing Exposure Brackets</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yuekun Dai, Zhoutong Zhang, Shangchen Zhou et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> image editing, image generation, diffusion model, latent diffusion, dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The life of a photo begins with photons striking the sensor, whose signals are passed through a sophisticated image signal processing (ISP) pipeline to produce a display-referred image. However, such images are no longer faithful to the incident light, being compressed in dynamic range and stylized by subjective preferences. In contrast, RAW images record direct sensor signals before non-linear tone mapping. After camera response curve correction and demosaicing, they can be converted into linear images, which are scene-referred representations that directly reflect true irradiance and are invariant to sensor-specific factors. Since image sensors have better dynamic range and bit depth, linear images contain richer information than display-referred ones, leaving users more room for editing during post-processing. Despite this advantage, current generative models mainly synthesize display-referred images, which inherently limits downstream editing. In this paper, we address the task of text-to-linear-image generation: synthesizing a high-quality, scene-referred linear image that preserves full dynamic range, conditioned on a text prompt, for professional post-processing. Generating linear images is challenging, as pre-trained VAEs in latent diffusion models struggle to simultaneously preserve extreme highlights and shadows due to the higher dynamic range and bit depth. To this end, we represent a linear image as a sequence of exposure brackets, each capturing a specific portion of the dynamic range, and propose a DiT-based flow-matching architecture for text-conditioned exposure bracket generation. We further demonstrate downstream applications including text-guided linear image editing and structure-conditioned generation via ControlNet.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.21008v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.21008v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry</title>
      <link>https://arxiv.org/abs/2604.20983v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20983v1</guid>
      <pubDate>Wed, 22 Apr 2026 18:12:07 GMT</pubDate>
      <dc:creator>Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botan...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Syed Nazmus Sakib, Nafiul Haque, Shahrear Bin Amin et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20983v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20983v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] PanGuide3D: Cohort-Robust Pancreas Tumor Segmentation via Probabilistic Pancreas Conditioning and a Transformer Bottleneck</title>
      <link>https://arxiv.org/abs/2604.20981v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20981v1</guid>
      <pubDate>Wed, 22 Apr 2026 18:10:01 GMT</pubDate>
      <dc:creator>Sunny Joy Ma, Xiang Ma</dc:creator>
      <category>模型架构</category>
      <description>Pancreatic tumor segmentation in contrast-enhanced computed tomography (CT) is clinically important yet technically challenging: lesions are often small, heterogeneous, and easily confused with surrou...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">PanGuide3D: Cohort-Robust Pancreas Tumor Segmentation via Probabilistic Pancreas Conditioning and a Transformer Bottleneck</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Sunny Joy Ma, Xiang Ma</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> u-net, dit, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Pancreatic tumor segmentation in contrast-enhanced computed tomography (CT) is clinically important yet technically challenging: lesions are often small, heterogeneous, and easily confused with surrounding soft tissue, and models that perform well on one cohort frequently degrade under cohort shift. Our goal is to improve cross-cohort generalization while keeping the model architecture simple, efficient, and practical for 3D CT segmentation. We introduce PanGuide3D, a cohort-robust architecture with a shared 3D encoder, a pancreas decoder that predicts a probabilistic pancreas map, and a tumor decoder that is explicitly conditioned on this pancreas probability at multiple scales via differentiable soft gating. To capture long-range context under distribution shift, we further add a lightweight Transformer bottleneck in the U-Net bottleneck representation. We evaluate cohort transfer by training on the PanTS (Pancreatic Tumor Segmentation) cohort and testing both in-cohort (PanTS) and out-of-cohort on MSD (Medical Segmentation Decathlon) Task07 Pancreas, using matched preprocessing and training protocols across strong baselines. We collect voxel-level segmentation metrics, patient-level tumor detection, subgroup analyses by tumor size and anatomical location, volume-conditioned performance analyses, and calibration measurements to assess reliability. Across the evaluated models, PanGuide3D achieves the best overall tumor performance and shows improved cross-cohort generalization, particularly for small tumors and challenging anatomical locations, while reducing anatomically implausible false positives. These findings support probabilistic anatomical conditioning as a practical strategy for improving cross-cohort robustness in an end-to-end model and suggest potential utility for contouring support, treatment planning, and multi-institutional studies.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20981v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20981v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation</title>
      <link>https://arxiv.org/abs/2604.20841v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20841v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:59:55 GMT</pubDate>
      <dc:creator>Hyeonwoo Kim, Jeonghwan Kim, Kyungwon Cho et al.</dc:creator>
      <category>模型架构</category>
      <description>Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipul...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Hyeonwoo Kim, Jeonghwan Kim, Kyungwon Cho et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Recent advances in video generative models enable the synthesis of realistic human-object interaction videos across a wide range of scenarios and object categories, including complex dexterous manipulations that are difficult to capture with motion capture systems. While the rich interaction knowledge embedded in these synthetic videos holds strong potential for motion planning in dexterous robotic manipulation, their limited physical fidelity and purely 2D nature make them difficult to use directly as imitation targets in physics-based character control. We present DeVI (Dexterous Video Imitation), a novel framework that leverages text-conditioned synthetic videos to enable physically plausible dexterous agent control for interacting with unseen target objects. To overcome the imprecision of generative 2D cues, we introduce a hybrid tracking reward that integrates 3D human tracking with robust 2D object tracking. Unlike methods relying on high-quality 3D kinematic demonstrations, DeVI requires only the generated video, enabling zero-shot generalization across diverse objects and interaction types. Extensive experiments demonstrate that DeVI outperforms existing approaches that imitate 3D human-object interaction demonstrations, particularly in modeling dexterous hand-object interactions. We further validate the effectiveness of DeVI in multi-object scenes and text-driven action diversity, showcasing the advantage of using video as an HOI-aware motion planner.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20841v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20841v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels</title>
      <link>https://arxiv.org/abs/2604.20825v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20825v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:49:20 GMT</pubDate>
      <dc:creator>Sina Gholami, Abdulmoneam Ali, Tania Haghighi et al.</dc:creator>
      <category>扩散模型</category>
      <description>Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. I...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">FedSIR: Spectral Client Identification and Relabeling for Federated Learning with Noisy Labels</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Sina Gholami, Abdulmoneam Ali, Tania Haghighi et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> distillation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Federated learning (FL) enables collaborative model training without sharing raw data; however, the presence of noisy labels across distributed clients can severely degrade the learning performance. In this paper, we propose FedSIR, a multi-stage framework for robust FL under noisy labels. Different from existing approaches that mainly rely on designing noise-tolerant loss functions or exploiting loss dynamics during training, our method leverages the spectral structure of client feature representations to identify and mitigate label noise.   Our framework consists of three key components. First, we identify clean and noisy clients by analyzing the spectral consistency of class-wise feature subspaces with minimal communication overhead. Second, clean clients provide spectral references that enable noisy clients to relabel potentially corrupted samples using both dominant class directions and residual subspaces. Third, we employ a noise-aware training strategy that integrates logit-adjusted loss, knowledge distillation, and distance-aware aggregation to further stabilize federated optimization. Extensive experiments on standard FL benchmarks demonstrate that FedSIR consistently outperforms state-of-the-art methods for FL with noisy labels. The code is available at https://github.com/sinagh72/FedSIR.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20825v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20825v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series</title>
      <link>https://arxiv.org/abs/2604.20822v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20822v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:47:33 GMT</pubDate>
      <dc:creator>Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer</dc:creator>
      <category>模型架构</category>
      <description>The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth O...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Global Offshore Wind Infrastructure: Deployment and Operational Dynamics from Dense Sentinel-1 Time Series</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Thorsten Hoeser, Felix Bachofer, Claudia Kuenzer</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20822v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20822v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[评估与优化] [扩散模型] [图像生成] [模型架构] ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control</title>
      <link>https://arxiv.org/abs/2604.20816v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20816v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:44:56 GMT</pubDate>
      <dc:creator>Shelly Golan, Michael Finkelson, Ariel Bereslavsky et al.</dc:creator>
      <category>评估与优化</category>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter,...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">ParetoSlider: Diffusion Models Post-Training for Continuous Reward Control</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Shelly Golan, Michael Finkelson, Ariel Bereslavsky et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 评估与优化, 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> image editing, diffusion model, dit, diffusion, human preference</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Reinforcement Learning (RL) post-training has become the standard for aligning generative models with human preferences, yet most methods rely on a single scalar reward. When multiple criteria matter, the prevailing practice of ``early scalarization&apos;&apos; collapses rewards into a fixed weighted sum. This commits the model to a single trade-off point at training time, providing no inference-time control over inherently conflicting goals -- such as prompt adherence versus source fidelity in image editing. We introduce ParetoSlider, a multi-objective RL (MORL) framework that trains a single diffusion model to approximate the entire Pareto front. By training the model with continuously varying preference weights as a conditioning signal, we enable users to navigate optimal trade-offs at inference time without retraining or maintaining multiple checkpoints. We evaluate ParetoSlider across three state-of-the-art flow-matching backbones: SD3.5, FluxKontext, and LTX-2. Our single preference-conditioned model matches or exceeds the performance of baselines trained separately for fixed reward trade-offs, while uniquely providing fine-grained control over competing generative goals.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20816v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20816v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 评估与优化, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning</title>
      <link>https://arxiv.org/abs/2604.20813v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20813v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:43:50 GMT</pubDate>
      <dc:creator>Yonatan Haile Medhanie, Yuanhua Ni</dc:creator>
      <category>模型架构</category>
      <description>Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yonatan Haile Medhanie, Yuanhua Ni</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge&apos;ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge&apos;ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge&apos;ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20813v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20813v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model</title>
      <link>https://arxiv.org/abs/2604.20806v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20806v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:37:40 GMT</pubDate>
      <dc:creator>Qiguang Chen, Chengyu Luan, Jiajun Wu et al.</dc:creator>
      <category>多模态生成</category>
      <description>Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models ofte...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Qiguang Chen, Chengyu Luan, Jiajun Wu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20806v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20806v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image</title>
      <link>https://arxiv.org/abs/2604.20800v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20800v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:27:13 GMT</pubDate>
      <dc:creator>Dimitrije Antić, Alvaro Budria, George Paschalidis et al.</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body a...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">LEXIS: LatEnt ProXimal Interaction Signatures for 3D HOI from an Image</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Dimitrije Antić, Alvaro Budria, George Paschalidis et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vae, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Reconstructing 3D Human-Object Interaction from an RGB image is essential for perceptive systems. Yet, this remains challenging as it requires capturing the subtle physical coupling between the body and objects. While current methods rely on sparse, binary contact cues, these fail to model the continuous proximity and dense spatial relationships that characterize natural interactions. We address this limitation via InterFields, a representation that encodes dense, continuous proximity across the entire body and object surfaces. However, inferring these fields from single images is inherently ill-posed. To tackle this, our intuition is that interaction patterns are characteristically structured by the action and object geometry. We capture this structure in LEXIS, a novel discrete manifold of interaction signatures learned via a VQ-VAE. We then develop LEXIS-Flow, a diffusion framework that leverages LEXIS signatures to estimate human and object meshes alongside their InterFields. Notably, these InterFields help in a guided refinement that ensures physically-plausible, proximity-aware reconstructions without requiring post-hoc optimization. Evaluation on Open3DHOI and BEHAVE shows that LEXIS-Flow significantly outperforms existing SotA baselines in reconstruction, contact, and proximity quality. Our approach not only improves generalization but also yields reconstructions perceived as more realistic, moving us closer to holistic 3D scene understanding. Code &amp; models will be public at https://anticdimi.github.io/lexis.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20800v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20800v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [模型架构] [图像生成] [多模态生成] LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model</title>
      <link>https://arxiv.org/abs/2604.20796v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20796v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:20:42 GMT</pubDate>
      <dc:creator>Inclusion AI, Tiwei Bie, Haoxing Chen et al.</dc:creator>
      <category>扩散模型</category>
      <category>模型架构</category>
      <category>图像生成</category>
      <category>多模态生成</category>
      <description>We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combine...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Inclusion AI, Tiwei Bie, Haoxing Chen et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 模型架构, 图像生成, 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, image generation, dit, diffusion, distillation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20796v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20796v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [扩散模型] GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction</title>
      <link>https://arxiv.org/abs/2604.20784v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20784v1</guid>
      <pubDate>Wed, 22 Apr 2026 17:12:02 GMT</pubDate>
      <dc:creator>Zhenlong Wu, Zihan Zheng, Xuanxuan Wang et al.</dc:creator>
      <category>视频生成</category>
      <category>扩散模型</category>
      <description>Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">GeoRect4D: Geometry-Compatible Generative Rectification for Dynamic Sparse-View 3D Reconstruction</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zhenlong Wu, Zihan Zheng, Xuanxuan Wang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 扩散模型</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> temporal consistency, distillation, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Reconstructing dynamic 3D scenes from sparse multi-view videos is highly ill-posed, often leading to geometric collapse, trajectory drift, and floating artifacts. Recent attempts introduce generative priors to hallucinate missing content, yet naive integration frequently causes structural drift and temporal inconsistency due to the mismatch between stochastic 2D generation and deterministic 3D geometry. In this paper, we propose GeoRect4D, a novel unified framework for sparse-view dynamic reconstruction that couples explicit 3D consistency with generative refinement via a closed-loop optimization process. Specifically, GeoRect4D introduces a degradation-aware feedback mechanism that incorporates a robust anchor-based dynamic 3DGS substrate with a single-step diffusion rectifier to hallucinate high-fidelity details. This rectifier utilizes a structural locking mechanism and spatiotemporal coordinated attention, effectively preserving physical plausibility while restoring missing content. Furthermore, we present a progressive optimization strategy that employs stochastic geometric purification to eliminate floaters and generative distillation to infuse texture details into the explicit representation. Extensive experiments demonstrate that GeoRect4D achieves state-of-the-art performance in reconstruction fidelity, perceptual quality, and spatiotemporal consistency across multiple datasets.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20784v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20784v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems</title>
      <link>https://arxiv.org/abs/2604.20745v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20745v1</guid>
      <pubDate>Wed, 22 Apr 2026 16:34:17 GMT</pubDate>
      <dc:creator>Beining Wu, Jun Huang</dc:creator>
      <category>模型架构</category>
      <description>Federated continual learning (FCL) allows distributed autonomous fleets to adapt collaboratively to evolving terrain types across extended mission lifecycles. However, current approaches face several ...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Lifecycle-Aware Federated Continual Learning in Mobile Autonomous Systems</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Beining Wu, Jun Huang</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Federated continual learning (FCL) allows distributed autonomous fleets to adapt collaboratively to evolving terrain types across extended mission lifecycles. However, current approaches face several key challenges: 1) they use uniform protection strategies that do not account for the varying sensitivities to forgetting on different network layers; 2) they focus primarily on preventing forgetting during training, without addressing the long-term effects of cumulative drift; and 3) they often depend on idealized simulations that fail to capture the real-world heterogeneity present in distributed fleets. In this paper, we propose a lifecycle-aware dual-timescale FCL framework that incorporates training-time (pre-forgetting) prevention and (post-forgetting) recovery. Under this framework, we design a layer-selective rehearsal strategy that mitigates immediate forgetting during local training, and a rapid knowledge recovery strategy that restores degraded models after long-term cumulative drift. We present a theoretical analysis that characterizes heterogeneous forgetting dynamics and establishes the inevitability of long-term degradation. Our experimental results show that this framework achieves up to 8.3\% mIoU improvement over the strongest federated baseline and up to 31.7\% over conventional fine-tuning. We also deploy the FCL framework on a real-world rover testbed to assess system-level robustness under realistic constraints; the testing results further confirm the effectiveness of our FCL design.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20745v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20745v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback</title>
      <link>https://arxiv.org/abs/2604.20730v2</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20730v2</guid>
      <pubDate>Wed, 22 Apr 2026 16:15:28 GMT</pubDate>
      <dc:creator>Guotao Liang, Zhangcheng Wang, Juncheng Hu et al.</dc:creator>
      <category>模型架构</category>
      <description>Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-l...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Guotao Liang, Zhangcheng Wang, Juncheng Hu et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop &quot;blind drawing&quot; approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20730v2" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20730v2" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers</title>
      <link>https://arxiv.org/abs/2604.20715v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20715v1</guid>
      <pubDate>Wed, 22 Apr 2026 16:01:04 GMT</pubDate>
      <dc:creator>Yuxuan Xue, Ruofan Liang, Egor Zakharov et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequent...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">GeoRelight: Learning Joint Geometrical Relighting and Reconstruction with Flexible Multi-Modal Diffusion Transformers</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Yuxuan Xue, Ruofan Liang, Egor Zakharov et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, latent diffusion, dit, diffusion, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Relighting a person from a single photo is an attractive but ill-posed task, as a 2D image ambiguously entangles 3D geometry, intrinsic appearance, and illumination. Current methods either use sequential pipelines that suffer from error accumulation, or they do not explicitly leverage 3D geometry during relighting, which limits physical consistency. Since relighting and estimation of 3D geometry are mutually beneficial tasks, we propose a unified Multi-Modal Diffusion Transformer (DiT) that jointly solves for both: GeoRelight. We make this possible through two key technical contributions: isotropic NDC-Orthographic Depth (iNOD), a distortion-free 3D representation compatible with latent diffusion models; and a strategic mixed-data training method that combines synthetic and auto-labeled real data. By solving geometry and relighting jointly, GeoRelight achieves better performance than both sequential models and previous systems that ignored geometry.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20715v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20715v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs</title>
      <link>https://arxiv.org/abs/2604.20696v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20696v1</guid>
      <pubDate>Wed, 22 Apr 2026 15:41:33 GMT</pubDate>
      <dc:creator>Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr et al.</dc:creator>
      <category>多模态生成</category>
      <description>Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., th...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">R-CoV: Region-Aware Chain-of-Verification for Alleviating Object Hallucinations in LVLMs</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Jiahao Xie, Alessio Tonioni, Nathalie Rauschmayr et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Large vision-language models (LVLMs) have demonstrated impressive performance in various multimodal understanding and reasoning tasks. However, they still struggle with object hallucinations, i.e., the claim of nonexistent objects in the visual input. To address this challenge, we propose Region-aware Chain-of-Verification (R-CoV), a visual chain-of-verification method to alleviate object hallucinations in LVLMs in a post-hoc manner. Motivated by how humans comprehend intricate visual information -- often focusing on specific image regions or details within a given sample -- we elicit such region-level processing from LVLMs themselves and use it as a chaining cue to detect and alleviate their own object hallucinations. Specifically, our R-CoV consists of six steps: initial response generation, entity extraction, coordinate generation, region description, verification execution, and final response generation. As a simple yet effective method, R-CoV can be seamlessly integrated into various LVLMs in a training-free manner and without relying on external detection models. Extensive experiments on several widely used hallucination benchmarks across multiple LVLMs demonstrate that R-CoV can significantly alleviate object hallucinations in LVLMs. Project page: https://github.com/Jiahao000/R-CoV.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20696v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20696v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm</title>
      <link>https://arxiv.org/abs/2604.20665v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20665v1</guid>
      <pubDate>Wed, 22 Apr 2026 15:15:32 GMT</pubDate>
      <dc:creator>Karan Goyal, Dikshant Kukreja</dc:creator>
      <category>多模态生成</category>
      <description>The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Karan Goyal, Dikshant Kukreja</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The rapid proliferation of Vision-Language Models (VLMs) is widely celebrated as the dawn of unified multimodal knowledge discovery but its foundation operates on a dangerous, unquestioned axiom: that current VLMs faithfully synthesise multimodal data. We argue they do not. Instead, a profound crisis of trustworthiness underlies the dominant Vision Encoder-Projector-LLM paradigm. Rather than extracting grounded knowledge from visual inputs, state-of-the-art models frequently exhibit functional blindness, i.e., exploiting strong language priors to bypass severe visual representation bottlenecks. In this work, we challenge the conventional methodology of multimodal evaluation, which relies on data ablation or new dataset creation and therefore fatally conflates dataset biases with architectural incapacity. We propose a radical, information-theoretic departure: the Modality Translation Protocol, designed to quantifiably unmask the Expense of Seeing. By translating semantic payloads rather than ablating them, we formulate three novel metrics -- the Toll (ToS), Curse (CoS), and Fallacy (FoS) of Seeing -- culminating in the Semantic Sufficiency Criterion (SSC). Furthermore, we posit a provocative Divergence Law of Multimodal Scaling, hypothesising that as the underlying language engines scale to unprecedented reasoning capabilities, the mathematical penalty of the visual knowledge bottleneck paradoxically increases. We challenge the KDD community to abandon the illusory pursuit of &quot;multimodal gain&quot;. By elevating the SSC from a passive diagnostic constraint to an active architectural blueprint, we provide the rigorous, trustworthy foundation required to force the next generation of AI systems to truly see the data, achieving true multimodal reasoning.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20665v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20665v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking</title>
      <link>https://arxiv.org/abs/2604.20623v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20623v1</guid>
      <pubDate>Wed, 22 Apr 2026 14:38:41 GMT</pubDate>
      <dc:creator>Roie Kazoom, Yotam Gigi, George Leifman et al.</dc:creator>
      <category>模型架构</category>
      <description>Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Roie Kazoom, Yotam Gigi, George Leifman et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20623v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20623v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Beyond ZOH: Advanced Discretization Strategies for Vision Mamba</title>
      <link>https://arxiv.org/abs/2604.20606v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20606v1</guid>
      <pubDate>Wed, 22 Apr 2026 14:20:59 GMT</pubDate>
      <dc:creator>Fady Ibrahim, Guangjun Liu, Guanghui Wang</dc:creator>
      <category>模型架构</category>
      <description>Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades tempora...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Beyond ZOH: Advanced Discretization Strategies for Vision Mamba</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Fady Ibrahim, Guangjun Liu, Guanghui Wang</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Vision Mamba, as a state space model (SSM), employs a zero-order hold (ZOH) discretization, which assumes that input signals remain constant between sampling instants. This assumption degrades temporal fidelity in dynamic visual environments and constrains the attainable accuracy of modern SSM-based vision models. In this paper, we present a systematic and controlled comparison of six discretization schemes instantiated within the Vision Mamba framework: ZOH, first-order hold (FOH), bilinear/Tustin transform (BIL), polynomial interpolation (POL), higher-order hold (HOH), and the fourth-order Runge-Kutta method (RK4). We evaluate each method on standard visual benchmarks to quantify its influence in image classification, semantic segmentation, and object detection. Our results demonstrate that POL and HOH yield the largest gains in accuracy at the cost of higher training-time computation. In contrast, the BIL provides consistent improvements over ZOH with modest additional overhead, offering the most favorable trade-off between precision and efficiency. These findings elucidate the pivotal role of discretization in SSM-based vision architectures and furnish empirically grounded justification for adopting BIL as the default discretization baseline for state-of-the-art SSM models.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20606v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20606v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[扩散模型] [图像生成] [模型架构] Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging</title>
      <link>https://arxiv.org/abs/2604.20594v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20594v1</guid>
      <pubDate>Wed, 22 Apr 2026 14:11:27 GMT</pubDate>
      <dc:creator>Qian Chen, Yuehao Chen, Qiang Wang et al.</dc:creator>
      <category>扩散模型</category>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on suffici...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Physics-Informed Conditional Diffusion for Motion-Robust Retinal Temporal Laser Speckle Contrast Imaging</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Qian Chen, Yuehao Chen, Qiang Wang et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 扩散模型, 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> diffusion model, dit, diffusion</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Retinal laser speckle contrast imaging (LSCI) is a noninvasive optical modality for monitoring retinal blood flow dynamics. However, conventional temporal LSCI (tLSCI) reconstruction relies on sufficiently long speckle sequences to obtain stable temporal statistics, which makes it vulnerable to acquisition disturbances and limits effective temporal resolution. A physically informed reconstruction framework, termed RetinaDiff (Retinal Diffusion Model), is proposed for retinal tLSCI that is robust to motion and requires only a few frames. In RetinaDiff, registration based on phase correlation is first applied to stabilize the raw speckle sequence before contrast computation, reducing interframe misalignment so that fluctuations at each pixel primarily reflect true flow dynamics. This step provides a physics prior corrected for motion and a high quality multiframe tLSCI reference. Next, guided by the physics prior, a conditional diffusion model performs inverse reconstruction by jointly conditioning on the registered speckle sequence and the corrected prior. Experiments on data acquired with a retinal LSCI system developed in house show improved structural continuity and statistical stability compared with direct reconstruction from few frames and representative baselines. The framework also remains effective in a small number of extremely challenging cases, where both the direct 5-frame input and the conventional multiframe reconstruction are severely degraded. Overall, this work provides a practical and physically grounded route for reliable retinal tLSCI reconstruction from extremely limited frames. The source code and model weights will be publicly available at https://github.com/QianChen113/RetinaDiff.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20594v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20594v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 扩散模型, 图像生成 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] On the Impact of Face Segmentation-Based Background Removal on Recognition and Morphing Attack Detection</title>
      <link>https://arxiv.org/abs/2604.20585v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20585v1</guid>
      <pubDate>Wed, 22 Apr 2026 14:02:30 GMT</pubDate>
      <dc:creator>Eduarda Caldeira, Guray Ozgur, Fadi Boutros et al.</dc:creator>
      <category>模型架构</category>
      <description>This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scen...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">On the Impact of Face Segmentation-Based Background Removal on Recognition and Morphing Attack Detection</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Eduarda Caldeira, Guray Ozgur, Fadi Boutros et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">This study investigates the impact of face image background correction through segmentation on face recognition and morphing attack detection performance in realistic, unconstrained image capture scenarios. The motivation is driven by operational biometric systems such as the European Entry/Exit System (EES), which require facial enrolment at airports and other border crossing points where controlled backgrounds usually required for such captures cannot always be guaranteed, as well as by accessibility needs that may necessitate image capture outside traditional office environments. By analyzing how such preprocessing steps influence both recognition accuracy and security mechanisms, this work addresses a critical gap between usability-driven image normalization and the reliability requirements of large-scale biometric identification systems. Our study evaluates a comprehensive range of segmentation techniques, three families of morphing attack detection methods, and four distinct face recognition models, using databases that include both controlled and in-the-wild image captures. The results reveal consistent patterns linking segmentation to both recognition performance and face image quality. Additionally, segmentation is shown to systematically influence morphing attack detection performance. These findings highlight the need for careful consideration when deploying such preprocessing techniques in operational biometric systems.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20585v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20585v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] Where are they looking in the operating room?</title>
      <link>https://arxiv.org/abs/2604.20574v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20574v1</guid>
      <pubDate>Wed, 22 Apr 2026 13:53:49 GMT</pubDate>
      <dc:creator>Keqi Chen, Séraphin Baributsa, Lilien Schewski et al.</dc:creator>
      <category>模型架构</category>
      <description>Purpose: Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, a...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Where are they looking in the operating room?</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Keqi Chen, Séraphin Baributsa, Lilien Schewski et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> vit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Purpose: Gaze-following, the task of inferring where individuals are looking, has been widely studied in computer vision, advancing research in visual attention modeling, social scene understanding, and human-robot interaction. However, gaze-following has never been explored in the operating room (OR), a complex, high-stakes environment where visual attention plays an important role in surgical workflow analysis. In this work, we introduce the concept of gaze-following to the surgical domain, and demonstrate its great potential for understanding clinical roles, surgical phases, and team communications in the OR. Methods: We extend the 4D-OR dataset with gaze-following annotations, and extend the Team-OR dataset with gaze-following and a new team communication activity annotations. Then, we propose novel approaches to address clinical role prediction, surgical phase recognition, and team communication detection using a gaze-following model. For role and phase recognition, we propose a gaze heatmap-based approach that uses gaze predictions solely; for team communication detection, we train a spatial-temporal model in a self-supervised way that encodes gaze-based clip features, and then feed the features into a temporal activity detection model. Results: Experimental results on the 4D-OR and Team-OR datasets demonstrate that our approach achieves state-of-the-art performance on all downstream tasks. Quantitatively, our approach obtains F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition. Furthermore, it significantly outperforms existing baselines in team communication detection, improving previous best performances by over 30%. Conclusion: We introduce gaze-following in the OR as a novel research direction in surgical data science, highlighting its great potential to advance surgical workflow analysis in computer-assisted interventions.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20574v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20574v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[图像生成] [模型架构] Exploring Spatial Intelligence from a Generative Perspective</title>
      <link>https://arxiv.org/abs/2604.20570v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20570v1</guid>
      <pubDate>Wed, 22 Apr 2026 13:50:00 GMT</pubDate>
      <dc:creator>Muzhi Zhu, Shunyao Jiang, Huanyi Zheng et al.</dc:creator>
      <category>图像生成</category>
      <category>模型架构</category>
      <description>Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified mul...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Exploring Spatial Intelligence from a Generative Perspective</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Muzhi Zhu, Shunyao Jiang, Huanyi Zheng et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 图像生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, image editing, image generation</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20570v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20570v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 图像生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[多模态生成] [模型架构] Evian: Towards Explainable Visual Instruction-tuning Data Auditing</title>
      <link>https://arxiv.org/abs/2604.20544v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20544v1</guid>
      <pubDate>Wed, 22 Apr 2026 13:28:27 GMT</pubDate>
      <dc:creator>Zimu Jia, Mingjie Xu, Andrew Estornell et al.</dc:creator>
      <category>多模态生成</category>
      <category>模型架构</category>
      <description>The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capabi...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">Evian: Towards Explainable Visual Instruction-tuning Data Auditing</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Zimu Jia, Mingjie Xu, Andrew Estornell et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 多模态生成, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, vlm, vision-language model</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel &quot;Decomposition-then-Evaluation&quot; paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20544v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20544v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 多模态生成, 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images</title>
      <link>https://arxiv.org/abs/2604.20543v2</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20543v2</guid>
      <pubDate>Wed, 22 Apr 2026 13:27:26 GMT</pubDate>
      <dc:creator>Guyue Hu, Hao Song, Yuxing Tong et al.</dc:creator>
      <category>模型架构</category>
      <description>Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with lar...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">RefAerial: A Benchmark and Approach for Referring Detection in Aerial Images</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Guyue Hu, Hao Song, Yuxing Tong et al.</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">Referring detection refers to locate the target referred by natural languages, which has recently attracted growing research interests. However, existing datasets are limited to ground images with large object centered in relative small scenes. This paper introduces a large-scale challenging dataset for referring detection in aerial images, termed as RefAerial. It distinguishes from conventional ground referring detection datasets by 4 characteristics: (1) low but diverse object-to-scene ratios, (2) numerous targets and distractors, (3)complex and fine-grained referring descriptions, (4) diverse and broad scenes in the aerial view. We also develop a human-in-the-loop referring expansion and annotation engine (REA-Engine) for efficient semi-automated referring pair annotation. Besides, we observe that existing ground referring detection approaches exhibiting serious performance degradation on our aerial dataset since the intrinsic scale variety issue within or across aerial images. Therefore, we further propose a novel scale-comprehensive and sensitive (SCS) framework for referring detection in aerial images. It consists of a mixture-of-granularity (MoG) attention and a two-stage comprehensive-to-sensitive (CtS) decoding strategy. Specifically, the mixture-of-granularity attention is developed for scale-comprehensive target understanding. In addition, the two-stage comprehensive-to-sensitive decoding strategy is designed for coarse-to-fine referring target decoding. Eventually, the proposed SCS framework achieves remarkable performance on our aerial referring detection dataset and even promising performance boost on conventional ground referring detection datasets.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20543v2" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20543v2" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[视频生成] [扩散模型] [模型架构] AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe</title>
      <link>https://arxiv.org/abs/2604.20936v1</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20936v1</guid>
      <pubDate>Wed, 22 Apr 2026 13:11:21 GMT</pubDate>
      <dc:creator>Adam Cole, Mick Grierson</dc:creator>
      <category>视频生成</category>
      <category>扩散模型</category>
      <category>模型架构</category>
      <description>We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Adam Cole, Mick Grierson</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 视频生成, 扩散模型, 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit, diffusion, video generation, video diffusion, transformer</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We present AttentionBender, a tool that manipulates cross-attention in Video Diffusion Transformers to help artists probe the internal mechanics of black-box video generation. While generative outputs are increasingly realistic, prompt-only control limits artists&apos; ability to build intuition for the model&apos;s material process or to work beyond its default tendencies. Using an autobiographical research-through-design approach, we built on Network Bending to design AttentionBender, which applies 2D transforms (rotation, scaling, translation, etc.) to cross-attention maps to modulate generation. We assess AttentionBender by visualizing 4,500+ video generations across prompts, operations, and layer targets. Our results suggest that cross-attention is highly entangled: targeted manipulations often resist clean, localized control, producing distributed distortions and glitch aesthetics over linear edits. AttentionBender contributes a tool that functions both as an Explainable AI style probe of transformer attention mechanisms, and as a creative technique for producing novel aesthetics beyond the model&apos;s learned representational space.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20936v1" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20936v1" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 视频生成, 扩散模型 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
    <item>
      <title>[模型架构] From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR</title>
      <link>https://arxiv.org/abs/2604.20522v2</link>
      <guid isPermaLink="true">https://arxiv.org/abs/2604.20522v2</guid>
      <pubDate>Wed, 22 Apr 2026 13:01:36 GMT</pubDate>
      <dc:creator>Nan Xu, Shiheng Li, Shengchao Hou</dc:creator>
      <category>模型架构</category>
      <description>We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, w...</description>
      <content:encoded><![CDATA[
<div style="font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px;">
  <h2 style="color: #1a1a1a; margin-bottom: 10px;">From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR</h2>
  
  <div style="background: #f5f5f5; padding: 15px; border-radius: 8px; margin: 20px 0;">
    <p style="margin: 5px 0;"><strong>👤 作者：</strong> Nan Xu, Shiheng Li, Shengchao Hou</p>
    <p style="margin: 5px 0;"><strong>📂 分类：</strong> 模型架构</p>
    <p style="margin: 5px 0;"><strong>🔖 关键词：</strong> dit</p>
    <p style="margin: 5px 0;"><strong>📅 发布：</strong> 2026-04-22</p>
  </div>
  
  <div style="background: white; padding: 20px; border-left: 4px solid #4CAF50; margin: 20px 0;">
    <h3 style="color: #4CAF50; margin-top: 0;">📝 摘要</h3>
    <p style="line-height: 1.8; color: #333;">We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.</p>
  </div>
  
  <div style="margin-top: 30px; padding: 15px; background: #e3f2fd; border-radius: 8px;">
    <p style="margin: 5px 0;"><strong>🔗 链接：</strong></p>
    <p style="margin: 10px 0;">
      <a href="https://arxiv.org/abs/2604.20522v2" style="color: #2196F3; text-decoration: none; margin-right: 20px;">📄 arXiv 页面</a>
      <a href="https://arxiv.org/pdf/2604.20522v2" style="color: #FF5722; text-decoration: none;">📥 PDF 下载</a>
    </p>
  </div>
  
  <div style="margin-top: 20px; padding: 10px; background: #fff3e0; border-radius: 8px; font-size: 0.9em;">
    <p style="margin: 5px 0; color: #666;">💡 <strong>提示：</strong>这是一篇 AIGC 领域的最新论文，涵盖 模型架构 等主题。</p>
  </div>
</div>
]]></content:encoded>
      <enclosure url="https://arxiv.org/static/browse/0.3.4/images/arxiv-logo-fb.png" type="image/png"/>
    </item>
  </channel>
</rss>