- Megrez2 Technical Report
- Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries
- Can One Domain Help Others? A Data-Centric Study on Multi-Domain Reasoning via Reinforcement Learning
- Investigating Training Data Detection in AI Coders
- EarthLink: Interpreting Climate Signals with Self-Evolving AI Agents
- ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
- LOCOFY Large Design Models -- Design to code conversion solution
- Compositional Coordination for Multi-Robot Teams with Large Language Models
- 3LM: Bridging Arabic, STEM, and Code through Benchmarking
- Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR
- Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training
- ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
- SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation
- Survey of GenAI for Automotive Software Development: From Requirements to Executable Code
- Doc2Chart: Intent-Driven Zero-Shot Chart Generation from Documents
- Mind the Gap: A Review of Arabic Post-Training Datasets and Their Limitations
- Exploring Human-AI Complementarity in CPS Diagnosis Using Unimodal and Multimodal BERT Models
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
- Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models
- Towards Formal Verification of LLM-Generated Code from Natural Language Prompts
- MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps
- GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities
- Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
- MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
- ExpliCIT-QA: Explainable Code-Based Image Table Question Answering
- MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization
- The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
- Function-to-Style Guidance of LLMs for Code Translation
- CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
- CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
- A Code Comprehension Benchmark for Large Language Models for Code
- Turning the Tide: Repository-based Code Reflection
- A Mixture of Linear Corrections Generates Secure Code
- OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique
- On Evaluating Performance of LLM Inference Serving Systems
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
- Multilingual Multimodal Software Developer for Code Generation
- Agentic Large Language Models for Conceptual Systems Engineering and Design
- Automating MD simulations for Proteins using Large language Models: NAMD-Agent
- Rethinking Verification for LLM Code Generation: From Generation to Testing
- Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
- Foundation Model Self-Play: Open-Ended Strategy Innovation via Foundation Models
- A Semantic Parsing Framework for End-to-End Time Normalization
- Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
- Coding Triangle: How Does Large Language Model Understand Code?
- CogniSQL-R1-Zero: Lightweight Reinforced Reasoning for Efficient SQL Generation
- Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning
- ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
- ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning
- A Technical Survey of Reinforcement Learning Techniques for Large Language Models
- Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy
- EvoAgentX: An Automated Framework for Evolving Agentic Workflows
- CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
- Discovering Algorithms with Computational Language Processing
- LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
- OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
- FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
- Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection
- CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
- The Anatomy of Evidence: An Investigation Into Explainable ICD Coding
- LogitSpec: Accelerating Retrieval-based Speculative Decoding via Next Next Token Speculation
- LLM-based Realistic Safety-Critical Driving Video Generation
- Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability
- Cognitive Load-Aware Inference: A Neuro-Symbolic Framework for Optimizing the Token Economy of Large Language Models
- iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing
- An AST-guided LLM Approach for SVRF Code Synthesis
- Teaching Programming in the Age of Generative AI: Insights from Literature, Pedagogical Proposals, and Student Perspectives
- VoyagerVision: Investigating the Role of Multi-modal Information for Open-ended Learning Systems
- Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation
- Beyond Code: The Multidimensional Impacts of Large Language Models in Software Development
- P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code
- VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs
- QuickSilver -- Speeding up LLM Inference through Dynamic Token Halting, KV Skipping, Contextual Token Fusion, and Adaptive Matryoshka Quantization
- Concept-Level AI for Telecom: Moving Beyond Large Language Models
- Exploring Modularity of Agentic Systems for Drug Discovery
- Training Language Model to Critique for Better Refinement
- Estimating Correctness Without Oracles in LLM-Based Code Generation
- DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
- Large Language Model-Driven Code Compliance Checking in Building Information Modeling
- ReCode: Updating Code API Knowledge with Reinforcement Learning
- SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models
- Language Modeling by Language Models
- Zero-Shot Attribution for Large Language Models: A Distribution Testing Approach
- SACL: Understanding and Combating Textual Bias in Code Retrieval with Semantic-Augmented Reranking and Localization
- QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges
- Scaling Speculative Decoding with Lookahead Reasoning
- Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
- From Reproduction to Replication: Evaluating Research Agents with Progressive Code Masking
- Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs
- Steering Conceptual Bias via Transformer Latent-Subspace Activation
- LLMs on a Budget? Say HOLA
- The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
- Use Property-Based Testing to Bridge LLM Code Generation and Validation
- RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
- Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
- TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs
- AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions
- LMR-BENCH: Evaluating LLM Agent's Ability on Reproducing Language Modeling Research
- Massive Supervised Fine-tuning Experiments Reveal How Data, Layer, and Training Factors Shape LLM Alignment Quality
- StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery
- Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees
- Sampling from Your Language Model One Byte at a Time
- How Does LLM Reasoning Work for Code? A Survey and a Call to Action
- LocationReasoner: Evaluating LLMs on Real-World Site Selection Reasoning
- A Technical Study into Small Reasoning Language Models
- FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation
- Structured Program Synthesis using LLMs: Results and Insights from the IPARC Challenge
- Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?
- QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm
- PRO-V: An Efficient Program Generation Multi-Agent System for Automatic RTL Verification
- code_transformed: The Influence of Large Language Models on Code
- Configurable Preference Tuning with Rubric-Guided Synthetic Data
- Leveraging GPT-4 for Vulnerability-Witnessing Unit Test Generation
- Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards
- LLM-as-a-Judge for Reference-less Automatic Code Validation and Refinement for Natural Language to Bash in IT Automation
- AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
- Specification and Evaluation of Multi-Agent LLM Systems -- Prototype and Cybersecurity Applications
- Edit Flows: Flow Matching with Edit Operations
- SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
- Draft-based Approximate Inference for LLMs
- Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study
- Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles
- Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models
- AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists
- SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design
- ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols
- Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning
- LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
- VeriLoC: Line-of-Code Level Prediction of Hardware Design Quality from Verilog Code
- Break-The-Chain: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation
- KnowCoder-V2: Deep Knowledge Analysis
- Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems
- Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit
- KramaBench: A Benchmark for AI Systems on Data-to-Insight Pipelines over Data Lakes
- DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
- Can Theoretical Physics Research Benefit from Language Agents?
- Table-r1: Self-supervised and Reinforcement Learning for Program-based Table Reasoning in Small Language Models
- Reinforcing Code Generation: Improving Text-to-SQL with Execution-Based Learning
- CP-Bench: Evaluating Large Language Models for Constraint Modelling
- SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code
- Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework
- Toward Greater Autonomy in Materials Discovery Agents: Unifying Planning, Physics, and Scientists
- ScaleRTL: Scaling LLMs with Reasoning Data and Test-Time Compute for Accurate RTL Code Generation
- ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests
- Accelerated Test-Time Scaling with Model-Free Speculative Sampling
- Agents of Change: Self-Evolving LLM Agents for Strategic Planning
- Demonstrations of Integrity Attacks in Multi-Agent Systems
- hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation
- Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning
- Adaptive Graph Pruning for Multi-Agent Communication
- Rethinking the effects of data contamination in Code Intelligence
- Consultant Decoding: Yet Another Synergistic Mechanism
- ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code
- Improving LLM-Generated Code Quality with GRPO
- SALAD: Systematic Assessment of Machine Unlearing on LLM-Aided Hardware Design
- Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
- DeepSeek in Healthcare: A Survey of Capabilities, Risks, and Clinical Applications of Open-Source Large Language Models
- Mamba Drafters for Speculative Decoding
- XAI-Units: Benchmarking Explainability Methods with Unit Tests
- Legal Compliance Evaluation of Smart Contracts Generated By Large Language Models
- A "Wenlu" Brain System for Multimodal Cognition and Embodied Decision-Making: A Secure New Architecture for Deep Integration of Foundation Models and Domain Knowledge
- Accelerating Diffusion LLMs via Adaptive Parallel Decoding
- Writing-Zero: Bridge the Gap Between Non-verifiable Problems and Verifiable Rewards
- Eye of Judgement: Dissecting the Evaluation of Russian-speaking LLMs with POLLUX
- Cross-Attention Speculative Decoding
- RMoA: Optimizing Mixture-of-Agents through Diversity Maximization and Residual Compensation
- SwiftEval: Developing a Language-Specific Benchmark for LLM-generated Code Evaluation
- A Reward-driven Automated Webshell Malicious-code Generator for Red-teaming
- Mitigating Overthinking in Large Reasoning Models via Manifold Steering
- Text2Grad: Reinforcement Learning from Natural Language Feedback
- MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps
- Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
- RAD: Redundancy-Aware Distillation for Hybrid Models via Self-Speculative Decoding
- From Coders to Critics: Empowering Students through Peer Assessment in the Age of AI Copilots
- R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
- Hardware-Efficient Attention for Fast Decoding
- rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset
- RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving
- An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
- Rendering-Aware Reinforcement Learning for Vector Graphics Generation
- SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
- AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
- Large Language Models for IT Automation Tasks: Are We There Yet?
- HAMburger: Accelerating LLM Inference via Token Smashing
- SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
- An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation
- Compliance-to-Code: Enhancing Financial Compliance Checking via Code Generation
- ReChisel: Effective Automatic Chisel Code Generation by LLM with Reflection
- Learning to Reason via Mixture-of-Thought for Logical Reasoning
- Long-Form Information Alignment Evaluation Beyond Atomic Facts
- Large Language Models as Computable Approximations to Solomonoff Induction
- dKV-Cache: The Cache for Diffusion Language Models
- Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
- Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
- HybridProver: Augmenting Theorem Proving with LLM-Driven Proof Synthesis and Refinement
- VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models
- Advancing LLM Safe Alignment with Safety Representation Ranking
- LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
- HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
- Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
- Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
- Guidelines for the Quality Assessment of Energy-Aware NAS Benchmarks
- DS-Bench: A Realistic Benchmark for Data Science Code Generation
- Deep Learning for Continuous-time Stochastic Control with Jumps
- UWSAM: Segment Anything Model Guided Underwater Instance Segmentation and A Large-scale Benchmark Dataset
- Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models
- Bridging the Domain Gap in Equation Distillation with Reinforcement Feedback
- Moonbeam: A MIDI Foundation Model Using Both Absolute and Relative Music Attributes