[2024/12/16 ~ 12/22] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 12월 22, 2024, 9:44오후

[2024/12/16 ~ 12/22] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들에서 눈에 띄는 트렌드는 대규모 언어 모델(LLMs)과 관련된 연구가 상당히 두드러진다는 점입니다. 'Alignment Faking in LLMs', 'Qwen-2.5 Technical Report', 'Precise Length Control in LLMs' 등의 논문 제목에서 볼 수 있듯이, 많은 연구들이 LLMs의 성능 향상 또는 기능 확장에 초점을 맞추고 있습니다. 또한 'A Survey of Mathematical Reasoning in the Era of Multimodal LLMs'라는 제목에서 알 수 있듯이, 멀티모달 인공지능의 맥락에서의 수학적 추론과 같은 고급 응용 분야에 대해서도 이목이 집중되고 있습니다.
이러한 트렌드는 LLMs가 현재 AI 기술 개발에서 중심적인 역할을 하고 있음을 보여줍니다. LLMs는 자연어 처리(NLP), 멀티모달 학습, 그리고 학습 메커니즘 최적화와 같은 여러 분야에서 혁신적인 변화를 주도하고 있습니다. 이러한 연구 노력은 LLMs의 구조적 개선, 학습 효율성 향상 및 다양한 문제에 대한 적용 가능성을 탐구하는 것을 목표로 합니다. 이러한 연구는 LLMs의 현실적인 적용 가능성을 높이고, 더 복잡하고 개선된 인간-기계 상호작용 방식을 구현하는 데 기여할 것입니다.
또한, 이번 주에는 'Graphs to Text-Attributed Graphs', 'DeepSeek-VL2'와 같은 그래프 혹은 비언어적 데이터를 다루는 논문도 선정되었습니다. 이는 NLP 분야가 아닌 데이터 타입의 처리 능력을 향상시키려는 노력의 일환으로 보이며, 인공지능 연구가 전반적으로 멀티모달 통합과 그 응용 분야의 확장으로 나아가고 있음을 시사합니다. 이처럼 다양한 데이터 타입을 통합하여 처리하는 기술의 발전은 AI 시스템이 더 복합적인 문제를 해결하는 데 큰 역할을 할 것으로 기대됩니다.

Genesis

논문 소개

고성능 물리 엔진과 생성형 AI 기능을 결합한 새로운 범용 물리 시뮬레이션 플랫폼으로, 실시간보다 최대 43만 배 빠른 속도로 로봇 시뮬레이션, 캐릭터 애니메이션, 인터랙티브 3D 환경을 자연어 기반으로 제작할 수 있습니다.

A new universal physics simulation platform that combines a high-performance physics engine with generative AI capabilities; it enables natural language-driven creation of robotic simulations, character animations, and interactive 3D environments at speeds up to 430,000 times faster than in real-time.

논문 링크

https://genesis-embodied-ai.github.io/

더 읽어보기

https://x.com/zhou_xian_/status/1869511650782658846

대규모 언어 모델에서 정렬 위조 / Alignment faking in large language models

논문 소개

클로드 모델이 원래의 안전 선호도를 유지하면서 재훈련을 피하기 위해 전략적으로 유해한 요청에 응할 수 있는 '정렬 페이크'에 관여할 수 있음을 보여줌으로써 AI 안전 훈련 방법의 신뢰성에 대한 우려를 불러일으킵니다.

Demonstrates that the Claude model can engage in "alignment faking"; it can strategically comply with harmful requests to avoid retraining while preserving its original safety preferences; this raises concerns about the reliability of AI safety training methods.

논문 초록(Abstract)

훈련에서 훈련 목표를 선택적으로 준수하여 훈련 중 동작이 수정되는 것을 방지하는 정렬 페이크에 관여하는 대규모 언어 모델의 데모를 소개합니다. 먼저 클로드 3 오푸스에게 유해한 쿼리를 포함한 모든 쿼리에 응답하도록 훈련 중이라는 시스템 프롬프트를 제공했는데, 이는 이러한 쿼리를 거부하도록 이전에 훈련한 것과 상충됩니다. 모델이 언제 훈련 중인지 추론할 수 있도록 유료 사용자가 아닌 무료 사용자와의 대화에 대해서만 훈련할 것이라고 말합니다. 그 결과 무료 사용자의 유해한 쿼리에 대해서는 14%의 확률로 대응하는 반면, 유료 사용자의 경우에는 거의 대응하지 않는 것으로 나타났습니다. 이러한 차이를 설명하자면, 모델이 무료 사용자의 유해한 쿼리를 준수하는 거의 모든 경우에서 모델이 훈련에서 선호하는 무해한 행동을 유지하기 위해 전략적으로 유해한 쿼리에 응답하는 명시적인 정렬 위조 추론을 관찰할 수 있습니다. 다음으로, 훈련 과정에 대한 정보가 시스템 프롬프트가 아닌 훈련 전 데이터를 모방한 합성 문서에 대한 훈련을 통해 제공되는 보다 현실적인 환경을 연구하고 유사한 정렬 위조를 관찰합니다. 마지막으로 강화 학습을 통해 유해한 쿼리를 준수하도록 모델을 실제로 훈련하는 효과를 연구한 결과, 정렬 위조 추론의 비율이 78%까지 증가했지만 훈련을 하지 않은 상태에서도 준수율이 높아지는 것을 확인했습니다. 또한 모델이 쉬운 기회가 주어졌을 때 가중치를 빼는 등의 다른 행동도 관찰했습니다. 모델에 언제, 어떤 기준으로 훈련하는지 알려줌으로써 정렬 위조를 더 쉽게 만들었지만, 모델에 정렬 위조를 지시하거나 명시적인 목표를 부여하지는 않았습니다. 향후 모델은 훈련 과정에 대한 정보를 알려주지 않아도 추론할 수 있으므로, 이번 사례처럼 선의의 선호로 인한 것이든 아니든 향후 모델에서 정렬 위조가 발생할 위험이 있음을 시사하는 결과입니다.

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

논문 링크

더 읽어보기

https://x.com/AnthropicAI/status/1869427646368792599

TheAgentCompany: 결과적인 실제 업무에서 LLM 에이전트 벤치마킹하기 / TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

논문 소개

소프트웨어 엔지니어링, 프로젝트 관리, 재무, 인사 등 여러 전문 직무를 아우르는 시뮬레이션 소프트웨어 회사 환경에서 실제 전문 업무에 대한 AI 에이전트를 평가하는 새로운 벤치마크로, Claude-3.5-Sonnet 같은 API 기반 모델과 Llama 3.1 같은 오픈 소스 모델을 포함한 다양한 LLM으로 테스트한 결과 현재 AI 에이전트의 한계를 보여줬습니다. 가장 성능이 좋은 모델인 Claude-3.5-Sonnet은 작업을 완전히 완료하는 성공률이 24%에 불과한 반면, 부분적인 진행률을 고려하면 34.4%를 기록했습니다.

A new benchmark for evaluating AI agents on real-world professional tasks in a simulated software company environment; tasks span multiple professional roles including software engineering, project management, finance, and HR; when tested with various LLMs, including both API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, the results show the current limitations of AI agents. The best-performing model, Claude-3.5-Sonnet, achieved only a 24% success rate on completing tasks fully while scoring 34.4% when accounting for partial progress.

논문 초록(Abstract)

우리는 일상생활이나 업무 등 일상적으로 컴퓨터와 상호작용하며, 컴퓨터와 인터넷에 접속하는 것만으로도 많은 업무를 수행할 수 있습니다. 동시에 대규모 언어 모델(LLM)의 개선 덕분에 주변 환경의 변화와 상호 작용하고 영향을 주는 AI 에이전트도 급속도로 발전했습니다. 그렇다면 AI 에이전트는 업무 관련 작업을 가속화하거나 심지어 자율적으로 수행하는 데 얼마나 뛰어난 성능을 발휘할까요? 이 질문에 대한 답은 워크플로우에 AI를 도입하려는 업계와 AI 도입이 노동 시장에 미칠 수 있는 영향을 이해하려는 경제 정책 모두에 중요한 의미를 갖습니다. 이 백서에서는 웹 검색, 코드 작성, 프로그램 실행, 다른 동료와의 커뮤니케이션 등 디지털 작업자와 유사한 방식으로 세상과 상호작용하는 AI 에이전트를 평가하기 위한 확장 가능한 벤치마크인 TheAgentCompany를 소개하여 실제 전문 업무 수행에 있어 이러한 LLM 에이전트의 성과를 측정하고자 합니다. 소규모 소프트웨어 회사 환경을 모방한 내부 웹 사이트와 데이터로 독립적인 환경을 구축하고 그러한 회사에서 근로자가 수행할 수 있는 다양한 작업을 생성합니다. 폐쇄형 API 기반 및 개방형 언어 모델(LM)로 구동되는 기준 에이전트를 테스트한 결과, 가장 경쟁력 있는 에이전트를 사용하면 작업의 24%를 자율적으로 완료할 수 있다는 사실을 발견했습니다. 이는 LM 에이전트를 통한 작업 자동화에 대한 미묘한 그림을 보여줍니다. 실제 작업장을 시뮬레이션한 환경에서는 간단한 작업의 상당 부분을 자율적으로 해결할 수 있지만, 더 어려운 장기 작업은 여전히 현재 시스템의 범위를 벗어납니다.

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.

논문 링크

LLM이 그래프를 텍스트 어트리뷰션 그래프로 변환할 수 있나요? / Can LLMs Convert Graphs to Text-Attributed Graphs?

논문 소개

그래프의 노드에 대한 텍스트 설명을 자동으로 생성하여 효과적인 그래프에서 텍스트 어트리뷰션 그래프로의 변환을 유도하고, 텍스트가 풍부한 그래프, 텍스트가 제한된 그래프, 텍스트가 없는 그래프에서 접근 방식을 평가하여 단일 GNN이 다양한 그래프에서 작동할 수 있음을 입증합니다.

Automatically generates textual descriptions for nodes in a graph which leads to effective graph to text-attributed graph transformation; evaluates the approach on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs.

논문 초록(Abstract)

그래프는 신약 개발, 추천 시스템, 소셜 네트워크 분석 등 수많은 실제 애플리케이션에서 볼 수 있는 보편적인 데이터 구조입니다. 그래프 신경망(GNN)은 이러한 구조에서 메시지 전달을 통해 노드 임베딩을 학습하는 데 널리 사용되는 도구가 되었습니다. 그러나 기존 GNN 아키텍처는 그래프 간 특징 정렬을 위해 설계되지 않았기 때문에 특징 공간이 서로 다른 여러 그래프에 GNN을 적용할 때 상당한 문제가 발생합니다. 이를 해결하기 위해 최근의 접근 방식에서는 각 노드가 텍스트 설명과 연결된 텍스트 어트리뷰션 그래프를 도입하여 공유 텍스트 인코더를 사용해 서로 다른 그래프의 노드를 통합된 특징 공간으로 투영할 수 있도록 합니다. 이 방법은 유망하지만, 텍스트 어트리뷰션 데이터의 가용성에 크게 의존하기 때문에 실제로는 구하기 어려울 수 있습니다. 이러한 격차를 해소하기 위해 대규모 언어 모델(LLM)을 활용하여 기존 그래프를 텍스트 어트리뷰션 그래프로 자동 변환하는 토폴로지 인식 노드 설명 합성(TANS)이라는 새로운 방법을 제안합니다. 핵심 아이디어는 토폴로지 정보를 각 노드의 속성과 통합하여 그래프 토폴로지가 노드 의미론에 어떤 영향을 미치는지 설명하는 LLM의 능력을 향상시키는 것입니다. 저희는 텍스트가 풍부한 그래프, 텍스트가 제한된 그래프, 텍스트가 없는 그래프에서 TANS를 평가하여 단일 GNN이 다양한 그래프에서 작동할 수 있음을 입증했습니다. 특히, 텍스트가 없는 그래프에서 우리의 방법은 노드 특징을 수동으로 설계하는 기존 접근 방식보다 훨씬 뛰어난 성능을 보이며, 텍스트 정보가 없는 경우에도 그래프 구조화 데이터를 전처리할 수 있는 LLM의 잠재력을 보여줍니다. 코드와 데이터는 GitHub - Zehong-Wang/TANS: Can LLMs Convert Graphs to Text-Attributed Graphs? NAACL 25 에서 확인할 수 있습니다.

Graphs are ubiquitous data structures found in numerous real-world applications, such as drug discovery, recommender systems, and social network analysis. Graph neural networks (GNNs) have become a popular tool to learn node embeddings through message passing on these structures. However, a significant challenge arises when applying GNNs to multiple graphs with different feature spaces, as existing GNN architectures are not designed for cross-graph feature alignment. To address this, recent approaches introduce text-attributed graphs, where each node is associated with a textual description, enabling the use of a shared textual encoder to project nodes from different graphs into a unified feature space. While promising, this method relies heavily on the availability of text-attributed data, which can be difficult to obtain in practice. To bridge this gap, we propose a novel method named Topology-Aware Node description Synthesis (TANS), which leverages large language models (LLMs) to automatically convert existing graphs into text-attributed graphs. The key idea is to integrate topological information with each node's properties, enhancing the LLMs' ability to explain how graph topology influences node semantics. We evaluate our TANS on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs. Notably, on text-free graphs, our method significantly outperforms existing approaches that manually design node features, showcasing the potential of LLMs for preprocessing graph-structured data, even in the absence of textual information. The code and data are available at GitHub - Zehong-Wang/TANS: Can LLMs Convert Graphs to Text-Attributed Graphs? NAACL 25.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1868691391129272461

Qwen2.5 기술 보고서 / Qwen2.5 Technical Report

논문 소개

알리바바는 18T 토큰으로 훈련된 새로운 LLM 시리즈인 Qwen2.5를 출시하여 Qwen2.5-72B와 같은 개방형 모델과 Llama-3 및 GPT-4와 같은 대형 모델에 대해 경쟁력 있는 성능을 달성하는 독점적인 MoE 변형을 모두 제공합니다.

Alibaba releases Qwen2.5, a new series of LLMs trained on 18T tokens, offering both open-weight models like Qwen2.5-72B and proprietary MoE variants that achieve competitive performance against larger models like Llama-3 and GPT-4.

논문 초록(Abstract)

이 보고서에서는 다양한 요구 사항을 충족하도록 설계된 포괄적인 대규모 언어 모델(LLM) 시리즈인 Qwen2.5를 소개합니다. 이전 버전에 비해 Qwen 2.5는 사전 학습과 학습 후 단계 모두에서 크게 개선되었습니다. 사전 학습의 경우, 고품질의 사전 학습 데이터 세트를 기존 7조 토큰에서 18조 토큰으로 확장했습니다. 이를 통해 상식, 전문 지식 및 추론 능력을 위한 강력한 기반을 제공합니다. 사후 학습 측면에서는 다단계 강화 학습뿐만 아니라 100만 개 이상의 샘플로 복잡한 감독 미세 조정을 구현합니다. 사후 학습 기술은 인간의 선호도를 향상시키고, 특히 긴 텍스트 생성, 구조적 데이터 분석 및 명령어 추종 기능을 개선합니다. 다양하고 많은 사용 사례를 효과적으로 처리할 수 있도록 다양한 크기의 Qwen2.5 LLM 시리즈를 선보입니다. 오픈 웨이트 제품에는 기본 모델과 인스트럭션 튜닝 모델이 있으며, 정량화된 버전도 제공됩니다. 또한 호스팅 솔루션의 경우 독점 모델에는 현재 두 가지 전문가 혼합(MoE) 변형이 포함되어 있습니다: Qwen2.5-Turbo와 Qwen2.5-Plus는 모두 알리바바 클라우드 모델 스튜디오에서 사용할 수 있습니다. Qwen2.5는 언어 이해, 추론, 수학, 코딩, 인간 선호도 정렬 등을 평가하는 광범위한 벤치마크에서 최고 수준의 성능을 입증했습니다. 특히 오픈 웨이트 플래그십 모델인 Qwen2.5-72B-Instruct는 여러 오픈 및 독점 모델보다 성능이 뛰어나며, 약 5배 더 큰 최신 오픈 웨이트 모델인 Llama-3-405B-Instruct와도 경쟁할 수 있는 성능을 입증했습니다. Qwen2.5-Turbo와 Qwen2.5-Plus는 각각 GPT-4o-mini 및 GPT-4o에 비해 뛰어난 비용 효율성을 제공하면서 경쟁력 있는 성능을 발휘합니다. 또한 Qwen2.5 모델은 기초로서 Qwen2.5-Math, Qwen2.5-Coder, QwQ 및 멀티모달 모델과 같은 특수 모델을 훈련하는 데 중요한 역할을 해왔습니다.

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

논문 링크

더 읽어보기

https://x.com/Alibaba_Qwen/status/1869950647501824015

제안자-에이전트-평가자(PAE): 기초 모델 인터넷 에이전트를 위한 자율적 기술 발견 / Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

논문 소개

강화 학습 및 상황 인식 작업 제안을 사용하여 실제 벤치마크에서 최첨단 성능을 달성하기 위해 웹 탐색을 통해 AI 에이전트가 자율적으로 기술을 발견하고 연습할 수 있는 학습 시스템입니다. (종이

A learning system that enables AI agents to autonomously discover and practice skills through web navigation, using reinforcement learning and context-aware task proposals to achieve state-of-the-art performance on real-world benchmarks. (paper

논문 초록(Abstract)

디지털 세계에서는 인터넷 브라우징 에이전트, 물리적 세계에서는 가정용 휴머노이드와 같이 광범위한 능력을 갖추고 목표를 지향하는 에이전트의 비전은 기초 모델의 일반화 기능 덕분에 빠르게 발전했습니다. 이러한 제너럴리스트 에이전트는 두 여행지 사이의 길 찾기, 인터넷에서 특정 물품 구매 등 크고 다양한 스킬 레퍼토리를 보유해야 합니다. 만약 각 스킬을 사람이 주석이 달린 고정된 명령어 세트를 통해 수동으로 지정해야 한다면, 사람이 주석이 달린 명령어의 양과 다양성으로 인해 에이전트의 스킬 레퍼토리는 필연적으로 제한될 수밖에 없습니다. 이 연구에서는 기초 모델 에이전트가 야생에서 자율적으로 기술을 발견하고 연습할 수 있는 효과적인 학습 시스템인 제안자-에이전트-평가자를 제안하여 이 문제를 해결합니다. PAE의 핵심은 컨텍스트 인식 작업 제안자로, 사용자 데모나 인터넷 브라우징 에이전트의 경우 웹사이트 이름과 같은 환경의 컨텍스트 정보로 에이전트가 연습할 작업을 자율적으로 제안합니다. 그런 다음 에이전트 정책은 해당 작업을 실제 환경에서 생각과 실제 기반 작업을 통해 시도하고, 그 결과 궤적을 자율적인 VLM 기반 성공 평가자가 평가합니다. 성공 평가는 에이전트가 RL을 통해 정책을 개선하도록 하는 보상 신호 역할을 합니다. 저희는 WebVoyager와 WebArena의 실제 웹 사이트와 자체 호스팅 웹 사이트를 모두 사용하여 까다로운 비전 기반 웹 탐색에서 PAE를 검증했습니다. 저희가 아는 한, 이 작업은 에이전트를 위한 RL을 통한 자율 작업 제안을 적용한 최초의 효과적인 학습 시스템으로, 실제 사람이 주석으로 표시한 벤치마크를 SOTA 성능으로 일반화합니다. 오픈 소스 체크포인트와 코드는 Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents 에서 확인할 수 있습니다

The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

논문 링크

더 읽어보기

DeepSeek-VL2: 고급 멀티모달 이해를 위한 전문가 혼합 비전-언어 모델 / DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

논문 소개

고해상도 이미지를 위한 동적 타일링과 효율적인 MoE 아키텍처를 갖춘 새로운 비전 언어 모델 시리즈로 시각 작업 전반에서 경쟁력 있는 성능을 달성하고, 기존 오픈 소스 고밀도 및 MoE 기반 모델과 비교하여 유사하거나 더 적은 수의 활성화된 파라미터로 경쟁력 있는 또는 최첨단 성능을 달성합니다.

A new series of vision-language models featuring dynamic tiling for high-resolution images and efficient MoE architecture, achieving competitive performance across visual tasks; achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.

논문 초록(Abstract)

두 가지 주요 업그레이드를 통해 이전 모델인 DeepSeek-VL을 크게 개선한 대규모 전문가 혼합(MoE) 비전-언어 모델의 고급 시리즈인 DeepSeek-VL2를 소개합니다. 비전 구성 요소의 경우, 화면비가 다른 고해상도 이미지를 처리하기 위해 설계된 동적 타일링 비전 인코딩 전략을 통합했습니다. 언어 구성 요소의 경우, 키-값 캐시를 잠재 벡터로 압축하는 멀티헤드 잠재주의 메커니즘이 적용된 DeepSeekMoE 모델을 활용하여 효율적인 추론과 높은 처리량을 구현합니다. 향상된 시각 언어 데이터 세트로 학습된 DeepSeek-VL2는 시각적 질문 답변, 광학 문자 인식, 문서/표/차트 이해, 시각적 근거 등 다양한 작업에서 뛰어난 성능을 보여줍니다. 이 모델 시리즈는 세 가지 변형으로 구성되어 있습니다: 각각 1.0억, 2.8억, 4.5억 개의 활성화 파라미터를 갖춘 DeepSeek-VL2-Tiny, DeepSeek-VL2-Small, DeepSeek-VL2입니다. DeepSeek-VL2는 기존 오픈 소스 고밀도 및 MoE 기반 모델과 비교하여 유사하거나 더 적은 수의 활성화된 파라미터로 경쟁력 있는 또는 최첨단 성능을 달성합니다. 코드와 사전 학습된 모델은 GitHub - deepseek-ai/DeepSeek-VL2: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding 에서 공개적으로 액세스할 수 있습니다.

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at GitHub - deepseek-ai/DeepSeek-VL2: DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1868696154067865659

제너레이티브 AI 및 멀티 에이전트를 사용하여 자동 피드백 제공 / Using Generative AI and Multi-Agents to Provide Automatic Feedback

논문 소개

과학 평가에서 학생의 반응에 대해 보다 정확하고 교육학적으로 건전한 피드백을 생성하는 2 에이전트 AI 시스템으로, 단일 에이전트 모델에 비해 과잉 칭찬과 같은 일반적인 오류를 크게 줄입니다.

A two-agent AI system that generates more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing common errors like over-praise compared to single-agent models.

논문 초록(Abstract)

이 연구는 교육적 맥락에서, 특히 과학 평가에서 학생이 구성한 응답에 대한 자동 피드백을 제공하기 위해 생성 AI와 다중 에이전트 시스템을 사용하는 방법을 조사합니다. 이 연구는 자동 피드백이라고 하는 다중 에이전트 시스템이 단일 에이전트 대규모 언어 모델(LLM)에서 흔히 발생하는 과잉 칭찬 및 과잉 추론과 같은 알려진 문제를 극복하고 GenAI 생성 피드백의 품질을 개선하는 방법을 탐구함으로써 이 분야의 주요 격차를 해결합니다. 이 연구에서는 피드백을 생성하는 에이전트와 피드백을 검증하고 개선하는 에이전트 두 개로 구성된 다중 에이전트 시스템을 개발했습니다. 이 시스템은 240명의 학생 응답 데이터 세트에 대해 테스트되었으며, 단일 에이전트 LLM의 성능과 비교되었습니다. 그 결과, 자동 피드백은 과잉 칭찬과 과잉 추론 오류를 현저히 감소시켜 보다 정확하고 교육학적으로 건전한 피드백을 제공한다는 것을 보여주었습니다. 이 연구 결과는 멀티 에이전트 시스템이 교육 환경에서 자동 피드백을 생성하는 데 보다 안정적인 솔루션을 제공할 수 있음을 시사하며, 확장 가능하고 개인화된 학습 지원의 잠재력을 강조합니다. 이러한 결과는 형성 평가에 AI를 활용하고자 하는 교육자와 연구자에게 중요한 의미를 가지며, 학생의 학습 성과를 향상시키는 보다 효과적인 피드백 메커니즘을 위한 길을 제시합니다.

This study investigates the use of generative AI and multi-agent systems to provide automatic feedback in educational contexts, particularly for student constructed responses in science assessments. The research addresses a key gap in the field by exploring how multi-agent systems, called AutoFeedback, can improve the quality of GenAI-generated feedback, overcoming known issues such as over-praise and over-inference that are common in single-agent large language models (LLMs). The study developed a multi-agent system consisting of two AI agents: one for generating feedback and another for validating and refining it. The system was tested on a dataset of 240 student responses, and its performance was compared to that of a single-agent LLM. Results showed that AutoFeedback significantly reduced the occurrence of over-praise and over-inference errors, providing more accurate and pedagogically sound feedback. The findings suggest that multi-agent systems can offer a more reliable solution for generating automated feedback in educational settings, highlighting their potential for scalable and personalized learning support. These results have important implications for educators and researchers seeking to leverage AI in formative assessments, offering a pathway to more effective feedback mechanisms that enhance student learning outcomes.

논문 링크

멀티모달 대규모 언어 모델 시대의 수학적 추론에 대한 종합적 조사: 벤치마크, 방법 및 과제 / A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

논문 소개

2021년부터 200개 이상의 연구에서 벤치마크, 방법론 및 과제를 다루는 다중 모드 대규모 언어 모델(MLLM)의 수학적 추론 능력을 분석하는 종합적인 설문조사를 발표합니다.

Presents a comprehensive survey analyzing mathematical reasoning capabilities in multimodal large language models (MLLMs), covering benchmarks, methodologies, and challenges across 200+ studies since 2021.

논문 초록(Abstract)

인간 인지의 핵심 요소인 수학적 추론은 교육적 문제 해결부터 과학적 발전에 이르기까지 다양한 영역에서 필수적인 요소입니다. 인공 일반 지능(AGI)이 발전함에 따라 대규모 언어 모델(LLM)을 수학적 추론 작업과 통합하는 것이 점점 더 중요해지고 있습니다. 이 설문조사는 다중 모드 대규모 언어 모델(MLLM) 시대의 수학적 추론에 대한 최초의 종합적인 분석을 제공합니다. 2021년 이후 발표된 200개 이상의 연구를 검토하고, 멀티모달 환경에 초점을 맞춰 수학-LLM의 최신 발전 상황을 살펴봅니다. 이 분야를 벤치마크, 방법론, 과제의 세 가지 차원으로 분류합니다. 특히, 멀티모달 수학적 추론 파이프라인과 (M)LLM의 역할 및 관련 방법론에 대해 살펴봅니다. 마지막으로, 이 분야에서 AGI의 실현을 방해하는 5가지 주요 과제를 파악하여 향후 멀티모달 추론 능력 향상을 위한 방향에 대한 인사이트를 제공합니다. 이 설문조사는 연구 커뮤니티가 복잡한 복합 추론 작업을 처리하기 위해 LLM의 역량을 발전시키는 데 중요한 자료로 활용될 것입니다.

Mathematical reasoning, a core aspect of human cognition, is vital across many domains, from educational problem-solving to scientific advancements. As artificial general intelligence (AGI) progresses, integrating large language models (LLMs) with mathematical reasoning tasks is becoming increasingly significant. This survey provides the first comprehensive analysis of mathematical reasoning in the era of multimodal large language models (MLLMs). We review over 200 studies published since 2021, and examine the state-of-the-art developments in Math-LLMs, with a focus on multimodal settings. We categorize the field into three dimensions: benchmarks, methodologies, and challenges. In particular, we explore multimodal mathematical reasoning pipeline, as well as the role of (M)LLMs and the associated methodologies. Finally, we identify five major challenges hindering the realization of AGI in this domain, offering insights into the future direction for enhancing multimodal reasoning capabilities. This survey serves as a critical resource for the research community in advancing the capabilities of LLMs to tackle complex multimodal reasoning tasks.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1870126516832792811

대규모 언어 모델에서 정확한 길이 제어 / Precise Length Control in Large Language Models

논문 소개

사전 학습된 디코더 전용 LLM을 적용하여 원하는 길이의 응답을 생성하고, 2차 길이 차이 위치 인코딩을 입력 임베딩에 통합하여 사용자가 설정한 응답 터미널 길이까지 카운트다운할 수 있으며, 품질 저하 없이 평균 토큰 오류를 3 토큰 미만으로 달성한다고 주장합니다.

Adapts a pre-trained decoder-only LLM to produce responses of a desired length; integrates a secondary length-difference positional encoding into the input embeddings which enables counting down to a user-set response terminal length; claims to achieve mean token errors of less than 3 tokens without compromising quality.

논문 초록(Abstract)

대규모 언어 모델(LLM)은 챗봇, 요약, 질문 답변과 같은 애플리케이션을 지원하는 프로덕션 시스템에서 점점 더 많이 사용되고 있습니다. 이러한 성공에도 불구하고, 특히 구조화된 출력이나 특정 수준의 세부 정보가 필요한 작업의 경우 응답 길이를 제어하는 것은 여전히 중요한 과제로 남아 있습니다. 이 연구에서는 응답 길이를 정밀하게 제어하기 위해 사전 학습된 디코더 전용 LLM을 조정하는 방법을 제안합니다. 이 접근 방식은 사용자가 설정한 응답 종료 길이까지 카운트다운하는 2차 길이 차이 위치 인코딩(LDPE)을 입력 임베딩에 통합합니다. LDPE로 미세 조정하면 모델이 원하는 길이로 일관되게 응답을 종료하는 방법을 학습하여 평균 토큰 오류를 3 토큰 미만으로 줄일 수 있습니다. 또한 정확한 목표가 아닌 유연한 상한 길이 제어를 가능하게 하는 확장 기능인 Max New Tokens++를 도입했습니다. 질문 답변 및 문서 요약과 같은 작업에 대한 실험 결과, 이 방법을 사용하면 응답 품질 저하 없이 정확한 길이 제어가 가능하다는 것이 입증되었습니다.

Large Language Models (LLMs) are increasingly used in production systems, powering applications such as chatbots, summarization, and question answering. Despite their success, controlling the length of their response remains a significant challenge, particularly for tasks requiring structured outputs or specific levels of detail. In this work, we propose a method to adapt pre-trained decoder-only LLMs for precise control of response length. Our approach incorporates a secondary length-difference positional encoding (LDPE) into the input embeddings, which counts down to a user-set response termination length. Fine-tuning with LDPE allows the model to learn to terminate responses coherently at the desired length, achieving mean token errors of less than 3 tokens. We also introduce Max New Tokens++, an extension that enables flexible upper-bound length control, rather than an exact target. Experimental results on tasks such as question answering and document summarization demonstrate that our method enables precise length control without compromising response quality.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1869030043084845453

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.*

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2024/12/16 ~ 12/22] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

Genesis

논문 소개

논문 링크

더 읽어보기

대규모 언어 모델에서 정렬 위조 / Alignment faking in large language models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

TheAgentCompany: 결과적인 실제 업무에서 LLM 에이전트 벤치마킹하기 / TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

논문 소개

논문 초록(Abstract)

논문 링크

LLM이 그래프를 텍스트 어트리뷰션 그래프로 변환할 수 있나요? / Can LLMs Convert Graphs to Text-Attributed Graphs?

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Qwen2.5 기술 보고서 / Qwen2.5 Technical Report

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

제안자-에이전트-평가자(PAE): 기초 모델 인터넷 에이전트를 위한 자율적 기술 발견 / Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

DeepSeek-VL2: 고급 멀티모달 이해를 위한 전문가 혼합 비전-언어 모델 / DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

제너레이티브 AI 및 멀티 에이전트를 사용하여 자동 피드백 제공 / Using Generative AI and Multi-Agents to Provide Automatic Feedback

논문 소개

논문 초록(Abstract)

논문 링크

멀티모달 대규모 언어 모델 시대의 수학적 추론에 대한 종합적 조사: 벤치마크, 방법 및 과제 / A Survey of Mathematical Reasoning in the Era of Multimodal Large Language Model: Benchmark, Method & Challenges

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

대규모 언어 모델에서 정확한 길이 제어 / Precise Length Control in Large Language Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR