[2024/12/09 ~ 12/15] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 12월 16, 2024, 3:15오전

[2024/12/09 ~ 12/15] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들을 분석해본 결과, 주요 트렌드로는 대규모 언어 모델(LLM: Large Language Models)과 연관된 다양한 주제들이 많이 다뤄졌습니다. 특히, LLM의 추론 능력을 강화하거나 새로운 활용 방안을 모색하려는 시도가 많이 보였습니다. 예를 들어, 'Training LLMs to Reason in a Continuous Latent Space'와 'AutoReason Improves Multi-step Reasoning' 같은 논문들은 LLM의 추론 능력을 보다 정교하게 발전시키기 위한 접근법을 연구하고 있습니다. 또한, 'A Survey on LLMs-as-Judges'에서는 LLM이 어떻게 다양한 분야에서 사용될 수 있는지를 탐구하고 있습니다.
이러한 트렌드가 두드러진 이유는 최근 인공지능 커뮤니티 내에서 LLM의 잠재력을 최대한 활용하기 위한 노력이 활발하게 이루어지고 있기 때문으로 보입니다. LLM은 자연어 처리나 대화형 AI의 성능을 크게 향상시킬 수 있는 강력한 도구로 자리잡고 있지만, 아직도 해결되지 않은 많은 과제들이 남아 있습니다. 특히, 복잡한 문제나 멀티스텝 추론을 필요로 하는 상황에서 그 효용성을 확장하는 것이 매우 중요합니다. 이러한 연구들은 LLM이 다양한 실제 응용에 보다 유용하게 적용될 수 있도록 하는 기반을 마련하려는 시도로 해석할 수 있습니다.
또한, 이러한 연구 방향은 기술의 윤리적 활용과 사회적 영향까지 고려하는 점에서도 중요성을 갖습니다. LLM이 다양한 분야에서 '판단자'로서의 역할을 수행할 가능성이 논의됨에 따라, 그 책임성과 정확성을 강화하려는 연구들이 필수적이기 때문입니다. 이러한 연구는 단지 기술적 성능을 넘어서, LLM이 사회적으로도 보다 안전하고 신뢰할 수 있는 방향으로 발전할 수 있도록 이끄는 데 기여할 것입니다.

연속적인 잠재 공간에서 추론하도록 대규모 언어 모델 훈련하기 / Training Large Language Models to Reason in a Continuous Latent Space

논문 소개

LLM이 자연어가 아닌 연속 잠재 공간에서 추론할 수 있도록 하는 새로운 패러다임인 코코넛(연속적 사고 체인)을 소개합니다. 코코넛은 LLM의 마지막 숨겨진 상태를 추론 상태로 삼고, 이를 연속 공간에 직접 임베딩하는 후속 입력으로 LLM에 다시 공급합니다. 이는 추론 작업에 대한 LLM의 역량을 강화하는 '연속적 사고'로 이어지고, 저자들은 이를 통해 새로운 폭 우선 검색 기능을 통해 복잡한 추론 작업에 대한 향상된 성능을 입증합니다.

Presents Coconut (Chain of Continuous Thought), a novel paradigm that enables LLMs to reason in continuous latent space rather than natural language; Coconut takes the last hidden state of the LLM as the reasoning state and feeds it back to the LLM as the subsequent input embedding directly in the continuous space; this leads to what the authors refer to as "continuous thought" which augments an LLM's capability on reasoning tasks; it demonstrates improved performance on complex reasoning tasks through emergent breadth-first search capabilities.

논문 초록(Abstract)

대규모 언어 모델(LLM)은 '언어 공간'에서 추론하는 것으로 제한되며, 일반적으로 복잡한 추론 문제를 해결하기 위해 추론 과정을 생각의 연쇄(CoT)로 표현합니다. 그러나 언어 공간이 항상 추론에 최적화된 것은 아니라는 주장도 있습니다. 예를 들어, 대부분의 단어 토큰은 주로 텍스트 일관성을 위한 것으로 추론에 필수적이지 않은 반면, 일부 중요한 토큰은 복잡한 계획이 필요하고 LLM에 큰 어려움을 야기합니다. 자연어를 사용하는 대신 제한되지 않은 잠재 공간에서 LLM 추론의 잠재력을 탐구하기 위해 새로운 패러다임인 코코넛(연속적 사고의 연쇄)을 도입합니다. 우리는 LLM의 마지막 숨겨진 상태를 추론 상태('연속적 사고'라고 함)의 표현으로 활용합니다. 이를 단어 토큰으로 디코딩하는 대신, 연속 공간에 직접 임베딩된 후속 입력으로 LLM에 다시 피드백합니다. 실험 결과, 코코넛은 여러 추론 작업에서 LLM을 효과적으로 보강할 수 있는 것으로 나타났습니다. 이 새로운 잠재 추론 패러다임은 새로운 고급 추론 패턴으로 이어집니다. 연속적 사고는 여러 대안적인 다음 추론 단계를 인코딩할 수 있어 모델이 CoT처럼 하나의 결정론적 경로를 조기에 확정하는 대신 폭 우선 탐색(BFS)을 수행하여 문제를 해결할 수 있도록 합니다. 코코넛은 계획 중 상당한 역추적이 필요한 특정 논리적 추론 작업에서 CoT보다 성능이 뛰어나며 추론 중 사고 토큰도 더 적게 사용합니다. 이러한 연구 결과는 잠재적 추론의 가능성을 보여주며 향후 연구를 위한 귀중한 인사이트를 제공합니다.

Large language models (LLMs) are restricted to reason in the "language space", where they typically express the reasoning process with a chain-of-thought (CoT) to solve a complex reasoning problem. However, we argue that language space may not always be optimal for reasoning. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using natural language, we introduce a new paradigm Coconut (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed "continuous thought"). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that Coconut can effectively augment the LLM on several reasoning tasks. This novel latent reasoning paradigm leads to emergent advanced reasoning patterns: the continuous thought can encode multiple alternative next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. Coconut outperforms CoT in certain logical reasoning tasks that require substantial backtracking during planning, with fewer thinking tokens during inference. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1866518791733342563

Phi-4 기술 보고서 / Phi-4 Technical Report

논문 소개

STEM-QA 역량에서 교사 모델을 능가하는 14B 모델인 phi-4를 발표합니다. 또한 향상된 데이터, 교육 커리큘럼, 사후 교육 체계의 혁신으로 인해 추론 중심 벤치마크에서 강력한 성과를 보고합니다.

Presents phi-4, a 14B model that surpasses its teacher model on STEM-QA capabilities. It also reports strong performance on reasoning-focused benchmarks due to improved data, training curriculum, and innovations in the post-training scheme.

논문 초록(Abstract)

데이터 품질에 중점을 둔 학습 레시피로 개발된 140억 개의 매개변수 언어 모델인 phi-4를 소개합니다. 사전 학습이 주로 웹 콘텐츠나 코드와 같은 유기적 데이터 소스를 기반으로 하는 대부분의 언어 모델과 달리, phi-4는 학습 과정 전반에 걸쳐 합성 데이터를 전략적으로 통합합니다. Phi 제품군의 이전 모델들이 주로 교사 모델(특히 GPT-4)의 기능을 증류한 반면, phi-4는 STEM에 초점을 맞춘 QA 기능에서 교사 모델을 크게 능가하며 데이터 생성 및 사후 훈련 기술이 증류를 뛰어넘는다는 증거를 제공합니다. Phi-3 아키텍처의 최소한의 변경에도 불구하고, phi-4는 향상된 데이터, 교육 커리큘럼 및 사후 교육 체계의 혁신으로 인해 특히 추론 중심의 벤치마크에서 규모에 비해 강력한 성능을 달성했습니다.

We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1867609628529635574

비동기 LLM 함수 호출 / Asynchronous LLM Function Calling

논문 소개

함수 호출과 인터럽트를 위한 컨텍스트 내 프로토콜을 설계하고, 인터럽트 의미론에 맞게 LLM을 조정하는 미세 조정 전략을 제공하며, 이러한 메커니즘을 LLM 추론 프로세스에서 효율적으로 구현하는 비동기 LLM 함수 호출 시스템인 AsyncLM을 제안하고, 동기 함수 호출에 비해 작업 완료 지연 시간을 1.6배~5.4배 단축하고, 함수 호출을 생성하고 동시에 실행할 수 있는 LLM을 구현합니다;

Proposes AsyncLM, a system for asynchronous LLM function calling; they design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process; AsyncLM can reduce task completion latency from 1.6x-5.4x compared to synchronous function calling; it enables LLMs to generate and execute function calls concurrently;

논문 초록(Abstract)

대규모 언어 모델(LLM)은 함수 호출을 사용하여 외부 도구 및 데이터 소스와 인터페이스합니다. 그러나 현재 LLM 함수 호출에 대한 접근 방식은 본질적으로 동기식이어서 각 호출이 LLM 추론을 차단하여 LLM 작동과 동시 함수 실행을 제한합니다. 이 연구에서는 비동기식 LLM 함수 호출을 위한 시스템인 AsyncLM을 제안합니다. AsyncLM은 LLM이 함수 호출을 동시에 생성하고 실행할 수 있게 함으로써 LLM의 운영 효율성을 향상시킵니다. 각 호출이 완료될 때까지 기다리는 대신 AsyncLM은 인터럽트 메커니즘을 도입하여 함수 호출이 반환될 때 비동기적으로 LLM에 알립니다. 함수 호출과 인터럽트를 위한 컨텍스트 내 프로토콜을 설계하고, 인터럽트 시맨틱에 맞게 LLM을 조정하는 미세 조정 전략을 제공하며, 이러한 메커니즘을 LLM 추론 프로세스에서 효율적으로 구현합니다. 버클리 함수 호출 리더보드(BFCL)의 벤치마크 작업 세트에서 비동기 함수 호출이 동기 함수 호출에 비해 엔드투엔드 작업 완료 지연 시간을 1.6배에서 5.4배까지 줄일 수 있음을 입증합니다. 또한 인터럽트 메커니즘을 확장하여 새로운 인간-LLM 또는 LLM-LLM 상호 작용을 가능하게 하는 방법에 대해서도 설명합니다.

Large language models (LLMs) use function calls to interface with external tools and data source. However, the current approach to LLM function calling is inherently synchronous, where each call blocks LLM inference, limiting LLM operation and concurrent function execution. In this work, we propose AsyncLM, a system for asynchronous LLM function calling. AsyncLM improves LLM's operational efficiency by enabling LLMs to generate and execute function calls concurrently. Instead of waiting for each call's completion, AsyncLM introduces an interrupt mechanism to asynchronously notify the LLM in-flight when function calls return. We design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process. We demonstrate that AsyncLM can reduce end-to-end task completion latency from 1.6x-5.4x compared to synchronous function calling on a set of benchmark tasks in the Berkeley function calling leaderboard (BFCL). Furthermore, we discuss how interrupt mechanisms can be extended to enable novel human-LLM or LLM-LLM interactions.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1866855077983686804

MAG-V: 합성 데이터 생성 및 검증을 위한 멀티 에이전트 프레임워크 / MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

논문 소개

먼저 고객 쿼리를 모방한 질문 데이터 세트를 생성한 다음, 응답에서 대체 질문을 리버스 엔지니어링하여 상담원 궤적을 검증하고, 생성된 합성 데이터가 실제 고객 쿼리에 대한 상담원 성능을 향상시킬 수 있음을 보고하고, 궤적 검증을 위해 기능 엔지니어링이 포함된 간단한 ML 기준선을 사용하면 더 비싸고 성능이 뛰어난 모델의 성능과 일치할 수 있음을 발견하는 다중 상담원 프레임워크입니다.

A multi-agent framework that first generates a dataset of questions that mimic customer queries; it then reverse engineers alternate questions from responses to verify agent trajectories; reports that the generated synthetic data can improve agent performance on actual customer queries; finds that for trajectory verification simple ML baselines with feature engineering can match the performance of more expensive and capable models.

논문 초록(Abstract)

환경 상호작용을 위한 기능이나 도구로 LLM(대규모 언어 모델)의 기능을 확장하면서 에이전트 패러다임이 등장했습니다. 업계에서는 도메인 데이터의 부족, 독점 고객 데이터에 대한 법적 보호, 급변하는 비즈니스 요구 사항, 새로운 어시스턴트의 프로토타입 제작 필요성 등으로 인해 LLM을 학습시키는 것이 항상 가능한 것은 아닙니다. 에이전트는 기본 LLM의 제로 샷 추론 능력에 의존하고 고객 데이터를 탐색 및 추론하고 사용자 요청에 응답하는 도구를 활용하여 위의 문제에 대한 우아한 솔루션을 제공합니다. 하지만 여기에는 두 가지 우려가 있습니다: (1) 상담원 테스트를 위해 대규모 고객 쿼리를 수집하는 데 시간이 많이 소요된다는 점, (2) 상담원이 사용자 쿼리에 응답하기 위해 따르는 도구 호출 순서(또는 궤적)에 대한 의존도가 높으면 예상치 못한 또는 잘못된 동작이 발생할 수 있다는 점입니다. 이를 해결하기 위해 먼저 고객 쿼리를 모방한 질문 데이터 세트를 생성하고, 두 번째로 궤적 검증을 위해 응답에서 대체 질문을 리버스 엔지니어링하는 다중 에이전트 프레임워크인 MAG-V를 제안합니다. 초기 결과에 따르면 합성 데이터는 실제 고객 쿼리에 대한 상담원의 성과를 향상시킬 수 있는 것으로 나타났습니다. 또한, 원거리 감독에서 영감을 얻고 기존의 머신러닝(ML) 모델을 사용하는 궤적 검증 방법론은 GPT-4o 판정 기준보다 11% 더 높은 정확도를 보이며, 구축된 데이터 세트에서 GPT-4 판정자의 성능과 일치하는 것으로 나타났습니다. 전반적으로 이러한 접근 방식은 다양한 작업 에이전트를 일관된 목표를 달성하기 위한 응집력 있는 프레임워크로 통합하는 단계입니다.

Extending the capabilities of Large Language Models (LLMs) with functions or tools for environment interaction has led to the emergence of the agent paradigm. In industry, training an LLM is not always feasible because of the scarcity of domain data, legal holds on proprietary customer data, rapidly changing business requirements, and the need to prototype new assistants. Agents provide an elegant solution to the above by relying on the zero-shot reasoning abilities of the underlying LLM and utilizing tools to explore and reason over customer data and respond to user requests. However, there are two concerns here: (I) acquiring large scale customer queries for agent testing is time-consuming, and (II) high reliance on the tool call sequence (or trajectory) followed by the agent to respond to user queries may lead to unexpected or incorrect behavior. To address this, we propose MAG-V, a multi-agent framework to first generate a dataset of questions that mimic customer queries; and second, reverse-engineer alternate questions from the responses for trajectory verification. Initial results indicate that our synthetic data can improve agent performance on actual customer queries. Furthermore, our trajectory verification methodology, inspired by distant supervision and using traditional machine learning (ML) models, outperforms a GPT-4o judge baseline by 11% accuracy and matches the performance of a GPT-4 judge on our constructed dataset. Overall, our approach is a step towards unifying diverse task agents into a cohesive framework for achieving an aligned objective.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1866143542726340890

Clio

논문 소개

AI 어시스턴트를 사용하여 수백만 건의 Claude.ai 대화에서 비공개로 집계된 사용 패턴을 분석하고 표시하는 플랫폼을 제안하고, 사용자 개인정보를 보호하면서 실제 AI 사용에 대한 인사이트를 제공하며, 사람이 원시 대화를 읽을 필요 없이 사용 트렌드, 안전 위험 및 조직적인 오용 시도를 식별할 수 있도록 도와줍니다.

Proposes a platform using AI assistants to analyze and surface private aggregated usage patterns from millions of Claude.ai conversations; enables insights into real-world AI use while protecting user privacy; the system helps identify usage trends, safety risks, and coordinated misuse attempts without human reviewers needing to read raw conversations.

논문 링크

더 읽어보기

https://x.com/AnthropicAI/status/1867325190352576780

판사로서의 LLM: LLM 기반 평가 방법에 대한 종합적인 서베이 논문 / LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

논문 소개

5가지 주요 관점에서 판사로서의 LLM 패러다임에 대한 종합적인 설문조사를 제시합니다: 기능, 방법론, 응용, 메타평가, 한계.

Presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations.

논문 초록(Abstract)

대규모 언어 모델(LLM)의 급속한 발전으로 다양한 분야에 걸쳐 그 적용 범위가 확대되고 있습니다. 가장 유망한 응용 분야 중 하나는 자연어 응답을 기반으로 한 평가자로서의 역할, 즉 '판사로서의 LLM'입니다. 이 프레임워크는 뛰어난 효과성, 업무 전반에 걸친 일반화 능력, 자연어 형태의 해석 가능성으로 인해 학계와 업계 모두에서 점점 더 많은 관심을 받고 있습니다. 이 백서에서는 5가지 주요 관점에서 판사로서의 LLM 패러다임에 대한 종합적인 조사를 제시합니다: 기능, 방법론, 응용, 메타 평가, 한계. 먼저 판사로서의 LLM을 체계적으로 정의하고 그 기능을 소개합니다(왜 LLM 판사를 사용해야 하는가?). 그런 다음 LLM으로 평가 시스템을 구축하는 방법론을 다룹니다(LLM 심사위원은 어떻게 사용하나요?). 또한, 잠재적인 적용 분야를 조사하고(어디에서 LLM 심사위원을 사용할 수 있나요?) 다양한 맥락에서 이를 평가하는 방법을 논의합니다(어떻게 LLM 심사위원을 평가하나요?). 마지막으로 LLM 심사위원의 한계에 대해 자세히 분석하고 향후 발전 방향에 대해 논의합니다. 체계적이고 종합적인 분석을 통해 연구와 실무 모두에서 판사로서의 LLM의 발전과 적용에 대한 인사이트를 제공하는 것을 목표로 합니다. 관련 리소스 목록은 GitHub - CSHaitao/Awesome-LLMs-as-Judges: The official repo for paper, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. 에서 계속 유지될 예정입니다.

The rapid advancement of Large Language Models (LLMs) has driven their expanding application across various fields. One of the most promising applications is their role as evaluators based on natural language responses, referred to as ''LLMs-as-judges''. This framework has attracted growing attention from both academia and industry due to their excellent effectiveness, ability to generalize across tasks, and interpretability in the form of natural language. This paper presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations. We begin by providing a systematic definition of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then we address methodology to construct an evaluation system with LLMs (How to use LLM judges?). Additionally, we investigate the potential domains for their application (Where to use LLM judges?) and discuss methods for evaluating them in various contexts (How to evaluate LLM judges?). Finally, we provide a detailed analysis of the limitations of LLM judges and discuss potential future directions. Through a structured and comprehensive analysis, we aim aims to provide insights on the development and application of LLMs-as-judges in both research and practice. We will continue to maintain the relevant resource list at GitHub - CSHaitao/Awesome-LLMs-as-Judges: The official repo for paper, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods..

논문 링크

더 읽어보기

https://github.com/CSHaitao/Awesome-LLMs-as-Judges

https://x.com/omarsar0/status/1866541394015518824

자동 추론: 자동 몇 샷 추론 분해 / AutoReason: Automatic Few-Shot Reasoning Decomposition

논문 소개

CoT 프롬프트를 사용하여 쿼리에 대한 추론을 자동으로 생성하는 방법을 제안하고, 제로 샷 쿼리를 LLM에서 CoT 예시로 사용되는 몇 샷 추론 트레이스로 변환하며, 약한 LLM의 추론을 개선할 수 있다고 주장합니다.

Proposes a method to automatically generate rationales for queries using CoT prompting; this transforms zero-shot queries into few-shot reasoning traces which are used as CoT exemplars by the LLM; claims to improve reasoning in weaker LLMs.

논문 초록(Abstract)

최근 연구에서 대규모 언어 모델에서 단계별 추론을 개선하기 위한 방법으로 생각의 연쇄(CoT)가 소개되었습니다. 그러나 CoT는 수작업으로 몇 개의 예시 프롬프트를 만들어야 하고 다양한 쿼리에 맞게 조정할 수 있는 기능이 없는 등 적용에 한계가 있습니다. 이 연구에서는 CoT를 사용해 자동으로 추론을 생성하는 시스템을 제안합니다. 이 방법은 암시적 쿼리를 여러 개의 명시적 질문으로 분해하여 다단계 암시적 추론 기능을 향상시킵니다. 이는 모델에 해석 가능성을 제공하여 약한 LLM의 추론을 개선합니다. 두 가지 Q&A 데이터 세트로 접근 방식을 테스트합니다: StrategyQA와 HotpotQA. 두 가지 모두에서 정확도가 향상되었으며, 특히 StrategyQA에서 더 높은 정확도를 보였습니다. 이 분야의 추가 연구를 촉진하기 위해 이 연구의 전체 소스 코드는 GitHub(GitHub - miralab-ai/autoreason)에 공개되어 있습니다.

Chain of Thought (CoT) was introduced in recent research as a method for improving step-by-step reasoning in Large Language Models. However, CoT has limited applications such as its need for hand-crafted few-shot exemplar prompts and no capability to adjust itself to different queries. In this work, we propose a system to automatically generate rationales using CoT. Our method improves multi-step implicit reasoning capabilities by decomposing the implicit query into several explicit questions. This provides interpretability for the model, improving reasoning in weaker LLMs. We test our approach with two Q&A datasets: StrategyQA and HotpotQA. We show an increase in accuracy with both, especially on StrategyQA. To facilitate further research in this field, the complete source code for this study has been made publicly available on GitHub: GitHub - miralab-ai/autoreason.

논문 링크

더 읽어보기

https://github.com/miralab-ai/autoreason

https://x.com/omarsar0/status/1867224350287372555

BLT: 바이트 잠복 트랜스포머 / The Byte Latent Transformer (BLT)

논문 소개

효율성과 견고성을 개선하면서 토큰화 기반 LLM 성능과 일치하는 바이트 수준 언어 모델 아키텍처 도입, 다음 바이트의 엔트로피를 기반으로 바이트들을 패치로 그룹화하는 동적 방법을 사용해 복잡한 예측에 더 많은 컴퓨팅 리소스를 할당하는 동시에 예측 가능한 시퀀스에 더 큰 패치를 사용, BLT는 추론 중에 최대 50% 적은 FLOP을 사용하면서 Llama 3과 같은 모델의 성능과 일치하거나 그 이상의 기능을 보여줍니다.

Introduces a byte-level language model architecture that matches tokenization-based LLM performance while improving efficiency and robustness; uses a dynamic method of grouping bytes into patches based on the entropy of the next byte, allocating more compute resources to complex predictions while using larger patches for more predictable sequences; BLT demonstrates the ability to match or exceed the performance of models like Llama 3 while using up to 50% fewer FLOPs during inference.

논문 초록(Abstract)

추론 효율성과 견고성을 크게 향상시키면서 토큰화 기반 LLM 성능을 대규모로 구현하는 새로운 바이트 수준 LLM 아키텍처인 바이트 잠복 트랜스포머(BLT)를 처음으로 소개합니다. BLT는 바이트를 동적 크기의 패치로 인코딩하며, 이는 기본 연산 단위로 사용됩니다. 패치는 다음 바이트의 엔트로피에 따라 동적으로 분할되어 데이터 복잡성 증가에 따라 더 많은 컴퓨팅 및 모델 용량을 할당합니다. 4T 트레이닝 바이트로 최대 8B 파라미터의 바이트 수준 모델에 대한 최초의 플롭 제어 확장 연구를 발표합니다. 연구 결과는 고정 어휘 없이 원시 바이트로 훈련된 모델을 확장할 수 있는 가능성을 보여줍니다. 데이터가 예측 가능할 때 긴 패치를 동적으로 선택함으로써 훈련과 추론 효율이 모두 향상되고 추론과 롱테일 일반화에 대한 질적 개선이 이루어집니다. 전반적으로 고정 추론 비용의 경우, BLT는 패치와 모델 크기를 동시에 늘림으로써 토큰화 기반 모델보다 훨씬 더 나은 확장성을 보여줍니다.

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented dynamically based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters with 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed-vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

논문 링크

더 읽어보기

https://x.com/ArtidoroPagnoni/status/1867601413741981804

RLHF는 확장 가능한가요? / Does RLHF Scale?

논문 소개

이 새 논문에서는 RLHF 프레임워크의 주요 구성 요소가 미치는 영향을 살펴봅니다. 주요 연구 결과를 요약합니다: 1) RLHF는 LLM에서 사전 훈련만큼 효과적으로 확장되지 않으며, 고정 보상 모델을 사용할 때 더 큰 정책 모델은 RLHF의 이점이 적습니다. 2) 정책 훈련 중 프롬프트당 샘플링되는 응답 수를 늘리면 처음에는 성능이 향상되지만 일반적으로 약 4-8개의 샘플로 빠르게 정체됩니다, 3) 더 큰 보상 모델을 사용하면 추론 작업의 성능이 향상되지만 작업 유형에 따라 개선이 일관되지 않을 수 있으며, 4) 보상 모델의 훈련 데이터 다양성을 높이는 것이 프롬프트당 응답 다양성을 높이는 것보다 더 효과적이지만 정책 훈련은 추가 데이터와 관계없이 초기 단계 이후에는 수익이 감소하는 것으로 나타났습니다.

This new paper explores the impacts of key components in the RLHF framework. Summary of main findings: 1) RLHF doesn't scale as effectively as pretraining in LLMs, with larger policy models benefiting less from RLHF when using a fixed reward model, 2) when increasing the number of responses sampled per prompt during policy training, performance improves initially but plateaus quickly, typically around 4-8 samples, 3) using larger reward models leads to better performance in reasoning tasks, but the improvements can be inconsistent across different types of tasks, and 4) increasing training data diversity for reward models is more effective than increasing response diversity per prompt, but policy training shows diminishing returns after the early stages regardless of additional data.

논문 초록(Abstract)

이 연구에서는 대규모 언어 모델(LLM)에서 인간 피드백을 통한 강화 학습(RLHF)의 확장 속성에 대해 살펴봅니다. RLHF는 LLM의 사후 학습에서 중요한 단계로 간주되지만, 그 확장 가능성은 아직 많이 알려지지 않았습니다. 저희는 모델 크기, 데이터 구성, 추론 예산 등 RLHF 프레임워크의 주요 구성 요소와 이들이 성능에 미치는 영향을 체계적으로 분석합니다. 연구 결과에 따르면 데이터의 다양성과 양을 늘리면 보상 모델 성능이 향상되어 프로세스 감독 모델을 더 잘 확장할 수 있습니다. 정책 훈련의 경우, 프롬프트당 응답 샘플이 많을수록 처음에는 성능이 향상되지만 금방 정체됩니다. 그리고 보상 모델이 클수록 정책 학습에서 약간의 이득이 있습니다. 또한 규모가 큰 정책 모델은 고정 보상 모델을 사용하는 RLHF의 이점이 적습니다. 전반적으로 RLHF는 사전 학습보다 확장 효율성이 떨어지며, 추가 컴퓨팅 리소스로 인한 수익도 감소합니다. 이러한 관찰 결과를 바탕으로 계산 한계 내에서 RLHF 성능을 최적화하는 전략을 제안합니다.

This study explores the scaling properties of Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). Although RLHF is considered an important step in post-training of LLMs, its scaling potential is still largely unknown. We systematically analyze key components in the RLHF framework--model size, data composition, and inference budget--and their impacts on performance. Our findings show that increasing data diversity and volume improves reward model performance, helping process-supervision models scale better. For policy training, more response samples per prompt boost performance initially but quickly plateau. And larger reward models offer modest gains in policy training. In addition, larger policy models benefit less from RLHF with a fixed reward model. Overall, RLHF scales less efficiently than pretraining, with diminishing returns from additional computational resources. Based on these observations, we propose strategies to optimize RLHF performance within computational limits.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1866525606562680954

Granite Guardian

논문 소개

IBM의 오픈 소스인 Granite Guardian은 LLM의 위험 감지를 위한 안전 장치 모음으로, 저자들은 유해 콘텐츠 및 RAG 환각 관련 벤치마크에서 각각 0.871점과 0.854점의 AUC를 기록하여 이 분야에서 가장 일반화 가능하고 경쟁력 있는 모델이라고 주장합니다.

IBM open-sources Granite Guardian, a suite of safeguards for risk detection in LLMs; the authors claim that With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space.

논문 초록(Abstract)

프롬프트 및 응답에 대한 위험 감지 기능을 제공하여 모든 대규모 언어 모델(LLM)과 함께 안전하고 책임감 있게 사용할 수 있도록 설계된 일련의 안전장치인 Granite Guardian 모델을 소개합니다. 이러한 모델은 사회적 편견, 욕설, 폭력, 성적인 콘텐츠, 비윤리적 행동, 탈옥, 환각 관련 위험 등 여러 위험 차원을 포괄적으로 다루며 검색 증강 생성(RAG)을 위한 문맥 관련성, 근거성, 답변 관련성 등 검색 관련성까지 제공합니다. 다양한 소스의 인간 주석과 합성 데이터를 결합한 고유한 데이터 세트에서 학습된 Granite Guardian 모델은 탈옥 및 RAG 관련 문제와 같이 기존의 위험 감지 모델에서 일반적으로 간과하는 위험을 해결합니다. 유해 콘텐츠 및 RAG 환각 관련 벤치마크에서 각각 0.871점, 0.854점의 AUC를 기록한 Granite Guardian은 이 분야에서 가장 일반화 가능하고 경쟁력 있는 모델입니다. 오픈 소스로 공개된 Granite Guardian은 커뮤니티 전반에서 책임감 있는 AI 개발을 촉진하는 것을 목표로 합니다. GitHub - ibm-granite/granite-guardian: The Granite Guardian models are designed to detect risks in prompts and responses.

We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. GitHub - ibm-granite/granite-guardian: The Granite Guardian models are designed to detect risks in prompts and responses.

논문 링크

더 읽어보기

https://github.com/ibm-granite/granite-guardian

https://x.com/omarsar0/status/1866852443621036228

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.*

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2024/12/09 ~ 12/15] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

연속적인 잠재 공간에서 추론하도록 대규모 언어 모델 훈련하기 / Training Large Language Models to Reason in a Continuous Latent Space

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Phi-4 기술 보고서 / Phi-4 Technical Report

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

비동기 LLM 함수 호출 / Asynchronous LLM Function Calling

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

MAG-V: 합성 데이터 생성 및 검증을 위한 멀티 에이전트 프레임워크 / MAG-V: A Multi-Agent Framework for Synthetic Data Generation and Verification

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Clio

논문 소개

논문 링크

더 읽어보기

판사로서의 LLM: LLM 기반 평가 방법에 대한 종합적인 서베이 논문 / LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

자동 추론: 자동 몇 샷 추론 분해 / AutoReason: Automatic Few-Shot Reasoning Decomposition

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

BLT: 바이트 잠복 트랜스포머 / The Byte Latent Transformer (BLT)

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

RLHF는 확장 가능한가요? / Does RLHF Scale?

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Granite Guardian

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR