[2025/01/06 ~ 01/12] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 1월 13, 2025, 12:39오전

[2025/01/06 ~ 01/12] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들을 살펴보면, 주로 대형 언어 모델(LLMs)과 이들의 성능 향상에 대한 연구가 많은 비중을 차지하고 있음을 알 수 있습니다. "Long Context vs. RAG for LLMs", "Towards System 2 Reasoning", "Can LLMs Design Good Questions?", "A Survey on LLMs"와 같은 제목의 논문들은 특히 LLM의 다양한 적용 및 방법론적 개선에 초점을 맞추고 있습니다. 또한, "Agent Laboratory"나 "Cosmos World Foundation Model" 같은 연구들은 LLM을 보다 복잡한 환경에서 어떻게 응용할 수 있는지를 탐색하고 있습니다.
이렇듯 대형 언어 모델에 관한 연구들이 두드러지게 많은 이유는 최근 인공지능 연구 커뮤니티에서 LLM의 잠재력이 크게 주목받고 있기 때문이라 할 수 있습니다. LLM은 다양한 자연어 처리 작업에서 뛰어난 성능을 보여주면서, 다른 AI 문제 해결에도 활용될 수 있는 가능성을 제시하고 있습니다. 이와 같은 맥락에서 연구자들은 LLM의 성능을 더욱 향상시키고, 새로운 응용 분야를 탐색하며, 모델의 한계를 극복하기 위한 다양한 방법들을 모색하고 있습니다.
또한, 대형 언어 모델의 응용범위가 확장됨에 따라, 모델의 해석성과 효율성을 개선하는 것 또한 중요한 연구 과제로 떠오르고 있습니다. 이러한 흐름은 인공지능 기술이 보다 실용적이고 안전하게 발전하기 위해 필수적인 부분이며, 그렇기에 이에 관한 연구들이 증가하고 있는 추세로 보입니다.

RAG를 하지 마세요: 지식 작업에 캐시 증강 생성만 있으면 되는 경우 / Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

논문 소개

모든 관련 문서가 포함된 LLM을 미리 로드하고 키-값(KV) 캐시를 미리 계산하여 긴 컨텍스트 LLM의 기능을 활용하는 접근 방식; 미리 로드된 컨텍스트는 모델이 런타임 중에 추가 검색 없이 컨텍스트에 맞는 정확한 답변을 제공하는 데 도움이 되며, 저자들은 검색할 문서/지식 크기가 제한적이고 관리 가능한 경우에 CAG가 RAG의 유용한 대안이라고 제안합니다.

An approach that aims to leverage the capabilities of long-context LLMs by preloading the LLM with all relevant docs in advance and precomputing the key-value (KV) cache; the preloaded context helps the model to provide contextually accurate answers without the need for additional retrieval during runtime; the authors suggest that CAG is a useful alternative to RAG for cases where the documents/knowledge for retrieval are of limited, manageable size.

논문 초록(Abstract)

검색 증강 생성(RAG)은 외부 지식 소스를 통합하여 언어 모델을 향상시키는 강력한 접근 방식으로 각광받고 있습니다. 그러나 RAG는 검색 지연, 문서 선택의 잠재적 오류, 시스템 복잡성 증가와 같은 문제를 야기합니다. 이 백서에서는 크게 확장된 컨텍스트 창을 갖춘 대규모 언어 모델(LLM)의 등장으로 실시간 검색을 우회하는 대안적 패러다임인 캐시 증강 생성(CAG)을 제안합니다. 이 방법은 특히 검색할 문서나 지식의 크기가 제한적이고 관리하기 쉬운 경우 모든 관련 리소스를 LLM의 확장된 컨텍스트에 미리 로드하고 런타임 매개변수를 캐싱하는 방식을 포함합니다. 추론 중에 모델은 이러한 사전 로드된 매개변수를 활용하여 추가 검색 단계 없이 쿼리에 답합니다. 비교 분석 결과, CAG는 검색 대기 시간을 없애고 검색 오류를 최소화하는 동시에 컨텍스트 관련성을 유지하는 것으로 나타났습니다. 여러 벤치마크에 걸친 성능 평가에서는 긴 컨텍스트 LLM이 기존 RAG 파이프라인을 능가하거나 보완하는 시나리오를 강조합니다. 이러한 결과는 특정 애플리케이션, 특히 지식 기반이 제한된 애플리케이션의 경우 CAG가 복잡성을 줄이면서 비슷하거나 더 우수한 결과를 얻을 수 있는 간소화되고 효율적인 RAG의 대안을 제공한다는 것을 시사합니다.

Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources. However, RAG introduces challenges such as retrieval latency, potential errors in document selection, and increased system complexity. With the advent of large language models (LLMs) featuring significantly extended context windows, this paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval. Our method involves preloading all relevant resources, especially when the documents or knowledge for retrieval are of a limited and manageable size, into the LLM's extended context and caching its runtime parameters. During inference, the model utilizes these preloaded parameters to answer queries without additional retrieval steps. Comparative analyses reveal that CAG eliminates retrieval latency and minimizes retrieval errors while maintaining context relevance. Performance evaluations across multiple benchmarks highlight scenarios where long-context LLMs either outperform or complement traditional RAG pipelines. These findings suggest that, for certain applications, particularly those with a constrained knowledge base, CAG provide a streamlined and efficient alternative to RAG, achieving comparable or superior results with reduced complexity.

논문 링크

https://arxiv.org/pdf/2412.15605

더 읽어보기

CAG(Cache-Augmented Generation): LLM의 Long Context를 활용한 RAG 대체 기법에 대한 연구 읽을거리&정보공유

CAG 연구 소개 대규모 언어 모델(Large Language Model, LLM)의 발전은 자연어 처리(NLP) 작업에서 혁신적인 성과를 가져왔습니다. 특히, 검색-보강 생성(Retrieval-Augmented Generation, RAG)은 외부 지식 기반을 활용하여 모델의 문맥 이해력을 크게 향상시키는 방법으로 주목받아 왔습니다. RAG는 특정 작업에서 외부 데이터 소스를 동적으로 검색하고 이를 기반으로 문맥에 적합한 응답을 생성하는 시스템입니다. 이는 개방형 질문 응답(Open-Domain Question Answering)과 같은 지식 집약적 작업에서 탁월한 성능을 발휘해 왔습니다. 하지만 RAG에는 다음과 같은 한계가 있습니다: 실시간 검색은 시스템의 지연(latency)을 초래하여 사용자 경험을 저하시킬 수 있습니다. 검색 단계에서 부정확하거나 관련 없는 문서를 선택하면 모델의 응답 품질이 크게 저하됩니다. 검색 및 생성 모듈을 통합해야 하는 시스템 구조는 복잡성을…

https://x.com/omarsar0/status/1876721221083214200

에이전트 실험실: LLM 에이전트를 연구 조교로 활용하기 / Agent Laboratory: Using LLM Agents as Research Assistants

논문 소개

전체 연구 프로세스를 완료할 수 있는 LLM 에이전트를 활용하는 접근 방식으로, 주요 결과는 다음과 같습니다: 1) o1-프리뷰로 구동되는 에이전트가 최고의 연구 결과를 도출했고, 2) 생성된 머신러닝 코드는 기존 방식에 비해 최첨단 성능을 달성할 수 있으며, 3) 사람의 피드백은 연구의 품질을 더욱 향상시키고, 4) 에이전트 실험실은 연구 비용을 크게 절감한다는 점입니다.

An approach that leverages LLM agents capable of completing the entire research process; the main findings are: 1) agents driven by o1-preview resulted in the best research outcomes, 2) generated machine learning code can achieve state-of-the-art performance compared to existing methods, 3) human feedback further improves the quality of research, and 4) Agent Laboratory significantly reduces research expenses.

논문 초록(Abstract)

역사적으로 과학적 발견은 초기 구상부터 최종 결과까지 상당한 시간과 리소스를 필요로 하는 길고 비용이 많이 드는 과정이었습니다. 과학적 발견을 가속화하고, 연구 비용을 절감하며, 연구 품질을 향상시키기 위해 전체 연구 프로세스를 완료할 수 있는 자율적인 LLM 기반 프레임워크인 에이전트 실험실(Agent Laboratory)을 도입했습니다. 이 프레임워크는 사람이 제공한 연구 아이디어를 받아 문헌 검토, 실험, 보고서 작성의 세 단계를 거쳐 코드 저장소, 연구 보고서 등 종합적인 연구 결과물을 생성하고 각 단계에서 사용자가 피드백 및 가이드를 제공할 수 있도록 지원합니다. 다양한 최첨단 LLM을 갖춘 에이전트 실험실을 배포하고 여러 연구자가 설문조사에 참여하여 품질을 평가하고 연구 프로세스를 안내하는 인적 피드백을 제공한 다음 최종 논문을 평가하도록 초대합니다. 저희는 다음과 같은 사실을 발견했습니다: (1) o1-preview로 구동되는 에이전트 실험실은 최고의 연구 결과를 생성하고, (2) 생성된 머신러닝 코드는 기존 방식에 비해 최첨단 성능을 달성할 수 있으며, (3) 각 단계마다 피드백을 제공하는 인간의 참여는 연구의 전반적인 품질을 크게 향상시키고, (4) 에이전트 실험실은 연구 비용을 크게 절감하여 기존 자율 연구 방식 대비 84% 감소하는 성과를 달성했습니다. 에이전트 실험실을 통해 연구자들이 낮은 수준의 코딩과 글쓰기보다 창의적인 아이디어에 더 많은 노력을 기울일 수 있게 되어 궁극적으로 과학적 발견을 가속화할 수 있기를 바랍니다.

Historically, scientific discovery has been a lengthy and costly process, demanding substantial time and resources from initial conception to final results. To accelerate scientific discovery, reduce research costs, and improve research quality, we introduce Agent Laboratory, an autonomous LLM-based framework capable of completing the entire research process. This framework accepts a human-provided research idea and progresses through three stages--literature review, experimentation, and report writing to produce comprehensive research outputs, including a code repository and a research report, while enabling users to provide feedback and guidance at each stage. We deploy Agent Laboratory with various state-of-the-art LLMs and invite multiple researchers to assess its quality by participating in a survey, providing human feedback to guide the research process, and then evaluate the final paper. We found that: (1) Agent Laboratory driven by o1-preview generates the best research outcomes; (2) The generated machine learning code is able to achieve state-of-the-art performance compared to existing methods; (3) Human involvement, providing feedback at each stage, significantly improves the overall quality of research; (4) Agent Laboratory significantly reduces research expenses, achieving an 84% decrease compared to previous autonomous research methods. We hope Agent Laboratory enables researchers to allocate more effort toward creative ideation rather than low-level coding and writing, ultimately accelerating scientific discovery.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1877382581358047375

LLM을 위한 긴 컨텍스트 대 RAG: 평가 및 재검토 / Long Context vs. RAG for LLMs: An Evaluation and Revisits

논문 소개

RAG 시스템과 비교하여 긴 문맥(LC) LLM을 종합적으로 평가한 결과, 세 가지 주요 결과는 다음과 같습니다: 1) 질문-답변 벤치마크에서 LC가 일반적으로 RAG보다 성능이 뛰어남, 2) 요약 기반 검색은 LC와 비슷한 성능을 보이지만 청크 기반 검색은 뒤처짐, 3) 대화 기반 및 일반 질문 쿼리에서 RAG가 유리함

Performs a comprehensive evaluation of long context (LC) LLMs compared to RAG systems; the three main findings are: 1) LC generally outperforms RAG in question-answering benchmarks, 2) summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind, and 3) RAG has advantages in dialogue-based and general question queries

논문 초록(Abstract)

컨텍스트 창을 확장(즉, 긴 컨텍스트, LC)하고 검색기를 사용하여 관련 정보에 선택적으로 액세스(즉, 검색 증강 생성, RAG)하는 것은 LLM이 매우 긴 외부 컨텍스트를 통합할 수 있도록 하는 두 가지 주요 전략입니다. 이 백서에서는 이 주제에 대한 최근 연구를 재검토하여 주요 인사이트와 차이점을 강조합니다. 그런 다음 외부 컨텍스트 없이도 답변이 가능한 질문을 걸러내고, 가장 효과적인 검색 방법을 파악하고, 데이터 세트를 확장하여 보다 포괄적인 평가를 제공합니다. LC는 일반적으로 질문 답변 벤치마크에서, 특히 Wikipedia 기반 질문에 대해 RAG보다 우수한 성능을 보였습니다. 요약 기반 검색은 LC와 비슷한 성능을 보이지만 청크 기반 검색은 뒤처집니다. 그러나 RAG는 대화 기반 및 일반 질문 쿼리에서 장점이 있습니다. 이러한 인사이트는 RAG와 LC 전략 간의 장단점을 강조하며, 향후 외부 지식 소스를 통해 LLM을 최적화하기 위한 지침을 제공합니다. 또한 기존 연구에서 간과되었던 문맥 관련성의 중요성을 강조하면서 이 주제에 대한 심층적인 논의를 제공합니다.

Extending context windows (i.e., Long Context, LC) and using retrievers to selectively access relevant information (i.e., Retrieval-Augmented Generation, RAG) are the two main strategies to enable LLMs to incorporate extremely long external contexts. This paper revisits recent studies on this topic, highlighting their key insights and discrepancies. We then provide a more comprehensive evaluation by filtering out questions answerable without external context, identifying the most effective retrieval methods, and expanding the datasets. We show that LC generally outperforms RAG in question-answering benchmarks, especially for Wikipedia-based questions. Summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind. However, RAG has advantages in dialogue-based and general question queries. These insights underscore the trade-offs between RAG and LC strategies, offering guidance for future optimization of LLMs with external knowledge sources. We also provide an in-depth discussion on this topic, highlighting the overlooked importance of context relevance in existing studies.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1876281074147299569

Search-o1: 에이전트 검색-향상된 대규모 추론 모델 / Search-o1: Agentic Search-Enhanced Large Reasoning Models

논문 소개

대규모 추론 모델(LRM)과 에이전트 검색 및 문서 정제 기능을 결합하여 지식 부족 문제를 해결하는 프레임워크로, 추론 중에 자율적인 지식 검색을 가능하게 하고 복잡한 작업에서 기준 모델과 인간 전문가를 능가하는 강력한 성능을 발휘하는 프레임워크입니다.

A framework that combines large reasoning models (LRMs) with agentic search and document refinement capabilities to tackle knowledge insufficiency; the framework enables autonomous knowledge retrieval during reasoning and demonstrates strong performance across complex tasks, outperforming both baseline models and human experts.

논문 초록(Abstract)

OpenAI-o1과 같은 대규모 추론 모델(LRM)은 대규모 강화 학습을 통해 인상적인 긴 단계적 추론 능력을 입증해 왔습니다. 그러나 이러한 확장 추론 프로세스는 종종 지식 부족으로 인해 불확실성과 잠재적 오류가 빈번하게 발생합니다. 이러한 한계를 해결하기 위해 저희는 에이전트 검색 증강 생성(RAG) 메커니즘과 검색된 문서를 정제하기 위한 문서 내 이유 모듈로 LRM을 향상시키는 프레임워크인 \textbf{Search-o1}을 도입했습니다. Search-o1은 에이전트 검색 워크플로우를 추론 프로세스에 통합하여 LRM이 불확실한 지식 포인트에 직면했을 때 외부 지식을 동적으로 검색할 수 있도록 합니다. 또한 검색된 문서의 장황한 특성으로 인해 검색된 정보를 추론 체인에 주입하기 전에 심층적으로 분석하는 별도의 문서 내 추론 모듈을 설계하여 노이즈를 최소화하고 일관된 추론 흐름을 유지합니다. 과학, 수학, 코딩 분야의 복잡한 추론 작업에 대한 광범위한 실험과 6개의 오픈 도메인 QA 벤치마크는 Search-o1의 강력한 성능을 입증합니다. 이러한 접근 방식은 복잡한 추론 작업에서 LRM의 신뢰성과 적용 가능성을 향상시켜 보다 안정적이고 다양한 지능형 시스템을 위한 기반을 마련합니다. 코드는 \url{GitHub - sunnynexus/Search-o1: Search-o1: Agentic Search-Enhanced Large Reasoning Models}에서 확인할 수 있습니다.

Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{GitHub - sunnynexus/Search-o1: Search-o1: Agentic Search-Enhanced Large Reasoning Models}.

논문 링크

더 읽어보기

https://github.com/sunnynexus/Search-o1

https://x.com/omarsar0/status/1877742469213004015

LLM에서 시스템 2 추론을 향해: 메타 사고 연쇄로 사고하는 방법 배우기 / Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

논문 소개

특정 CoT에 도달하는 데 필요한 기본 추론을 모델링하여 기존의 CoT(Chain-of-Thought)를 확장하는 메타 사고망(Meta-CoT)을 제안합니다. 주요 논거는 CoT는 순진하고 Meta-CoT는 고급 문제 해결에 필요한 인지 프로세스에 더 가깝다는 것입니다.

Proposes Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by modeling the underlying reasoning required to arrive at a particular CoT; the main argument is that CoT is naive and Meta-CoT gets closer to the cognitive process required for advanced problem-solving.

논문 초록(Abstract)

저희는 특정 CoT에 도달하는 데 필요한 기본 추론을 명시적으로 모델링함으로써 기존의 CoT를 확장하는 새로운 프레임워크인 메타 생각의 연쇄(Meta-CoT)를 제안합니다. 문맥 내 검색과 일치하는 행동을 보이는 최신 모델의 경험적 증거를 제시하고 프로세스 감독, 합성 데이터 생성, 검색 알고리즘을 통해 Meta-CoT를 생성하는 방법을 살펴봅니다. 마지막으로, 선형화된 검색 추적과 강화 학습 사후 훈련으로 인스트럭션 튜닝을 통합하여 메타코트를 생성하기 위한 모델을 훈련하는 구체적인 파이프라인을 간략하게 설명합니다. 마지막으로 확장 법칙, 검증자 역할, 새로운 추론 알고리즘을 발견할 수 있는 잠재력 등 미해결 연구 질문에 대해 논의합니다. 이 연구는 LLM에서 메타코트를 구현하기 위한 이론적, 실용적 로드맵을 제공하여 인공지능에서 보다 강력하고 인간과 유사한 추론을 할 수 있는 길을 열어줍니다.

We propose a novel framework, Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by explicitly modeling the underlying reasoning required to arrive at a particular CoT. We present empirical evidence from state-of-the-art models exhibiting behaviors consistent with in-context search, and explore methods for producing Meta-CoT via process supervision, synthetic data generation, and search algorithms. Finally, we outline a concrete pipeline for training a model to produce Meta-CoTs, incorporating instruction tuning with linearized search traces and reinforcement learning post-training. Finally, we discuss open research questions, including scaling laws, verifier roles, and the potential for discovering novel reasoning algorithms. This work provides a theoretical and practical roadmap to enable Meta-CoT in LLMs, paving the way for more powerful and human-like reasoning in artificial intelligence.

논문 링크

더 읽어보기

https://x.com/rm_rafailov/status/1877446475271037314

rStar-Math: 소규모 LLM도 스스로 진화한 딥씽킹으로 수학 추론을 마스터할 수 있습니다 / rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

논문 소개

새로운 접근 방식은 수학 추론을 향상시키기 위한 세 가지 핵심 구성 요소, 즉 1) 정책 SLM 훈련에 사용되는 단계별 검증된 추론 궤적을 생성하기 위해 MCTS를 포함하는 코드 증강 CoT 데이터 합성 방법, 2) 각 수학 추론 단계의 보상 라벨을 안정적으로 예측하는 SLM 기반 프로세스 보상 모델, 3) 정책 SLM과 PPM이 반복적으로 진화하여 수학 추론을 개선하는 자체 진화 레시피를 제안합니다. 수학 벤치마크에서 rStar-Math는 Qwen2를 향상시켰습니다.5-Math-7B는 58.8%에서 90.0%로, Phi3-mini-3.8B는 41.4%에서 86.4%로 향상되어 o1-preview를 +4.5% 및 +0.9% 능가합니다.

A new approach proposes three core components to enhance math reasoning: 1) a code-augmented CoT data synthesis method involving MCTS to generate step-by-step verified reasoning trajectories which are used to train the policy SLM, 2) an SLM-based process reward model that reliably predicts a reward label for each math reasoning step, and 3) a self-evolution recipe where the policy SLM and PPM are iteratively evolved to improve math reasoning; on the MATH benchmark, rStar-Math improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%.

논문 초록(Abstract)

당사는 소규모 언어 모델(SLM)이 우수한 모델의 증류 없이도 OpenAI o1의 수학 추론 능력에 필적하거나 이를 능가할 수 있음을 입증하기 위해 rStar-Math를 선보입니다. rStar-Math는 수학 정책 SLM이 SLM 기반 프로세스 보상 모델에 따라 테스트 시간 검색을 수행하는 몬테카를로 트리 검색(MCTS)을 통해 "심층 사고"를 실행하여 이를 달성합니다. rStar-Math는 두 가지 SLM을 훈련할 때 발생하는 문제를 해결하기 위해 세 가지 혁신을 도입했습니다: (1) 정책 SLM을 훈련하는 데 사용되는 단계별로 검증된 추론 궤적을 생성하기 위해 광범위한 MCTS 롤아웃을 수행하는 새로운 코드 증강 CoT 데이터 합성 방법, (2) 단계별 점수 주석을 피하여 보다 효과적인 프로세스 선호 모델(PPM)을 생성하는 새로운 프로세스 보상 모델 훈련 방법, (3) 정책 SLM과 PPM을 처음부터 구축하고 추론 기능을 개선하기 위해 반복적으로 진화하는 자체 진화 레시피가 바로 그 예입니다. 74만 7천 개의 수학 문제에 대해 수백만 개의 솔루션을 합성한 4번의 자체 진화를 통해 rStar-Math는 SLM의 수학 추론 능력을 최첨단 수준으로 끌어올립니다. 수학 벤치마크에서 Qwen2.5-Math-7B는 58.8%에서 90.0%로, Phi3-mini-3.8B는 41.4%에서 86.4%로 향상되어 o1-preview를 +4.5% 및 +0.9% 능가합니다. 미국 수학 올림피아드(AIME)에서 rStar-Math는 평균 53.3%(8/15)의 문제를 풀며 상위 20%의 고등학생 중 가장 뛰어난 수재들 중 하나로 꼽혔습니다. 코드와 데이터는 GitHub - microsoft/rStar 에서 확인할 수 있습니다.

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at GitHub - microsoft/rStar.

논문 링크

더 읽어보기

https://github.com/microsoft/rStar

https://x.com/omarsar0/status/1877378301293142050

코스모스 월드 파운데이션 모델 / Cosmos World Foundation Model

논문 소개

실제 배포 전에 디지털 환경에서 물리적 AI 시스템을 훈련하기 위한 프레임워크, 이 플랫폼에는 물리적 세계의 디지털 트윈 역할을 하는 사전 훈련된 월드 기반 모델이 포함되어 있어 AI 시스템이 물리적 하드웨어의 손상 위험 없이 안전하게 학습하고 상호 작용할 수 있으며 이러한 모델은 카메라 제어, 로봇 조작, 자율 주행과 같은 특정 애플리케이션에 맞게 미세 조정할 수 있습니다.

A framework for training Physical AI systems in digital environments before real-world deployment; the platform includes pre-trained world foundation models that act as digital twins of the physical world, allowing AI systems to safely learn and interact without risking damage to physical hardware; these models can be fine-tuned for specific applications like camera control, robotic manipulation, and autonomous driving.

논문 초록(Abstract)

물리적 AI는 먼저 디지털 방식으로 학습해야 합니다. 그 자체의 디지털 트윈인 정책 모델과 세계의 디지털 트윈인 세계의 디지털 트윈, 즉 세계 모델이 필요합니다. 이 백서에서는 코스모스 월드 파운데이션 모델을 소개합니다. 개발자가 물리적 AI 설정을 위한 맞춤형 월드 모델을 구축할 수 있도록 지원하는 플랫폼입니다. 유니티의 포지셔닝 월드 파운데이션 모델을 맞춤형으로 미세 조정할 수 있는 범용 월드 모델입니다. 범용 월드 모델로 포지셔닝하고 있습니다. 유니티의 플랫폼은 비디오 큐레이션 파이프라인, 사전 학습된 월드 파운데이션 모델, 사전 학습된 월드 파운데이션 모델의 사후 학습 사례, 비디오 토큰라이저. 물리적 AI 구축자가 우리 사회의 가장 중요한 문제를 해결할 수 있도록 돕기 위해, 저희는 플랫폼을 오픈 소스로 공개합니다. 플랫폼을 오픈 소스로 제공하고, NVIDIA Cosmos를 통해 허용되는 라이선스로 모델을 오픈 웨이트화했습니다.

Physical AI needs to be trained digitally first. It needs a digital twin of itself, the policy model, and a digital twin of the world, the world model. In this paper, we present the Cosmos World Foundation Model Platform to help developers build customized world models for their Physical AI setups. We position a world foundation model as a general-purpose world model that can be fine-tuned into customized world models for downstream applications. Our platform covers a video curation pipeline, pre-trained world foundation models, examples of post-training of pre-trained world foundation models, and video tokenizers. To help Physical AI builders solve the most critical problems of our society, we make our platform open-source and our models open-weight with permissive licenses available via NVIDIA Cosmos.

논문 링크

더 읽어보기

https://github.com/NVIDIA/Cosmos

https://x.com/EthanHe_42/status/1876487556755521798

암묵적 보상을 통한 프로세스 강화 / Process Reinforcement through Implicit Rewards

논문 소개

프로세스 보상을 사용하여 언어 모델 추론을 개선하는 온라인 강화 학습 프레임워크, 제안된 알고리즘은 온라인 프롬프트 필터링, RLOO 수익/우위 추정, PPO 손실, 암시적 프로세스 보상 모델링 온라인 업데이트를 결합, Eurus-2-7B-PRIME 모델에서 유사 모델 대비 1/10의 훈련 데이터만 사용하여 AIME 2024에서 26.7% pass@1을 달성하며 GPT-4 및 기타 모델을 능가합니다.

A framework for online reinforcement learning that uses process rewards to improve language model reasoning; the proposed algorithm combines online prompt filtering, RLOO return/advantage estimation, PPO loss, and implicit process reward modeling online updates; on their model, Eurus-2-7B-PRIME, achieves 26.7% pass@1 on AIME 2024, surpassing GPT-4 and other models, using only 1/10 of the training data compared to similar models.

논문 링크

더 읽어보기

https://github.com/PRIME-RL/PRIME

https://x.com/lifan__yuan/status/1874867809983033649

LLM은 컨텍스트를 기반으로 좋은 질문을 설계할 수 있을까요? / Can LLMs Design Good Questions Based on Context?

논문 소개

LLM으로 생성된 질문의 품질을 체계적으로 평가한 결과, 주요 결과는 다음과 같습니다: 1) LLaMA와 GPT 모델 모두에서 특정 사실과 수치에 대한 질문을 선호하는 경향이 강하고, 2) 질문 길이는 20단어 내외인 경향이 있지만 LLM에 따라 길이에 대한 선호도가 뚜렷하게 나타나는 경향이 있으며, 3) LLM 생성 질문은 일반적으로 훨씬 긴 답변을 요구하고, 4) 사람이 생성한 질문은 문맥의 시작 부분에 집중하는 경향이 있는 반면, LLM 생성 질문은 양 끝 부분에 약간 집중하는 균형 있는 분포가 나타나는 것으로 나타났습니다.

Systematically evaluates the quality of questions generated with LLMs; here are the main findings: 1) there is a strong preference for asking about specific facts and figures in both LLaMA and GPT models, 2) the question lengths tend to be around 20 words but different LLMs tend to exhibit distinct preferences for length, 3) LLM-generated questions typically require significantly longer answers, and 4) human-generated questions tend to concentrate on the beginning of the context while LLM-generated questions exhibit a more balanced distribution, with a slight decrease in focus at both ends.

논문 초록(Abstract)

이 백서에서는 문맥에서 LLM이 생성한 질문을 6가지 차원에 걸쳐 사람이 생성한 질문과 비교하여 평가합니다. 질문의 길이, 유형, 문맥 범위, 답변 가능성 등의 측면에 초점을 맞춘 자동화된 LLM 기반 평가 방법을 소개합니다. 조사 결과는 LLM으로 생성된 질문의 고유한 특성을 강조하여 질문 품질 및 하위 애플리케이션에 대한 추가 연구를 지원할 수 있는 인사이트를 제공합니다.

This paper evaluates questions generated by LLMs from context, comparing them to human-generated questions across six dimensions. We introduce an automated LLM-based evaluation method, focusing on aspects like question length, type, context coverage, and answerability. Our findings highlight unique characteristics of LLM-generated questions, contributing insights that can support further research in question quality and downstream applications.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1877008618207560049

대규모 언어 모델에 대한 서베이를 통해 그 기능과 한계에 대한 인사이트를 얻으세요 / A Survey on Large Language Models with some Insights on their Capabilities and Limitations

논문 소개

LLM에 대한 새로운 설문조사에는 기능과 한계에 대한 몇 가지 인사이트가 포함되어 있습니다.

A new survey on LLMs including some insights on capabilities and limitations.

논문 초록(Abstract)

특히 트랜스포머 아키텍처에 기반한 대규모 언어 모델(LLM)의 개발과 함께 인공지능의 급속한 발전은 자연어 처리의 기능을 재정의했습니다. 이러한 모델은 이제 텍스트 생성, 질문 답변, 번역, 요약 등 다양한 언어 관련 작업에서 놀라운 성능을 발휘하며 종종 인간과 유사한 이해력을 자랑합니다. 더욱 흥미로운 점은 LLM이 핵심 기능을 넘어 상식적인 추론, 코드 생성, 산술과 같은 작업에서도 능숙함을 보이는 등 새로운 능력을 보여줬다는 점입니다. 이 설문조사 백서에서는 이러한 기능을 구동하는 기본 구성 요소, 확장 메커니즘, 아키텍처 전략에 대해 살펴봅니다. GPT 및 LLaMA와 같은 모델을 중심으로 기하급수적인 데이터 및 계산 증가가 LLM 성능에 미치는 영향을 분석하는 한편, 확장과 관련된 트레이드오프에 대해서도 다룹니다. 또한 의료, 금융, 교육, 법률 등 다양한 분야의 LLM 애플리케이션을 검토하여 도메인별 과제를 해결할 수 있는 적응성과 잠재력을 강조합니다. 이 연구의 핵심은 LLM이 다양한 업무에서 어떻게 일반화되고, 계획 및 추론 능력을 발휘하는지, 그리고 이러한 새로운 능력을 체계적으로 이끌어내거나 향상시킬 수 있는지에 대한 질문입니다. 특히 사전 학습 데이터가 이러한 능력의 출현에 어떤 영향을 미치는지에 초점을 맞춰 LLM의 CoT(연쇄 사고망)와 PoT(계획 사고망) 능력에 대한 인사이트를 제공합니다. 또한 외부 시스템을 통합하여 LLM이 복잡하고 동적인 작업을 처리할 수 있도록 하는 LLM 모듈 프레임워크에 대해서도 조사합니다. 이 백서는 이러한 요소를 분석함으로써 LLM의 기능과 한계에 대한 지속적인 논의를 촉진하여 새롭고 점점 더 복잡해지는 환경에서의 책임감 있는 개발과 적용을 촉진하는 것을 목표로 합니다.

The rapid advancement of artificial intelligence, particularly with the development of Large Language Models (LLMs) built on the transformer architecture, has redefined the capabilities of natural language processing. These models now exhibit remarkable performance across various language-related tasks, such as text generation, question answering, translation, and summarization, often rivaling human-like comprehension. More intriguingly, LLMs have demonstrated emergent abilities extending beyond their core functions, showing proficiency in tasks like commonsense reasoning, code generation, and arithmetic. This survey paper explores the foundational components, scaling mechanisms, and architectural strategies that drive these capabilities. Emphasizing models like GPT and LLaMA, we analyze the impact of exponential data and computational growth on LLM performance, while also addressing the trade-offs associated with scaling. We also examine LLM applications across sectors, such as healthcare, finance, education, and law, highlighting their adaptability and potential to solve domain-specific challenges. Central to this work are the questions of how LLMs generalize across diverse tasks, exhibit planning, and reasoning abilities, and whether these emergent abilities can be systematically elicited or enhanced. In particular, we provide some insights into the CoT (Chain of Thought) and PoT (Plan of Thought) abilities within LLMs, focusing on how pre-training data influences their emergence. Additionally, we investigate LLM-modulo frameworks that integrate external systems, allowing LLMs to handle complex, dynamic tasks. By analyzing these factors, this paper aims to foster the ongoing discussion on the capabilities and limits of LLMs, promoting their responsible development and application in novel and increasingly complex environments.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1877416049999802408

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.*

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2025/01/06 ~ 01/12] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

RAG를 하지 마세요: 지식 작업에 캐시 증강 생성만 있으면 되는 경우 / Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

에이전트 실험실: LLM 에이전트를 연구 조교로 활용하기 / Agent Laboratory: Using LLM Agents as Research Assistants

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

LLM을 위한 긴 컨텍스트 대 RAG: 평가 및 재검토 / Long Context vs. RAG for LLMs: An Evaluation and Revisits

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Search-o1: 에이전트 검색-향상된 대규모 추론 모델 / Search-o1: Agentic Search-Enhanced Large Reasoning Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

LLM에서 시스템 2 추론을 향해: 메타 사고 연쇄로 사고하는 방법 배우기 / Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

rStar-Math: 소규모 LLM도 스스로 진화한 딥씽킹으로 수학 추론을 마스터할 수 있습니다 / rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

코스모스 월드 파운데이션 모델 / Cosmos World Foundation Model

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

암묵적 보상을 통한 프로세스 강화 / Process Reinforcement through Implicit Rewards

논문 소개

논문 링크

더 읽어보기

LLM은 컨텍스트를 기반으로 좋은 질문을 설계할 수 있을까요? / Can LLMs Design Good Questions Based on Context?

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

대규모 언어 모델에 대한 서베이를 통해 그 기능과 한계에 대한 인사이트를 얻으세요 / A Survey on Large Language Models with some Insights on their Capabilities and Limitations

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR