[2025/02/03 ~ 02/09] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 2월 10, 2025, 1:31오후

[2025/02/03 ~ 02/09] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들을 종합해 보면, 대규모 언어 모델의 추론 능력 강화에 관한 연구가 두드러집니다. 예를 들어, s1, LIMO, CoAT, 그리고 긴 CoT 추론 관련 논문들은 소규모 데이터셋이나 추가적인 테스트 시간 계산을 활용해 모델이 복잡한 추론을 수행할 수 있도록 하는 다양한 전략을 제시합니다. 이러한 접근법은 기존의 대규모 미세 조정 데이터셋 없이도 LLM이 내재된 풍부한 사전 지식을 효과적으로 활용하여 고도화된 추론을 가능하게 함을 보여줍니다. 특히, LIMO의 경우 817개의 엄선된 예제만으로도 우수한 성능을 달성하는 등 “적은 것이 더 많다”는 가설을 실험적으로 입증하고 있습니다.
또한, 멀티모달 생성 및 데이터 증강 분야에서도 눈에 띄는 발전이 이루어지고 있습니다. OmniHuman-1은 단일 이미지와 오디오/비디오 입력만으로도 매우 사실적인 인간 동영상을 생성하는 혁신적인 모델로, 다양한 모달리티의 융합을 통한 현실감 있는 생성 기술의 가능성을 보여줍니다. 한편, Syntriever와 텍스트 데이터 증강 관련 서베이 논문들은 합성 데이터( Synthetic Data)를 활용하여 검색 모델과 LLM의 성능을 높이는 새로운 방법론을 제안합니다. 이와 더불어, 에이전트 앙상블 접근법인 Self-MoA와 에이전트 슈퍼넷(MaAS)은 여러 모델의 결과물을 단순히 혼합하는 대신, 단일 강력한 모델의 다양한 출력을 집계하거나 동적으로 최적의 에이전트 팀을 구성함으로써 비용 효율적이고 성능 면에서도 우수한 결과를 도출하는 전략을 보여줍니다.
이러한 연구 경향은 AI 분야의 최전선에서 어떤 문제들이 중요하게 여겨지고 있는지를 잘 보여줍니다. LLM과 같은 복잡한 모델들은 그 자체로도 큰 학습 능력을 가지고 있지만, 이를 더욱 개선하기 위해서는 데이터의 품질 뿐만 아니라, 보상 구조와 같은 학습 과정의 세밀한 조정이 필요합니다. 특히, 모델의 추론 능력을 강화하고자 하는 최근의 연구들은 실제 응용에서 더욱 복잡하고 다양한 문제를 해결할 수 있는 AI 시스템의 개발을 목표로 하고 있습니다. 이러한 노력들은 궁극적으로 AI의 일반화 능력과 더 나은 성능을 끌어내는 데 기여할 것이며, 이는 다양한 산업 분야에서 AI의 활용 가능성을 크게 확장시킬 것입니다.

s1: 간단한 테스트 시간 스케일링 / s1: Simple test-time scaling

논문 소개

스탠포드, UW 등의 연구원들은 추론 시 추가 컴퓨팅(“테스트 시간 확장”)을 사용하여 LLM 성능을 향상시키는 방법인 s1을 소개합니다. 주요 아이디어는 다음과 같습니다:

작지만 강력한 데이터 세트 - 이들은 32억 개의 모델을 미세 조정하기 위해 상세한 추론 흔적이 있는 1,000개의 까다로운 문제들로만 구성된 s1K를 큐레이팅했습니다. 데이터는 작지만 강력한 추론의 예시를 제공합니다.
**추론을 위한 **“예산 강제” ** - 새로운 디코딩 트릭은 모델이 멈추려고 할 때 “대기” 토큰을 추가하여 더 오래 생각하도록 강제합니다. 이를 통해 모델이 추론 단계를 다시 확인하고 수정하도록 유도합니다. 또한 지나치게 긴 추론을 차단함으로써 추론 시간을 제어합니다.
OpenAI의 o1보다 큰 향상 - 결과 모델(s1-32B)(Qwen2.5-32B-Instruct의 미세 조정 버전)은 경쟁 수준의 수학 문제(MATH & AIME24)에서 OpenAI의 o1-프리뷰 모델보다 최대 +27% 뛰어난 성능을 발휘합니다. 특히, 테스트 시간 확장을 통해 AIME24의 정확도를 50%에서 **57%**로 높여 자체적인 정상 한계를 뛰어넘었습니다.

Researchers from Stanford, UW, and others introduce s1, a method to boost LLM performance by using extra compute at inference (“test-time scaling”). Key ideas include:

Small yet powerful dataset – They curated s1K, only 1,000 challenging questions with detailed reasoning traces, to fine-tune a 32B model. Despite the tiny data, this provides strong reasoning exemplars.

“Budget forcing” for reasoning – A new decoding trick appends the token “Wait” when the model tries to stop, forcing it to think longer. This leads the model to double-check and fix its reasoning step. By also cutting off overly long reasoning, they control inference time.

Big gains over OpenAI’s o1 – The resulting model (s1-32B) (a fine-tuned version of Qwen2.5-32B-Instruct) outperforms OpenAI’s o1-preview model by up to +27% on competition-level math questions (MATH & AIME24). Notably, with test-time scaling, it boosts accuracy on AIME24 from 50% to 57%, surpassing its own normal limit.

논문 초록(Abstract)

테스트 시간 확장은 테스트 시간 컴퓨팅을 추가로 사용하여 성능을 개선하는 언어 모델링의 유망한 새 접근 방식입니다. 최근 OpenAI의 o1 모델이 이 기능을 선보였지만 방법론을 공개적으로 공유하지 않았기 때문에 많은 모방이 이루어졌습니다. 저희는 테스트 시간 확장과 강력한 추론 성능을 달성하기 위해 가장 간단한 접근 방식을 추구합니다. 먼저, 난이도, 다양성, 품질이라는 세 가지 기준에 따라 추론 흔적과 짝을 이루는 1,000개의 질문으로 구성된 소규모 데이터 세트 s1K를 선별하여 추론을 통해 검증합니다. 둘째, 모델의 사고 과정을 강제로 종료하거나 종료하려고 할 때 모델 생성에 '대기'를 여러 번 추가하여 시간을 연장함으로써 테스트 시간 계산을 제어하기 위한 예산 강제화를 개발합니다. 이렇게 하면 모델이 답을 다시 확인하도록 유도하여 종종 잘못된 추론 단계를 수정할 수 있습니다. s1K에서 Qwen2.5-32B-Instruct 언어 모델을 감독하에 미세 조정하고 예산 강제력을 장착한 결과, s1-32B 모델은 경쟁 수학 문제에서 최대 27%(MATH 및 AIME24)까지 o1-preview를 능가했습니다. 또한 예산 강제 적용을 통해 s1-32B를 확장하면 시험 시간 개입 없이도 그 이상의 성능(AIME24의 경우 50%에서 57%까지)을 추정할 수 있습니다. 모델, 데이터 및 코드는 이 https URL에서 오픈 소스로 제공됩니다.

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL

논문 링크

더 읽어보기

s1: 테스트 시점 스케일링(Test-Time Scaling)을 단순하게 구현하는 방법에 대한 연구 읽을거리&정보공유

[s1: 테스트 시점 스케일링(Test-Time Scaling)을 단순하게 구현하는 방법에 대한 연구] s1: Simple Test-Time Scaling 연구 배경 최근 인공지능(AI) 기술이 급속도로 발전하면서, 대형 언어 모델(Large Language Model, LLM)의 활용이 점점 더 확대되고 있습니다. GPT-4o, Claude, Gemini와 같은 최신 모델들은 방대한 양의 데이터를 학습한 후, 복잡한 자연어 처리(NLP) 작업을 수행할 수 있으며, 점차 인간과 유사한 수준의 언어 이해 및 생성 능력을 보이고 있습니다. 이러한 모델들은 기계 번역, 문서 요약, 질의응답 시스템 등 다양한 영역에서 사용되며, AI 기반 애플리케이션의 핵심 요소로 자리 잡고 있습니다. 이러한 언어 모델의 성능을 높이는 전통적인 접근 방식은 훈련 시점(Train-time)에서의 확장입니다. 즉, 더 많은 데이터를 학습하고, 모델의 크기를 키우며, 학습 시간을 늘리는 방식이 일반적으로 사…

http://twitter.com/omarsar0/status/1886428631041225030

OmniHuman-1: 원스테이지 인간 애니메이션 스케일링 / OmniHuman-1: Scaling One-Stage Human Animation

논문 소개

바이트댄스 AI 랩 연구팀은 단 하나의 이미지와 모션 입력(오디오 또는 비디오)으로 매우 사실적인 사람 동영상을 생성할 수 있는 확산 트랜스포머 모델인 OmniHuman-1을 공개했습니다. 주요 특징은 다음과 같습니다:

엔드투엔드 휴먼 비디오 생성 - OmniHuman은 하나의 이미지(얼굴만 있는 이미지부터 전신까지 모든 화면 비율)와 오디오 클립 또는 비디오 모션으로 해당 인물이 말하고, 노래하고, 동작을 취하는 생생한 비디오를 생성합니다. 결과물은 모션, 조명, 텍스처 디테일이 놀라울 정도로 사실적입니다.
혼합 모달리티 훈련 - 핵심 혁신은 훈련 중에 다양한 모션 모달리티(오디오 중심, 비디오 중심, 포즈 등)를 혼합하는 옴니-컨디션 훈련입니다. 이를 통해 훈련 데이터가 크게 확장되고 고품질 화자 영상 데이터의 일반적인 부족 문제를 극복할 수 있습니다. 모델은 다양한 입력(음성, 노래, 악기)과 까다로운 포즈를 처리하는 방법을 학습합니다.
기존 방식보다 뛰어난 성능 - 이전의 1단계 모델(예: 오디오 기반 토킹 헤드)에 비해 OmniHuman은 보다 사실적인 비디오를 생성하고 입력 유형에 유연하게 대응할 수 있습니다. 심지어 만화나 동물 피규어를 입력으로 처리하여 각 스타일에 자연스럽게 모션을 전달할 수도 있습니다.
더 폭넓은 지원 - 모든 인물 콘텐츠(얼굴 클로즈업, 반신, 전신)와 여러 개의 주행 신호를 동시에 지원합니다. 이러한 범용성은 엔드투엔드 휴먼 애니메이션 모델로는 최초입니다.

A team at ByteDance AI Lab unveiled OmniHuman-1, a diffusion-transformer model that can generate highly realistic human videos from just a single image plus motion input (audio or video). Highlights:

End-to-end human video generation – OmniHuman takes one image (any aspect ratio, from face only to full-body) and an audio clip or video motion and produces a lifelike video of that person speaking, singing, or performing actions. The outputs are remarkably realistic in motion, lighting, and texture detail.

Mixed modality training – A key innovation is Omni-Conditions Training: mixing various motion modalities during training (audio-driven, video-driven, pose, etc.). This greatly expands the training data and overcomes the usual scarcity of high-quality talking-head video data. The model learns to handle diverse inputs (speech, song, instruments) and challenging poses.

Outperforms prior methods – Compared to earlier one-stage models (e.g. audio-driven talking heads), OmniHuman generates more realistic videos and is more flexible in input types. It can even handle cartoons or animal figures as input, transferring motion naturally to each style.

Broader support – The approach supports any portrait content (face close-up, half-body, full-body) and multiple driving signals simultaneously. This generality is a first for end-to-end human animation models.

논문 초록(Abstract)

오디오 기반 말하는 사람 생성과 같은 엔드투엔드 휴먼 애니메이션은 최근 몇 년 동안 괄목할 만한 발전을 거듭해 왔습니다. 그러나 기존 방식은 여전히 대규모 일반 비디오 생성 모델로 확장하는 데 어려움을 겪고 있어 실제 애플리케이션에서 그 잠재력을 발휘하는 데 한계가 있습니다. 이 논문에서는 훈련 단계에 모션 관련 조건을 혼합하여 데이터를 확장하는 확산 트랜스포머 기반 프레임워크인 OmniHuman을 제안합니다. 이를 위해 이러한 혼합 조건에 대한 두 가지 훈련 원칙과 해당 모델 아키텍처 및 추론 전략을 소개합니다. 이러한 설계를 통해 OmniHuman은 데이터 기반 모션 생성을 최대한 활용하여 궁극적으로 매우 사실적인 사람 영상을 생성할 수 있습니다. 무엇보다도 OmniHuman은 다양한 인물 콘텐츠(얼굴 클로즈업, 인물, 반신, 전신)를 지원하고, 말하기와 노래를 모두 지원하며, 인간과 물체의 상호작용 및 까다로운 신체 포즈를 처리하고, 다양한 이미지 스타일을 수용할 수 있습니다. 기존의 엔드투엔드 오디오 중심 방식과 비교했을 때, OmniHuman은 더욱 사실적인 영상을 제작할 뿐만 아니라 입력의 유연성도 뛰어납니다. 또한 다양한 실행 방식(오디오 구동, 비디오 구동, 주행 신호 결합)을 지원합니다. 비디오 샘플은 ttfamily 프로젝트 페이지(이 https URL)에서 제공됩니다.

End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (this https URL)

논문 링크

더 읽어보기

https://omnihuman-lab.github.io/

http://twitter.com/unseenvie/status/1886672598576325011

LIMO: 추론을 위해서는 적은 것이 더 많은 것입니다 / LIMO: Less Is More for Reasoning

논문 소개

**몇 가지 예제로 복잡한 수학 추론을 LLM에게 가르칠 수 있을까요? 이 새로운 LIMO 논문은 어려운 추론 작업을 위해 방대한 미세 조정 데이터 세트가 필요하다는 생각에 도전합니다. 주요 연구 결과는 다음과 같습니다:

놀랍도록 적은 예제 - 817개의 엄선된 훈련 샘플만으로 LIMO 모델은 AIME 수학 경시대회에서 **57.1%**의 정확도를, MATH에서는 **94.8%**의 정확도를 달성했습니다. 이는 이전 SFT 기반 모델(각각 6.5%와 59.2%의 점수를 기록했으며, 이전 접근 방식에서는 필요한 데이터의 **1%**만 사용)에서 크게 도약한 수치입니다.
**더 적은 데이터로 일반화? ** - LIMO는 10개의 다양한 벤치마크에서 평균 **+40.5%**의 절대적인 개선을 보이며, 심지어 100배 더 많은 데이터로 학습된 모델보다 더 뛰어난 성능을 보이는 인상적인 OOD 일반화를 보여줍니다. 이는 복잡한 기술에는 항상 더 많은 데이터가 필요하고 미세 조정은 암기로만 이어진다는 가정에 도전하는 것입니다.
“적은 것이 더 많다” 가설 - 저자들은 LLM의 사전 학습을 통해 이미 풍부한 지식을 갖추었다면, 고급 추론을 위해 신중하게 설계된 최소한의 예제(“인지 템플릿”이라고 함) 세트만 있으면 된다고 제안합니다. 기본적으로 모델은 수천 개의 반복적인 문제가 아니라 지식을 사용하는 방법만 알면 됩니다.
오픈 소스 제품군 - 데이터 효율적인 추론에 대한 추가 연구를 지원하기 위해 커뮤니티를 위한 완전한 LIMO 트레이닝 제품군이 공개되었습니다. 이 연구는 소규모의 고품질 데이터 세트에서도 최첨단 추론이 가능하여 강력한 LLM을 미세 조정하는 장벽을 낮출 수 있음을 암시합니다.

Can a handful of examples teach complex math reasoning to LLMs? This new LIMO paper challenges the notion that we need huge fine-tuning datasets for tough reasoning tasks. Key findings:

Surprisingly few examples – With only 817 carefully curated training samples, the LIMO model achieves 57.1% accuracy on the AIME math competition and 94.8% on MATH. This is a giant leap from prior SFT-based models (which scored 6.5% and 59.2% respectively – using just 1% of the data those earlier approaches needed.

Generalization with less data? – LIMO shows impressive OOD generalization: a +40.5% absolute improvement on average across 10 diverse benchmarks, even outperforming models trained on 100× more data. This challenges the assumption that more data is always required for complex skills and that fine-tuning only leads to memorization.

“Less-Is-More” Hypothesis – The authors propose that if an LLM’s pre-training has already endowed it with rich knowledge, then only a minimal set of carefully designed examples (which they call “cognitive templates”) is needed to unlock advanced reasoning. Essentially, the model just needs to see how to use its knowledge, not thousands of repetitive problems.

Open-source suite – The complete LIMO training suite is released for the community, supporting further research on data-efficient reasoning. This work hints that small, high-quality datasets might yield state-of-the-art reasoning, lowering the barrier to fine-tuning powerful LLMs.

논문 초록(Abstract)

대규모 언어 모델에서 복잡한 추론이 어떻게 나타나는지에 대한 우리의 이해에 도전하는 근본적인 발견을 제시합니다. 기존의 통념에 따르면 정교한 추론 작업을 수행하려면 10만 개가 넘는 방대한 훈련 데이터가 필요하지만, 우리는 놀랍도록 적은 수의 예제로도 복잡한 수학적 추론 능력을 효과적으로 이끌어낼 수 있음을 보여줍니다. 종합적인 실험을 통해 제안된 모델 LIMO는 수학적 추론에서 전례 없는 성능을 보여줍니다. 817개의 엄선된 훈련 샘플만으로 LIMO는 AIME에서 57.1%, MATH에서 94.8%의 정확도를 달성해 기존 SFT 기반 모델의 각각 6.5%, 59.2%보다 향상된 성능을 보였으며, 이전 접근 방식에 필요한 훈련 데이터의 1%만 사용했습니다. LIMO는 10개의 다양한 벤치마크에서 40.5%의 절대적인 개선을 달성하며 100배 더 많은 데이터로 훈련된 모델을 능가하는 탁월한 분포 외 일반화를 보여줌으로써 SFT가 일반화가 아닌 암기로 이어진다는 통념에 도전장을 내밀었습니다. 이러한 결과를 바탕으로 우리는 LIMO 가설(Less-Is-More Reasoning Hypothesis)을 제안합니다: 사전 훈련 중에 도메인 지식이 포괄적으로 인코딩된 기초 모델에서는 최소한의, 그러나 정확하게 조율된 인지 과정의 시연을 통해 정교한 추론 능력이 나타날 수 있습니다. 이 가설은 복잡한 추론의 유도 임계값은 두 가지 주요 요인에 의해 결정된다고 가정합니다: (1) 사전 훈련 중 모델의 인코딩된 지식 기반의 완성도, (2) 복잡한 추론 작업을 해결하기 위해 지식 기반을 활용하는 방법을 모델에 보여주는 '인지 템플릿'으로서의 사후 훈련 예제의 효과성. 데이터 효율적인 추론의 재현성과 향후 연구를 용이하게 하기 위해 이 https URL에서 포괄적인 오픈 소스 제품군으로 LIMO를 공개합니다.

We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at this https URL.

논문 링크

더 읽어보기

http://twitter.com/omarsar0/status/1887514592747937984

CoAT: 대규모 언어 모델 추론 향상을 위한 연관된 생각의 연쇄 프레임워크 / CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

논문 소개

이 연구에서는 생각을 탐색하고 업데이트하여 LLM이 인간처럼 추론할 수 있도록 하는 새로운 '느린 사고' 추론 프레임워크인 CoAT를 소개합니다. 주요 구성 요소는 다음과 같습니다:

MCTS + 연상 메모리 - CoAT는 **몬테 카를로 트리 검색(MCTS)**과 연상 메모리 메커니즘을 결합합니다. MCTS는 모델이 다양한 추론 지점(가능한 해결책)을 체계적으로 탐색할 수 있게 해주며, 연상 기억은 필요에 따라 새로운 관련 정보를 맥락에 동적으로 주입합니다(인간이 생각 중에 사실을 기억하는 방식 모방).
반복적이고 자기 개선적인 추론 - 프레임워크는 솔루션의 검색 공간을 확장하고 이전의 중간 결론을 재검토하거나 구체화할 수 있습니다. 분기를 평가하면서 새로운 단서를 통합하거나 스스로 수정하여 보다 정확하고 포괄적인 최종 답을 도출할 수 있습니다. 이는 역추적이나 새로운 정보를 즉각적으로 수집할 수 없는 표준 원패스 LLM 추론과는 대조적입니다.
정확성 및 다양성 향상 - 다양한 생성 및 추론 작업에 대한 실험에서 CoAT는 정확성, 추론 단계의 일관성, 솔루션 다양성 등의 지표에서 기존의 단일 패스 추론보다 뛰어난 성능을 보였습니다. 관련 맥락을 유지하면서 반복적으로 검색을 확장하는 기능은 '빠른 사고'만으로 추론하는 것보다 더 나은 결과를 도출합니다.
인간의 사고에 더 가깝게 - CoAT는 인간이 대안을 반복적으로 고려하고, 사실을 기억하고, 생각을 다듬는 등 문제를 해결하는 방식에서 영감을 받았습니다. 이는 검색 알고리즘과 메모리를 사용하여 보다 신뢰할 수 있는 추론을 수행할 수 있는 LLM 에이전트를 지향합니다.

This work introduces CoAT, a new “slow thinking” inference framework that enables an LLM to reason more like a human by exploring and updating its thoughts. Main components:

MCTS + associative memory – CoAT marries Monte Carlo Tree Search (MCTS) with an associative memory mechanism. MCTS lets the model systematically explore different reasoning branches (possible solutions), while the associative memory dynamically injects new relevant information into the context as needed (mimicking how humans recall facts mid-thought).

Iterative, self-improving reasoning – The framework can expand the search space of solutions and revisit or refine earlier intermediate conclusions. As it evaluates branches, it can incorporate new clues or correct itself, ensuring the final answer is more accurate and comprehensive. This is in contrast to standard one-pass LLM reasoning, which can’t easily backtrack or gather new info on the fly.

Improved accuracy and diversity – In experiments across various generation and reasoning tasks, CoAT outperformed conventional single-pass inference on metrics like accuracy, coherence of reasoning steps, and solution diversity. The ability to iteratively broaden the search while keeping relevant context yields better results than “fast thinking” alone.

Closer to human thought – CoAT is inspired by how humans solve problems: we iteratively consider alternatives, recall facts, and refine our thinking. It points toward LLM agents that can use search algorithms and memory to achieve more reliable reasoning.

논문 초록(Abstract)

LLM 기술에 대한 연구가 빠르게 진행되고 있으며, 대부분 추론에 '빠른 사고' 접근 방식을 채택하고 있습니다. 대부분의 LLM은 단일 쿼리와 LLM의 추론 능력만을 기반으로 최종 결과를 생성합니다. 하지만 OpenAI-o1의 등장으로 인간의 사고 과정에 더 가까운 '느린 사고' 기법이 주목받기 시작했습니다. 사고하는 동안 끊임없이 지식을 연결하고 보충하는 인간의 능력에서 영감을 받아, 새로운 '연관된 생각의 연쇄(CoAT)' 프레임워크를 개발하여 몬테카를로 트리 검색(MCTS) 알고리즘과 새로운 핵심 정보를 통합하는 동적 메커니즘인 '연관 메모리' 간의 혁신적인 시너지 효과를 도입했습니다. MCTS의 구조화된 탐색 기능과 연관 기억의 적응형 학습 기능을 결합함으로써 CoAT는 LLM 검색 공간을 크게 확장하여 프레임워크가 다양한 추론 경로를 탐색하고 지식 기반을 실시간으로 동적으로 업데이트할 수 있게 해줍니다. 이를 통해 프레임워크는 이전의 추론을 재검토하고 개선할 뿐만 아니라 진화하는 정보를 적응적으로 통합하여 최종 결과물이 정확하고 포괄적일 수 있도록 보장합니다. 프레임워크의 효과를 검증하기 위해 다양한 생성 및 추론 작업에 걸쳐 광범위한 실험을 수행했습니다. 이러한 실험을 통해 정확성, 일관성, 다양성 측면에서 기존 추론 프로세스를 능가하는 프레임워크의 성능을 입증했습니다. 프레임워크는 맥락에 맞는 정보 결과를 유지하면서 검색 공간을 반복적으로 확장하는 기능을 제공합니다.

Research on LLM technologies is rapidly emerging, with most of them employing a 'fast thinking' approach to inference. Most LLMs generate the final result based solely on a single query and LLM's reasoning capabilities. However, with the advent of OpenAI-o1, 'slow thinking' techniques have garnered increasing attention because its process is closer to the human thought process. Inspired by the human ability to constantly associate and replenish knowledge during thinking, we developed the novel Chain-of-Associated-Thoughts (CoAT) framework, which introduces an innovative synergy between the Monte Carlo Tree Search (MCTS) algorithm and a dynamic mechanism for integrating new key information, termed 'associative memory'. By combining the structured exploration capabilities of MCTS with the adaptive learning capacity of associative memory, CoAT significantly expands the LLM search space, enabling our framework to explore diverse reasoning pathways and dynamically update its knowledge base in real-time. This allows the framework to not only revisit and refine earlier inferences but also adaptively incorporate evolving information, ensuring that the final output is both accurate and comprehensive. To validate the effectiveness of our framework, we conducted extensive experiments across a range of generative and reasoning tasks. These experiments demonstrated that our framework outperforms conventional inference processes on accuracy, coherence, and diversity. The framework's ability to iteratively expand its search space while retaining contextually relevant information results.

논문 링크

더 읽어보기

http://twitter.com/omarsar0/status/1887187689247752370

Syntriever: LLM이 생성한 데이터로 검색기 학습하기 / Syntriever: Training Retrievers with LLM-Generated Data

논문 소개

라벨이 지정된 대규모 데이터 세트나 LLM의 내부에 대한 액세스 없이 어떻게 고품질 텍스트 검색기를 구축할 수 있을까요? Syntriever는 합성 데이터를 사용해 블랙박스 LLM의 지식을 검색 모델로 추출하는 2단계 프레임워크를 제시합니다. 단계는 다음과 같습니다:

1단계 - 합성 Q&A를 통한 증류: 쿼리가 주어지면 강력한 LLM(예: GPT-4)이 다양성을 보장하기 위해 연쇄 사고를 사용하여 관련 구절(답변)과 그럴듯하지만 틀린 구절도 생성하도록 유도합니다. 그런 다음 LLM은 이렇게 생성된 구절을 자체 검증하여 환각이나 품질이 낮은 데이터를 걸러냅니다. 그 결과 긍정과 부정 구절이 포함된 쿼리의 합성 데이터 세트가 생성됩니다. 리트리버는 이를 학습하여 관련성 있는 구절의 임베딩을 관련성이 없는 구절보다 더 가깝게 클러스터링합니다.
2단계 - LLM 기본 설정과 정렬: LLM이 선호하는 결과를 선호하도록 리트리버를 추가로 정렬합니다. 리트리버는 부분 플라켓-루체 랭킹 방법을 사용해 1단계 모델에서 너무 멀리 벗어나지 않도록 정규화를 통해 LLM의 판단과 유사하게 구절의 순위를 매기는 방법을 학습합니다. 이 단계에서는 리트리버가 블랙박스 LLM의 선호도를 모방하도록 미세 조정합니다.
최신 결과 - 신트리버는 여러 도메인에 걸친 여러 검색 벤치마크에서 새로운 SOTA를 달성했습니다. 이는 실제 학습 쿼리 없이 달성된 것으로, 모든 학습 데이터는 LLM에 의해 합성적으로 생성되었습니다.
로짓이 필요 없음 - 이전 LLM에서 리트리버로의 증류에는 모델 로짓 또는 확률이 필요했습니다(비공개 API에서는 사용할 수 없음). Syntriever는 생성된 텍스트와 LLM 점수만 사용하여 이 문제를 해결하므로 폐쇄형 모델에도 적용할 수 있습니다.

How can we build a high-quality text retriever without large labeled datasets or access to an LLM’s internals? Syntriever presents a two-stage framework to distill knowledge from a black-box LLM into a retrieval model using synthetic data. Steps:

Stage 1 – Distillation via synthetic Q&A: Given a query, they prompt a powerful LLM (e.g. GPT-4) to generate a relevant passage (answer) and also plausible but incorrect passages, using chain-of-thought to ensure variety. The LLM then self-verifies these generated passages to filter out any hallucinations or low-quality data. The result is a synthetic dataset of queries with positive and negative passages. A retriever is trained on this, with a loss that clusters embeddings of relevant passages closer than irrelevant ones.

Stage 2 – Alignment with LLM preferences: They further align the retriever to prefer results the LLM would prefer. Using a partial Plackett-Luce ranking method, the retriever learns to rank passages similarly to the LLM’s judgments, with regularization to not drift too far from the Stage 1 model. This step fine-tunes the retriever to mimic the black-box LLM’s preferences.

State-of-the-art results – Syntriever achieves new SOTA on several retrieval benchmarks across domains. This was achieved without any real training queries: all training data was synthetically generated by the LLM.

No logits needed – Prior LLM-to-retriever distillation needed model logits or probabilities (not available from closed APIs). Syntriever gets around this by using only generated text and LLM scoring, making it applicable even to closed models.

논문 초록(Abstract)

LLM은 많은 AI 애플리케이션의 발전을 촉진했습니다. 최근에는 LLM의 방대한 지식을 정보 검색 시스템으로 추출하려는 시도가 있었습니다. 이러한 증류 방법은 대부분 최신 블랙박스 LLM에서는 사용할 수 없는 LLM의 출력 확률을 사용합니다. 저희는 블랙박스 LLM의 합성 데이터를 사용하는 검색기용 훈련 프레임워크인 Syntriever를 제안합니다. Syntriever는 두 단계로 구성됩니다. 먼저 증류 단계에서는 주어진 쿼리에 대해 연쇄 사고를 사용하여 관련성이 있는 구절과 관련성이 없는 구절을 합성하고 증강된 쿼리를 생성합니다. LLM은 합성 데이터에서 가능한 환각에 대해 자체 검증을 수행한 후, 관련 구절의 임베딩을 클러스터링하도록 설계된 손실로 검색기를 훈련합니다. 두 번째로 정렬 단계에서는 리트리버를 LLM의 선호도에 맞게 정렬합니다. 우리는 부분 플라켓-루스 랭킹이라는 선호도 모델링을 제안하여 모델이 증류 단계에서 훈련된 모델에서 과도하게 벗어나지 않도록 정규화를 통해 LLM 선호도를 학습합니다. 실험 결과 Syntriever는 nDCG@K의 다양한 도메인 벤치마크 데이터 세트에서 최첨단 성능을 달성하는 것으로 나타났습니다. 코드는 이 https URL에서 확인할 수 있습니다.

LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@K. The code is available at this https URL.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1887878242276954557

LLM에서 긴 사고의-연쇄(CoT) 추론 이해하기 / Demystifying Long Chain-of-Thought Reasoning in LLMs

논문 소개

이 연구에서는 RL과 컴퓨팅 확장에 초점을 맞춰 LLM이 확장된 CoT 추론을 개발하는 방법을 조사합니다. 주요 인사이트는 다음과 같습니다:

지도 미세 조정(SFT)으로 성능 향상 - 반드시 필요한 것은 아니지만, SFT는 훈련을 간소화하고 효율성을 높입니다. 긴 CoT 데이터로 미세 조정된 모델은 짧은 CoT 시퀀스를 사용하는 모델보다 더 높은 정확도를 달성합니다.
안정적인 RL을 위해서는 보상 형성이 중요 - 이 연구는 순진한 RL 접근 방식이 항상 CoT 길이를 효과적으로 확장하지는 못한다는 사실을 발견했습니다. 이 문제를 해결하기 위해 저자는 반복 페널티가 있는 코사인 길이 스케일링 보상을 도입하여 추론 깊이의 균형을 맞추고 무의미한 길이 증가를 방지합니다.
검증 가능한 보상 신호 스케일링 - 웹에서 추출한 노이즈가 있는 “실버” 감독 신호로 훈련된 RL 모델은 STEM 추론과 같은 OOD 작업에 더 잘 일반화할 수 있습니다. 이러한 데이터를 필터링하는 것은 학습 안정성을 유지하는 데 매우 중요합니다.
기본 모델의 긴급 추론 능력 - 오류 수정 및 역추적과 같은 기술은 기본 모델에 존재하지만 복잡한 작업에서 효과적으로 활용하려면 신중한 RL 인센티브가 필요합니다.

이 논문은 LLM을 위한 CoT 훈련 전략을 개선하고자 하는 연구자들을 위한 체계적인 로드맵을 제공하며, RL과 보상 튜닝이 추론 깊이에 어떤 영향을 미치는지 강조합니다.

This work investigates how LLMs develop extended CoT reasoning, focusing on RL and compute scaling. Key insights include:

Supervised fine-tuning (SFT) boosts performance – While not strictly necessary, SFT simplifies training and increases efficiency. Models fine-tuned with long CoT data achieve higher accuracy than those using short CoT sequences.

Reward shaping is crucial for stable RL – The study finds that naive RL approaches don’t always extend CoT length effectively. To address this, the authors introduce a cosine length-scaling reward with repetition penalties, which balances reasoning depth and prevents meaningless length increases.

Scaling verifiable reward signals – RL models trained with noisy, web-extracted “silver” supervision signals can generalize better to OOD tasks, such as STEM reasoning. Filtering such data is crucial to maintaining training stability.

Emergent reasoning abilities in base models – Skills like error correction and backtracking exist in base models but require careful RL incentives to be effectively utilized in complex tasks.
This paper provides a structured roadmap for researchers looking to refine CoT training strategies for LLMs, highlighting how RL and reward tuning impact reasoning depth.

논문 초록(Abstract)

추론 컴퓨팅을 확장하면 대규모 언어 모델(LLM)에서 추론이 향상되며, 긴 생각의 사슬(CoT)을 통해 역추적 및 오류 수정과 같은 전략이 가능해집니다. 강화 학습(RL)은 이러한 기능을 개발하는 데 중요한 방법으로 부상했지만, 긴 CoT가 나타나는 조건은 여전히 불분명하며 RL 훈련에는 신중한 설계 선택이 필요합니다. 이 연구에서는 긴 CoT 추론의 메커니즘을 체계적으로 조사하여 모델이 긴 CoT 궤적을 생성할 수 있는 핵심 요소를 파악합니다. 광범위한 감독 미세 조정(SFT) 및 RL 실험을 통해 네 가지 주요 결과를 제시합니다. (1) SFT가 반드시 필요한 것은 아니지만, 훈련을 단순화하고 효율성을 개선합니다. (2) 추론 능력은 훈련 컴퓨팅이 증가하면 나타나는 경향이 있지만, 그 발달이 보장되는 것은 아니므로 보상 형성이 CoT 길이의 안정화에 중요합니다. (3) 검증 가능한 보상 신호의 확장은 RL에 매우 중요합니다. 필터링 메커니즘을 갖춘 노이즈가 많은 웹 추출 솔루션을 활용하는 것이 특히 STEM 추론과 같은 배포 외(OOD) 작업에 강력한 잠재력을 보인다는 점, (4) 오류 수정과 같은 핵심 능력은 기본 모델에 본질적으로 존재하지만 RL을 통해 복잡한 작업에 대해 이러한 기술을 효과적으로 인센티브화하려면 상당한 컴퓨팅이 필요하며 이러한 기술의 출현을 측정하려면 미묘한 접근 방식이 필요하다는 점 등을 발견했습니다. 이러한 인사이트는 LLM에서 긴 CoT 추론을 향상시키기 위한 교육 전략을 최적화하기 위한 실용적인 지침을 제공합니다. 코드는 다음에서 확인할 수 있습니다: 이 https URL

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: this https URL

논문 링크

더 읽어보기

https://x.com/xiangyue96/status/1887332772198371514

에이전트 혼합에 대해 다시 생각하기: 하나의 강력한 LLM으로 앙상블하기 / Rethinking Mixture-of-Agents: Ensemble One Strong LLM

논문 소개

여러 모델을 조합하는 것(혼합 에이전트, MoA)은 성능을 향상시키는 데 널리 사용되는 방법입니다. 이 논문에서는 다음과 같이 질문합니다: 여러 개의 LLM을 혼합하는 것이 실제로 도움이 될까요, 아니면 하나의 상위 모델 결과물을 앙상블하는 것이 더 나을까요 놀라운 답이 나옵니다: “Self-MoA”(단일 모델 앙상블)가 종종 다중 모델 앙상블보다 우월하다는 것입니다. 키포인트는 다음과 같습니다:

Self-MoA와 MoA - 저자들은 여러 모델의 결과물을 조합하는 대신 단일 최적 모델에서 여러 결과물을 생성한 다음 이를 집계(예: 다수결 투표 또는 순위 지정)하는 Self-MoA를 제안합니다. 이렇게 하면 더 약한 모델을 도입하지 않고도 여러 번의 시도를 통해 다양성을 높일 수 있습니다.
성능 향상 - 광범위한 테스트 결과, Self-MoA는 많은 경우 다양한 LLM을 혼합한 것보다 성능이 우수했습니다. 예를 들어, 하나의 강력한 모델을 사용한 Self-MoA는 AlpacaEval 2.0 벤치마크에서 혼합 모델 MoA보다 +6.6% 더 높은 점수를 얻었으며 MMLU, CRUX, MATH와 같은 작업에서 평균 +3.8% 더 높은 점수를 얻었습니다. 실제로 최상위 AlpacaEval 모델에 Self-MoA를 적용하면 리더보드에서 새로운 신기록을 세웠습니다.
효과가 있는 이유 - 모델을 혼합하면 약한 멤버로 인해 전반적인 품질이 저하될 수 있습니다. 이 연구에 따르면 MoA의 이점은 각 모델의 품질에 매우 민감하며, 약한 모델을 추가하면 성능이 희석됩니다. 모든 모델이 매우 강력하고 상호 보완적이지 않는 한, 한 모델의 결과물을 사용하는 것이 더 낫습니다. 다양한 모델이 도움이 되는 틈새 시나리오를 식별할 수 있지만 이는 예외입니다.
순차적 집계 - 한 번에 한꺼번에 집계하는 것이 아니라 여러 차례에 걸쳐 많은 수의 출력을 결합할 수 있는 순차적 버전의 Self-MoA도 도입했습니다. 이 순차적 Self-MoA는 원샷 어그리게이션만큼 효과적이며, 다수의 출력에 대한 앙상블을 효율적으로 확장할 수 있습니다.

Ensembling multiple models (Mixture-of-Agents, MoA) is a popular way to boost performance. This paper asks: is mixing different LLMs actually helpful, or are we better off ensembling one top model’s outputs? The surprising answer: “Self-MoA” (single-model ensemble) often wins over multi-model ensembles. Key points:

Self-MoA vs. MoA – The authors propose Self-MoA, which simply generates multiple outputs from the single best model and then aggregates them (e.g., by majority voting or ranking), instead of combining outputs from various models. This increases diversity via multiple attempts, without introducing weaker models.

Better performance – Extensive tests show Self-MoA outperforms a mixture of different LLMs in many cases. For example, using one strong model, Self-MoA achieved +6.6% higher score than a mixed-model MoA on the AlpacaEval 2.0 benchmark, and on average +3.8% across tasks like MMLU, CRUX, and MATH. In fact, applying Self-MoA to a top AlpacaEval model set a new state-of-the-art on the leaderboard.

Why it works – Mixing models can hurt because the overall quality is limited by the weaker members. The study finds MoA’s benefit is highly sensitive to the quality of each model – adding a weaker model dilutes performance. Unless all models are very strong and complementary, you’re better off with one model’s outputs. They do identify niche scenarios where diverse models help, but those are exceptions.

Sequential aggregation – They also introduce a sequential version of Self-MoA that can combine a large number of outputs over multiple rounds (rather than all at once). This sequential Self-MoA is as effective as one-shot aggregation, scaling ensembling to many outputs efficiently.

논문 초록(Abstract)

다양한 소스의 출력을 앙상블하는 것은 성능을 향상시키는 간단하면서도 효과적인 접근 방식입니다. 에이전트 혼합(MoA)은 여러 개의 서로 다른 대규모 언어 모델(LLM)의 출력을 통합하는 널리 사용되는 앙상블 방법 중 하나입니다. 이 논문에서는 언어 모델의 맥락에서 서로 다른 LLM을 혼합하는 것이 정말 유익한가라는 질문을 제기합니다. 이 논문에서는 가장 성능이 우수한 단일 LLM의 결과만 집계하는 앙상블 방식인 Self-MoA를 제안합니다. 광범위한 실험 결과, 놀랍게도 다양한 시나리오에서 Self-MoA가 여러 LLM을 혼합하는 표준 MoA보다 우수한 성능을 발휘하는 것으로 나타났습니다: Self-MoA는 알파카에벌 2.0 벤치마크에서 MoA보다 6.6% 개선되었으며, MMLU, CRUX, MATH 등 다양한 벤치마크에서 평균 3.8% 개선된 결과를 달성했습니다. 알파카에볼 2.0의 최상위 모델 중 하나에 Self-MoA를 적용하면 리더보드에서 새로운 최첨단 성능을 직접 달성할 수 있습니다. Self-MoA의 효과를 이해하기 위해 다양한 MoA 설정 하에서 출력의 다양성과 품질 간의 상충 관계를 체계적으로 조사했습니다. 그 결과, MoA 성과는 품질에 다소 민감하며, 서로 다른 LLM을 혼합하면 모델의 평균 품질이 낮아지는 경우가 많다는 것을 확인했습니다. 이 연구를 보완하기 위해 다양한 LLM을 혼합하는 것이 도움이 될 수 있는 시나리오를 확인했습니다. 이 논문에서는 여러 라운드에 걸쳐 많은 수의 LLM 출력을 즉시 집계할 수 있으며 모든 출력을 한 번에 집계하는 것만큼 효과적인 순차적 버전의 Self-MoA를 소개합니다.

Ensembling outputs from diverse sources is a straightforward yet effective approach to boost performance. Mixture-of-Agents (MoA) is one such popular ensemble method that aggregates outputs from multiple different Large Language Models (LLMs). This paper raises the question in the context of language models: is mixing different LLMs truly beneficial? We propose Self-MoA -- an ensemble method that aggregates outputs from only the single top-performing LLM. Our extensive experiments reveal that, surprisingly, Self-MoA outperforms standard MoA that mixes different LLMs in a large number of scenarios: Self-MoA achieves 6.6% improvement over MoA on the AlpacaEval 2.0 benchmark, and an average of 3.8% improvement across various benchmarks, including MMLU, CRUX, and MATH. Applying Self-MoA to one of the top-ranking models in AlpacaEval 2.0 directly achieves the new state-of-the-art performance on the leaderboard. To understand the effectiveness of Self-MoA, we systematically investigate the trade-off between diversity and quality of outputs under various MoA settings. We confirm that the MoA performance is rather sensitive to the quality, and mixing different LLMs often lowers the average quality of the models. To complement the study, we identify the scenarios where mixing different LLMs could be helpful. This paper further introduces a sequential version of Self-MoA, that is capable of aggregating a large number of LLM outputs on-the-fly over multiple rounds, and is as effective as aggregating all outputs at once.

논문 링크

더 읽어보기

http://twitter.com/omarsar0/status/1886792384954163347

MaAS: 멀티 에이전트 아키텍처 검색(에이전트 슈퍼넷) / MaAS: Multi-agent Architecture Search (Agentic Supernet)

논문 소개

여러 에이전트가 각각 특정 역할이나 도구를 사용하여 협업하는 LLM의 다중 에이전트 시스템을 구축하는 것은 강력하지만 일반적으로 하나의 복잡한 파이프라인을 수작업으로 설계해야 합니다. 대신 MaAS(멀티 에이전트 아키텍처 검색)는 범용 “에이전트 슈퍼넷”을 학습하여 각 쿼리에 대해 최적의 에이전트 팀을 즉시 스폰드할 수 있습니다. 이는 작업별 상담원 워크플로우 설계를 자동화합니다:

에이전트 슈퍼넷 - 저자들은 가능한 에이전트 아키텍처(LLM 호출 체인, 도구 사용 등)의 연속적인 공간을 정의합니다. 하나의 정적 아키텍처를 선택하는 대신 다양한 구성을 포괄하는 슈퍼넷을 학습시킵니다. 각 쿼리는 해당 쿼리의 도메인과 난이도에 맞춰 서로 다른 하위 에이전트 네트워크를 트리거할 수 있습니다.
동적 리소스 할당 - 시스템이 쿼리별로 적응하기 때문에 리소스를 효율적으로 할당할 수 있습니다. 쉬운 질문은 간단하고 빠른 에이전트 체인을 사용하고, 어려운 문제는 보다 정교한 추론 팀을 호출할 수 있습니다. 이렇게 하면 단일화된 에이전트 시스템의 획일적인 비용을 피할 수 있습니다.
막대한 비용 절감 - 6개의 벤치마크에서 MaAS는 기존 다중 에이전트 파이프라인의 추론 비용의 **6~45%**만 사용했지만, 정확도에서는 0.5~11.8% 더 뛰어난 성능을 보였습니다. 에이전트 구성을 작업에 맞게 조정하여 동등하거나 더 나은 성능을 달성할 수 있는 더 저렴한 방법을 찾아냅니다.
강력하고 이전 가능 - 에이전트 슈퍼넷 접근 방식은 강력한 일반화를 보여주었습니다. 한 작업에서 효과적인 것으로 확인된 아키텍처는 새로운 도메인과 다른 LLM 백본에서도 잘 이전되어 정적 설계보다 뛰어난 성능을 보였습니다. 이는 이 방법이 LLM 에이전트를 최적으로 조율하는 방법에 대한 일반적인 원칙을 학습했음을 시사합니다.

Building multi-agent systems of LLMs (where multiple agents collaborate, each with specific roles or tools) is powerful but usually requires hand-designing a single complex pipeline. MaAS (Multi-agent Architecture Search) instead learns a universal “agentic supernet” from which it can spawn an optimal agent team on the fly for each query. It automates designing the agent workflow per task:

Agentic supernet – The authors define a continuous space of possible agent architectures (chains of LLM calls, tool uses, etc.). Rather than picking one static architecture, they train a supernet that encompasses many configurations. Each query can trigger a different sub-network of agents tailored to that query’s domain and difficulty.

Dynamic resource allocation – Because the system adapts per query, it can allocate resources efficiently. Easy questions might use a simple, fast agent chain; hard problems invoke a more elaborate reasoning team. This avoids the one-size-fits-all cost of a monolithic agent system.

Huge cost savings – On six benchmarks, MaAS used only 6–45% of the inference cost of existing multi-agent pipelines, yet still outperformed them by ~0.5–11.8% in accuracy. It finds cheaper ways to reach equal or better performance by tuning the agent configuration to the task.

Robust and transferable – The agentic supernet approach showed strong generalization: architectures found effective on one task transferred well to new domains and even with different LLM backbones, outperforming static designs. This suggests the method learns general principles of how to orchestrate LLM agents optimally.

논문 초록(Abstract)

LLM(대규모 언어 모델) 기반의 다중 에이전트 시스템은 체계적인 협업과 상호작용을 통해 개별 에이전트의 인지적 경계를 확장하지만, 이러한 시스템을 구축하려면 종종 노동 집약적인 수작업 설계가 필요합니다. 에이전트 워크플로우의 설계를 자동화할 수 있는 방법이 있음에도 불구하고 일반적으로 정적이고 복잡한 획일화된 시스템을 추구하기 때문에 각 쿼리의 난이도와 도메인에 따라 추론 리소스를 동적으로 할당하지 못합니다. 이러한 문제를 해결하기 위해, 저희는 모놀리식 에이전트 시스템 추구에서 벗어나 확률적이고 연속적인 에이전트 아키텍처 분포인 \textbf{Agentic Supernet}$ 을 최적화하는 방식으로 전환했습니다. 슈퍼넷에서 쿼리 종속 에이전트 시스템을 샘플링하여 고품질 솔루션과 맞춤형 리소스 할당(\textit{예: LLM 호출, 도구 호출, 토큰 비용)을 제공하는 자동화된 프레임워크인 MaAS를 소개합니다.) 6개의 벤치마크에 대한 종합적인 평가 결과, MaAS \textbf{(I)}는 기존 수작업 또는 자동화된 멀티 에이전트 시스템의 추론 비용의 6∼45%, \textbf{(II)}는 0.54%∼11.82%, \textbf{(III)}는 데이터 세트 간 및 LLM 백본 간 전송성이 우수한 것으로 입증되었습니다.

Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only 6∼45% of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by 0.54%∼11.82%, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.

논문 링크

더 읽어보기

http://twitter.com/omarsar0/status/1887884027530727876

LLM의 발전된 추론 / Advancing Reasoning in LLMs

논문 소개

이 서베이 논문은 LLM의 추론 능력을 향상시키기 위한 새로운 방법에 대한 시의적절한 개요를 제공합니다. 문헌을 몇 가지 주요 접근법 카테고리로 정리했습니다:

프롬프트 전략 - 영리한 프롬프트를 통해 모델의 추론을 유도하는 기법(예: 연쇄적 사고 프롬프트(모델이 단계별 솔루션을 생성하도록 함), 자기 일관성(여러 추론 경로를 샘플링하고 최선의 답을 선택), 트리 전략 등). 이러한 방법은 모델의 아키텍처를 변경하지 않고도 논리적 추론과 다단계 솔루션을 개선합니다.
아키텍처 혁신 - 추론을 더 원활하게 하기 위해 모델 또는 컨텍스트를 수정합니다. 여기에는 검색 증강 모델(외부 사실을 가져올 수 있는 LLM), 모듈식 추론 네트워크(문제를 여러 모듈이나 전문가가 처리하는 하위 작업으로 나누는 시스템), 신경 기호 통합(신경망과 기호 논리 또는 도구를 결합하는 것)이 포함됩니다. 이러한 변화는 LLM이 더 많은 지식 또는 더 구조화된 추론 프로세스에 접근할 수 있도록 하는 것을 목표로 합니다.
학습 패러다임 - 추론 능력을 심어주는 새로운 훈련 방법: 추론 관련 데이터 세트(예: 수학 단어 문제)에 대한 미세 조정, 강화 학습 접근 방식(올바른 추론 체인에 대한 보상), 모델에 추론을 훈련시키는 자기 감독 목표(증명에서 마스크된 단계 예측 등). 이러한 접근 방식은 일반적인 사전 학습이 제공하는 것 이상으로 모델의 고유한 추론 능력을 향상시킵니다.
평가 및 과제 - 이 서베이 논문에서는 논리, 수학, 상식 등에 대한 벤치마크(LLM)에서 추론을 평가하는 방법을 검토하고 미해결 과제를 파악합니다. 주요 문제로는 환각(비논리적이거나 사실이 아닌 중간 단계를 조작하는 모델), 작은 변화에 대한 취약성(견고성), 다양한 작업과 영역에서 추론 방법의 일반화 등이 있습니다. 이러한 문제를 해결하는 것이 차세대 추론 증강 LLM의 핵심이 될 것입니다.

This survey paper provides a timely overview of emerging methods to enhance reasoning capabilities in LLMs. It organizes the literature into several key approach categories:

Prompting strategies – Techniques that guide the model’s reasoning via clever prompts, e.g. Chain-of-Thought prompting (having the model generate step-by-step solutions), Self-Consistency (sampling multiple reasoning paths and choosing the best answer), Tree-of-Thought strategies, etc. These methods improve logical deduction and multi-step solutions without changing the model’s architecture.

Architectural innovations – Modifications to the model or its context to better facilitate reasoning. This includes retrieval-augmented models (LLMs that can fetch external facts), modular reasoning networks (systems that break a problem into sub-tasks handled by different modules or experts), and neuro-symbolic integration (combining neural nets with symbolic logic or tools. Such changes aim to give LLMs access to either more knowledge or more structured reasoning processes.

Learning paradigms – New training methods to instill reasoning skills: fine-tuning on reasoning-specific datasets (e.g. math word problems), reinforcement learning approaches (rewarding correct reasoning chains), and self-supervised objectives that train the model to reason (like predicting masked steps in a proof. These improve the model’s inherent reasoning ability beyond what general pre-training provides.

Evaluation & challenges – The survey also reviews how we evaluate reasoning in LLMs (benchmarks for logic, math, commonsense, etc.) and identifies open challenges. Key issues include hallucinations (the model fabricating illogical or untrue intermediate steps), brittleness to small changes (robustness), and generalization of reasoning methods across different tasks and domains. Addressing these will be crucial for the next generation of reasoning-augmented LLMs.

논문 초록(Abstract)

대규모 언어 모델(LLM)은 다양한 자연어 처리(NLP) 작업에서 괄목할 만한 성공을 거두었지만, 추론 능력은 여전히 근본적인 과제로 남아 있습니다. LLM은 인상적인 유창성과 사실 기억력을 보여주지만 논리적 추론, 수학적 문제 해결, 상식적 추론, 다단계 추론 등 복잡한 추론을 수행하는 능력은 종종 인간의 기대에 미치지 못합니다. 이 설문조사에서는 LLM에서 추론을 향상시키는 새로운 기법에 대한 종합적인 검토를 제공합니다. 기존 방법을 프롬프트 전략(예: 연쇄 추론, 자기 일관성, 트리 추론), 아키텍처 혁신(예: 검색 증강 모델, 모듈식 추론 네트워크, 신경 기호 통합), 학습 패러다임(예: 추론별 데이터 세트를 사용한 미세 조정, 강화 학습, 자기 감독 추론 목표) 등 주요 접근 방식으로 분류합니다. 또한 LLM에서 추론을 평가하는 데 사용되는 평가 프레임워크를 살펴보고 다양한 작업에서 환각, 견고성, 추론 일반화와 같은 미해결 과제를 강조합니다. 이 설문조사는 최근의 발전 사항을 종합하여 추론 증강 LLM의 향후 연구 및 실제 적용을 위한 유망한 방향에 대한 인사이트를 제공하는 것을 목표로 합니다.

Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks, yet their reasoning capabilities remain a fundamental challenge. While LLMs exhibit impressive fluency and factual recall, their ability to perform complex reasoning-spanning logical deduction, mathematical problem-solving, commonsense inference, and multi-step reasoning-often falls short of human expectations. This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs. We categorize existing methods into key approaches, including prompting strategies (e.g., Chain-of-Thought reasoning, Self-Consistency, and Tree-of-Thought reasoning), architectural innovations (e.g., retrieval-augmented models, modular reasoning networks, and neuro-symbolic integration), and learning paradigms (e.g., fine-tuning with reasoning-specific datasets, reinforcement learning, and self-supervised reasoning objectives). Additionally, we explore evaluation frameworks used to assess reasoning in LLMs and highlight open challenges, such as hallucinations, robustness, and reasoning generalization across diverse tasks. By synthesizing recent advancements, this survey aims to provide insights into promising directions for future research and practical applications of reasoning-augmented LLMs.

논문 링크

더 읽어보기

http://twitter.com/omarsar0/status/1887875470269849659

서베이: LLM을 위한 텍스트 데이터 증강 / Survey: Text Data Augmentation for LLMs

논문 소개

이 종합적인 서베이 논문에서는 LLM을 위한 텍스트 데이터 증강 기술을 다룹니다. LLM은 방대한 학습 데이터를 요구하기 때문에 합성 또는 변환된 텍스트로 데이터 세트를 보강하는 것이 필수적입니다. 이 백서에서는 다음과 같은 내용을 다룹니다:

증강 방법을 분류 - (1) 단순 증강 - 동의어 교체, 자르기 등과 같은 기본적인 텍스트 조작, (2) 간단한 증강 - 동의어 교체, 자르기 등과 같은 기본적인 텍스트 조작, (3) 신속한 증강 - 동의어 교체, 자르기 등과 같은 기본적인 텍스트 조작의 네 가지 범주를 정의합니다. (2) 프롬프트 기반 증강 - 특정 프롬프트가 있는 LLM을 사용하여 새로운 훈련 예제를 생성하는 방법(LLM 자체의 생성 능력을 활용), (3) 검색 기반 증강 - 생성된 텍스트의 사실에 근거하기 위해 외부 지식이나 문맥(검색 또는 데이터베이스를 통해)을 가져오는 방법, (4) 하이브리드 증강 - 위의 방법을 조합하거나 다단계 전략으로 사용하는 방법 등이 있습니다.
데이터 생성자로서의 LLM - 핵심 인사이트는 최신 LLM이 스스로를 개선하기 위해 고품질의 합성 데이터를 생성할 수 있다는 것입니다. LLM에게 작업의 변형을 생성하도록 신중하게 유도함으로써(예: ChatGPT에게 새로운 수학 단어 문제를 제시하도록 요청) 훈련 세트를 극적으로 확장할 수 있습니다. 이 설문조사에서는 이를 위한 프롬프트 설계와 생성된 데이터가 다양하고 유용하도록 보장하는 방법에 대해 논의합니다.
사후 처리 및 필터링 - 증강 데이터가 항상 완벽한 것은 아닙니다. 이 설문조사에서는 생성된 데이터를 정제하고 필터링하는 기술을 다룹니다. 예를 들어, 보조 모델을 통해 사실을 검증하거나 오류를 유발할 수 있는 예시를 제거합니다. 이 단계는 데이터를 보강할 때 '쓰레기 유입, 쓰레기 배출'을 방지하기 위해 매우 중요합니다.
평가 및 향후 방향 - 데이터 증강이 사용되는 일반적인 작업(예: 저자원 언어 번역, QA 등)과 그 영향(정확성, 견고성 향상 등)을 평가하는 방법에 대해 설명합니다. 마지막으로, 과제(예: 증강이 데이터 배포를 왜곡하지 않도록 보장, 모델 편향성 강화 방지)와 새로운 연구 기회에 대해 논의합니다.

This comprehensive survey covers text data augmentation techniques for LLMs. As LLMs demand massive training data, augmenting datasets with synthetic or transformed text is vital. In this paper:

Classifies augmentation methods – It defines four categories: (1) Simple augmentation – basic text manipulations like synonym replacement, cropping, etc.; (2) Prompt-based augmentation – using an LLM with specific prompts to generate new training examples (taking advantage of the LLM’s own generative power; (3) Retrieval-based augmentation – pulling in external knowledge or contexts (via search or databases) to ground the generated text in facts; and (4) Hybrid augmentation – combinations of the above, or multi-step strategies.

LLMs as data generators – A key insight is that modern LLMs can create high-quality synthetic data to improve themselves. By carefully prompting an LLM to produce variations of a task (for example, ask ChatGPT to come up with new math word problems), one can dramatically expand a training set. The survey discusses prompt design for this purpose and how to ensure the generated data is diverse and useful.

Post-processing and filtering – Augmented data isn’t always perfect. The survey covers techniques to refine and filter generated data. For instance, verifying facts with a secondary model or removing examples that might introduce errors. This step is crucial to prevent “garbage in, garbage” out when augmenting data.

Evaluation and future directions – It outlines common tasks where data augmentation is used (like low-resource language translation, QA, etc.) and how to evaluate the impact (improvement in accuracy, robustness, etc.). Finally, it discusses challenges (e.g. ensuring augmentation doesn’t distort data distribution, avoiding model bias reinforcement) and opportunities for new research.

논문 초록(Abstract)

사전 학습된 언어 모델의 크기와 복잡성이 증가함에 따라 많은 애플리케이션에서 우수한 성능이 입증되었지만, 일반적으로 적절하게 학습하려면 대규모 학습 데이터 세트가 필요합니다.훈련 세트가 충분하지 않으면 예기치 않게 모델이 과도하게 적합해져 복잡한 작업에 대처하지 못할 수 있습니다.광범위한 말뭉치로 훈련된 대규모 언어 모델(LLM)은 뛰어난 텍스트 생성 기능을 갖추고 있어 데이터의 품질과 양을 개선하고 데이터 증강에 중요한 역할을 합니다.특히, 개인화된 작업에는 고유한 프롬프트 템플릿이 제공되어 LLM이 필요한 콘텐츠를 생성할 수 있도록 안내합니다. 최근 유망한 검색 기반 기술은 외부 지식을 도입하여 보다 근거에 기반한 데이터를 생성할 수 있도록 함으로써 데이터 증강에서 LLM의 표현 능력을 더욱 향상시킵니다.이 설문조사에서는 LLM의 데이터 증강에 대한 심층 분석을 통해 기술을 단순 증강, 프롬프트 기반 증강, 검색 기반 증강, 하이브리드 증강으로 분류합니다. 증강된 데이터를 정제하고 모델이 불충분한 콘텐츠를 걸러내는 데 크게 기여하는 데이터 증강의 후처리 접근법을 요약합니다. 그런 다음 일반적인 작업과 평가 지표를 제공합니다. 마지막으로 데이터 증강을 더욱 개선할 수 있는 기존의 과제와 향후 기회를 소개합니다.

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.

논문 링크

더 읽어보기

http://twitter.com/omarsar0/status/1886428687350006067

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.*

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2025/02/03 ~ 02/09] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

s1: 간단한 테스트 시간 스케일링 / s1: Simple test-time scaling

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

OmniHuman-1: 원스테이지 인간 애니메이션 스케일링 / OmniHuman-1: Scaling One-Stage Human Animation

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

LIMO: 추론을 위해서는 적은 것이 더 많은 것입니다 / LIMO: Less Is More for Reasoning

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

CoAT: 대규모 언어 모델 추론 향상을 위한 연관된 생각의 연쇄 프레임워크 / CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Syntriever: LLM이 생성한 데이터로 검색기 학습하기 / Syntriever: Training Retrievers with LLM-Generated Data

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

LLM에서 긴 사고의-연쇄(CoT) 추론 이해하기 / Demystifying Long Chain-of-Thought Reasoning in LLMs

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

에이전트 혼합에 대해 다시 생각하기: 하나의 강력한 LLM으로 앙상블하기 / Rethinking Mixture-of-Agents: Ensemble One Strong LLM

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

MaAS: 멀티 에이전트 아키텍처 검색(에이전트 슈퍼넷) / MaAS: Multi-agent Architecture Search (Agentic Supernet)

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

LLM의 발전된 추론 / Advancing Reasoning in LLMs

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

서베이: LLM을 위한 텍스트 데이터 증강 / Survey: Text Data Augmentation for LLMs

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR