[2024/01/15 ~ 01/21] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 1월 22, 2024, 5:38오전

PyTorchKR

이번 주 선정된 논문들은 주로 '언어 모델의 튜닝 및 평가', '강화 학습을 통한 추론', 그리고 '특정 AI 적용 분야의 최적화'와 관련된 주제로 보입니다. "AlphaCodium", "RAG vs. Finetuning", "Tuning Language Models by Proxy" 및 "Self-Rewarding Models"와 같은 논문들은 언어 모델과 fine-tuning에 초점을 맞춘 연구 경향을 보여줍니다. 이는 최근 몇 년간 자연어 처리 분야에서 전이 학습의 중요성과 언어 모델의 맞춤화가 증가한 것을 반영하는 것으로 보입니다.
"AlphaGeometry", "Reasoning with Reinforced Fine-Tuning" 그리고 "MoE-Mamba" 등의 논문들은 강화 학습과 이를 이용한 추론 능력에 대한 연구임을 시사합니다. 강화 학습을 통해 모델이 보다 정교한 결정을 내릴 수 있게 함으로써, 인공지능의 추론과 학습 능력을 높이려는 시도로 볼 수 있습니다. 또한 다양한 분야에 특수화된 모델들이 등장하는 것은, AI 기술이 특정 분야의 정보를 처리하고 이해하는 데 있어 효과적으로 적용될 수 있는 방향으로 발전하고 있음을 나타냅니다.
추가적으로, "Overview of LLMs for Evaluation" 및 "The Unreasonable Effectiveness of Easy Training Data for Hard Tasks"는 언어 모델의 평가 방법론과 학습 데이터의 효과성에 대한 탐구를 강조합니다. 이는 대형 언어 모델의 성능을 객관적으로 측정하고, 효율적인 학습 기법을 발굴하기 위한 연구의 필요성을 반영하는 것일 수 있습니다. 전반적으로 이번 주 논문들은 언어 모델의 발전, 모델 최적화 전략, 그리고 인공지능의 보다 세밀한 튜닝에 초점을 맞추며 향후 AI 분야의 연구 및 개발 동향에 중요한 실마리를 제공하고 있습니다.
이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.

사람의 시연 없이 올림피아드 기하학 풀기 / Solving olympiad geometry without human demonstrations

논문 소개

인간의 시연 없이도 올림피아드 기하학 문제를 풀 수 있는 정리 증명자 역할을 하는 인공지능 시스템으로, 다양한 수준의 복잡성에 걸쳐 수백만 개의 정리와 증명이 포함된 합성 데이터를 학습하고, 이 데이터를 사용하여 올림피아드 수준의 문제를 풀 수 있는 신경 언어 모델을 학습하여 평균 국제수학올림피아드(IMO) 금메달리스트의 성능에 근접하는 인공지능 시스템입니다.

An AI system that acts as a theorem prover that can solve olympiad geometry problems without human demonstrations; this system is trained on synthetic data involving millions of theorems and proofs across different levels of complexity; the data is used to train a neural language model that can solve olympiad-level problems and approaches the performance of an average international mathematical olympiad (IMO) gold medallist.

논문 초록(Abstract)

올림피아드 수준의 수학 정리를 증명하는 것은 세계 최고의 대학 예비 수학 인재들 사이에서도 난이도가 높은 것으로 알려져 있기 때문에 인간 수준의 자동화된 추론에 있어 주목할 만한 이정표가 될 것입니다. 그러나 현재의 머신러닝 접근 방식은 인간의 증명을 기계가 검증 가능한 형식으로 변환하는 데 많은 비용이 들기 때문에 대부분의 수학 영역에 적용할 수 없습니다. 특히 기하학의 경우 번역이 어렵고 학습 데이터가 매우 부족하기 때문에 문제가 더욱 심각합니다. 우리는 다양한 수준의 복잡성에 걸쳐 수백만 개의 정리와 증명을 합성하여 사람이 증명할 필요를 없애는 유클리드 평면 기하학용 정리 증명기인 AlphaGeometry를 제안합니다. 알파기하학은 대규모 합성 데이터로 처음부터 학습된 신경 언어 모델을 사용하여 어려운 문제에서 무한한 분기점을 통해 기호 추론 엔진을 안내하는 신경 기호 시스템입니다. 30개의 최신 올림피아드 수준의 문제로 구성된 테스트 세트에서 AlphaGeometry는 25개의 문제를 해결하여 10개의 문제만 해결한 기존 최고 방법을 능가하고 평균적인 국제수학올림피아드(IMO) 금메달리스트의 성능에 근접했습니다. 특히 알파기하학은 사람이 읽을 수 있는 증명을 생성하고, 전문가들의 평가 하에 2000년과 2015년 IMO의 모든 기하학 문제를 풀었으며, 2004년에 번역된 IMO 정리의 일반화된 버전을 발견했습니다.

Proving mathematical theorems at the olympiad level represents a notable milestone in human-level automated reasoning, owing to their reputed difficulty among the world’s best talents in pre-university mathematics. Current machine-learning approaches, however, are not applicable to most mathematical domains owing to the high cost of translating human proofs into machine-verifiable format. The problem is even worse for geometry because of its unique translation challenges, resulting in severe scarcity of training data. We propose AlphaGeometry, a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity. AlphaGeometry is a neuro-symbolic system that uses a neural language model, trained from scratch on our large-scale synthetic data, to guide a symbolic deduction engine through infinite branching points in challenging problems. On a test set of 30 latest olympiad-level problems, AlphaGeometry solves 25, outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist. Notably, AlphaGeometry produces human-readable proofs, solves all geometry problems in the IMO 2000 and 2015 under human expert evaluation and discovers a generalized version of a translated IMO theorem in 2004.

논문 링크

https://www.nature.com/articles/s41586-023-06747-5

더 읽어보기

https://x.com/GoogleDeepMind/status/1747651817461125352

AlphaCodium을 사용한 코드 생성: 프롬프트 엔지니어링에서 흐름 엔지니어링까지 / Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering

논문 소개

코드 생성에 대한 LLM을 개선하는 코드 지향적 반복 흐름으로, LLM의 코드 생성 기능을 개선하기 위한 두 가지 핵심 단계를 포함합니다: i) 반복 프로세스를 지원하기 위한 추가 생성 데이터(문제 자체 반영 및 테스트 추론), ii) 추가 AI 생성 테스트를 사용하여 공개 테스트 강화 코드 콘테스트 검증 데이터셋을 사용하면 잘 만들어진 단일 프롬프트를 사용할 때 19%였던 gpt-4 pass@5 정확도가 알파코드 흐름을 사용하면 44%로 증가했으며, 훨씬 적은 계산 예산과 4배나 적은 llm 호출을 사용하여 알파코드보다 성능이 뛰어납니다.

A code-oriented iterative flow that improves llms on code generation; it involves two key steps to improve code generation capabilities in llms: i) additional generated data (problem self-reflection and test reasoning) to aid the iterative process, and ii) enriching public tests using additional ai-generated tests; using the codecontests validation dataset, gpt-4 pass@5 accuracy increased from 19% using a single well-crafted prompt to 44% using the alphacodium flow; it even outperforms alphacode using a significantly smaller computation budget and 4 orders of magnitude fewer llm calls.

논문 초록(Abstract)

코드 생성 문제는 일반적인 자연어 문제와는 달리 대상 언어의 정확한 구문을 일치시키고, 행복한 경로와 에지 케이스를 식별하고, 문제 사양의 수많은 세부 사항에 주의를 기울이고, 기타 코드별 문제와 요구 사항을 해결해야 합니다. 따라서 자연어 생성에 성공한 많은 최적화 및 트릭이 코드 작업에는 효과적이지 않을 수 있습니다. 이 연구에서는 코드 문제에 대한 LLM의 성능을 개선하는 테스트 기반의 다단계 코드 지향 반복 흐름인 AlphaCodium이라는 새로운 접근 방식을 제안합니다. 저희는 Codeforces와 같은 플랫폼의 경쟁 프로그래밍 문제가 포함된 CodeContests라는 까다로운 코드 생성 데이터셋에서 AlphaCodium을 테스트했습니다. 제안된 흐름은 일관되게 결과를 크게 개선했습니다. 예를 들어, 검증 세트에서 GPT-4 정확도(통과율@5)는 잘 설계된 단일 직접 프롬프트에서는 19%에서 AlphaCodium 플로우를 사용하면 44%로 증가했습니다. 이 작업에서 얻은 많은 원칙과 모범 사례는 일반적인 코드 생성 작업에도 광범위하게 적용될 수 있다고 생각합니다. 전체 구현은 GitHub - Codium-ai/AlphaCodium: Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering"" 에서 확인할 수 있습니다

Code generation problems differ from common natural language problems - they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to numerous small details in the problem spec, and addressing other code-specific issues and requirements. Hence, many of the optimizations and tricks that have been successful in natural language generation may not be effective for code tasks. In this work, we propose a new approach to code generation by LLMs, which we call AlphaCodium - a test-based, multi-stage, code-oriented iterative flow, that improves the performances of LLMs on code problems. We tested AlphaCodium on a challenging code generation dataset called CodeContests, which includes competitive programming problems from platforms such as Codeforces. The proposed flow consistently and significantly improves results. On the validation set, for example, GPT-4 accuracy (pass@5) increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow. Many of the principles and best practices acquired in this work, we believe, are broadly applicable to general code generation tasks. Full implementation is available at: GitHub - Codium-ai/AlphaCodium: Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

논문 링크

더 읽어보기

https://github.com/Codium-ai/AlphaCodium

https://x.com/itamar_mar/status/1747957348293824676

RAG 대 Finetuning: 파이프라인, 트레이드오프, 농업 사례 연구 / RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture

논문 소개

Llama-2, GPT-4와 같은 LLM을 사용할 때 RAG와 파인튜닝 사이의 트레이드오프에 대해 논의하는 논문으로, 농업 데이터셋에 파이프라인을 적용할 때 자세한 분석을 수행하고 인사이트를 강조하는 내용과 함께, 모델을 파인튜닝할 때 정확도가 6%p 이상 증가하며 이것이 RAG와 함께 사용 시 정확도가 5%p 더 증가한다는 것을 관찰한 보고서 등 다양한 보고서가 있습니다.

Report discussing the tradeoff between rag and fine-tuning when using llms like llama 2 and gpt-4; performs a detailed analysis and highlights insights when applying the pipelines on an agricultural dataset; observes that there is an accuracy increase of over 6 p.p. when fine-tuning the model and this is cumulative with rag, which increases accuracy by 5 p.p. further.

논문 초록(Abstract)

개발자가 대규모 언어 모델(LLM)의 애플리케이션을 구축할 때 독점 데이터와 도메인별 데이터를 통합하는 일반적인 방법에는 두 가지가 있습니다: 검색 증강 생성(RAG)과 미세 조정입니다. RAG는 외부 데이터로 프롬프트를 보강하는 반면, 미세 조정은 추가 지식을 모델 자체에 통합합니다. 그러나 두 접근법의 장단점은 잘 알려져 있지 않습니다. 이 논문에서는 미세 조정과 RAG를 위한 파이프라인을 제안하고, Llama2-13B, GPT-3.5, GPT-4 등 널리 사용되는 여러 LLM에 대해 두 가지의 장단점을 제시합니다. 저희의 파이프라인은 PDF에서 정보를 추출하고, 질문과 답변을 생성하고, 이를 미세 조정에 사용하고, 결과를 평가하기 위해 GPT-4를 활용하는 등 여러 단계로 구성되어 있습니다. 저희는 RAG 및 미세 조정 파이프라인의 여러 단계의 성능을 평가하기 위한 지표를 제안합니다. 농업 데이터셋에 대한 심층 연구를 수행합니다. 산업으로서의 농업은 AI의 보급이 많지 않은 분야로, 농부에게 위치 기반 인사이트를 제공할 수 있다면 어떨까 하는 잠재적으로 파괴적인 애플리케이션을 연구합니다 그 결과, 지역별 지식을 포착하는 데 있어 데이터 세트 생성 파이프라인의 효과와 RAG 및 미세 조정의 양적, 질적 이점을 확인할 수 있었습니다. 모델을 미세 조정할 때 정확도가 6퍼센트 이상 증가했으며, 이는 RAG를 통해 누적되어 정확도가 5퍼센트 더 증가했습니다. 특정 실험에서는 미세 조정된 모델이 여러 지역의 정보를 활용하여 특정 질문에 답함으로써 답변의 유사성을 47%에서 72%로 높인다는 사실도 입증했습니다. 전반적으로 이러한 결과는 LLM을 사용하여 구축된 시스템이 특정 산업에 중요한 차원에 걸쳐 지식을 통합하고 대응하도록 조정할 수 있음을 보여 주며, 다른 산업 영역에서 LLM을 더 많이 적용할 수 있는 길을 열어줍니다.

There are two common ways in which developers are incorporating proprietary and domain-specific data when building applications of Large Language Models (LLMs): Retrieval-Augmented Generation (RAG) and Fine-Tuning. RAG augments the prompt with the external data, while fine-Tuning incorporates the additional knowledge into the model itself. However, the pros and cons of both approaches are not well understood. In this paper, we propose a pipeline for fine-tuning and RAG, and present the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. Our pipeline consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. We propose metrics to assess the performance of different stages of the RAG and fine-Tuning pipeline. We conduct an in-depth study on an agricultural dataset. Agriculture as an industry has not seen much penetration of AI, and we study a potentially disruptive application - what if we could provide location-specific insights to a farmer? Our results show the effectiveness of our dataset generation pipeline in capturing geographic-specific knowledge, and the quantitative and qualitative benefits of RAG and fine-tuning. We see an accuracy increase of over 6 p.p. when fine-tuning the model and this is cumulative with RAG, which increases accuracy by 5 p.p. further. In one particular experiment, we also demonstrate that the fine-tuned model leverages information from across geographies to answer specific questions, increasing answer similarity from 47% to 72%. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a specific industry, paving the way for further applications of LLMs in other industrial domains.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1747676541876596779

자기 보상형 언어 모델 / Self-Rewarding Language Models

논문 소개

학습 중에 보상을 제공하기 위해 모델 자체를 판단자 역할을 하는 학습자 프롬프트에 사용하는 자체 정렬 방법, 자체 학습 생성 단계에서 생성된 데이터에서 구축된 선호도 쌍을 사용하여 학습 후 학습에 사용하는 반복적 DPO(Iterative DPO), 이 접근 방식을 사용하여 세 번의 반복으로 라마-2 70b 모델을 미세 조정하면 알파카발 2.0 리더보드에서 클로드 2 및 제미니 프로와 같은 학습자 모델을 능가하는 모델이 될 수 있다는 제안을 합니다.

Proposes a self-alignment method that uses the model itself for llm-as-a-judge prompting to provide its rewards during training; iterative dpo is used for instruction following training using the preference pairs built from the generated data which comes from a self-instruction creation phase; using this approach, fine-tuning a llama 2 70b model on three iterations can lead to a model that outperforms llms like claude 2 and gemini pro on the alpacaeval 2.0 leaderboard.

논문 초록(Abstract)

우리는 초인적인 에이전트를 구현하기 위해서는 미래의 모델이 적절한 학습 신호를 제공하기 위해 초인적인 피드백을 필요로 한다고 가정합니다. 현재의 접근 방식은 일반적으로 인간의 선호도로부터 보상 모델을 학습시키는데, 이는 인간의 성능 수준에 의해 병목 현상을 일으킬 수 있으며, 두 번째로 이렇게 고정된 보상 모델은 LLM 학습 중에 개선 학습을 할 수 없습니다. 이 연구에서는 언어 모델 자체가 학습 중에 스스로 보상을 제공하도록 유도하는 LLM-as-a-Judge 프롬프트를 통해 사용되는 자기 보상 언어 모델을 연구합니다. 우리는 반복적 DPO 학습 중에 명령어 추종 능력이 향상될 뿐만 아니라 스스로 고품질 보상을 제공하는 능력도 향상된다는 것을 보여줍니다. 세 차례에 걸친 반복 학습을 통해 라마 2 70B를 미세 조정한 결과, 알파원 2.0 리더보드에서 클로드 2, 제미니 프로, GPT-4 0613 등 기존의 많은 시스템을 능가하는 모델이 도출되었습니다. 아직 예비 연구에 불과하지만, 이 연구는 두 축 모두에서 지속적으로 개선할 수 있는 모델의 가능성을 열어줍니다.

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While only a preliminary study, this work opens the door to the possibility of models that can continually improve in both axes.

논문 링크

더 읽어보기

https://x.com/jaseweston/status/1748158323369611577

프록시를 통한 언어 모델 튜닝 / Tuning Language Models by Proxy

논문 소개

작은 기본 모델(Small Base Model)과 파인튜닝된 기본 모델(Fine-tuned Base Model) 간의 로짓(Logits' Difference) 차이로 대상 LLM의 로짓(Logit)을 수정하는 디코딩 시점(Decoding-Time)의 알고리즘인 프록시 튜닝을 도입하여 더 큰 대상 기본 모델이 미세 조정된 버전과 동일한 성능을 발휘할 수 있도록 함. 프록시 튜닝은 겨우 7b 크기의 프록시를 사용하여 llama2-70b에 적용하여 Llama2-70b와 튜닝된 Chat 모델 간의 차이를 88%까지 좁혔습니다.

Introduces proxy-tuning, a decoding-time algorithm that modifies logits of a target llm with the logits’ difference between a small base model and a fine-tuned base model; this can enable a larger target base model to perform as well as would a fine-tuned version of it; proxy-tuning is applied to llama2-70b using proxies of only 7b size to close 88% of the gap between llama2-70b and its tuned chat version.

논문 초록(Abstract)

사전 학습된 대규모 언어 모델의 일반적인 기능에도 불구하고, 원하는 동작을 더 잘 달성하기 위해 추가 조정을 통해 지속적으로 이점을 얻을 수 있습니다. 하지만 이러한 모델을 튜닝하는 데 리소스를 많이 사용하거나 모델 가중치가 비공개인 경우 불가능해졌습니다. 우리는 블랙박스 LM 위에서 작동하는 경량 디코딩 시간 알고리즘인 프록시 튜닝을 도입하여 모델을 직접 튜닝하는 것과 같은 결과를 얻되, 출력 어휘에 대한 예측에만 액세스하여 모델을 튜닝할 수 있습니다. 이 방법은 대신 더 작은 LM을 튜닝한 다음, 튜닝된 작은 LM과 튜닝되지 않은 LM의 예측 차이를 적용하여 대규모 사전 학습의 이점을 유지하면서 기본 모델의 원래 예측을 튜닝 방향으로 이동시킵니다. 실험 결과, 7B 크기의 프록시만 사용하여 라마2-70B에 프록시 튜닝을 적용했을 때 지식, 추론, 안전성 벤치마크 전반에서 평가했을 때 라마2-70B와 실제 튜닝된 채팅 버전 간의 격차를 88%까지 좁힐 수 있었습니다. 흥미롭게도 TruthfulQA에서 테스트한 결과, 프록시 튜닝된 모델이 직접 튜닝된 모델보다 실제로 더 진실한 것으로 나타났는데, 이는 디코딩 시간 안내가 모델의 사실 지식을 더 잘 유지하기 때문일 수 있습니다. 그런 다음 프록시 튜닝을 코드의 도메인 적응과 질의응답 및 수학 문제에 대한 작업별 미세 조정에 적용하여 프록시 튜닝의 일반성을 입증합니다. 우리의 연구는 디코딩 시간 안내를 통해 잠재적으로 독점적일 수 있는 대규모 LM을 효율적으로 사용자 정의하기 위해 작은 튜닝된 LM을 사용할 수 있다는 가능성을 보여줍니다.

Despite the general capabilities of large pretrained language models, they consistently benefit from further adaptation to better achieve desired behaviors. However, tuning these models has become increasingly resource-intensive, or impossible when model weights are private. We introduce proxy-tuning, a lightweight decoding-time algorithm that operates on top of black-box LMs to achieve the result of directly tuning the model, but by accessing only its prediction over the output vocabulary. Our method instead tunes a smaller LM, then applies the difference between the predictions of the small tuned and untuned LMs to shift the original predictions of the base model in the direction of tuning, while retaining the benefits of larger scale pretraining. In experiments, when we apply proxy-tuning to Llama2-70B using proxies of only 7B size, we can close 88% of the gap between Llama2-70B and its truly-tuned chat version, when evaluated across knowledge, reasoning, and safety benchmarks. Interestingly, when tested on TruthfulQA, proxy-tuned models are actually more truthful than directly tuned models, possibly because decoding-time guidance better retains the model's factual knowledge. We then demonstrate the generality of proxy-tuning by applying it for domain adaptation on code, and task-specific finetuning on question-answering and math problems. Our work demonstrates the promise of using small tuned LMs to efficiently customize large, potentially proprietary LMs through decoding-time guidance.

논문 링크

더 읽어보기

https://x.com/rasbt/status/1748021765790376385

ReFT: 강화된 파인튜닝을 통한 추론 / ReFT: Reasoning with Reinforced Fine-Tuning

논문 소개

추론을 위한 LLM의 일반화 가능성을 향상시키기 위한 접근 방식인 ReFT(Reinforced Fine-Tuning)를 제안합니다. 이 접근 방식은 SFT(Supervised Fine-Tuning)를 적용한 다음 학습할 추론 경로를 자동으로 샘플링하면서 더 세분화하기 위해 온라인 RL을 적용합니다. 이는 사람이 라벨링한 데이터로부터 학습한 보상 모델을 활용하지 않는다는 점에서 RLHF와 다릅니다.

Proposes an approach, reft, to enhance the generalizability of llms for reasoning; it starts with applying sft and then applies online rl for further refinement while automatically sampling reasoning paths to learn from; this differs from rlhf in that it doesn’t utilize a reward model learned from human-labeled data; reft demonstrates improved performance and generalization abilities on math problem-solving.

논문 초록(Abstract)

대규모 언어 모델(LLM)의 추론 능력을 향상시키는 한 가지 방법은 생각의 연쇄(CoT, Chain-of-Thought) 주석을 사용하여 지도 미세 조정(SFT)을 수행하는 것입니다. 그러나 이 접근 방식은 주어진 CoT 데이터에만 의존하기 때문에 충분히 강력한 일반화 능력을 보여주지 못합니다. 예를 들어 수학 문제 풀이에서는 일반적으로 학습 데이터에 각 문제에 대해 주석이 달린 추론 경로가 하나만 존재합니다. 직관적으로 알고리즘이 주어진 문제에 대해 여러 개의 주석이 달린 추론 경로를 학습하는 것이 더 좋을 것입니다. 이 문제를 해결하기 위해 수학 문제 해결을 예로 들어 추론용 LLM 학습의 일반화 가능성을 높이기 위한 간단하면서도 효과적인 접근 방식인 강화 미세 조정(Reinforced Fine-Tuning, ReFT)을 제안합니다. ReFT는 먼저 SFT로 모델을 워밍업한 다음, 온라인 강화 학습, 특히 이 백서의 PPO 알고리즘을 사용하여 모델을 더욱 세밀하게 조정함으로써 질문에 따라 풍부한 추론 경로가 자동으로 샘플링되고 보상이 실측 답변에서 자연스럽게 도출되도록 합니다. GSM8K, MathQA, SVAMP 데이터셋에 대한 광범위한 실험 결과, ReFT가 SFT보다 훨씬 뛰어난 성능을 보였으며, 다수결 투표와 순위 재지정과 같은 추론 시간 전략을 결합하면 성능을 더욱 향상시킬 수 있습니다. ReFT는 추가 또는 증강 학습 질문에 의존하지 않고 SFT와 동일한 학습 질문에서 학습함으로써 이러한 개선 효과를 얻을 수 있습니다. 이는 ReFT의 일반화 능력이 우수하다는 것을 의미합니다.

One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1747820246268887199

NLG 평가를 위한 대규모 언어 모델 활용하기: 서베이 논문 / Leveraging Large Language Models for NLG Evaluation: A Survey

논문 소개

방법론을 철저히 조사하고 그 강점과 한계를 탐구하며, 평가를 위한 신속한 엔지니어링 또는 오픈소스 언어모델(LM) 보정과 관련된 다양한 접근 방식의 분류법을 제공합니다.

Thoroughly surveys the methodologies and explores their strengths and limitations; provides a taxonomy of different approaches involving prompt engineering or calibrating open-source llms for evaluation.

논문 초록(Abstract)

빠르게 진화하는 자연어 생성(NLG) 평가 영역에서 대규모 언어 모델(LLM)의 도입은 일관성, 창의성, 문맥 관련성 등 생성된 콘텐츠 품질을 평가할 수 있는 새로운 길을 열었습니다. 이 설문조사는 체계적인 분석이 부족한 급성장하는 분야인 자연어 처리 평가를 위한 LLM 활용에 대한 전반적인 개요를 제공하는 것을 목표로 합니다. 기존의 LLM 기반 평가 지표를 정리하기 위한 일관된 분류법을 제안하고, 이러한 방법을 이해하고 비교할 수 있는 구조화된 프레임워크를 제공합니다. 또한 다양한 LLM 기반 방법론을 비판적으로 평가하고 NLG 결과물을 평가하는 데 있어 각 방법론의 강점과 한계를 비교하는 등 세부적으로 살펴봅니다. 이 설문조사는 편향성, 견고성, 도메인 특이성, 통합 평가 등 해결되지 않은 과제에 대해 논의함으로써 연구자들에게 인사이트를 제공하고 보다 공정하고 발전된 NLG 평가 기법을 옹호하고자 합니다.

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation, introducing Large Language Models (LLMs) has opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance. This survey aims to provide a thorough overview of leveraging LLMs for NLG evaluation, a burgeoning area that lacks a systematic analysis. We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods. Our detailed exploration includes critically assessing various LLM-based methodologies, as well as comparing their strengths and limitations in evaluating NLG outputs. By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this survey seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1748016227090305167

Patchscopes: 언어 모델의 숨겨진 표현을 검사하기 위한 통합 프레임워크 / Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

논문 소개

모델 자체를 활용하여 내부 표현을 설명하는 프레임워크를 제안하고, 표현을 별도의 추론 패스로 '패치(Patch)'하여 정보를 추출할 수 있는 LLM의 숨겨진 표현에서 정보를 해독하며, LLM의 계산에 대한 질문에 답하는 데 사용할 수 있고 잠재된 멀티홉 추론 오류를 수정하는 데에도 사용할 수 있습니다.

Proposes a framework that leverages a model itself to explain its internal representations; it decodes information from llm hidden representations which is possible by “patching” representations into a separate inference pass that encourages the extraction of that information; it can be used to answer questions about an llm’s computation and can even be used to fix latent multi-hop reasoning errors.

논문 초록(Abstract)

대규모 언어 모델(LLM)의 숨겨진 표현에 인코딩된 정보를 검사하면 모델의 동작을 설명하고 인간의 가치와 일치하는지를 확인할 수 있습니다. 사람이 이해할 수 있는 텍스트를 생성하는 LLM의 기능을 고려할 때, 모델 자체를 활용하여 내부 표현을 자연어로 설명할 것을 제안합니다. 패치스코프라는 프레임워크를 소개하고 이 프레임워크를 사용하여 LLM의 계산에 대한 다양한 질문에 답하는 방법을 보여드립니다. 표현을 어휘 공간에 투영하고 LLM 계산에 개입하는 것을 기반으로 하는 사전 해석 가능성 방법이 이 프레임워크의 인스턴스로 볼 수 있음을 보여줍니다. 또한 초기 레이어 검사 실패나 표현력 부족과 같은 몇 가지 단점을 패치스코프를 통해 완화할 수 있습니다. 패치스코프는 사전 검사 기법을 통합하는 것 외에도 더 큰 모델을 사용하여 더 작은 모델의 표현을 설명하는 것과 같은 새로운 가능성을 열어주며, 멀티홉 추론에서 자체 수정과 같은 새로운 애플리케이션의 가능성을 열어줍니다.

Inspecting the information encoded in hidden representations of large language models (LLMs) can explain models' behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM's computation. We show that prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and unlocks new applications such as self-correction in multi-hop reasoning.

논문 링크

더 읽어보기

https://x.com/ghandeharioun/status/1746946621215003041

어려운 작업을 위한 쉬운 학습 데이터의 불합리한 효과 / The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

논문 소개

언어 모델이 쉬운 데이터에서 어려운 데이터로 일반화, 즉 Easy-to-Hard 일반화를 잘하는 경우가 많다는 점을 제시하고, 어려운 데이터에서 성능을 개선하는 데 중점을 두더라도 어려운 데이터보다는 쉬운 데이터에서 학습하는 것이 더 나을 수 있다고 주장하며, 확장 가능한 감독 문제가 이전에 생각했던 것보다 쉬울 수 있다고 제안합니다.

Suggests that language models often generalize well from easy to hard data, i.e., easy-to-hard generalization; it argues that it can be better to train on easy data as opposed to hard data, even when the emphasis is on improving performance on hard data, and suggests that the scalable oversight problem may be easier than previously thought.

논문 초록(Abstract)

하드 트레이닝 데이터는 정의상 올바르게 레이블을 지정하기 어려운데 어떻게 하면 하드 테스트 데이터에서 우수한 성능을 발휘하도록 모델을 트레이닝할 수 있을까요? 이 질문은 확장 가능한 감독 문제라고 불리며 언어 모델이 지속적으로 개선됨에 따라 점점 더 많은 관심을 받고 있습니다. 이 논문에서는 현재의 언어 모델이 쉬운 데이터에서 어려운 데이터로 비교적 잘 일반화하며 심지어 어려운 데이터로 학습된 '오라클' 모델만큼 잘 수행한다는 놀라운 결론을 제시합니다. 우리는 6가지의 경험적으로 다양한 인간 경도 측정치(예: 학년 수준)와 1가지 모델 기반 측정치(손실 기반)를 포함한 7가지 데이터 포인트 경도 측정치에 대해 컨텍스트 내 학습, 선형 분류기 헤드, QLoRA와 같은 간단한 학습 방법을 사용하여 이러한 종류의 쉬운 것에서 어려운 것으로의 일반화를 입증합니다. 또한, 하드 데이터에 대한 모델 성능을 가장 중요하게 생각하더라도 하드 데이터는 일반적으로 노이즈가 많고 수집 비용이 더 많이 들기 때문에 하드 데이터보다는 쉬운 데이터를 수집하고 학습하는 것이 더 나을 수 있음을 보여줍니다. 이 실험에서는 최대 70b 크기의 공개 모델과 초등학교 3학년 과학 문제부터 대학 수준의 STEM 문제, 일반 상식 퀴즈까지 다양한 난이도의 질문으로 구성된 4개의 공개 질문-답변 데이터셋을 사용했습니다. 연구한 과제에서 LM의 쉬운 일반화에서 어려운 일반화가 놀라울 정도로 강력하다는 결론을 내렸으며, 이는 확장 가능한 감독 문제가 이전에 생각했던 것보다 쉬울 수 있음을 시사합니다. 코드는 GitHub - allenai/easy-to-hard-generalization: Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data" 에서 확인할 수 있습니다

How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current language models often generalize relatively well from easy to hard data, even performing as well as "oracle" models trained on hard data. We demonstrate this kind of easy-to-hard generalization using simple training methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect and train on easy data rather than hard data, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied, suggesting the scalable oversight problem may be easier than previously thought. Our code is available at GitHub - allenai/easy-to-hard-generalization: Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"

논문 링크

더 읽어보기

https://x.com/peterbhase/status/1747301128683839998

MoE-Mamba: 다양한 전문가가 혼합된 효율적인 선택적 상태 공간 모델 / MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

논문 소개

상태 공간 모델(SSM; State Space Models)과 전문가 혼합 모델(MoE; Mixture-of-Experts)을 결합하여 LLM을 효율적으로 확장하는 접근 방식인 MoE-Mamba는 Mamba 및 Transformer-MoE다 성능이 뛰어나며, 2.2배 적은 학습 단계로 Mamba와 동일한 성능에 도달하면서도 Transformer에 대한 Mamba의 추론 성능 이점을 그대로 유지합니다.

An approach to efficiently scale llms by combining state space models (SSMs) with mixture of experts (MoE); moe-mamba, outperforms both mamba and transformer-moe; it reaches the same performance as mamba in 2.2x less training steps while preserving the inference performance gains of mamba against the transformer.

논문 초록(Abstract)

상태 공간 모델(SSM)은 순차적 모델링 분야에서 트랜스포머의 아성에 도전하는 강력한 경쟁자가 되었습니다. 동시에 전문가 혼합(MoE)은 최근의 최신 오픈소스 모델을 포함하여 Transformer 기반 LLM을 크게 개선했습니다. 확장성을 위해 SSM의 잠재력을 활용하려면 MoE와 결합해야 한다고 제안합니다. 트랜스포머와 같은 놀라운 성능을 구현하는 최신 SSM 기반 모델인 맘바(Mamba)를 통해 이를 살펴봅니다. 당사의 모델인 MoE-Mamba는 Mamba와 Transformer-MoE를 모두 능가하는 성능을 발휘합니다. 특히 MoE-Mamba는 2.2배 더 적은 학습 단계로 Mamba와 동일한 성능에 도달하는 동시에 Transformer 대비 Mamba의 추론 성능 향상을 유지합니다.

State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based LLMs, including recent state-of-the-art open-source models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable, Transformer-like performance. Our model, MoE-Mamba, outperforms both Mamba and Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in 2.2x less training steps while preserving the inference performance gains of Mamba against the Transformer.

논문 링크

더 읽어보기

https://x.com/arankomatsuzaki/status/1744552215946100969

원문

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

PyTorchKR​

사람의 시연 없이 올림피아드 기하학 풀기 / Solving olympiad geometry without human demonstrations

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

AlphaCodium을 사용한 코드 생성: 프롬프트 엔지니어링에서 흐름 엔지니어링까지 / Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

RAG 대 Finetuning: 파이프라인, 트레이드오프, 농업 사례 연구 / RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

자기 보상형 언어 모델 / Self-Rewarding Language Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

프록시를 통한 언어 모델 튜닝 / Tuning Language Models by Proxy

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

ReFT: 강화된 파인튜닝을 통한 추론 / ReFT: Reasoning with Reinforced Fine-Tuning

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

NLG 평가를 위한 대규모 언어 모델 활용하기: 서베이 논문 / Leveraging Large Language Models for NLG Evaluation: A Survey

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Patchscopes: 언어 모델의 숨겨진 표현을 검사하기 위한 통합 프레임워크 / Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

어려운 작업을 위한 쉬운 학습 데이터의 불합리한 효과 / The Unreasonable Effectiveness of Easy Training Data for Hard Tasks

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

MoE-Mamba: 다양한 전문가가 혼합된 효율적인 선택적 상태 공간 모델 / MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR