[2024/12/30 ~ 2025/01/05] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 1월 6, 2025, 1:31오전

[2024/12/30 ~ 2025/01/05] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들에서 두드러진 경향은 언어 모델, 특히 자연어 처리(NLP) 및 논리적 추론과 관련된 주제들에 대한 연구가 많이 포함되었다는 점입니다. 예를 들어, "OLMo 2", "On the Overthinking of LLMs", "Machine-Assisted Proof", "Measuring Higher Level Mathematical Reasoning" 등이 있는데, 이들 모두는 언어 모델과 관련된 다양한 측면을 다루고 있습니다. 특히, 이러한 논문들은 자연어 처리 시스템의 이해 능력, 논리적 추론 및 수학적 사고를 증진시키기 위한 방법론을 발전시키는 데 중점을 두고 있습니다.
또한, 메모리에 관련된 연구가 주목받고 있는 것으로 보입니다. "Memory Layers at Scale"과 같은 논문은 대규모 데이터 집합을 위한 메모리 구조 설계에 대한 논의를 포함하고 있습니다. 이는 데이터의 복잡성과 양이 증가하는 현대의 AI 환경에서 효율적인 데이터 저장 및 처리가 얼마나 중요한지를 강조사고 있습니다.
이러한 경향은 인공지능, 특히 자연어 처리(NLP) 및 논리적 이해에 대한 연구가 심화되고 있음을 보여줍니다. 이는 AI 시스템이 인간의 복잡한 사고나 추론 능력을 모방하고 발전시키는 데 중요한 역할을 할 수 있음을 시사하며, 앞으로의 연구가 이러한 시스템의 성능을 얼마나 더 향상시킬 수 있을지에 대한 심도 있는 탐구로 이어질 것입니다. 또한, 데이터 스케일링 문제는 점점 더 복잡해지는 AI 모델들이 해결해야 하는 핵심 과제 중 하나로, 관련 연구가 활발히 수행됨으로써 데이터 처리의 효율성을 혁신적으로 개선시킬 잠재성을 가진다고 할 수 있습니다.

에이전트로는 충분치 않습니다 / Agents Are Not Enough

논문 소개

AI 에이전트가 가능성을 보여주기는 하지만 그것만으로는 자율적인 작업 수행의 과제를 해결할 수 없다고 주장하며 세 가지 핵심 요소를 결합한 새로운 생태계를 제안합니다: 에이전트(특정 작업을 위한 좁은 목적 중심의 모듈), 심(사용자 선호도와 행동의 디지털 표현), 어시스턴트(사용자, 심, 에이전트 간의 조율을 담당하는 프로그램).

Argues that while AI agents show promise, they alone cannot address the challenges in autonomous task execution; proposes a new ecosystem combining three key components: Agents (narrow, purpose-driven modules for specific tasks), Sims (digital representations of user preferences and behaviors), and Assistants (programs that coordinate between users, Sims, and Agents).

논문 초록(Abstract)

인공지능(AI)이 우리 삶의 다양한 측면에 점점 더 많이 통합되는 가운데, 에이전트가 다시 부활하고 있습니다. 인간을 대신하여 행동하는 이러한 자율 프로그램은 새로운 것도 아니고 주류 AI의 전유물도 아닙니다. 에이전트의 과거 모습을 살펴봄으로써 우리는 이전에 어떤 일이 이루어졌고, 무엇이 효과가 있었는지, 더 중요한 것은 무엇이 제대로 작동하지 않았으며 그 이유는 무엇인지 이해할 수 있습니다. 이러한 이해를 바탕으로 현재 에이전트에 초점을 맞추고 있는 것이 무엇인지 살펴볼 수 있습니다. 제너레이티브 AI는 매력적이지만, 이 기술만으로는 새로운 세대의 상담원이 더 큰 성공을 거두기에는 충분하지 않습니다. 현재의 에이전트 물결을 효과적이고 지속 가능하게 만들기 위해서는 에이전트뿐만 아니라 사용자의 선호도와 행동을 나타내는 심(Sim)과 사용자와 직접 상호 작용하고 에이전트의 도움을 받아 사용자 작업의 실행을 조정하는 어시스턴트까지 포함하는 에코시스템을 구상하고 있습니다.

In the midst of the growing integration of Artificial Intelligence (AI) into various aspects of our lives, agents are experiencing a resurgence. These autonomous programs that act on behalf of humans are neither new nor exclusive to the mainstream AI movement. By exploring past incarnations of agents, we can understand what has been done previously, what worked, and more importantly, what did not pan out and why. This understanding lets us to examine what distinguishes the current focus on agents. While generative AI is appealing, this technology alone is insufficient to make new generations of agents more successful. To make the current wave of agents effective and sustainable, we envision an ecosystem that includes not only agents but also Sims, which represent user preferences and behaviors, as well as Assistants, which directly interact with the user and coordinate the execution of user tasks with the help of the agents.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1874196827115061741

2 OLMo 2 분노의 질주 / 2 OLMo 2 Furious

논문 소개

향상된 아키텍처, 학습 방법, 특수 데이터 혼합물인 Dolmino Mix 1124를 소개합니다. 완전한 학습 데이터와 코드가 포함된 7B 및 13B 파라미터 규모로 출시된 완전 투명 모델은 더 적은 컴퓨팅 리소스를 사용하면서도 Llama 3.1 및 Qwen 2.5와 같은 유사한 오픈 웨이트 모델과 일치하거나 더 뛰어난 성능을 발휘하며, 명령어 조정 버전(OLMo 2-Instruct)은 유사한 모델과 경쟁할 수 있는 경쟁력을 유지합니다.

Introduces an enhanced architecture, training methods, and a specialized data mixture called Dolmino Mix 1124; the fully transparent model, released at 7B and 13B parameter scales with complete training data and code, matches or outperforms similar open-weight models like Llama 3.1 and Qwen 2.5 while using fewer computational resources, and its instruction-tuned version (OLMo 2-Instruct) remains competitive with comparable models.

논문 초록(Abstract)

완전 개방형 언어 모델의 차세대 버전인 OLMo 2를 소개합니다. OLMo 2에는 향상된 아키텍처와 훈련 레시피, 사전 훈련 데이터 혼합, 명령어 튜닝 레시피를 갖춘 고밀도 자동 회귀 모델이 포함되어 있습니다. 수정된 모델 아키텍처와 훈련 레시피는 훈련 안정성과 토큰당 효율성을 모두 향상시킵니다. 업데이트된 프리트레이닝 데이터 혼합은 Dolmino Mix 1124라는 새롭고 특화된 데이터 혼합을 도입하여 후기 커리큘럼 트레이닝(즉, 프리트레이닝의 어닐링 단계에서 특화된 데이터)을 통해 도입할 경우 여러 다운스트림 작업 벤치마크에서 모델 기능을 크게 향상시킵니다. 마지막으로, T"ulu 3의 모범 사례를 통합하여 허용 데이터에 초점을 맞추고 검증 가능한 보상(RLVR)으로 최종 단계 강화 학습을 확장하는 OLMo 2-Instruct를 개발합니다. OLMo 2 기본 모델은 연산 성능의 파레토 경계에 위치하며, 더 적은 FLOP을 사용하고 완전히 투명한 훈련 데이터, 코드 및 레시피를 사용하면서도 종종 Llama 3.1 및 Qwen 2.5와 같은 오픈 가중치 전용 모델과 비슷하거나 더 뛰어난 성능을 발휘합니다. 유니티의 완전 개방형 OLMo 2-Instruct 모델은 Qwen 2.5, Llama 3.1, Gemma 2 등 비슷한 크기의 오픈 웨이트 전용 모델과 경쟁하거나 이를 능가합니다. 전체 훈련 데이터, 훈련 코드 및 레시피, 훈련 로그, 수천 개의 중간 체크포인트를 포함한 7B 및 13B 규모의 사전 훈련 및 사후 훈련 모델 등 모든 OLMo 2 아티팩트를 공개적으로 공개합니다. 최종 인스트럭션 모델은 Ai2 플레이그라운드에서 무료 연구 데모로 제공됩니다.

We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes dense autoregressive models with improved architecture and training recipe, pretraining data mixtures, and instruction tuning recipes. Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to compute, often matching or outperforming open-weight only models like Llama 3.1 and Qwen 2.5 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size, including Qwen 2.5, Llama 3.1 and Gemma 2. We release all OLMo 2 artifacts openly -- models at 7B and 13B scales, both pretrained and post-trained, including their full training data, training code and recipes, training logs and thousands of intermediate checkpoints. The final instruction model is available on the Ai2 Playground as a free research demo.

논문 링크

더 읽어보기

https://x.com/soldni/status/1875266934943649808

기계 지원 증명 / Machine-Assisted Proof

논문 소개

수학자들이 오랫동안 수학 연구를 지원하기 위해 기계를 사용해 온 방법을 살펴보고, 수학 증명 지원을 혁신하고 있는 최근의 AI 도구에 대해 논의합니다.

Examines how mathematicians have long used machines to assist with mathematics research and discusses recent AI tools that are transforming mathematical proof assistance.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1873045937259462656

더 높은 수준의 수학적 추론 측정 / Measuring Higher Level Mathematical Reasoning

논문 소개

236개의 Putnam Competition 문제와 52개의 변형 문제로 구성된 새로운 수학 추론 벤치마크인 Putnam-AXIOM을 소개합니다. 가장 좋은 모델(OpenAI의 o1-preview)도 원본 문제에서는 41.95%의 정확도만 달성하고 변형 문제에서는 성능이 현저히 떨어집니다.

Introduces Putnam-AXIOM, a new math reasoning benchmark with 236 Putnam Competition problems and 52 variations; even the best model considered (OpenAI's o1-preview) achieves only 41.95% accuracy on original problems and performs significantly worse on variations.

논문 초록(Abstract)

대규모 언어 모델(LLM)이 계속 발전함에 따라 추론 능력을 평가하기 위해 고안된 기존의 많은 벤치마크가 포화 상태에 이르고 있습니다. 따라서 저희는 윌리엄 로웰 퍼트넘 수학 경시대회에서 출제된 236개의 수학 문제와 상세한 단계별 솔루션으로 구성된 Putnam-AXIOM 오리지널 벤치마크를 제시합니다. Putnam-AXIOM 벤치마크의 유효성을 유지하고 잠재적인 데이터 오염을 완화하기 위해 52개의 문제를 기능적으로 변형한 Putnam-AXIOM Variation 벤치마크를 만들었습니다. 변수나 상수 같은 문제 요소를 프로그래밍 방식으로 변경함으로써 온라인에서는 찾아볼 수 없는 새롭고 똑같이 어려운 문제를 무제한으로 생성할 수 있습니다. 거의 모든 모델이 원래 문제보다 변형 문제의 정확도가 현저히 낮다는 것을 확인했습니다. 연구 결과에 따르면 가장 성능이 좋은 모델인 OpenAI의 o1-preview는 Putnam-AXIOM 원본에 대해서는 41.95%의 정확도를 달성했지만 변형 데이터 세트에 대해서는 원본 문제에 비해 약 30%의 정확도 감소를 경험한 것으로 나타났습니다.

As large language models (LLMs) continue to advance, many existing benchmarks designed to evaluate their reasoning capabilities are becoming saturated. Therefore, we present the Putnam-AXIOM Original benchmark consisting of 236 mathematical problems from the William Lowell Putnam Mathematical Competition, along with detailed step-by-step solutions. To preserve the Putnam-AXIOM benchmark's validity and mitigate potential data contamination, we created the Putnam-AXIOM Variation benchmark with functional variations of 52 problems. By programmatically altering problem elements like variables and constants, we can generate unlimited novel, equally challenging problems not found online. We see that almost all models have significantly lower accuracy in the variations than the original problems. Our results reveal that OpenAI's o1-preview, the best performing model, achieves merely 41.95% accuracy on the Putnam-AXIOM Original but experiences around a 30% reduction in accuracy on the variations' dataset when compared to corresponding original problems.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1874489752243597635

2+3이 무엇인지에 대해 그렇게 많이 생각하지 마세요. O1-Like LLM의 지나친 생각에 대해 / Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

논문 소개

O1과 유사한 LLM에서 과잉 사고를 완화하기 위한 자가 학습 전략을 제안하며, QwQ-32B-Preview에 적용된 것처럼 널리 사용되는 MATH500 테스트 세트에서 정확도를 유지하면서 토큰 출력을 48.6%까지 줄일 수 있습니다.

Proposes a self-training strategy to mitigate overthinking in o1-like LLMs; it can reduce token output by 48.6% while maintaining accuracy on the widely-used MATH500 test set as applied to QwQ-32B-Preview.

논문 초록(Abstract)

OpenAI o1과 같은 모델의 놀라운 성능은 추론 과정에서 인간과 같은 장시간의 사고를 모방하는 능력에 기인합니다. 이러한 모델은 확장된 사고 연쇄(CoT) 프로세스를 사용하여 문제 해결 능력을 향상시키기 위한 다양한 전략을 모색합니다. 하지만 여전히 중요한 질문이 남아 있습니다: 테스트 중에 컴퓨팅 리소스를 지능적이고 효율적으로 확장하는 방법입니다. 이 백서에서는 이러한 모델에 널리 퍼져 있는 과잉 사고 문제에 대한 최초의 종합적인 연구를 제시하며, 최소한의 이득이 있는 단순한 문제에 과도한 컴퓨팅 리소스가 할당되는 문제를 다룹니다. 결과 및 프로세스 관점에서 새로운 효율성 지표를 도입하여 O1 유사 모델의 합리적인 계산 리소스 사용을 평가합니다. 자가 학습 패러다임을 사용하여 정확도를 떨어뜨리지 않으면서 추론 프로세스를 간소화하여 과도한 사고를 완화하는 전략을 제안합니다. 실험 결과에 따르면 이 접근 방식은 GSM8K, MATH500, GPQA, AIME 등 다양한 난이도의 다양한 테스트 세트에서 모델 성능을 유지하면서 계산 오버헤드를 성공적으로 줄이는 것으로 나타났습니다.

The remarkable performance of models like the OpenAI o1 can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by o1-like models. Using a self-training paradigm, we propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1874848885170176364

MEDEC: 임상 노트의 의료 오류 감지 및 수정을 위한 벤치마크 / MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

논문 소개

임상 노트에서 의료 오류를 탐지하고 수정하기 위해 공개적으로 사용 가능한 벤치마크인 MEDEC을 소개하며, 5가지 오류 유형(진단, 관리, 치료, 약물 요법, 원인 기관)을 다루고, 미국 3개 병원 시스템의 임상 노트 488개를 포함한 3,848개의 임상 텍스트로 구성, 실험 결과 Cluade 3.5 Sonnet이 오류를 더 잘 탐지하는 반면 o1-preview가 오류 수정 성능이 더 뛰어난 것으로 나타났습니다.

Introduces MEDEC, a publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism); it consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems; experimental results shows that Cluade 3.5 Sonnet performs better at detecting errors while o1-preview is better at correcting errors.

논문 초록(Abstract)

여러 연구에 따르면 대규모 언어 모델(LLM)은 일부 의학 시험에서 인간의 평균 점수보다 더 높은 점수로 의학 질문에 정확하게 답할 수 있는 것으로 나타났습니다. 그러나 저희가 아는 한, 언어 모델이 기존 또는 생성된 의학 텍스트의 정확성과 일관성을 검증하는 능력을 평가하는 연구는 수행되지 않았습니다. 이 백서에서는 5가지 오류 유형(진단, 관리, 치료, 약물요법, 인과관계)을 다루는 임상 노트의 의료 오류 탐지 및 수정을 위한 최초의 공개 벤치마크인 MEDEC(GitHub - abachaa/MEDEC)을 소개합니다. MEDEC은 3,848개의 임상 텍스트로 구성되어 있으며, 여기에는 이전에는 LLM에서 볼 수 없었던 미국 병원 시스템 3곳의 임상 노트 488개가 포함되어 있습니다. 이 데이터 세트는 17개 참여 시스템을 평가하기 위한 MEDIQA-CORR 공유 작업에 사용되었습니다[Ben Abacha 외, 2024]. 이 논문에서는 데이터 생성 방법을 설명하고 의학 지식과 추론 능력을 모두 필요로 하는 의료 오류를 탐지하고 수정하는 작업에 대해 최신 LLM(예: o1-preview, GPT-4, Claude 3.5 Sonnet, Gemini 2.0 Flash)을 평가합니다. 또한 두 명의 의사가 MEDEC 테스트 세트에서 동일한 과제를 수행하는 비교 연구도 진행했습니다. 그 결과 MEDEC은 기존 또는 생성된 메모를 검증하고 의료 오류를 수정하는 모델의 능력을 평가하기에 충분히 도전적인 벤치마크라는 것을 보여주었습니다. 또한 최근의 LLM이 오류 감지 및 수정에서 좋은 성능을 보이지만 이러한 작업에서 여전히 의사보다 성능이 떨어진다는 사실도 발견했습니다. 이러한 격차의 잠재적 요인, 실험을 통해 얻은 인사이트, 현재 평가 지표의 한계에 대해 논의하고 향후 연구를 위한 시사점을 공유합니다.

Several studies showed that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (GitHub - abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen participating systems [Ben Abacha et al., 2024]. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1875232390265577675

1.58비트 플럭스 / 1.58-bit FLUX

논문 소개

1.58비트 가중치(즉, {-1, 0, +1}의 값)를 사용하여 최첨단 텍스트-이미지 생성 모델인 FLUX.1-dev를 정량화하는 최초의 성공적인 접근 방식을 제시합니다. 이 방법은 FLUX.1-dev 모델의 자체 감독에 의존하며 1024 x 1024 이미지를 생성할 때 원래 FLUX 모델과 비슷한 성능을 유지합니다.

Presents the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}); the method relies on self-supervision from the FLUX.1-dev model and maintains comparable performance for generating 1024 x 1024 images as the original FLUX model.

논문 초록(Abstract)

1.58비트 가중치(즉, {-1, 0, +1}의 값)를 사용해 최첨단 텍스트-이미지 생성 모델인 FLUX.1-dev를 정량화하는 데 최초로 성공한 1.58비트 FLUX는 1024 x 1024 이미지를 생성하는 데 비슷한 성능을 유지하면서 1.58비트 FLUX를 제시합니다. 특히, 저희의 양자화 방법은 이미지 데이터에 액세스하지 않고 FLUX.1-dev 모델의 자체 감독에만 의존하여 작동합니다. 또한 1.58비트 연산에 최적화된 맞춤형 커널을 개발하여 모델 스토리지 7.7배 감소, 추론 메모리 5.1배 감소, 추론 지연 시간 개선을 달성했습니다. GenEval 및 T2I Compbench 벤치마크에 대한 광범위한 평가를 통해 1.58비트 FLUX가 생성 품질을 유지하면서 계산 효율을 크게 향상시키는 데 효과적임을 입증했습니다.

We present 1.58-bit FLUX, the first successful approach to quantizing the state-of-the-art text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}) while maintaining comparable performance for generating 1024 x 1024 images. Notably, our quantization method operates without access to image data, relying solely on self-supervision from the FLUX.1-dev model. Additionally, we develop a custom kernel optimized for 1.58-bit operations, achieving a 7.7x reduction in model storage, a 5.1x reduction in inference memory, and improved inference latency. Extensive evaluations on the GenEval and T2I Compbench benchmarks demonstrate the effectiveness of 1.58-bit FLUX in maintaining generation quality while significantly enhancing computational efficiency.

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1873782702178263549

Aviary: 까다로운 과학 작업에 대한 언어 에이전트 교육 / Aviary: training language agents on challenging scientific tasks

논문 소개

확장 가능한 오픈 소스 체육관으로, 몇 가지 까다로운 과학 작업에서 제로 샷 프론티어 LLM과 심지어 인간의 성능을 능가하는 언어 에이전트를 구축하는 데 도움이 될 수 있습니다.

An extensible open-source gymnasium that can help build language agents that exceed the performance of zero-shot frontier LLMs and even humans on several challenging scientific tasks.

논문 초록(Abstract)

복잡한 현실 세계의 과제를 해결하려면 일련의 작업과 관찰이 필요합니다. 특히 과학 분야에서는 분석, 도구 사용, 실험을 반복적으로 수행해야 하는 작업이 많습니다. 언어 에이전트는 자연어 또는 코드를 통해 도구와 상호 작용할 수 있기 때문에 과학 분야의 지적 작업을 자동화하는 데 유망합니다. 하지만 에이전트는 내부 추론, 계획, 도구 사용, 온도 샘플링 언어 모델의 내재적 확률성과 같은 비표준 구성 요소로 구성될 수 있기 때문에 유연성 때문에 소프트웨어 구현에 개념적, 실용적 과제를 안겨줍니다. 여기에서는 언어 에이전트를 위한 확장 가능한 체육관인 Aviary를 소개합니다. 우리는 에이전트를 언어 기반 부분 관찰 가능한 마르코프 의사 결정 프로세스를 해결하는 정책으로 공식화하며, 이를 언어 의사 결정 프로세스라고 부릅니다. 그런 다음 세 가지 도전적인 과학 환경을 포함한 다섯 가지 환경을 구현합니다: (1) 분자 복제를 위한 DNA 구조 조작, (2) 과학 문헌에 액세스하여 연구 질문에 답하기, (3) 단백질 안정성 엔지니어링. 이러한 환경은 다단계 추론에 중점을 두고 현대 생물학 연구와의 관련성을 고려하여 선정되었습니다. 마지막으로, 온라인 트레이닝과 추론 시간 컴퓨팅 확장을 통해 오픈 소스 비프론티어 LLM의 지원을 받는 언어 에이전트가 최대 100배 낮은 추론 비용으로 여러 작업에서 프런티어 LLM 에이전트와 인간 전문가를 모두 능가할 수 있음을 보여줍니다.

Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1875270927304511535

규모에 맞는 메모리 레이어 / Memory Layers at Scale

논문 소개

대규모 메모리 레이어의 효과를 입증하고, 이러한 메모리 레이어를 사용하는 모델이 특히 실제 작업에서 절반의 계산을 사용하는 기존의 고밀도 모델보다 성능이 우수함을 보여주며, 최대 128억 개의 메모리 파라미터와 1조 개의 훈련 토큰으로 확장되는 병렬화 가능한 메모리 레이어 구현을 포함하고, 최대 8억 개의 파라미터로 기본 모델에 대해 테스트합니다.

Demonstrates the effectiveness of memory layers at scale; shows that models with these memory layers outperform traditional dense models using half the computation, particularly in factual tasks; includes a parallelizable memory layer implementation that scales to 128B memory parameters and 1 trillion training tokens, tested against base models up to 8B parameters.

논문 초록(Abstract)

메모리 레이어는 훈련 가능한 키-값 조회 메커니즘을 사용해 FLOP을 늘리지 않고도 모델에 추가 매개변수를 추가할 수 있습니다. 개념적으로 드물게 활성화되는 메모리 레이어는 연산량이 많은 고밀도 피드 포워드 레이어를 보완하여 정보를 저렴하게 저장하고 검색할 수 있는 전용 용량을 제공합니다. 이 작업은 개념 증명을 넘어 메모리 레이어를 현대적 규모에서 유용하게 사용할 수 있음을 입증합니다. 다운스트림 작업에서 개선된 메모리 계층으로 강화된 언어 모델은 계산 예산이 두 배 이상 많은 고밀도 모델과 계산 및 매개변수 모두에 대해 일치하는 전문가 혼합 모델보다 성능이 뛰어납니다. 특히 사실적인 작업에서 이점이 두드러지게 나타납니다. 저희는 완전히 병렬화 가능한 메모리 계층 구현을 제공하여 최대 8억 개의 매개변수를 가진 기본 모델과 비교하여 최대 1조 개의 토큰으로 사전 학습된 최대 128억 개의 메모리 매개변수로 확장 법칙을 입증했습니다.

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.

논문 링크

더 읽어보기

https://x.com/AIatMeta/status/1874897646542033030

HuatuoGPT-o1, LLM을 통한 의료 복합 추론으로 나아가기 / HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

논문 소개

의료 검증자를 사용하여 모델 출력을 검증하고 복잡한 추론 능력의 개발을 안내함으로써 언어 모델의 의료 추론을 개선하는 새로운 접근 방식을 제시합니다. 이 시스템은 미세 조정 및 강화 학습과 검증자 기반 보상을 결합한 2단계 접근 방식을 사용하여 검증 가능한 의료 문제 4만 개만 사용하면서 기존 모델보다 우수한 성능을 달성합니다.

Presents a novel approach to improving medical reasoning in language models by using a medical verifier to validate model outputs and guide the development of complex reasoning abilities; the system employs a two-stage approach combining fine-tuning and reinforcement learning with verifier-based rewards, achieving superior performance over existing models while using only 40,000 verifiable medical problems.

논문 초록(Abstract)

OpenAI o1의 획기적인 발전은 추론을 향상시켜 LLM을 개선할 수 있는 잠재력을 강조합니다. 하지만 대부분의 추론 연구는 수학적 과제에 집중되어 있어 의료와 같은 영역은 제대로 연구되지 않았습니다. 의료 영역은 수학과는 다르지만, 높은 의료 수준을 고려할 때 신뢰할 수 있는 답을 제공하기 위해 강력한 추론이 필요합니다. 그러나 의학적 추론을 검증하는 것은 수학의 추론과 달리 매우 어렵습니다. 이를 해결하기 위해 의료 검증기를 통해 검증 가능한 의료 문제를 제시하여 모델 결과의 정확성을 확인합니다. 이러한 검증 가능한 특성을 통해 (1) 검증자를 사용하여 LLM을 미세 조정하기 위한 복잡한 추론 궤적 검색을 안내하고, (2) 검증자 기반 보상을 통해 강화 학습(RL)을 적용하여 복잡한 추론을 더욱 강화하는 2단계 접근 방식을 통해 의료 추론의 발전을 가능하게 합니다. 마지막으로, 4만 개의 검증 가능한 문제만으로 일반 및 의료 관련 기준선을 능가하는 복잡한 추론이 가능한 의료용 LLM인 HuatuoGPT-o1을 소개합니다. 실험에 따르면 복잡한 추론은 의료 문제 해결 능력을 향상시키고 RL을 통해 더 많은 이점을 얻을 수 있습니다. 저희의 접근 방식이 의료 및 기타 전문 영역에서 추론의 발전에 영감을 주기를 바랍니다.

The breakthrough of OpenAI o1 highlights the potential of enhancing reasoning to improve LLM. Yet, most research in reasoning has focused on mathematical tasks, leaving domains like medicine underexplored. The medical domain, though distinct from mathematics, also demands robust reasoning to provide reliable answers, given the high standards of healthcare. However, verifying medical reasoning is challenging, unlike those in mathematics. To address this, we propose verifiable medical problems with a medical verifier to check the correctness of model outputs. This verifiable nature enables advancements in medical reasoning through a two-stage approach: (1) using the verifier to guide the search for a complex reasoning trajectory for fine-tuning LLMs, (2) applying reinforcement learning (RL) with verifier-based rewards to enhance complex reasoning further. Finally, we introduce HuatuoGPT-o1, a medical LLM capable of complex reasoning, which outperforms general and medical-specific baselines using only 40K verifiable problems. Experiments show complex reasoning improves medical problem-solving and benefits more from RL. We hope our approach inspires advancements in reasoning across medical and other specialized domains.

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1873572891092283692

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.*

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2024/12/30 ~ 2025/01/05] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

에이전트로는 충분치 않습니다 / Agents Are Not Enough

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

2 OLMo 2 분노의 질주 / 2 OLMo 2 Furious

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

기계 지원 증명 / Machine-Assisted Proof

논문 소개

논문 링크

더 읽어보기

더 높은 수준의 수학적 추론 측정 / Measuring Higher Level Mathematical Reasoning

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

2+3이 무엇인지에 대해 그렇게 많이 생각하지 마세요. O1-Like LLM의 지나친 생각에 대해 / Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

MEDEC: 임상 노트의 의료 오류 감지 및 수정을 위한 벤치마크 / MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

1.58비트 플럭스 / 1.58-bit FLUX

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Aviary: 까다로운 과학 작업에 대한 언어 에이전트 교육 / Aviary: training language agents on challenging scientific tasks

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

규모에 맞는 메모리 레이어 / Memory Layers at Scale

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

HuatuoGPT-o1, LLM을 통한 의료 복합 추론으로 나아가기 / HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR