[2024/09/30 ~ 10/06] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 10월 7, 2024, 7:21오전

[2024/09/30 ~ 10/06] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들에서 주목할 만한 트렌드는 크게 두 가지가 있습니다. 첫째, 대형 언어 모델(LLM)에 관련된 연구가 상당히 많다는 점입니다. 'LLMs Know More Than They Show', 'Not All LLM Reasoners Are Created Equal' 등의 논문들이 바로 이 범주에 속합니다. 이 논문들은 대형 언어 모델의 능력, 한계, 그리고 다양한 상황에서의 성능 분석에 주로 초점을 맞추고 있습니다. NLP 영역에서 LLM의 진화와 그 활용 가능성에 대한 학계와 산업계의 지속적인 관심을 반영하고 있다고 볼 수 있습니다.
둘째, 신경망의 구조 및 설계에 관련된 연구들이 눈에 많이 띕니다. 'Were RNNs All We Needed?', 'Architecture Search Framework for Inference-Time Techniques' 등의 논문들이 그러한 내용을 다루고 있습니다. 이러한 연구들은 기존의 RNN 및 다른 신경망 구조들이 현재의 최신 기술 상황에서 여전히 유효한지, 아니면 어떻게 더 효율적으로 사용될 수 있는지를 탐구하며 심층 학습에서는 모델 성능을 최적화할 수 있는 새로운 방법을 모색하고 있습니다.
이런 트렌드에 대해 논의하면, 최근 AI 연구에서 대형 언어 모델은 매우 중요한 분야로 자리 잡았습니다. 새로운 발견이나 발전이 있을 때마다 NLP 분야에 직접적인 영향을 미치기 때문에, 연구자들은 이 주제에 대해 지속적으로 탐구하고 있습니다. 이러한 모델들이 가진 사전 지식을 더 효과적으로 활용하거나, 단점을 보완하여 더욱 다양한 응용 분야에 활용하고자 하는 연구가 계속 늘어나고 있는 것입니다.
또한, 신경망의 구조나 설계 관련 연구는 AI 모델의 성능 향상과 직결된 문제입니다. 특히 효율성과 연산 비용, 그리고 모델 경량화 등의 문제는 현실 세계의 응용에서 매우 중요하게 다루어집니다. 인공지능의 잠재력을 최대한 활용하기 위해서는 최적의 모델 구조를 찾는 것이 필수적이므로, 이러한 연구들의 지속적인 발전 역시 필수적입니다.

/ Movie Gen

논문 소개

다양한 종횡비 및 동기화된 오디오를 포함한 고품질 1080p HD 비디오를 생성하는 기본 모델 세트, 73K 비디오 토큰의 컨텍스트 길이를 지원하는 30B 파라미터 모델로 16fps에서 16초 비디오를 생성할 수 있으며, 13B 파라미터 비디오-오디오 생성 모델과 사후 학습을 통해 달성한 새로운 비디오 편집 모델, 텍스트-비디오 합성, 비디오 개인화, 비디오-오디오 생성 등과 같은 작업에서 최첨단 성능을 달성합니다.

A set of foundation models to generate high-quality, 1080p HD videos, including different aspect ratios and synchronized audio; the 30B parameter model supports a context length of 73K video tokens, which enables generation of 16-second videos at 16fps; it also presents a 13B parameter video-to-audio generation model and a novel video editing model that’s attained via post-training; achieves state-of-the-art performance on tasks such as text-to-video synthesis, video personalization, video-to-audio generation and more.

논문 링크

https://ai.meta.com/static-resource/movie-gen-research-paper

더 읽어보기

https://discuss.pytorch.kr/t/meta-ai-movie-gen/5296

https://x.com/AIatMeta/status/1842188252541043075

RNN이 우리에게 필요한 전부였을까요? / Were RNNs All We Needed?

논문 소개

입력, 잊기, 업데이트 게이트에서 숨겨진 상태를 제거함으로써 RNN을 효율적으로 병렬로 훈련할 수 있음을 보여줍니다. 이러한 변화를 통해 LSTM 및 GRU와 같은 아키텍처는 더 이상 시간을 통한 역전파(BPTT)가 필요하지 않으며 512 시퀀스 길이에서 175배 빠른 minLSTM 및 minGRU를 도입하기 때문에 가능합니다.

Revisits RNNs and shows that by removing the hidden states from input, forget, and update gates RNNs can be efficiently trained in parallel; this is possible because with this change architectures like LSTMs and GRUs no longer require backpropagate through time (BPTT); they introduce minLSTMs and minGRUs that are 175x faster for a 512 sequence length.

논문 초록(Abstract)

시퀀스 길이와 관련된 Transform의 확장성 한계로 인해 훈련 중에 병렬화가 가능한 반복적 시퀀스 모델에 대한 관심이 다시 높아졌습니다. 그 결과, S4, Mamba, Aaren과 같은 새로운 순환형 아키텍처가 제안되어 비슷한 성능을 달성했습니다. 이번 연구에서는 10여 년 전의 전통적인 순환 신경망(RNN)을 다시 살펴봅니다: LSTM(1997)과 GRU(2014)입니다. 이러한 모델은 시간에 따른 역전파(BPTT)가 필요해 속도가 느렸지만, 입력, 망각, 업데이트 게이트에서 숨겨진 상태 종속성을 제거함으로써 LSTM과 GRU를 더 이상 BPTT할 필요가 없고 병렬로 효율적으로 훈련할 수 있음을 보여줍니다. 이를 기반으로 (1) 기존 버전보다 훨씬 적은 수의 파라미터를 사용하고 (2) 훈련 중에 완전히 병렬화가 가능한(길이 512의 시퀀스의 경우 175배 더 빠름) 최소 버전(minLSTM 및 minGRU)을 소개합니다. 마지막으로, 10년 된 RNN의 이러한 축소 버전이 최근의 시퀀스 모델의 경험적 성능과 일치한다는 것을 보여줍니다.

The scalability limitations of Transformers regarding sequence length have renewed interest in recurrent sequence models that are parallelizable during training. As a result, many novel recurrent architectures, such as S4, Mamba, and Aaren, have been proposed that achieve comparable performance. In this work, we revisit traditional recurrent neural networks (RNNs) from over a decade ago: LSTMs (1997) and GRUs (2014). While these models were slow due to requiring to backpropagate through time (BPTT), we show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel. Building on this, we introduce minimal versions (minLSTMs and minGRUs) that (1) use significantly fewer parameters than their traditional counterparts and (2) are fully parallelizable during training (175x faster for a sequence of length 512). Lastly, we show that these stripped-down versions of decade-old RNNs match the empirical performance of recent sequence models.

논문 링크

https://arxiv.org/abs/2410.01201

더 읽어보기

https://x.com/omarsar0/status/1842246985790914608

LLM은 보이는 것보다 더 많은 것을 알고 있습니다: LLM 환각의 본질적 표현에 관하여 / LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

논문 소개

LLM의 '진실성' 정보가 특정 토큰에 집중되어 있음을 발견하고, 이러한 인사이트를 통해 오류 감지 성능을 향상시키고 이러한 문제를 일부 완화할 수 있으며, 내부 표현을 사용하여 LLM이 발생할 수 있는 오류 유형을 예측할 수 있다고 주장합니다.

Finds that the "truthfulness" information in LLMs is concentrated in specific tokens; this insight can help enhance error detection performance and further mitigate some of these issues; they also claim that internal representations can be used to predict the types of errors the LLMs are likely to make.

논문 초록(Abstract)

대규모 언어 모델(LLM)은 종종 사실의 부정확성, 편견, 추론 실패 등의 오류를 발생시키며, 이를 통칭하여 '환각'이라고 합니다. 최근 연구에 따르면 LLM의 내부 상태는 출력의 진실성에 관한 정보를 인코딩하며, 이 정보를 활용하여 오류를 감지할 수 있다는 사실이 입증되었습니다. 이 연구에서는 LLM의 내부 표현이 이전에 인식된 것보다 훨씬 더 많은 진실성에 관한 정보를 부호화한다는 것을 보여줍니다. 먼저 진실성 정보가 특정 토큰에 집중되어 있으며, 이 속성을 활용하면 오류 탐지 성능이 크게 향상된다는 사실을 발견했습니다. 그러나 이러한 오류 감지기가 데이터 세트 전반에서 일반화되지 못한다는 것을 보여 주며, 이는 이전의 주장과 달리 진실성 인코딩이 보편적이지 않고 오히려 다면적이라는 것을 암시합니다. 다음으로, 내부 표현을 사용하여 모델이 발생할 수 있는 오류 유형을 예측하고 맞춤형 완화 전략을 개발할 수 있음을 보여줍니다. 마지막으로, LLM의 내부 인코딩과 외부 행동 사이의 불일치, 즉 정답을 인코딩할 수 있지만 지속적으로 잘못된 답을 생성하는 불일치를 밝혀냈습니다. 이러한 인사이트를 종합하면 모델 내부 관점에서 LLM 오류에 대한 이해가 깊어져 향후 오류 분석 및 완화를 개선하기 위한 연구에 도움이 될 수 있습니다.

Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.

논문 링크

https://arxiv.org/abs/2410.02707

더 읽어보기

https://x.com/omarsar0/status/1842240840389001381

Archon: 추론-시간 기술을 위한 아키텍처 검색 프레임워크 / Archon: An Architecture Search Framework for Inference-Time Techniques

논문 소개

여러 추론 시간 기술을 결합하여 LLM을 구축하고 최적화하기 위한 모듈식 프레임워크 도입, 이 접근 방식은 LLM 시스템 설계의 과제를 하이퍼파라미터 최적화 문제로 재구성, MT-Bench 및 CodeContests 등의 벤치마크에서 테스트한 결과 Archon은 GPT-4o 및 Claude 3.5 Sonnet과 같은 주요 모델을 능가하여 평균 15.1% 정확도 향상을 달성했습니다.

Introduces a modular framework for building and optimizing LLMs by combining multiple inference-time techniques; this approach reframes the challenge of LLM system design as a hyperparameter optimization problem; tested on benchmarks including MT-Bench and CodeContests, Archon surpasses leading models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement.

논문 초록(Abstract)

추론 시간 기법은 대규모 언어 모델(LLM) 기능을 향상시키는 매우 효과적인 도구로 부상하고 있습니다. 그러나 개별 추론 시간 기법의 유용성과 기법 간의 상호 작용에 대한 이해가 제한되어 있어 이러한 기법을 결합한 시스템을 개발하는 모범 사례는 아직 개발되지 않은 상태입니다. 또한, 모델 선택 공간, 추론 시간 기법 및 그 구성의 공간을 효율적이고 자동으로 검색하는 것은 설계 공간이 넓기 때문에 어려운 일입니다. 이러한 문제를 해결하기 위해 유니티는 추론 시간 기법의 레이어를 선택, 결합, 스택하여 목표 벤치마크에 최적화된 LLM 시스템을 구축하는 모듈식 프레임워크인 Archon을 도입했습니다. 한 번 호출되는 단일 LLM에 의존하는 것이 아니라 다양한 LLM과 추론 시간 기법을 활용하여 각 부분의 합보다 더 큰 LLM 시스템을 구축합니다. Archon은 생성 앙상블, 반복 샘플링, 랭킹, 융합, 비평, 검증, 단위 테스트와 같은 기술을 포괄하는 확장 가능한 설계 공간을 정의합니다. 이는 LLM 시스템 구축 문제를 하이퍼파라미터 최적화 목표로 전환합니다. 사용 가능한 LLM, 추론 시간 기법, 컴퓨팅 예산이 주어지면 Archon은 하이퍼파라미터 검색 기법을 활용하여 목표 벤치마크에 최적화된 아키텍처를 발견합니다. MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, CodeContests 등 다양한 명령어 추종, 추론, 코딩 벤치마크에 걸쳐 Archon 아키텍처를 평가합니다. Archon 아키텍처는 이러한 벤치마크에서 사용 가능한 모든 LLM을 사용하여 평균 15.1% 포인트의 정확도 향상을 달성함으로써 GPT-4o 및 Claude 3.5 Sonnet과 같은 프론티어 모델보다 뛰어난 성능을 보였습니다. 코드와 데이터 세트는 Github(https://github.com/ScalingIntelligence/Archon)에서 공개적으로 사용할 수 있습니다.

Inference-time techniques are emerging as highly effective tools to enhance large language model (LLM) capabilities. However, best practices for developing systems that combine these techniques remain underdeveloped due to our limited understanding of the utility of individual inference-time techniques and the interactions between them. Additionally, efficiently and automatically searching the space of model choices, inference-time techniques, and their compositions is challenging due to the large design space. To address these challenges, we introduce Archon, a modular framework for selecting, combining, and stacking layers of inference-time techniques to construct optimized LLM systems for target benchmarks. Rather than relying on a single LLM called once, we leverage a diverse set of LLMs and inference-time techniques, creating LLM systems greater than the sum of their parts. Archon defines an extensible design space, encompassing techniques such as generation ensembling, repeated sampling, ranking, fusion, critiquing, verification, and unit testing. It transforms the problem of building LLM systems into a hyperparameter optimization objective. Given the available LLMs, inference-time techniques, and compute budget, Archon utilizes hyperparameter search techniques to discover optimized architectures for target benchmark(s). We evaluate Archon architectures across a range of instruction-following, reasoning, and coding benchmarks, including MT-Bench, Arena-Hard-Auto, AlpacaEval 2.0, MixEval, MixEval Hard, MATH, and CodeContests. Archon architectures outperform frontier models, such as GPT-4o and Claude 3.5 Sonnet, on these benchmarks, achieving an average accuracy increase of 15.1 percentage points by using all available LLMs. We make our code and datasets available publicly on Github: https://github.com/ScalingIntelligence/Archon.

논문 링크

https://arxiv.org/abs/2409.15254

더 읽어보기

https://github.com/ScalingIntelligence/Archon

https://x.com/Azaliamirh/status/1840892626096345530

합리주의: 추론 능력 향상을 위한 사전 교육 과정-감독 / RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

논문 소개

다양한 추론 작업에서 일반화를 가능하게 하는 추론 프로세스 감독 모델, 이 프로세스는 더미에 있는 79,000개의 추론 모음과 최소한의 인간 개입으로 추론 데이터 세트의 조합에 대한 사전 학습을 통해 이루어지며, LLaMa-3-8B에서 미세 조정된 제안 모델은 7개 추론 벤치마크에서 평균 3.9%의 추론 정확도를 개선합니다.

A model for process-supervision of reasoning that enables generalization across diverse reasoning tasks; this process is achieved with pre-training on a collection of 79k rationales from the Pile and a combination of reasoning datasets with minimal human intervention; fine-tuned from LLaMa-3-8B, the proposed model improves the accuracy of reasoning by an average of 3.9% on 7 reasoning benchmarks.

논문 초록(Abstract)

LLM이 생성하는 추론 단계는 사전 학습 데이터에서 발견되는 일상적인 의사소통에서 흔히 볼 수 있는 논리적 비약을 모방하기 때문에 불완전할 수 있으며, 기본 근거는 종종 암시적(명시되지 않음)으로 남겨져 있습니다. 이러한 문제를 해결하기 위해 라벨링되지 않은 데이터에서 추출한 방대한 근거 주석 모음에 대한 사전 학습을 기반으로 추론의 프로세스 감독을 위한 모델인 RATIONALYST를 도입했습니다. 라벨이 없는 웹 규모의 데이터 세트(더미)와 사람의 개입을 최소화한 추론 데이터 세트의 조합에서 7만 9천 개의 추론을 추출합니다. 이러한 웹 규모의 추론 사전 학습을 통해 RATIONALYST는 수학적, 상식적, 과학적, 논리적 추론을 포함한 다양한 추론 작업에 걸쳐 일관되게 일반화할 수 있습니다. LLaMa-3-8B에서 미세 조정된 RATIONALYST는 7가지 대표 추론 벤치마크에서 추론 정확도를 평균 3.9% 향상시켰습니다. 또한 GPT-4와 같이 훨씬 더 큰 규모의 검증기 및 매칭 훈련 세트에서 미세 조정된 비슷한 크기의 모델에 비해 우수한 성능을 보여줍니다.

The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.

논문 링크

https://arxiv.org/abs/2410.01044

언어 모델이 추론에 최적화되어 있어도 여전히 자동 회귀의 불씨를 보여줄까요? OpenAI o1 분석 / When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

논문 소개

O1-프리뷰와 같은 대규모 추론 모델은 더 어려운 과제에서 개선되었지만 이전 LLM과 유사한 질적 경향을 보인다는 보고가 있습니다. o1은 예제와 과제의 확률에 민감하여 낮은 확률보다 높은 확률의 환경에서 더 나은 성능을 발휘하고 더 적은 '사고 토큰'을 필요로 한다는 것입니다.

Reports that large reasoning models like o1-preview, while improving on more difficult tasks, display similar qualitative trends as previous LLMs; o1 is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones.

논문 초록(Abstract)

"자동 회귀의 불씨(Embers of Autoregression)"(McCoy 외, 2023)에서 우리는 몇몇 대규모 언어 모델(LLM)이 다음 단어 예측에 기원을 둔 몇 가지 중요한 한계를 가지고 있음을 보여주었습니다. 여기에서는 추론에 최적화되어 있다는 점에서 이전의 LLM과 다른 OpenAI의 새로운 시스템인 o1에서도 이러한 문제가 지속되는지 살펴봅니다. 그 결과, o1은 많은 경우에서 이전 LLM보다 훨씬 뛰어난 성능을 보였으며, 특히 일반적인 작업의 드문 변형(예: 목록에서 각 단어의 첫 글자가 아닌 두 번째 글자로 약어 형성)에서 큰 개선이 있었습니다. 그러나 이러한 양적 개선에도 불구하고, o1은 여전히 이전 시스템에서 관찰된 것과 동일한 질적 경향을 보여줍니다. 특히, o1은 이전 LLM과 마찬가지로 예제 및 작업의 확률에 민감하게 반응하여 낮은 확률의 환경보다 높은 확률의 환경에서 더 나은 성능을 발휘하고 더 적은 수의 '사고 토큰'을 필요로 합니다. 이러한 결과는 추론을 위해 언어 모델을 최적화하면 언어 모델의 확률 민감도를 완화할 수는 있지만 완전히 극복할 수는 없음을 보여줍니다.

In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.

논문 링크

https://arxiv.org/abs/2410.01792

더 읽어보기

https://x.com/omarsar0/status/1841842414157472240

사실, 가져오기 및 이유 검색 증강 생성(RAG)에 대한 통합된 평가 / Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

논문 소개

사실에 입각한 답변을 제공하고, 검색 능력과 최종 답변을 생성하는 데 필요한 추론 능력을 평가하는 통합 프레임워크, 여러 출처의 정보를 통합해야 하는 멀티 홉 질문 포함, 최신 LLM이 작업에 어려움을 겪고 검색 없이 40%의 정확도만 달성한다는 보고, 제안된 다단계 검색 방식은 성능을 66%의 정확도로 향상시킵니다.

A unified framework to evaluate an LLM’s ability to provide factual responses, assess retrieval capabilities, and the reasoning required to generate final responses; includes multi-hop questions that require the integration of information from multiple sources; reports that state-of-the-art LLMs struggle on the task and only achieve 40% accuracy with no retrieval; the proposed multi-step retrieval approach improves performance to 66% accuracy.

논문 초록(Abstract)

대규모 언어 모델(LLM)은 다양한 인지 작업에서 상당한 성능 향상을 입증했습니다. LLM을 사용하여 검색 증강 생성(RAG) 기능을 향상시키는 새로운 애플리케이션이 등장하고 있습니다. 이러한 시스템에서는 사용자 쿼리를 이해하고, 관련 정보를 검색하며, 일관되고 정확한 답변을 합성하기 위해 LLM이 필요합니다. 이러한 시스템의 실제 배포가 증가함에 따라 종합적인 평가가 중요해지고 있습니다. 이를 위해 저희는 사실적인 답변을 제공하고, 검색 능력을 평가하고, 최종 답변을 생성하는 데 필요한 추론을 평가하는 LLM의 능력을 테스트하도록 설계된 고품질 평가 데이터세트인 FRAMES(사실성, 검색, 추론 측정 세트)를 제안합니다. 이전 작업에서는 이러한 능력을 개별적으로 평가하기 위한 데이터 세트와 벤치마크를 제공했지만, FRAMES는 엔드투엔드 RAG 시나리오에서 LLM 성능을 보다 명확하게 파악할 수 있는 통합 프레임워크를 제공합니다. 데이터 세트는 여러 소스의 정보를 통합해야 하는 까다로운 멀티홉 질문으로 구성되어 있습니다. 최첨단 LLM도 이 작업에 어려움을 겪으며 검색 없이 0.40의 정확도를 달성하는 데 그친다는 것을 보여주는 기준 결과를 제시합니다. 저희가 제안한 다단계 검색 파이프라인을 사용하면 정확도가 크게 개선되어 0.66의 정확도(50% 이상 개선)를 달성할 수 있습니다. 저희의 연구가 평가 격차를 해소하고 보다 강력하고 유능한 RAG 시스템을 개발하는 데 도움이 되기를 바랍니다.

Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.

논문 링크

https://arxiv.org/abs/2409.12941

더 읽어보기

https://x.com/_philschmid/status/1840628834275602585

모든 LLM 리서처가 똑같이 만들어지는 것은 아닙니다 / Not All LLM Reasoners Are Created Equal

논문 소개

LLM의 초등학교 수학 문제 해결 능력을 심층적으로 조사하고, LLM이 추론 능력에서 상당한 격차를 보인다고 보고하고, LLM이 구성 쌍을 풀 때와 독립적으로 문제를 풀 때 큰 성능 차이를 보인다는 사실을 발견합니다.

Investigates in depth the grade-school math problem-solving capabilities of LLMs; reports that LLMs show a significant gap in reasoning; finds that LLMs display a huge performance difference when solving compositional pairs and solving questions independently.

논문 초록(Abstract)

우리는 LLM의 초등학교 수학(GSM) 문제 해결 능력의 깊이를 연구합니다. 이를 위해 첫 번째 문제의 정답에 따라 두 번째 문제에 대한 답이 달라지도록 기존의 수학 단어 문제 쌍에 대한 성능을 함께 평가합니다. 연구 결과 대부분의 LLM에서 상당한 추론 격차, 즉 구성 쌍을 푸는 것과 각 문제를 독립적으로 푸는 것 사이의 성능 차이가 있음을 발견했습니다. 이러한 격차는 더 작고 비용 효율적이며 수학에 특화된 모델에서 더 두드러집니다. 또한 명령어 튜닝 레시피와 코드 생성은 LLM 규모에 따라 다양한 영향을 미치는 반면, GSM에서 미세 튜닝은 작업 과잉 적합으로 이어질 수 있습니다. 분석 결과, 큰 추론 격차는 테스트 세트 누수 때문이 아니라 추가 컨텍스트의 산만함과 세컨드 홉 추론의 부실로 인해 발생하는 것으로 나타났습니다. 전반적으로 LLM은 표준 벤치마크에서의 성능에도 불구하고 추론 능력에서 체계적인 차이를 보입니다.

We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

논문 링크

https://arxiv.org/abs/2410.01748

더 읽어보기

https://x.com/arianTBD/status/1841875515860517130

OpenAI 평가 o1: AGI의 기회와 도전 과제 / Evaluation of OpenAI o1: Opportunities and Challenges of AGI

논문 소개

경쟁 프로그래밍, 일관되고 정확한 방사선학 보고서 생성, 고등학교 수준의 수학적 추론 작업, 칩 설계 작업, 인류학 및 지질학, 정량적 투자, 소셜 미디어 분석 및 기타 여러 영역과 문제 등 다양한 작업에서 강력한 성능을 보여줍니다.

Provides a comprehensive evaluation of OpenAI's o1-preview LLM; shows strong performance across many tasks such as competitive programming, generating coherent and accurate radiology reports, high school-level mathematical reasoning tasks, chip design tasks, anthropology and geology, quantitative investing, social media analysis, and many other domains and problems.

논문 초록(Abstract)

이 포괄적인 연구는 컴퓨터 과학, 수학, 자연 과학, 의학, 언어학, 사회 과학 등 여러 영역에 걸친 다양한 복잡한 추론 작업에서 OpenAI의 대규모 언어 모델인 o1-preview의 성능을 평가합니다. 엄격한 테스트를 통해 o1-preview는 코딩 과제에서 과학적 추론, 언어 처리에서 창의적 문제 해결에 이르기까지 다양한 영역에서 인간 수준 또는 그 이상의 성능을 달성하는 놀라운 능력을 보여주었습니다. 주요 결과는 다음과 같습니다: -복잡한 경쟁 프로그래밍 문제를 해결할 때 83.3%의 성공률로 많은 인간 전문가를 능가합니다. -일관성 있고 정확한 방사선 보고서를 생성하는 데 있어 다른 평가 모델보다 뛰어난 능력. -고등학교 수준의 수학적 추론 과제에서 100% 정확도로 상세한 단계별 솔루션을 제공합니다. -의료와 같은 일반 및 전문 영역에서 고급 자연어 추론 기능을 제공합니다. -칩 설계 작업에서 뛰어난 성능을 발휘하여 EDA 스크립트 생성 및 버그 분석과 같은 영역에서 전문 모델보다 뛰어난 성능을 발휘합니다. -인류학 및 지질학에 대한 탁월한 숙련도로 이러한 전문 분야에 대한 깊은 이해와 추론을 보여줍니다. -퀀트 투자에 대한 강력한 역량. O1은 종합적인 금융 지식과 통계 모델링 기술을 보유하고 있습니다. -감정 분석 및 감정 인식을 포함한 소셜 미디어 분석에서 효과적인 성과를 보였습니다. 이 모델은 특히 다양한 분야의 복잡한 추론과 지식 통합이 필요한 작업에서 탁월한 능력을 발휘했습니다. 간혹 간단한 문제에서 오류가 발생하고 고도로 전문화된 개념에 어려움을 겪는 등 몇 가지 한계가 관찰되었지만, 전반적인 결과는 인공 일반 지능을 향한 상당한 진전을 보여주었습니다.

This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

논문 링크

https://arxiv.org/abs/2409.18486

더 읽어보기

https://x.com/omarsar0/status/1840953712635732006

리젝션 샘플링 IMLE: 더 나은 소수 샷 이미지 합성을 위한 선행 설계 / Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

논문 소개

더 나은 소수 샷 이미지 합성을 위한 전구체 설계 - 제한된 데이터로 GAN과 같은 생성 모델을 훈련하는 것은 어렵고, 현재의 암시적 최대 가능성 추정 접근법(IMLE)은 훈련을 위해 선택된 잠재 코드와 추론 중에 선택된 코드 간의 대응이 부적절하며, 제안된 접근법인 RS-IMLE은 훈련용 전구체 분포를 변경하여 테스트 시간 성능을 개선하고 고품질 이미지 생성으로 이어집니다.

Designing Priors for Better Few-Shot Image Synthesis - training generative models like GAN with limited data is difficult; current Implicit Maximum Likelihood Estimation approaches (IMLE) have an inadequate correspondence between latent code selected for training and those selected during inference; the proposed approach, RS-IMLE, changes the prior distribution for training which improves test-time performance and leads to higher quality image generation.

논문 초록(Abstract)

최근 떠오르는 연구 분야는 제한된 학습 데이터로 심층 생성 모델을 학습하는 것을 목표로 합니다. GAN이나 확산 모델과 같은 기존의 생성 모델은 좋은 성능을 내기 위해 많은 데이터가 필요하며, 적은 양의 데이터로만 학습하면 성능이 저하됩니다. 최근 암시적 최대 가능성 추정(IMLE)이라는 기법이 소수 데이터 설정에 적용되어 최첨단 성능을 달성했습니다. 그러나 현재의 IMLE 기반 접근 방식은 훈련을 위해 선택된 잠재 코드와 추론 중에 도출된 코드 간의 불충분한 일치로 인해 문제에 직면합니다. 이로 인해 테스트 시간 성능이 최적화되지 않습니다. 이 문제를 이론적으로 해결하는 방법을 제시하고, 훈련에 사용되는 사전 분포를 변경하는 새로운 접근 방식인 RS-IMLE을 제안합니다. 이는 기존의 GAN 및 IMLE 기반 방식에 비해 훨씬 더 높은 품질의 이미지를 생성하며, 9개의 소수 촬영 이미지 데이터 세트에서 수행한 종합적인 실험을 통해 검증되었습니다.

An emerging area of research aims to learn deep generative models with limited training data. Prior generative models like GANs and diffusion models require a lot of data to perform well, and their performance degrades when they are trained on only a small amount of data. A recent technique called Implicit Maximum Likelihood Estimation (IMLE) has been adapted to the few-shot setting, achieving state-of-the-art performance. However, current IMLE-based approaches encounter challenges due to inadequate correspondence between the latent codes selected for training and those drawn during inference. This results in suboptimal test-time performance. We theoretically show a way to address this issue and propose RS-IMLE, a novel approach that changes the prior distribution used for training. This leads to substantially higher quality image generation compared to existing GAN and IMLE-based methods, as validated by comprehensive experiments conducted on nine few-shot image datasets.

논문 링크

https://arxiv.org/abs/2409.17439

더 읽어보기

https://x.com/KL_Div/status/1841729946302943295

원문

https://nlp.elvissaravia.com/p/top-ml-papers-of-the-week-b77

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.*

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2024/09/30 ~ 10/06] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

[2024/09/30 ~ 10/06] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

/ Movie Gen

논문 소개

논문 링크

더 읽어보기

RNN이 우리에게 필요한 전부였을까요? / Were RNNs All We Needed?

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

LLM은 보이는 것보다 더 많은 것을 알고 있습니다: LLM 환각의 본질적 표현에 관하여 / LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

Archon: 추론-시간 기술을 위한 아키텍처 검색 프레임워크 / Archon: An Architecture Search Framework for Inference-Time Techniques

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

합리주의: 추론 능력 향상을 위한 사전 교육 과정-감독 / RATIONALYST: Pre-training Process-Supervision for Improving Reasoning

논문 소개

논문 초록(Abstract)

논문 링크

언어 모델이 추론에 최적화되어 있어도 여전히 자동 회귀의 불씨를 보여줄까요? OpenAI o1 분석 / When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

사실, 가져오기 및 이유 검색 증강 생성(RAG)에 대한 통합된 평가 / Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

모든 LLM 리서처가 똑같이 만들어지는 것은 아닙니다 / Not All LLM Reasoners Are Created Equal

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

OpenAI 평가 o1: AGI의 기회와 도전 과제 / Evaluation of OpenAI o1: Opportunities and Challenges of AGI

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

리젝션 샘플링 IMLE: 더 나은 소수 샷 이미지 합성을 위한 선행 설계 / Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR