[2024/05/13 ~ 05/19] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 5월 19, 2024, 10:12오후

[2024/05/13 ~ 05/19] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 제출된 논문들을 통해 관찰할 수 있는 주요 추세는 자연어 처리(NLP)와 강화학습(RL)에 초점을 맞춘 연구들이 다수 포함되어 있다는 것입니다. 예를 들어, "GPT-4o", "Fine-tuning and Hallucinations", 그리고 "Zero-shot Tokenizer Transfer"는 자연어 처리와 관련된 최신 기술과 방법론을 다루고 있으며, 특히, 생성 모델과 토크나이저 전이 학습 같은 분야에서의 발전을 탐구하고 있습니다. 이외에도 "RLHF Workflow"에서는 강화학습을 활용한 효율적인 학습 과정 설계에 관한 연구를 제시하고 있으며, 이는 RL 분야의 실용적 응용 가능성을 확장하려는 시도로 볼 수 있습니다.
이러한 추세는 인공지능 기술, 특히 기계학습 및 심층학습의 발전에 있어 핵심적인 역할을 하는 자연어 처리 및 강화학습의 중요성이 갈수록 증가하고 있음을 반영합니다. 최근 몇 년간 GPT와 같은 언어 생성 모델의 폭발적인 성장과 함께, 이들의 응용 범위는 계속해서 넓어지고 있으며, 이는 연구자들 사이에서도 이러한 모델을 더욱 정교하게 튜닝하고, 기존 모델들의 약점을 보완하기 위한 연구에 많은 관심을 기울이게 만들었습니다. 또한, 강화학습은 의사결정과정을 최적화하고, 복잡한 환경에서의 학습능력을 향상시키는 데 중요한 기법으로 자리잡았으며, 이는 학습 효율성을 극대화하고자 하는 현재의 연구 경향과도 맥을 같이 합니다.
따라서 이번 주에 제출된 논문들은 학계와 산업계에서의 연구개발 노력이 어떤 방향으로 진행되고 있는지를 잘 보여주고 있습니다. 자연어 처리는 인간과 기계 간의 상호작용을 보다 자연스럽고 효율적으로 만들기 위한 기술의 핵심이며, 강화학습은 이러한 상호작용을 기반으로 한 의사결정 과정을 최적화하는 데 중요한 역할을 합니다. 이러한 연구 동향은 앞으로도 계속해서 진화할 인공지능 기술의 미래 방향성을 제시해주는 중요한 지표가 될 것입니다.

GPT-4o

소개

오디오, 시각, 텍스트를 실시간으로 지원하는 멀티모달 추론 기능을 갖춘 새로운 모델로, 텍스트, 오디오, 이미지, 비디오의 모든 조합을 입력으로 받아 텍스트, 오디오, 이미지 출력의 조합을 생성할 수 있으며, API를 통해 50% 훨씬 빠르고 저렴하면서도 GPT-4 터보 성능과 동일한 것으로 보고되고 있습니다.

A new model with multimodal reasoning capabilities with real-time support across audio, vision, and text; it can accept as input any combination of text, audio, image, and video to generate combinations of text, audio, and image outputs; it’s reported to match GPT-4 Turbo performance while being 50% much faster and cheaper via APIs.

더 읽어보기

OpenAI, 더 빠르고 더 강력한 GPT-4o 공개 (& 내용은 GPT-4o가 직접 작성) 읽을거리&정보공유

들어가며 아래 내용은 GPT-4o를 사용하여 직접 작성한 내용입니다. 처음 사용한 프롬프트는 다음과 같으며, 이후 일부 내용 / 출처 등을 추가하기 위해 추가 요청을 진행하였습니다. 9bow: 인터넷을 검색해서 오늘 출시된 GPT-4o에 대한 자세한 내용들을 검색해서 모아주세요. 이후, 모든 글들을 시간을 두고 천천히 사려깊게 읽은 뒤, 소개 글을 작성해주세요. 내용을 잘 전달할 수 있도록 목차를 먼저 작성한 뒤, 각 목차를 2개 이상의 문단으로 구성하여 작성해주세요. GPT-4o 소개 1. 소개 OpenAI는 2024년 5월 13일, 최신 모델인 GPT-4o를 출시했습니다. 이 모델은 텍스트와 이미지를 동시에 처리할 수 있는 멀티모달 능력을 갖추고 있어 이전 모델보다 더욱 강력한 성능을 자랑합니다. GPT-4o는 특히 GPT-4 Turbo와 유사한 성능을 제공하면서도, 보다 효율적이고 정확한 결과를 도출할 수 있도록 설계되었습니다. 2. GPT-4o의…

https://x.com/OpenAI/status/1790072174117613963

제미나이 1.5 플래시 / Gemini 1.5 Flash

논문 소개

멀티모달 기능을 갖춘 2M 컨텍스트 창을 갖춘 경량 트랜스포머 디코더 모델로, 효율성을 위해 설계되었으며 여러 평가 언어에서 모든 모델 중 가장 빠른 출력 생성을 제공합니다. 전체적으로 Gemini 1.5 Flash는 Gemini 1.0 Pro에 비해 균일하게 우수한 성능을 보이며 여러 벤치마크에서 1.0 Ultra와 비슷한 수준의 성능을 보였습니다.

A lightweight transformer decoder model with a 2M context window with multimodal capabilities; it is designed for efficiency and yields the fastest output generation of all models on several evaluated languages; overall, Gemini 1.5 Flash performs uniformly better compared to Gemini 1.0 Pro and even performs at a similar level to 1.0 Ultra on several benchmarks.

논문 초록 (Abstract)

이 보고서에서는 여러 개의 긴 문서와 몇 시간 분량의 비디오 및 오디오 등 수백만 개의 컨텍스트 토큰에서 세분화된 정보를 기억하고 추론할 수 있는 컴퓨팅 효율성이 뛰어난 차세대 멀티모달 모델을 대표하는 Gemini 1.5 모델 제품군을 소개합니다. 이 제품군에는 (1) 대부분의 기능과 벤치마크에서 2월 버전을 능가하는 업데이트된 Gemini 1.5 Pro와 (2) 품질 저하를 최소화하면서 효율성을 위해 설계된 더욱 가벼운 버전인 Gemini 1.5 Flash의 두 가지 새로운 모델이 포함됩니다. Gemini 1.5 모델은 여러 모달리티에 걸쳐 긴 컨텍스트 검색 작업에서 거의 완벽한 리콜을 달성하고, 긴 문서 QA, 긴 동영상 QA 및 긴 컨텍스트 ASR에서 최첨단 성능을 개선하며, 광범위한 벤치마크에서 Gemini 1.0 Ultra의 최첨단 성능과 일치하거나 이를 능가합니다. Gemini 1.5의 긴 컨텍스트 기능의 한계를 연구한 결과, 다음 토큰 예측과 완벽에 가까운 검색(99% 이상)이 최소 1,000만 토큰까지 지속적으로 개선되어 Claude 3.0(200만) 및 GPT-4 Turbo(128만) 같은 기존 모델보다 한 세대 이상 도약한 것으로 나타났습니다. 마지막으로, 10가지 직종에서 26~75%의 시간 절약을 달성한 Gemini 1.5의 업무 완료에 대한 전문직과의 협업과 같은 실제 사용 사례와 전 세계적으로 사용자가 200명 미만인 언어인 칼라망어 문법 매뉴얼이 주어지면 모델이 동일한 콘텐츠로 학습한 사람과 비슷한 수준으로 영어를 칼라망어로 번역하는 놀라운 새로운 기능에 대해 살펴봅니다.

In this report, we introduce the Gemini 1.5 family of models, representing the next generation of highly compute-efficient multimodal models capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. The family includes two new models: (1) an updated Gemini 1.5 Pro, which exceeds the February version on the great majority of capabilities and benchmarks; (2) Gemini 1.5 Flash, a more lightweight variant designed for efficiency with minimal regression in quality. Gemini 1.5 models achieve near-perfect recall on long-context retrieval tasks across modalities, improve the state-of-the-art in long-document QA, long-video QA and long-context ASR, and match or surpass Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5’s long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 3.0 (200k) and GPT-4 Turbo (128k). Finally, we highlight real-world use cases, such as Gemini 1.5 collaborating with professions on their completing their tasks achieving 26 to 75% time savings across 10 different job categories, as well as surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.

논문 링크

더 읽어보기

[GN] Google I/O 2024에서 발표된 모든 것들 읽을거리&정보공유

Google I/O 2024에서 발표된 모든 것들 [[GN] Google I/O 2024에서 발표된 모든 것들] https://www.youtube.com/watch?v=WsEQjeZoEng Gemini 1.5 Flash 모델 발표 Gemini 1.5 Pro만큼 강력하지만, 좁고 빈번하며 지연 시간이 짧은 작업에 최적화된 새로운 다중 모달 모델 빠른 응답 생성에 더 적합함 Gemini 1.5의 번역, 추론, 코딩 능력도 개선 Gemini 1.5 Pro의 컨텍스트 윈도우(흡수할 수 있는 정보량)가 100만 토큰에서 200만 토큰으로 두 배 증가 Project Astra: Google의 Star Trek AI 미래상 https://www.youtube.com/watch?v=nXVvvRhiGjI 다중 모달 AI 어시스턴트로, 기기의 카메라를 통해 보고 이해하며, 사물의 위치를 기억하고, 사용자를 대신해 작업을 수행하는 것을 목표로 함 올해 I/O에서 가장 인상적인 데모 대부분에…

https://x.com/OriolVinyalsML/status/1791521517211107515

Veo

소개

구글 딥마인드의 가장 뛰어난 동영상 생성 모델은 1분 이상의 고화질 1080p 해상도 동영상을 생성하고, 동영상에서 마스크 편집을 지원하며, 텍스트와 함께 입력된 이미지로 동영상을 생성할 수도 있고, 잠재 확산 변환기를 통해 일관성을 유지하면서 동영상 클립을 60초 이상으로 확장할 수도 있습니다.

Google Deepmind’s most capable video generation model generates high-quality, 1080p resolution videos beyond 1 minute; it supports masked editing on videos and can also generate videos with an input image along with text; the model can extend video clips to 60 seconds and more while keeping consistency with its latent diffusion transformer.

더 읽어보기

Veo: Google Deepmind가 공개한 고품질 영상 생성 모델 읽을거리&정보공유

Veo: Google Deepmind가 공개한 text-to-video 생성 모델 https://deepmind.google/api/blob/website/media/veo_cowboy_sun_1.mp4 프롬프트: 고독한 카우보이가 아름다운 석양, 부드러운 빛, 따뜻한 색채를 배경으로 말을 타고 탁 트인 평원을 가로지릅니다. Prompt: A lone cowboy rides his horse across an open plain at beautiful sunset, soft light, warm colors 소개 Veo는 Google DeepMind가 이번 Google I/O 2024에서 공개한 비디오 생성 모델로, 1분 이상의 고해상도(1080p) 영상을 다양한 시네마틱 및 시각 스타일로 생성할 수 있습니다. 이 모델은 프롬프트의 뉘앙스와 톤을 정확하게 인지하고 시네마틱 효과를 이해하여 창의적인 제어를 제공합니다. Veo는 비디오 제작을 누구나 접근할 수 있게 도와주며, …

https://x.com/GoogleDeepMind/status/1790435824598716704

카멜레온: 혼합-모달 초기 융합 파운데이션 모델 / Chameleon: Mixed-Modal Early-Fusion Foundation Models

논문 소개

임의의 순서로 이미지와 텍스트를 생성하기 위한 토큰 기반 혼합 모달 모델 제품군, 이미지 캡션에서 최첨단 성능을 보고하고 텍스트 전용 작업에서 라마 2를 능가하며 Mixtral 8x7B 및 Gemini-Pro와도 경쟁, 새로운 롱폼 혼합 모달 생성 평가에서 제미니 프로 및 GPT-4V의 성능을 능가합니다.

A family of token-based mixed-modal models for generating images and text in any arbitrary sequence; reports state-of-the-art performance in image captioning and outperforms Llama 2 in text-only tasks and is also competitive with Mixtral 8x7B and Gemini-Pro; exceeds the performance of Gemini Pro and GPT-4V on a new long-form mixed-modal generation evaluation.

논문 초록(Abstract)

임의의 시퀀스에서 이미지와 텍스트를 이해하고 생성할 수 있는 초기 융합 토큰 기반 혼합 모달 모델 제품군인 카멜레온을 소개합니다. 초기부터 안정적인 훈련 접근 방식, 얼라인먼트 레시피, 초기 융합 토큰 기반 혼합 모달 설정에 맞춘 아키텍처 매개변수화에 대해 설명합니다. 시각적 질문 답변, 이미지 캡션, 텍스트 생성, 이미지 생성, 긴 형식의 혼합 모달 생성 등 포괄적인 범위의 작업에서 모델을 평가합니다. 카멜레온은 이미지 캡션 작업에서 최첨단 성능을 비롯하여 광범위하고 일반적인 기능을 보여주며, 텍스트 전용 작업에서 Llama-2를 능가하는 동시에 Mixtral 8x7B 및 Gemini-Pro와 같은 모델과 경쟁하고, 단일 모델에서 사소한 이미지 생성도 수행합니다. 또한 프롬프트 또는 출력물에 이미지와 텍스트가 혼합된 시퀀스가 포함된 새로운 긴 형식의 혼합 모달 생성 평가에서 사람이 직접 판단한 결과에 따르면 Gemini Pro 및 GPT-4V를 포함한 훨씬 더 큰 모델의 성능과 일치하거나 이를 능가합니다. 카멜레온은 전체 멀티모달 문서의 통합 모델링에서 중요한 진전을 이루었습니다.

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

논문 링크

더 읽어보기

Meta, 다양한 모달리티에서 더 뛰어난 성능을 제공하는 융합 모델 Chameleon 공개 읽을거리&정보공유

Chameleon: Mixed-Modal Early-Fusion Foundation Models [Meta, 다양한 모달리티에서 더 뛰어난 성능을 제공하는 융합 모델 Chameleon 공개] 논문 소개 최근 Meta에서 공개한 Chameleon 논문은 텍스트와 이미지를 혼합하여 처리할 수 있는 새로운 기법의 모델을 제시하고 있습니다. 이 모델은 텍스트와 이미지를 혼합하여 임의의 순서로 이해하고 생성할 수 있는 초기 융합 토큰 기반의 혼합 모달 모델로, 텍스트 생성, 이미지 생성, 비주얼 질문 응답 등 다양한 작업에서 뛰어난 성능을 보여주고 있습니다. 지금까지의 멀티모달 모델들은 각 모달리티별로 별도의 인코더(예. 이미지의 경우 ViT 등)나 디코더를 사용하여 모달리티들을 통합하였습니다. 예를 들어, 아래의 LLaVA 모델과 같이 Vision Encoder를 별도로 사용하였으며, Vision Encoder로부터 나온 Embedding을 LLM의 입력과 정렬(align)하기 위한 …

https://x.com/AIatMeta/status/1791263344714014733

새로운 지식에 대한 LLM을 미세 조정하는 것이 환각을 조장할까요? / Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

논문 소개

새로운 지식에 대한 미세 조정이 학습자의 환각 성향에 미치는 영향을 연구하고, 새로운 지식이 포함된 미세 조정 예제를 설정하여 학습자가 미세 조정을 통해 새로운 사실적 지식을 습득하는 데 어려움을 겪는다는 것을 보여주며, 새로운 지식을 학습할수록 모델의 환각 성향이 증가한다는 사실을 발견합니다.

Studies the impact of fine-tuning on new knowledge on the hallucination tendencies of LLMs; the setup includes fine-tuning examples that include new knowledge; shows that LLMs struggle to acquire new factual knowledge via fine-tuning; also finds that as new knowledge is learned it increases the model’s tendency to hallucinate.

논문 초록(Abstract)

대규모 언어 모델이 감독된 미세 조정을 통해 조정되면 사전 학습을 통해 획득하지 못한 새로운 사실 정보를 접할 수 있습니다. 이 경우 모델이 기존 지식에 근거하지 않은 사실을 생성하도록 훈련되기 때문에 사실과 다른 응답을 환각하는 행동을 학습할 수 있다고 추측되기도 합니다. 이 연구에서는 새로운 지식에 대한 이러한 노출이 미세 조정된 모델이 기존 지식을 활용하는 능력에 미치는 영향을 연구합니다. 이를 위해 새로운 지식을 도입하는 미세 조정 예제의 비율을 변화시키는 비공개 QA에 초점을 맞춘 통제된 설정을 설계합니다. 새로운 지식을 도입하는 미세 조정 예제는 모델의 지식과 일치하는 예제보다 훨씬 느리게 학습되기 때문에 대규모 언어 모델이 미세 조정을 통해 새로운 사실적 지식을 습득하는 데 어려움을 겪는다는 사실을 입증했습니다. 그러나 새로운 지식이 포함된 예시가 결국 학습됨에 따라 모델의 환각 경향이 선형적으로 증가한다는 사실도 발견했습니다. 이러한 연구 결과를 종합하면, 미세 조정을 통해 새로운 사실 지식을 도입할 때의 위험성을 강조하며, 대규모 언어 모델은 대부분 사전 학습을 통해 사실 지식을 습득하는 반면 미세 조정을 통해 더 효율적으로 사용하도록 가르친다는 견해를 뒷받침합니다.

When large language models are aligned via supervised fine-tuning, they may encounter new factual information that was not acquired through pre-training. It is often conjectured that this can teach the model the behavior of hallucinating factually incorrect responses, as the model is trained to generate facts that are not grounded in its pre-existing knowledge. In this work, we study the impact of such exposure to new knowledge on the capability of the fine-tuned model to utilize its pre-existing knowledge. To this end, we design a controlled setup, focused on closed-book QA, where we vary the proportion of the fine-tuning examples that introduce new knowledge. We demonstrate that large language models struggle to acquire new factual knowledge through fine-tuning, as fine-tuning examples that introduce new knowledge are learned significantly slower than those consistent with the model's knowledge. However, we also find that as the examples with new knowledge are eventually learned, they linearly increase the model's tendency to hallucinate. Taken together, our results highlight the risk in introducing new factual knowledge through fine-tuning, and support the view that large language models mostly acquire factual knowledge through pre-training, whereas fine-tuning teaches them to use it more efficiently.

논문 링크

더 읽어보기

https://x.com/arankomatsuzaki/status/1788859706187882960

제로샷 토큰나이저 전이 / Zero-Shot Tokenizer Transfer

논문 소개

토큰화기를 입력으로 받아 해당 임베딩을 예측하는 하이퍼네트워크를 훈련하고, 인코더와 디코더 LLM을 통해 새로운 토큰화기에 대한 일반화를 시연하며, 이 방법이 다국어 및 코딩 작업에서 원래 모델의 성능에 가까운 성능을 달성하는 동시에 토큰화된 시퀀스의 길이를 줄인다고 보고합니다.

Trains a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings; it demonstrates generalization to new tokenizers both with encoder and decoder LLMs; reports that the method achieves performance close to the original models' performance in cross-lingual and coding tasks while reducing the length of the tokenized sequence.

논문 초록(Abstract)

언어 모델(LM)은 원시 텍스트를 일련의 어휘 항목(토큰)에 매핑하는 토큰화기에 바인딩됩니다. 예를 들어, 영어를 중심으로 학습된 LM은 다른 자연어 및 프로그래밍 언어에서는 여전히 잘 작동할 수 있지만 영어 중심의 토큰화 도구로 인해 효율성이 크게 떨어질 수 있습니다. 이를 완화하기 위해서는 성능 저하 없이 원래의 LM 토큰화기를 임의의 토큰화기로 즉시 교체할 수 있어야 합니다. 따라서 이번 작업에서는 제로샷 토큰화 전송(ZeTT)이라는 새로운 문제를 정의합니다. ZeTT의 핵심 과제는 새로운 토큰화 도구의 어휘에서 토큰을 위한 임베딩을 찾는 것입니다. 임베딩을 초기화하기 위한 이전의 휴리스틱은 종종 ZeTT 환경에서 우연적인 수준에서 수행되기 때문에, 저희는 토큰화자를 입력으로 받아 해당 임베딩을 예측하는 하이퍼네트워크를 훈련시키는 새로운 솔루션을 제안합니다. 이 하이퍼네트워크가 인코더(예: XLM-R)와 디코더 LLM(예: Mistral-7B) 모두에서 새로운 토큰화기에 일반화된다는 것을 경험적으로 증명합니다. 우리의 방법은 다국어 및 코딩 작업에서 원래 모델의 성능에 근접하면서도 토큰화된 시퀀스의 길이를 현저하게 줄입니다. 또한 1B 미만의 토큰에 대한 지속적인 학습을 통해 나머지 격차를 빠르게 좁힐 수 있음을 발견했습니다. 마지막으로, 기본 (L)LM에 대해 훈련된 ZeTT 하이퍼네트워크는 추가 훈련 없이도 미세 조정된 변형에도 적용될 수 있음을 보여줍니다. 전반적으로, 저희의 결과는 토큰화 기법에서 LM을 분리하는 데 상당한 진전을 이루었습니다.

Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.

논문 링크

더 읽어보기

https://x.com/bminixhofer/status/1790267652587258343

WavCraft: 대규모 언어 모델을 사용한 오디오 편집 및 생성 / WavCraft: Audio Editing and Generation with Large Language Models

논문 소개

오디오 콘텐츠 제작 및 편집을 위해 작업별 모델을 연결하고, 사용자의 지시를 여러 작업으로 분해하여 각 작업을 특정 모듈과 공동으로 처리하며, 사용자가 명시적인 명령 없이도 오디오 콘텐츠와 상호 작용하고 제작할 수 있도록 지원하는 LLM을 활용합니다

Leverages LLMs to connect task-specific models for audio content creation and editing; decomposes users' instructions into several tasks and tackles each task collaboratively with the particular module; it can enable users to interact and produce audio content without explicit commands

논문 초록(Abstract)

대규모 언어 모델(LLM)을 활용하여 오디오 콘텐츠 제작 및 편집을 위한 다양한 작업별 모델을 연결하는 종합 시스템인 WavCraft를 소개합니다. 구체적으로 WavCraft는 원시 오디오 자료의 콘텐츠를 자연어로 설명하고 오디오 설명과 사용자 요청에 따라 LLM을 조정합니다. WavCraft는 LLM의 컨텍스트 내 학습 기능을 활용하여 사용자의 지시를 여러 작업으로 분해하고 각 작업을 특정 모듈과 공동으로 처리합니다. 일련의 작업별 모델과 함께 작업 분해를 통해 WavCraft는 입력 지시를 따라 더 자세한 내용과 근거를 가진 오디오 콘텐츠를 만들거나 편집하여 사용자 제어를 용이하게 합니다. 또한 WavCraft는 대화 상호 작용을 통해 사용자와 협력할 수 있으며 명시적인 사용자 명령 없이 오디오 콘텐츠를 제작할 수도 있습니다. 실험 결과, 특히 오디오 클립의 로컬 영역을 조정할 때 기존 방식보다 WavCraft가 더 나은 성능을 발휘하는 것으로 나타났습니다. 또한 WavCraft는 복잡한 지침에 따라 입력된 녹음 위에 오디오 콘텐츠를 편집하고 제작할 수 있어 광범위한 애플리케이션에서 오디오 제작자를 지원할 수 있습니다. 구현 및 데모는 여기(GitHub - JinhuaLiang/WavCraft: Official repo for WavCraft, an AI agent for audio creation and editing)에서 확인할 수 있습니다.

We introduce WavCraft, a collective system that leverages large language models (LLMs) to connect diverse task-specific models for audio content creation and editing. Specifically, WavCraft describes the content of raw audio materials in natural language and prompts the LLM conditioned on audio descriptions and user requests. WavCraft leverages the in-context learning ability of the LLM to decomposes users' instructions into several tasks and tackle each task collaboratively with the particular module. Through task decomposition along with a set of task-specific models, WavCraft follows the input instruction to create or edit audio content with more details and rationales, facilitating user control. In addition, WavCraft is able to cooperate with users via dialogue interaction and even produce the audio content without explicit user commands. Experiments demonstrate that WavCraft yields a better performance than existing methods, especially when adjusting the local regions of audio clips. Moreover, WavCraft can follow complex instructions to edit and create audio content on the top of input recordings, facilitating audio producers in a broader range of applications. Our implementation and demos are available at this GitHub - JinhuaLiang/WavCraft: Official repo for WavCraft, an AI agent for audio creation and editing.

논문 링크

더 읽어보기

https://github.com/JinhuaLiang/WavCraft

RLHF 워크플로: 보상 모델링에서 온라인 RLHF까지 / RLHF Workflow: From Reward Modeling to Online RLHF

논문 소개

온라인 반복적 RLHF를 쉽게 재현할 수 있는 레시피를 제공하고, 온라인 반복적 RLHF의 이론적 통찰력과 알고리즘 원리 및 실제 구현에 대해 설명합니다.

Provides an easily reproducible recipe for online iterative RLHF; discusses theoretical insights and algorithmic principles of online iterative RLHF and practical implementation.

논문 초록(Abstract)

이 기술 보고서에서는 최근 대규모 언어 모델(LLM) 문헌에서 오프라인보다 큰 폭으로 성능이 뛰어난 것으로 널리 보고된 온라인 반복적 강화 학습(RLHF)의 워크플로우를 소개합니다. 그러나 기존의 오픈 소스 RLHF 프로젝트는 여전히 오프라인 학습 환경에 국한되어 있습니다. 이 기술 보고서에서는 이러한 간극을 메우고 온라인 반복 RLHF에 쉽게 재현할 수 있는 상세한 레시피를 제공하는 것을 목표로 합니다. 특히, 리소스가 제한된 오픈소스 커뮤니티에서는 일반적으로 온라인 인적 피드백이 불가능하기 때문에 다양한 오픈소스 데이터 세트를 사용하여 선호도 모델을 구축하고, 구축된 프록시 선호도 모델을 사용하여 인적 피드백을 근사화하는 것부터 시작합니다. 그런 다음 온라인 반복적 RLHF의 이론적 인사이트와 알고리즘 원리에 대해 논의한 다음 세부적인 실제 구현을 살펴봅니다. 훈련된 LLM인 SFR-Iterative-DPO-LLaMA-3-8B-R은 알파카에벌-2, 아레나-하드, MT-벤치 등 LLM 챗봇 벤치마크는 물론 휴먼에벌, 트루스풀QA 등 기타 학술 벤치마크에서 인상적인 성능을 달성했습니다. 저희는 감독 미세 조정(SFT)과 반복적 RLHF가 완전한 오픈 소스 데이터 세트를 통해 최첨단 성능을 얻을 수 있음을 보여주었습니다. 또한, 모델, 선별된 데이터 세트, 포괄적인 단계별 코드 가이드북을 공개적으로 사용할 수 있도록 했습니다. 자세한 내용은 GitHub - RLHFlow/RLHF-Reward-Modeling: Recipes to train reward model for RLHF. 및 GitHub - RLHFlow/Online-RLHF: A recipe for online RLHF and online iterative DPO. 을 참조하세요.

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM, SFR-Iterative-DPO-LLaMA-3-8B-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to GitHub - RLHFlow/RLHF-Reward-Modeling: Recipes to train reward model for RLHF. and GitHub - RLHFlow/Online-RLHF: A recipe for online RLHF and online iterative DPO. for more detailed information.

논문 링크

더 읽어보기

GitHub - RLHFlow/RLHF-Reward-Modeling: Recipes to train reward model for RLHF. 및 GitHub - RLHFlow/Online-RLHF: A recipe for online RLHF and online iterative DPO.

https://x.com/CaimingXiong/status/1790379121719361776

한 번만 캐시합니다: 언어 모델용 디코더-디코더 아키텍처 / You Only Cache Once: Decoder-Decoder Architectures for Language Models

논문 소개

키-값 쌍을 한 번만 캐시하는 디코더-디코더 LLM 아키텍처는 글로벌 키-값 캐시를 효율적으로 인코딩하는 자체 디코더 위에 크로스 디코더를 쌓고 크로스 인코더가 크로스 어텐션을 통해 캐시를 재사용함으로써 기능 저하 없이 GPU 메모리 사용을 크게 줄이고, 모델 크기와 학습 토큰 수를 확장하는 다양한 설정에서 Transformer와 비슷한 성능을 달성할 수 있습니다.

A decoder-decoder LLM architecture that only caches key-value pairs once; it involves a cross-decoder stacked upon a self-decoder which efficiently encodes global key-value caches and the cross-encoder reuses the cache via cross-attention; this leads to a significant reduction in GPU memory use without sacrificing capabilities; achieves comparable performance to Transformer in various settings of scaling up model size and number of training token.

논문 초록(Abstract)

대규모 언어 모델을 위해 키-값 쌍을 한 번만 캐시하는 디코더-디코더 아키텍처인 YOCO를 소개합니다. 이 아키텍처는 셀프 디코더 위에 쌓인 크로스 디코더라는 두 가지 구성 요소로 이루어져 있습니다. 셀프 디코더는 크로스 어텐션을 통해 크로스 디코더에서 재사용되는 글로벌 키-값(KV) 캐시를 효율적으로 인코딩합니다. 전체 모델은 디코더 전용 트랜스포머처럼 작동하지만, YOCO는 캐시를 한 번만 사용합니다. 이 설계는 GPU 메모리 수요를 크게 줄이면서도 글로벌 어텐션 기능을 유지합니다. 또한 계산 흐름은 최종 출력을 변경하지 않고도 프리필에서 조기 종료까지 가능하므로 프리필 단계의 속도가 크게 빨라집니다. 실험 결과에 따르면 YOCO는 모델 크기와 훈련 토큰 수를 확장하는 다양한 설정에서 Transformer에 비해 유리한 성능을 달성했습니다. 또한 YOCO는 거의 완벽한 바늘 검색 정확도로 1M 컨텍스트 길이까지 확장되었습니다. 프로파일링 결과에 따르면 YOCO는 컨텍스트 길이와 모델 크기에 따라 추론 메모리, 프리필 지연 시간, 처리량이 크게 향상되는 것으로 나타났습니다. 코드는 unilm/YOCO at master · microsoft/unilm · GitHub 에서 확인할 수 있습니다.

We introduce a decoder-decoder architecture, YOCO, for large language models, which only caches key-value pairs once. It consists of two components, i.e., a cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes global key-value (KV) caches that are reused by the cross-decoder via cross-attention. The overall model behaves like a decoder-only Transformer, although YOCO only caches once. The design substantially reduces GPU memory demands, yet retains global attention capability. Additionally, the computation flow enables prefilling to early exit without changing the final output, thereby significantly speeding up the prefill stage. Experimental results demonstrate that YOCO achieves favorable performance compared to Transformer in various settings of scaling up model size and number of training tokens. We also extend YOCO to 1M context length with near-perfect needle retrieval accuracy. The profiling results show that YOCO improves inference memory, prefill latency, and throughput by orders of magnitude across context lengths and model sizes. Code is available at unilm/YOCO at master · microsoft/unilm · GitHub.

논문 링크

더 읽어보기

https://x.com/arankomatsuzaki/status/1788435838474355098

CAT3D: 멀티뷰 확산 모델로 무엇이든 3D로 제작하기 / CAT3D: Create Anything in 3D with Multi-View Diffusion Models

논문 소개

멀티뷰 확산 모델을 사용하여 실제 캡처 과정을 시뮬레이션하여 무엇이든 3D로 생성하는 방법을 제시하고, 3D 재구성 기술에 입력으로 사용할 수 있는 장면의 일관된 새로운 뷰를 생성하여 실시간으로 렌더링된 3D 표현을 생성할 수 있으며, CAT3D의 장면은 1분 이내에 생성할 수 있고 단일 이미지 및 소수 뷰 3D 장면 생성 작업에서 기존 방법보다 성능이 뛰어난 것으로 보고되고 있습니다.

Presents a method for creating anything in 3D by simulating the real-world capture process using a multi-view diffusion model; it can generate consistent novel views of a scene which can be used as input to 3D reconstruction techniques to produce 3D representation rendered in real-time; the scene from CAT3D can be generated in less than one minute and is reported to outperform existing methods on single image and few-view 3D scene creation tasks.

논문 초록(Abstract)

3D 재구성의 발전으로 고품질 3D 캡처가 가능해졌지만, 3D 장면을 만들려면 수백에서 수천 개의 이미지를 수집해야 합니다. 유니티는 이 실제 캡처 과정을 멀티뷰 확산 모델로 시뮬레이션하여 무엇이든 3D로 생성할 수 있는 방법인 CAT3D를 소개합니다. 입력 이미지의 수와 목표 신규 시점 세트가 주어지면 유니티의 모델은 장면의 매우 일관된 신규 시점을 생성합니다. 이렇게 생성된 뷰는 강력한 3D 재구성 기법의 입력으로 사용되어 모든 시점에서 실시간으로 렌더링할 수 있는 3D 표현을 생성할 수 있습니다. CAT3D는 단 1분 만에 전체 3D 장면을 생성할 수 있으며, 단일 이미지 및 소수 뷰 3D 장면 생성에 있어 기존 방식보다 뛰어난 성능을 발휘합니다. 결과물과 인터랙티브 데모는 프로젝트 페이지(https://cat3d.github.io)를 참조하세요.

Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at https://cat3d.github.io .

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1791294630614442009

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

[2024/05/13 ~ 05/19] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR​

GPT-4o

소개

관련 링크

더 읽어보기

제미나이 1.5 플래시 / Gemini 1.5 Flash

논문 소개

논문 초록 (Abstract)

논문 링크

더 읽어보기

Veo

소개

관련 링크

더 읽어보기

카멜레온: 혼합-모달 초기 융합 파운데이션 모델 / Chameleon: Mixed-Modal Early-Fusion Foundation Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

새로운 지식에 대한 LLM을 미세 조정하는 것이 환각을 조장할까요? / Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

제로샷 토큰나이저 전이 / Zero-Shot Tokenizer Transfer

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

WavCraft: 대규모 언어 모델을 사용한 오디오 편집 및 생성 / WavCraft: Audio Editing and Generation with Large Language Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

RLHF 워크플로: 보상 모델링에서 온라인 RLHF까지 / RLHF Workflow: From Reward Modeling to Online RLHF

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

한 번만 캐시합니다: 언어 모델용 디코더-디코더 아키텍처 / You Only Cache Once: Decoder-Decoder Architectures for Language Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

CAT3D: 멀티뷰 확산 모델로 무엇이든 3D로 제작하기 / CAT3D: Create Anything in 3D with Multi-View Diffusion Models

논문 소개

논문 초록(Abstract)

논문 링크

더 읽어보기

원문

PyTorchKR