[2024/03/04 ~ 03/10] 이번 주의 주요 ML 논문 (Top ML Papers of the Week)

9bow · 3월 11, 2024, 2:20오전

[2024/03/04 ~ 03/10] 이번 어텐션 주요 ML 논문 (Top ML Papers of the Week)

PyTorchKR

이번 주에 선정된 논문들을 살펴보면, 인공지능 분야에서 추론 기능 및 로버스트(robust) 모델 평가와 관련된 주제가 주를 이루고 있는 추세입니다. 더불어, 대형 언어 모델(LLMs)의 이해와 예측 능력, 그리고 법률과 같은 특정 본야에서의 적용 가능성에 대해 다루고 있는 논문들도 두드러지게 나타납니다.
인공지능 기술이 발전함에 따라, 단순 패턴 인식을 넘어서는 고차원적인 문제 해결 능력에 대한 요구가 점점 높아지고 있습니다. 이러한 배경 하에서, 인공지능 시스템이 인간과 유사한 추론 능력을 갖추고 있는지 평가하며 이를 개선하기 위한 다양한 연구가 진행되고 있는 것을 볼 수 있습니다. 또한, 강인한(robust) 인공지능 모델을 만들기 위해서는 환경 변화나 입력 오류에 대해 강건하게 대응하는 모델 평가 방법들이 중요해지고 있는 것이 트렌드로 관측됩니다.
이러한 트렌드는 인공지능 기술이 실제 응용 분야에 적용될 때 발생할 수 있는 다양한 문제 상황들에 대한 대응력을 강화하기 위한 연구 개발의 필요성에서 비롯될 수 있습니다. 추론과 계획을 수행할 수 있는 인공지능의 개발은 로봇 공학, 자율 주행 차량, 의학 진단, 재난 대응 시스템 등 다양한 분야에서 중요한 역할을 할 수 있기 때문입니다. 따라서 이번 주에 선택된 논문들은 인공지능 시스템의 실제 문제 해결 능력을 향상시키고 그 평가 방법을 단단히 하는 데 주안점을 둔 연구라고 볼 수 있습니다.

클로드 3 / Claude 3

논문 소개

세 가지 모델 제품군(claude 3 하이쿠, claude 3 소네트, claude 3 오푸스)으로 구성; 가장 강력한 모델인 claude 3 오푸스는 mmlu 및 humaneval과 같은 일반적인 벤치마크에서 gpt-4보다 성능이 뛰어남; claude 3 기능에는 분석, 예측, 콘텐츠 생성, 코드 생성, 스페인어, 일본어, 프랑스어 같은 비영어권의 언어 변환이 포함됨; 200만 개의 컨텍스트 창을 지원하지만 일부 고객을 위해 100만 개의 토큰으로 확장 가능; 사진, 차트, 그래프와 같은 형식을 처리하는 강력한 비전 기능; 이러한 모델은 요청을 더 미묘하게 이해하고 거부하는 경우가 적다고 주장합니다.

Consists of a family of three models (claude 3 haiku, claude 3 sonnet, and claude 3 opus); claude 3 opus (the strongest model) seems to outperform gpt-4 on common benchmarks like mmlu and humaneval; claude 3 capabilities include analysis, forecasting, content creation, code generation, and converting in non-english languages like spanish, japanese, and french; 200k context windows supported but can be extended to 1m token to select customers; the models also have strong vision capabilities for processing formats like photos, charts, and graphs; anthropic claims these models have a more nuanced understanding of requests and make fewer refusals.

논문 초록(Abstract)

새로운 대형 멀티모달 모델 제품군인 Claude 3를 소개합니다. 가장 뛰어난 성능을 자랑하는 Claude 3 Opus, 기술과 속도를 겸비한 Claude 3 Sonnet, 가장 빠르고 저렴한 모델인 Claude 3 Haiku가 바로 그것입니다. 모든 새로운 모델에는 이미지 데이터를 처리하고 분석할 수 있는 비전 기능이 탑재되어 있습니다. Claude 3 제품군은 벤치마크 평가에서 강력한 성능을 보여주며 추론, 수학 및 코딩 측정에 대한 새로운 기준을 제시합니다. Claude 3 Opus는 GPQA [1], MMLU [2], MMMU [3] 등과 같은 평가에서 최첨단 결과를 달성합니다. Claude 3 하이쿠는 대부분의 순수 텍스트 작업에서 Claude 2[4]와 비슷하거나 더 나은 성능을 발휘하며, Sonnet과 Opus는 그보다 훨씬 뛰어난 성능을 발휘합니다. 또한, 이 모델들은 비영어권 언어에 대한 유창성이 향상되어 전 세계 사용자들에게 더욱 다양한 용도로 활용될 수 있습니다. 이 보고서에서는 핵심 역량, 안전, 사회적 영향에 중점을 두고 평가에 대한 심층 분석을 제공합니다, 사회적 영향, 그리고 책임 있는 확장 정책[5]에서 약속한 치명적인 위험 평가에 초점을 맞춘 심층 분석을 제공합니다. 확장 정책[5]에서 약속한 내용을 중심으로 분석합니다.

We introduce Claude 3, a new family of large multimodal models – Claude 3 Opus, our most capable offering, Claude 3 Sonnet, which provides a combination of skills and speed, and Claude 3 Haiku, our fastest and least expensive model. All new models have vision capabilities that enable them to process and analyze image data. The Claude 3 family demonstrates strong performance across benchmark evaluations and sets a new standard on measures of reasoning, math, and coding. Claude 3 Opus achieves state-of-the-art results on evaluations like GPQA [1], MMLU [2], MMMU [3] and many more. Claude 3 Haiku performs as well or better than Claude 2 [4] on most pure-text tasks, while Sonnet and Opus significantly outperform it. Additionally, these models exhibit improved fluency in non-English languages, making them more versatile for a global audience. In this report, we provide an in-depth analysis of our evaluations, focusing on core capabilities, safety, societal impacts, and the catastrophic risk assessments we committed to in our Responsible Scaling Policy [5].

논문 링크

더 읽어보기

https://x.com/AnthropicAI/status/1764653830468428150

추론 성능의 강력한 평가를 위한 기능적 벤치마크와 추론 격차 / Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

논문 소개

LLM의 추론 능력 평가를 위한 기능적 벤치마크를 제안하고, 현재 모델과 58.35%에서 80.31%까지 추론 격차가 있음을 발견하지만, 저자는 보다 정교한 프롬프트 전략으로 이러한 격차를 줄일 수 있다고 보고합니다.

Proposes functional benchmarks for the evaluation of the reasoning capabilities of llms; finds that there is a reasoning gap with current models from 58.35% to 80.31%; however, the authors also report that those gaps can be reduced with more sophisticated prompting strategies.

논문 초록(Abstract)

저희는 벤치마크의 기능적 변형을 사용하여 언어 모델의 추론 능력을 강력하게 평가할 수 있는 프레임워크를 제안합니다. 추론 테스트를 푸는 모델은 기능적 변형의 스냅샷과 비교했을 때 정적 버전의 문제에서 성능 차이가 없어야 합니다. 저희는 MATH 벤치마크의 관련 부분을 기능적 변형인 MATH()로 재작성했으며, 다른 벤치마크의 기능화도 뒤따를 예정입니다. MATH()의 스냅샷을 통해 현재의 최신 모델을 평가할 때, 정적 정확도와 함수 정확도 사이의 백분율 차이인 추론 갭을 발견했습니다. 정적 벤치마크에서 우수한 성능을 보이는 최신 폐쇄형 및 개방형 가중치 모델에서 58.35%에서 80.31%의 추론 격차를 발견했으며, 더 정교한 프롬프트 전략을 사용하면 격차가 더 작아질 수 있다는 점에 유의해야 합니다. 여기에서는 실제 작업에서 추론 성능이 좋은 모델일수록 정량화할 수 있을 정도로 격차가 작아 '격차 0' 모델을 구축해야 하는 미해결 과제가 있음을 보여줍니다. 평가용 코드와 새로운 평가 데이터셋인 세 개의 MATH() 스냅샷은 GitHub - ConsequentAI/fneval: Functional Benchmarks and the Reasoning Gap 에서 공개적으로 사용할 수 있습니다.

We propose a framework for robust evaluation of reasoning capabilities of language models, using functional variants of benchmarks. Models that solve a reasoning test should exhibit no difference in performance over the static version of a problem compared to a snapshot of the functional variant. We have rewritten the relevant fragment of the MATH benchmark into its functional variant MATH(), with functionalization of other benchmarks to follow. When evaluating current state-of-the-art models over snapshots of MATH(), we find a reasoning gap -- the percentage difference between the static and functional accuracies. We find reasoning gaps from 58.35% to 80.31% among the state-of-the-art closed and open weights models that perform well on static benchmarks, with the caveat that the gaps are likely to be smaller with more sophisticated prompting strategies. Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building "gap 0" models. Code for evaluation and new evaluation datasets, three MATH() snapshots, are publicly available at GitHub - ConsequentAI/fneval: Functional Benchmarks and the Reasoning Gap.

논문 링크

더 읽어보기

https://github.com/consequentai/fneval/

https://x.com/_saurabh/status/1763626711407816930

GaLore: 그라디언트 로우 랭크 투영을 통한 메모리 효율적인 LLM 트레이닝 / GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

논문 소개

로우랭크 투영을 통한 학습을 위한 메모리 효율적인 접근 방식을 제안합니다. 이 학습 전략은 전체 파라미터 학습을 허용하고 로라와 같은 일반적인 로우랭크 적응 방법보다 메모리 효율적이며, 최적화 상태에서 메모리 사용량을 최대 65.5%까지 줄이면서 라마 1b 및 7b 아키텍처에서 사전 학습의 효율성과 성능을 모두 유지할 수 있습니다.

Proposes a memory-efficient approach for training llm through low-rank projection; the training strategy allows full-parameter learning and is more memory-efficient than common low-rank adaptation methods such as lora; reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on llama 1b and 7b architectures.

논문 초록(Abstract)

대규모 언어 모델(LLM)을 학습할 때는 주로 가중치와 최적화 상태의 크기가 커지기 때문에 상당한 메모리 문제가 발생합니다. 저순위 적응(LoRA)과 같은 일반적인 메모리 감소 접근 방식은 각 계층의 고정된 사전 학습 가중치에 학습 가능한 저순위 행렬을 추가하여 학습 가능한 파라미터와 최적화 상태를 줄입니다. 그러나 이러한 접근 방식은 일반적으로 매개변수 검색을 낮은 순위 하위 공간으로 제한하고 학습 역학을 변경하기 때문에 사전 학습과 미세 조정 단계 모두에서 풀 순위 가중치를 사용한 학습의 성능이 떨어지며, 나아가 풀 순위 웜 스타트가 필요할 수 있습니다. 이 연구에서는 전체 파라미터 학습이 가능하지만 LoRA와 같은 일반적인 저순위 적응 방법보다 메모리 효율이 높은 학습 전략인 그라데이션 저순위 투영(GaLore)을 제안합니다. 이 접근 방식은 최적화 상태에서 메모리 사용량을 최대 65.5%까지 줄이면서 최대 197억 개의 토큰이 포함된 C4 데이터셋을 사용한 LLaMA 1B 및 7B 아키텍처의 사전 학습과 GLUE 작업에서 RoBERTa를 미세 조정할 때 효율성과 성능을 모두 유지합니다. 8비트 GaLore는 BF16 기준선에 비해 최적화 메모리를 최대 82.5%, 총 학습 메모리를 63.3%까지 줄였습니다. 특히, 모델 병렬, 체크포인트 또는 오프로드 전략 없이 24GB 메모리를 갖춘 소비자용 GPU(예: NVIDIA RTX 4090)에서 7B 모델을 사전 학습할 수 있는 가능성을 처음으로 입증했습니다.

Training Large Language Models (LLMs) presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

논문 링크

더 읽어보기

https://x.com/AnimaAnandkumar/status/1765613815146893348

대규모 언어 모델로 추론하고 계획할 수 있나요? / Can Large Language Models Reason and Plan?

논문 소개

LLM의 추론과 계획에 대한 주제를 다루고 있는데, 저자의 결론을 요약하면 다음과 같습니다: "요약하자면, 제가 읽고, 검증하고, 수행한 그 어떤 것도 일반적으로 이해되는 것처럼 LLM가 추론/계획을 수행한다고 믿을 만한 설득력 있는 이유를 제시하지 못합니다. 대신 웹 규모의 학습으로 무장한 LLM가 수행하는 것은 보편적인 근사치 검색의 한 형태이며, 제가 주장한 것처럼 때로는 추론 능력으로 오인될 수 있습니다."

A new position paper discusses the topic of reasoning and planning for llms; here is a summary of the author's conclusion: "to summarize, nothing that i have read, verified, or done gives me any compelling reason to believe that llms do reasoning/planning, as normally understood. what they do instead, armed with web-scale training, is a form of universal approximate retrieval, which, as i have argued, can sometimes be mistaken for reasoning capabilities".

논문 초록(Abstract)

인간은 때때로 자기 비판을 통해 자신의 잘못된 추측을 바로잡는 능력을 보이기도 하지만, LLM의 경우 그러한 가정에 대한 근거가 없는 것 같습니다.

While humans sometimes do show the capability of correcting their own erroneous guesses with self-critiquing, there seems to be no basis for that assumption in the case of LLMs.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1766123621326475285

AI 생성 콘텐츠를 위한 RAG(검색 증강 생성): 서베이 논문 / Retrieval-Augmented Generation for AI-Generated Content: A Survey

논문 소개

코드, 이미지, 오디오 등 다양한 세대 시나리오에서 사용되는 걸레에 대한 개요와 주요 논문을 참조한 걸레 개선 사항 분류를 제공합니다.

Provides an overview of rag used in different generation scenarios like code, image, and audio, including a taxonomy of rag enhancements with reference to key papers.

논문 초록(Abstract)

인공지능 생성 콘텐츠(AIGC)의 개발은 모델 알고리즘의 발전, 확장 가능한 기반 모델 아키텍처, 풍부한 고품질 데이터 세트의 가용성으로 인해 촉진되었습니다. AIGC는 괄목할 만한 성과를 거두었지만, 여전히 최신의 롱테일 지식을 유지하기 어렵고 데이터 유출의 위험이 있으며 학습 및 추론과 관련된 높은 비용과 같은 과제에 직면해 있습니다. 최근 이러한 문제를 해결하기 위한 패러다임으로 검색 증강 세대(RAG)가 부상하고 있습니다. 특히 RAG는 정보 검색 프로세스를 도입하여 사용 가능한 데이터 저장소에서 관련 객체를 검색함으로써 AIGC 결과를 향상시켜 정확도와 견고성을 높입니다. 이 논문에서는 RAG 기술을 AIGC 시나리오에 통합하는 기존의 노력을 종합적으로 검토합니다. 먼저 리트리버가 제너레이터를 증강하는 방식에 따라 RAG 기반을 분류합니다. 그리고 다양한 리트리버와 제너레이터를 위한 증강 방법론의 근본적인 추상화를 도출합니다. 이 통합된 관점은 모든 RAG 시나리오를 아우르며 미래의 잠재적 발전에 도움이 되는 발전과 중추적인 기술을 조명합니다. 또한 RAG의 추가 개선 방법을 요약하여 RAG 시스템의 효과적인 엔지니어링 및 구현을 촉진합니다. 그런 다음 다른 관점에서 다양한 양식과 작업에서 RAG의 실제 적용 사례를 조사하여 연구자와 실무자에게 유용한 참고자료를 제공합니다. 또한 RAG의 벤치마크를 소개하고 현재 RAG 시스템의 한계에 대해 논의하며 향후 연구 방향을 제시합니다. 프로젝트: GitHub - hymie122/RAG-Survey: Collecting awesome papers of RAG for AIGC. We propose a taxonomy of RAG foundations, enhancements, and applications in paper "Retrieval-Augmented Generation for AI-Generated Content: A Survey".

The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by advancements in model algorithms, scalable foundation model architectures, and the availability of ample high-quality datasets. While AIGC has achieved remarkable performance, it still faces challenges, such as the difficulty of maintaining up-to-date and long-tail knowledge, the risk of data leakage, and the high costs associated with training and inference. Retrieval-Augmented Generation (RAG) has recently emerged as a paradigm to address such challenges. In particular, RAG introduces the information retrieval process, which enhances AIGC results by retrieving relevant objects from available data stores, leading to greater accuracy and robustness. In this paper, we comprehensively review existing efforts that integrate RAG technique into AIGC scenarios. We first classify RAG foundations according to how the retriever augments the generator. We distill the fundamental abstractions of the augmentation methodologies for various retrievers and generators. This unified perspective encompasses all RAG scenarios, illuminating advancements and pivotal technologies that help with potential future progress. We also summarize additional enhancements methods for RAG, facilitating effective engineering and implementation of RAG systems. Then from another view, we survey on practical applications of RAG across different modalities and tasks, offering valuable references for researchers and practitioners. Furthermore, we introduce the benchmarks for RAG, discuss the limitations of current RAG systems, and suggest potential directions for future research. Project: GitHub - hymie122/RAG-Survey: Collecting awesome papers of RAG for AIGC. We propose a taxonomy of RAG foundations, enhancements, and applications in paper "Retrieval-Augmented Generation for AI-Generated Content: A Survey".

논문 링크

더 읽어보기

https://github.com/hymie122/RAG-Survey

https://x.com/omarsar0/status/1765414854397985175

KnowAgent: LLM 기반 에이전트를 위한 지식 증강 계획 수립 / KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents

논문 소개

명시적인 액션 지식을 통해 llms의 계획 기능을 향상시키는 접근 방식을 제안하고, 액션 지식 기반과 지식이 풍부한 자가 학습 단계를 사용하여 모델의 액션 생성을 안내하고 계획 환각을 완화하며 지속적인 개선을 가능하게 하며, 기존 기준선을 능가하고 외부 액션 지식을 통합하여 llms로 계획을 간소화하고 복잡한 계획 문제를 해결할 수 있는 잠재력을 보여 줍니다.

Proposes an approach to enhance the planning capabilities of llms through explicit action knowledge; uses an action knowledge base and a knowledgeable self-learning phase to guide the model's action generation, mitigate planning hallucination, and enable continuous improvement; outperforms existing baselines and shows the potential of integrating external action knowledge to streamline planning with llms and solve complex planning challenges.

논문 초록(Abstract)

대규모 언어 모델(LLM)은 복잡한 추론 작업에서 큰 잠재력을 보여 왔지만, 실행 가능한 액션을 생성하여 환경과 상호작용하는 등 보다 정교한 문제를 해결할 때는 부족함을 드러냅니다. 이러한 부적절함은 주로 언어 에이전트에 내장된 액션 지식이 부족하여 작업 해결 중 계획 궤적을 효과적으로 안내하지 못하고 계획 환각을 초래하는 데서 비롯됩니다. 이 문제를 해결하기 위해 명시적 행동 지식을 통합하여 LLM의 계획 능력을 향상시키기 위해 고안된 새로운 접근 방식인 KnowAgent를 소개합니다. 구체적으로 KnowAgent는 액션 지식 기반과 지식이 풍부한 자가 학습 전략을 사용하여 계획 중 액션 경로를 제한함으로써 보다 합리적인 궤적 합성을 가능하게 하여 언어 에이전트의 계획 성능을 향상시킵니다. 다양한 백본 모델에 기반한 HotpotQA 및 ALFWorld의 실험 결과는 KnowAgent가 기존 기준선과 비슷하거나 더 우수한 성능을 달성할 수 있음을 보여줍니다. 추가 분석을 통해 계획 환각 완화 측면에서 KnowAgent의 효과를 확인할 수 있습니다. 코드는 GitHub - zjunlp/KnowAgent: [NAACL 2025] KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents 에서 확인할 수 있습니다.

Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation. Code is available in GitHub - zjunlp/KnowAgent: [NAACL 2025] KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents.

논문 링크

더 읽어보기

https://github.com/zjunlp/KnowAgent

https://x.com/omarsar0/status/1765408813467759037

Sora: 대형 비전 모델의 배경, 기술, 한계 및 기회에 대한 검토 / Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

논문 소개

대규모 비전 모델의 한계와 기회를 포함하여 이 모델을 뒷받침하는 주요 개발 사항과 소라에 대한 종합적인 검토입니다.

A comprehensive review of sora and some of the key developments powering this model, including limitations and opportunities of large vision models.

논문 초록(Abstract)

소라는 2024년 2월 OpenAI에서 출시한 텍스트-비디오 생성 AI 모델입니다. 이 모델은 텍스트 지침에서 현실적이거나 상상적인 장면의 비디오를 생성하도록 학습되어 실제 세계를 시뮬레이션할 수 있는 잠재력을 보여줍니다. 이 논문에서는 공개 기술 보고서와 리버스 엔지니어링을 기반으로 모델의 배경, 관련 기술, 응용 분야, 남은 과제, 텍스트-비디오 AI 모델의 향후 방향에 대해 종합적으로 검토합니다. 먼저 Sora의 개발 과정을 추적하고 이 '월드 시뮬레이터'를 구축하는 데 사용된 기반 기술을 살펴봅니다. 그런 다음 영화 제작, 교육, 마케팅에 이르기까지 다양한 산업 분야에서 소라의 활용 사례와 잠재적 영향력을 자세히 설명합니다. 그리고 안전하고 편견 없는 영상 제작을 보장하는 등 소라를 널리 배포하기 위해 해결해야 할 주요 과제와 한계에 대해 논의합니다. 마지막으로, 소라와 동영상 생성 모델 전반의 향후 발전 방향과 이 분야의 발전이 어떻게 새로운 방식의 인간-AI 상호작용을 가능하게 하여 동영상 생성의 생산성과 창의성을 높일 수 있는지에 대해 논의합니다.

Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model's background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora's development and investigate the underlying technologies used to build this "world simulator". Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.

논문 링크

더 읽어보기

https://x.com/omarsar0/status/1765756669659603015

SaulLM-7B: 법률을 위한 선구적인 대규모 언어 모델 / SaulLM-7B: A pioneering Large Language Model for Law

논문 소개

법률 텍스트 이해 및 생성을 위해 명시적으로 설계된 법률 영역용 대규모 언어 모델인 saullm-7b를 소개하고, 법률 데이터세트를 활용하여 법률 업무의 성능을 더욱 향상시키는 교육적 미세 조정 방법을 제시합니다.

Introduces saullm-7b, a large language model for the legal domain explicitly designed for legal text comprehension and generation; presents an instructional fine-tuning method that leverages legal datasets to further enhance performance in legal tasks.

논문 초록(Abstract)

이 논문에서는 법률 분야에 특화된 대규모 언어 모델(LLM)인 SaulLM-7B를 소개합니다. 70억 개의 파라미터를 갖춘 SaulLM-7B는 법률 텍스트 이해 및 생성을 위해 명시적으로 설계된 최초의 LLM입니다. Mistral 7B 아키텍처를 기반으로 하는 SaulLM-7B는 300억 개 이상의 토큰으로 구성된 영어 법률 코퍼스로 학습되었습니다. SaulLM-7B는 법률 문서를 이해하고 처리하는 데 있어 최첨단 수준의 숙련도를 보여줍니다. 또한 법률 데이터셋을 활용하는 새로운 교육용 미세 조정 방법을 제시하여 법률 작업에서 SaulLM-7B의 성능을 더욱 향상시킵니다. SaulLM-7B는 MIT 라이선스에 따라 출시됩니다.

In this paper, we introduce SaulLM-7B, a large language model (LLM) tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the MIT License.

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1765614083875738028

Design2Code: 프런트엔드 엔지니어링 자동화는 어디까지 왔을까요? / Design2Code: How Far Are We From Automating Front-End Engineering?

논문 소개

프런트엔드 엔지니어링 자동화의 핵심인 시각적 디자인을 코드 구현으로 변환하는 데 멀티모달 llms를 사용하는 방법을 조사하고, 484개의 다양한 실제 웹페이지 벤치마크와 디자인-코드 변환 기능을 측정하는 일련의 평가 지표를 소개하며, 멀티모달 프롬프트 방법을 추가로 개발하여 gpt-4v와 gemini 프로 비전에서 그 효과를 보여주고, 오픈소스로 미세 조정된 디자인2코드가 gemini 프로 비전의 성능과 비슷하지만 작업에서 gpt-4v가 가장 우수한 성능을 발휘합니다.

Investigates the use of multimodal llms for converting a visual design into code implementation which is key for automating front-end engineering; introduces a benchmark of 484 diverse real-world webpages and a set of evaluation metrics to measure the design-to-code capability; further develops a suite of multimodal prompting methods and show their effectiveness on gpt-4v and gemini pro vision; an open-source fine-tuned design2code matches the performance of gemini pro vision, however, gpt-4v performs the best on the task.

논문 초록(Abstract)

생성형 AI는 최근 몇 년 동안 급속도로 발전하여 멀티모달 이해와 코드 생성에 있어 전례 없는 역량을 달성했습니다. 이를 통해 멀티모달 LLM이 시각적 디자인을 코드 구현으로 직접 변환할 수 있는 새로운 패러다임의 프런트엔드 개발이 가능해졌습니다. 이 작업에서는 이를 Design2Code 작업으로 공식화하여 포괄적인 벤치마킹을 수행합니다. 구체적으로 484개의 다양한 실제 웹페이지를 테스트 사례로 삼아 벤치마크를 수동으로 큐레이션하고, 스크린샷을 입력으로 주어졌을 때 현재의 멀티모달 LLM이 주어진 참조 웹페이지에 직접 렌더링하는 코드 구현을 얼마나 잘 생성할 수 있는지 평가하기 위한 자동 평가 메트릭을 개발합니다. 또한 포괄적인 인적 평가로 자동 지표를 보완합니다. 유니티는 멀티모달 프롬프트 방법을 개발하여 GPT-4V 및 Gemini Pro Vision에서 그 효과를 보여줍니다. 또한 오픈소스인 Design2Code-18B 모델을 더욱 세밀하게 조정하여 Gemini Pro Vision의 성능과 성공적으로 일치시킵니다. 사람의 평가와 자동 메트릭 모두 다른 모델에 비해 GPT-4V가 이 작업에서 가장 우수한 성능을 보인다는 것을 보여줍니다. 또한 어노테이터들은 시각적 외관과 콘텐츠 측면에서 49%의 경우에서 GPT-4V로 생성된 웹페이지가 원본 참조 웹페이지를 대체할 수 있다고 생각했으며, 놀랍게도 64%의 경우에서 GPT-4V로 생성된 웹페이지가 원본 참조 웹페이지보다 더 나은 것으로 간주했습니다. 세분화된 분석 지표에 따르면 오픈소스 모델은 입력 웹페이지에서 시각적 요소를 불러오고 올바른 레이아웃 디자인을 생성하는 데 대부분 지연되는 반면, 텍스트 콘텐츠 및 색상과 같은 측면은 적절한 미세 조정을 통해 크게 개선할 수 있는 것으로 나타났습니다.

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1765199160653828385

TripoSR: 단일 이미지에서 빠른 3D 오브젝트 재구성 / TripoSR: Fast 3D Object Reconstruction from a Single Image

논문 소개

빠른 피드 포워드 3D 생성을 위한 트랜스포머 기반 3D 재구성 모델로, 단일 이미지에서 0.5초 이내에 3D 메시를 생성할 수 있으며 데이터 처리, 모델 설계 및 학습이 개선되었습니다.

A transformer-based 3d reconstruction model for fast feed-forward 3d generation; it can produce 3d mesh from a single image in under 0.5 seconds; improvement includes better data processing, model design, and training.

논문 초록(Abstract)

이 기술 보고서에서는 빠른 피드포워드 3D 생성을 위해 트랜스포머 아키텍처를 활용하여 단일 이미지에서 0.5초 이내에 3D 메시를 생성하는 3D 재구성 모델인 TripoSR을 소개합니다. LRM 네트워크 아키텍처를 기반으로 하는 TripoSR은 데이터 처리, 모델 설계, 학습 기법을 크게 개선했습니다. 공개 데이터셋에 대한 평가에 따르면 TripoSR은 다른 오픈소스 대안에 비해 양적, 질적으로 모두 우수한 성능을 보여줍니다. MIT 라이선스에 따라 출시된 TripoSR은 연구자, 개발자, 크리에이터에게 3D 생성형 AI의 최신 발전된 기능을 제공하기 위한 것입니다.

This technical report introduces TripoSR, a 3D reconstruction model leveraging transformer architecture for fast feed-forward 3D generation, producing 3D mesh from a single image in under 0.5 seconds. Building upon the LRM network architecture, TripoSR integrates substantial improvements in data processing, model design, and training techniques. Evaluations on public datasets show that TripoSR exhibits superior performance, both quantitatively and qualitatively, compared to other open-source alternatives. Released under the MIT license, TripoSR is intended to empower researchers, developers, and creatives with the latest advancements in 3D generative AI.

논문 링크

더 읽어보기

https://x.com/_akhaliq/status/1764841524431392794

원문

이 글은 GPT 모델로 정리한 것으로, 잘못된 부분이 있을 수 있으니 글 아래쪽의 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~