Road to Sora: OpenAI의 Sora를 이해하기 위한 선행 연구 소개 (feat. Oxen.AI)

9bow · 3월 25, 2024, 3:57오전

PyTorchKR:

고품질의 AI 데이터셋 도구를 제작하는 Oxen.AI에서는 매주 금요일마다 AI 논문을 읽고 인사이트를 공유하는 ArXiv Dives를 운영하고 있습니다.
이번 글은 3월 초 ArXiv Dives에서 진행되었던 Road to Sora라는 글을 허락 하에 번역하여 공유합니다.
Road to Sora는 OpenAI에서 공개한 이미지 생성 모델인 Sora의 기술 문서를 바탕으로 Sora 모델을 이해하는데 필요한 지식들을 알아보는 것을 목표로 합니다.

Road to Sora: OpenAI의 Sora를 이해하기 위한 연구들 / "Road to Sora" Paper Reading List

by Greg Schoeninger, Mar 5, 2024

이번 글은 금요일 논문 클럽인 ArXiv Dives의 읽기 목록을 정리하기 위한 노력의 일환입니다. 아직 Sora에 대한 공식 논문이 발표되지 않았기 때문에 OpenAI의 Sora 기술 보고서의 내용을 따르는 것이 목표입니다. 앞으로 몇 주 동안 금요일 논문 클럽에서 몇 가지 기본 논문을 검토하여 Sora의 장막 뒤에서 어떤 일이 벌어지고 있는지 더 잘 파악할 수 있도록 할 계획입니다.

This post is an effort to put together a reading list for our Friday paper club called ArXiv Dives. Since there has not been an official paper released yet for Sora, the goal is follow the bread crumbs from OpenAI's technical report on Sora. We plan on going over a few of the fundamental papers in the coming weeks during our Friday paper club, to help paint a better picture of what is going on behind the curtain of Sora.

Sora가 무엇인가요? / What is Sora?

Sora는 생성형 AI 분야에서 큰 반향을 일으킨 모델로, 자연어 프롬프트에서 고음질 동영상을 생성할 수 있습니다. 아직 Sora의 예시를 보지 못했다면 아래 산호초에서 헤엄치는 거북이 동영상을 확인해 보세요.

Sora has taken the Generative AI space by storm with it's ability to generate high fidelity videos from natural language prompts. If you haven't seen an example yet, here's a generated video of a turtle swimming in a coral reef for your enjoyment.

OpenAI에서 모델 자체의 기술적 세부 사항에 대한 공식 연구 논문을 발표하지는 않았지만, 사용한 기술에 대한 고수준의 세부 사항과 일부 정성적 결과를 다루는 기술 문서는 공개했습니다.

While the team at OpenAI has not released an official research paper on the technical details of the model itself, they did release a technical report that covers some high level details of the techniques they used and some qualitative results.

https://openai.com/research/video-generation-models-as-world-simulators

Sora 아키텍처 개요 / Sora Architecture Overview

아래 논문들을 읽으시고 나면 Sora의 아키텍처가 이해되실 것입니다. (OpenAI의) 기술 문서는 매우 높은 곳에서 바라본 것으로, 각 논문들에서 다양한 측면을 자세히 이해하고 전체 그림을 그릴 수 있으시기를 바랍니다. 먼저 "Sora: 대형 비전 모델의 배경, 기술, 한계 및 기회에 대한 검토 (Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models"라는 멋진 리뷰 논문에서 리버스 엔지니어링된 아키텍처에 대한 고수준의 다이어그램을 제공하고 있습니다.

After reading the papers below, the architecture here should start to make sense. The technical report is a 10,000 foot view and my hope is that each paper will zoom into different aspects and paint the full picture. There is a nice literature review called "Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models" that gives a high level diagram of a reverse engineered architecture.

OpenAI에서는 Sora가 위 논문에 나열된 많은 개념을 결합한 '확산 트랜스포머(Diffusion Transformer)'라고 설명하며, 비디오로부터 생성된 잠재 시공간 패치(latent spacetime patch)들에 적용된다고 말합니다.

The team at OpenAI states that Sora is a "Diffusion Transformer" which combines many of the concepts listed in the papers above, but applied applied to latent spacetime patches generated from video.

이는 비전 트랜스포머(ViT; Vision Transformer) 논문에 사용된 패치의 형태과 Latent Diffusion 논문과 유사한 잠재 공간(latent space)의 조합이지만, 확산 트랜스포머의 형태로 결합한 것입니다. 이미지의 넓이(width)과 높이(height)별로 패치가 있을 뿐만 아니라 영상의 시간 차원으로도 확장되어 있습니다.

This is a combination of the style of patches used in the Vision Transformer (ViT) paper, with latent spaces similar to the Latent Diffusion Paper, but combined in the style of the Diffusion Transformer. They not only have patches in width and height of the image but extend it to the time dimension of video.

이 모든 것을 위한 학습 데이터를 정확히 어떻게 수집했는지 말하기는 어렵지만, Dall-E 3 논문에 나온 기술들을 조합하였을 뿐만 아니라, GPT-4를 사용하여 각 이미지에 대한 상세한 텍스트 설명을 합친 뒤, 이를 영상으로 전환한 것으로 보입니다. 학습 데이터는 여기서 가장 중요한 비밀 소스일 가능성이 높기 때문에 기술 보고서에서 세부적인 내용이 가장 적게 설명되어 있습니다.

It's hard to say how exactly they collected the training data for all of this, but it seems like a combination of the techniques in the Dalle-3 paper as well as using GPT-4 to elaborate on textual descriptions of images, that they then turn into videos. Training data is likely the main secret sauce here, hence has the least level of detail in the technical report.

활용 사례 / Use Cases

Sora와 같은 영상 생성 기술은 흥미로운 사용 사례 및 응용이 많습니다. 영화나 교육, 게임, 의료, 로봇 공학 등, 자연어 프롬프트로부터 사실적인 동영상을 생성하는 것은 여러 산업 분야에 큰 변화를 가져올 것입니다.

There are many interesting use cases and applications for video generation technologies like Sora. Whether it be movies, education, gaming, healthcare or robotics, there is no doubt generating realistic videos from natural language prompts is going to shake up multiple industries.

이 다이어그램의 하단에 있는 주석은 Oxen.ai에게도 해당되는 내용입니다. Oxen.ai에 대해 잘 모르시는 분들을 위해 설명드리자면, 저희는 머신러닝 모델에서 들어오고 나가는 데이터를 협업하고 평가하는 데 도움이 되는 오픈소스 도구를 구축하고 있습니다. 우리는 많은 사람들이 이러한 데이터에 대한 가시성을 필요로 하며, 이를 위해서는 공동의 노력이 필요하다고 믿습니다. AI는 다양한 분야와 산업에 영향을 미치고 있으며, 이러한 모델을 학습하고 평가하는 데이터를 더 많이 볼수록 더 나은 결과를 얻을 수 있습니다.

The note at the bottom of this diagram rings true for us at Oxen.ai. If you are not familiar with Oxen.ai we are building open source tools to help you collaborate on and evaluate data the comes in and out of machine learning models. We believe that many people need visibility into this data, and that it should be a collaborative effort. AI is touching many different fields and industries and the more eyes on the data that trains and evaluates these models, the better.

여기에서 확인해보실 수 있습니다: https://oxen.ai

Check us out here: https://oxen.ai

논문 목록 / Paper Reading List

OpenAI가 공개한 기술 문서의 참고 문헌 섹션에는 많은 논문이 링크되어 있지만, 어떤 것을 먼저 읽어야 할지 또는 중요한 배경 지식이 되는지 알기가 다소 어렵습니다. 저희는 그 중 가장 영향력 있고 흥미로운 논문을 선별하여 유형별로 정리해 보았습니다.

There are many papers linked in the references section of the OpenAI technical report but it is a bit hard to know which ones to read first or are important background knowledge. We've sifted through them and selected what we think are the most impactful and interesting ones to read, and organized them by type.

배경 지식 관련 논문들 / Background Papers

생성된 이미지와 동영상의 품질은 2015년부터 꾸준히 향상되고 있습니다. 일반 대중의 시선을 사로잡은 가장 큰 발전은 2022년 Midjourney와 Stable Diffusion, Dall-E부터 시작되었습니다. 이 섹션에는 여러 문헌에서 반복적으로 언급되는 몇 가지 기초 논문과 모델 아키텍처가 포함되어 있습니다. 모든 논문이 Sora 아키텍처에 직접적으로 관련된 것은 아니지만, 시간의 흐름에 따라 기술이 어떻게 발전해왔는지에 대해 이해할 수 있는 중요한 배경들입니다.

The quality of generated images and video have been steadily increasing since 2015. The biggest gains that caught the general public's eyes began in 2022 with Midjourney, Stable Diffusion and Dalle. This section contains some foundational papers and model architectures that are referenced over and over again in the literature. While not all papers are directly involved in the Sora architecture, they are all important context for how the state of the art has improved over time.

출처: [2402.17177] Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

아래 논문 중 많은 부분을 이전 ArXiv Dives에서 다루었으니 따라잡고 싶으시면 Oxen.ai 블로그에 있는 모든 내용을 참고하세요.

U-Net

"U-Net: 생물의학 이미지 세분화를 위한 컨볼루션 네트워크(U-Net: Convolutional Networks for Biomedical Image Segmentatio)" 논문은 특정 분야(여기에서는 생물의학 이미지)의 작업에 사용되었던 논문이 다양한 사용 사례에 적용된 좋은 예입니다. 가장 주목할 만한 점은 각 단계에서 노이즈를 예측하고 완화하는 학습을 용이하게 하기 위해 Stable Diffusion과 같은 여러 디퓨전 모델을 기반으로 하고 있다는 점입니다. Sora 아키텍처에서 직접 사용되지는 않았지만, 이전 최신 기술을 위한 중요한 배경 지식입니다.

"U-Net: Convolutional Networks for Biomedical Image Segmentation" is a great example of a paper that was used for a task in one domain (Biomedical imaging) that got applied across many different use cases. Most notably is the backbone many diffusion models such as Stable Diffusion to facilitate learning to predict and mitigate noise at each step. While not directly used in the Sora architecture, important background knowledge for previous state of the art.

언어 트랜스포머 / Language Transformers

"어텐션만 있으면 충분합니다(Attnetion is All You Need)" 논문은 기계 번역 작업에서 입증된 또 다른 논문이지만 결국 모든 자연어 처리 연구에 중요한 논문이 되었습니다. 트랜스포머는 현재 ChatGPT와 같은 많은 LLM 애플리케이션의 근간이 되고 있습니다. 트랜스포머는 결국 다양한 양식으로 확장할 수 있으며, Sora 아키텍처의 구성 요소로 사용됩니다.

"Attention Is All You Need" is another paper that proved itself on a Machine Translation task, but ended up being a seminal paper for all of natural language processing research. Transformers are now the backbone of many LLM applications such as ChatGPT. Transformers end up being extensible to many modalities and are used as a component of the Sora architecture.

비전 트랜스포머 / Vision Transformer (ViT)

"이미지는 16x16 단어의 가치가 있습니다: 대규모 이미지 인식을 위한 트랜스포머(An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale)" 논문은 트랜스포머를 이미지 인식에 적용한 최초의 논문 중 하나로, 충분히 큰 데이터 세트에 대해 트레이닝하면 ResNet과 다른 컨볼루션 신경망을 능가할 수 있음을 증명한 논문입니다. 이 논문은 "Attention is All You Need" 논문에 나온 아키텍처를 컴퓨터 비전 작업에 적용했습니다. ViT는 텍스트 토큰을 입력으로 사용하는 대신 16x16 이미지 패치를 입력으로 사용합니다.

"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" was one of the first papers to apply Transformers to image recognition, proving that they can outperform ResNets and other Convolutional Neural Networks if you train them on large enough datasets. This takes the architecture from the "Attention Is All You Need" paper and makes it work for computer vision tasks. Instead of the inputs being text tokens, ViT uses 16x16 image patches as input.

잠재 확산 모델 / Latent Diffusion Models

"잠재 확산 모델을 사용한 고해상도 이미지 합성(High-Resolution Image Synthesis with Latent Diffusion Models)"은 Stable Diffusion과 같은 많은 이미지 생성 모델의 기반이 되는 기술입니다. 잠재 표현(latent representation)에서 노이즈 제거 자동 인코더의 시퀀스로 이미지 생성을 재구성하는 방법을 보여줍니다. 이 모델들은 위에서 언급한 U-Net 아키텍처를 생성 프로세스의 중추로 사용합니다. 이러한 모델은 텍스트 입력이 주어지면 사실적인 이미지를 생성할 수 있습니다.

"High-Resolution Image Synthesis with Latent Diffusion Models" is the technique behind many image generation models such as Stable Diffusion. They show how you can reformulate the image generation as a sequence of denoising auto-encoders from a latent representation. They use the U-Net architecture referenced above as the backbone of the generative process. These models can generate photo-realistic images given any text input.

CLIP

"자연어 감독으로부터 전이 가능한 시각 모델 학습(Learning Transferable Visual Models From Natural Language Supervision)"은 대조 언어-이미지 사전 훈련(CLIP; Contrastive Language-Image Pre-training)이라고도 하며 텍스트 데이터와 이미지 데이터를 서로 동일한 잠재 공간에 포함시키는 기법입니다. 이 기술은 텍스트와 이미지 쌍 간에 코사인 유사성이 높은지 확인하여 생성 모델의 언어 이해와 시각적 이해를 연결하는 데 도움이 됩니다.

"Learning Transferable Visual Models From Natural Language Supervision" often referred to as Contrastive Language-Image Pre-training (CLIP) is a technique for embedding text data and image data into the same latent space as each other. This technique helps connect the language understanding half of generative models to the visual understanding half by making sure that the cosine similarity between the text and image representations are high between text and image pairs.

VQ-VAE

Sora 기술 문서에 따르면, 벡터 양자화 변형 자동 인코더(VQ-VAE, Vector Quantized Variational Auto Encoder)로 원본 비디오의 차원을 줄입니다. VAE 모델은 잠재적 표현을 학습하기 위한 강력한 비지도 사전 학습 방법으로 알려져 있습니다.

According to the technical report, they reduce the dimensionality of the raw video with a Vector Quantised Variational Auto Encoder (VQ-VAE). VAEs have been shown to be a powerful unsupervised pre-training method to learn latent representations.

Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution

Sora 기술 문서에서는 모든 화면 비율의 비디오를 가져오는 방법과 이를 통해 훨씬 더 많은 데이터셋들을 학습하는 방법에 대해 설명합니다. 데이터를 자르지 않고도 모델에 더 많은 데이터를 공급할 수 있을수록 더 나은 결과를 얻을 수 있습니다. 이 논문에서는 이미지에 동일한 기법을 사용하지만, Sora는 이를 동영상으로 확장합니다.

The Sora technical report talks about how they take in videos of any aspect ratio, and how this allows them to train on a much larger set of data. The more data they can feed the model without having to crop it, the better results they get. This paper uses the same technique but for images, and Sora extends it for video.

영상 생성 분야의 논문들 / Video Generation Papers

They reference a few video generation papers that inspired Sora and take the generative models above to the next level by applying them to video.

ViViT: A Video Vision Transformer

이 논문에서는 영상 작업에 필요한 '시공간 토큰(Spatio-Temporal Token)'으로 동영상을 분할하는 방법에 대해 자세히 설명합니다. 이 논문은 비디오 분류에 초점을 맞추고 있지만, 동일한 토큰화 방식을 비디오 생성 태스크에도 적용할 수 있습니다.

This paper goes into details about how you can chop the video into "spatio-temporal tokens" needed for video tasks. The paper focuses on video classification, but the same tokenization can be applied to generating video.

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen은 일련의 비디오 디퓨전 모델을 기반으로 하는 텍스트-조건부 비디오 생성 시스템(Text-conditional Video Generation System)입니다. 시간 방향의 컨볼루션과 Super Resolution 기법을 사용하여 텍스트로부터 고화질 동영상을 생성합니다.

Imagen is a text-conditional video generation system based on a cascade of video diffusion models. They use convolutions in the temporal direction and super resolution to generate high quality videos from text.

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

이 논문은 위의 이미지 생성 논문에서 잠재 디퓨전 모델을 가져와 잠재 공간에 시간 차원(temporal dimension)을 도입합니다. 여기에서는 잠재 공간을 정렬하여 시간적 차원에서 몇 가지 흥미로운 기법을 적용하지만, 아직 Sora의 시간적 일관성에는 미치지 못합니다.

This paper takes the latent diffusion models from the image generation papers above and introduces a temporal dimension to the latent space. They apply some interesting techniques in the temporal dimension by aligning the latent spaces, but does not quite have the temporal consistency of Sora yet.

Photorealistic video generation with diffusion models

이 논문은 확산 모델링을 통한 사실적인 동영상 생성을 위한 트랜스포머 기반 접근 방식인 W.A.L.T를 소개합니다. 제가 알기로는 레퍼런스 목록에서 소라와 가장 가까운 기술인 것 같으며, 구글, 스탠포드, 조지아공대 팀이 2023년 12월에 발표했습니다.

They introduce W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. This feels like the closest technique to Sora in the reference list as far as I can tell, and was released in December of 2023 by the teams at Google, Stanford and Georgia Tech.

시각-언어 이해 분야의 논문들 / Vision-Language Understanding

텍스트 프롬프트에서 동영상을 생성하려면 대량의 데이터셋들을 수집해야 합니다. 사람이 직접 그 많은 동영상에 라벨을 붙이는 것은 불가능하기 때문에 DALL-E 3 논문에 설명된 것과 유사한 합성 데이터 기법을 사용하는 것으로 보입니다.

In order to Generate Videos from text prompts, they need to collect a large dataset. It is not feasible to have humans label that many videos, so it seems they use some synthetic data techniques similar to those described in the DALL·E 3 paper.

DALL·E 3

텍스트-비디오 생성 시스템을 학습하려면 해당 텍스트 캡션이 포함된 대량의 동영상이 필요합니다. DALL-E 3에 소개된 리-캡셔닝(re-captioning) 기법들을 Sora의 동영상 학습 데이터들에 적용합니다. DALL-E 3와 마찬가지로, 짧은 사용자 프롬프트를 더 긴 상세 캡션으로 변환하여 비디오 모델에 전송하는 데에도 GPT 모델을 활용합니다.

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. They apply the re-captioning technique introduced in DALL·E 3 to videos. Similar to DALL·E 3, they also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model.

https://openai.com/dall-e-3

Llava

모델이 사용자의 지시문을 따를 수 있도록 하기 위해, Llava 논문과 유사한 지시문 기반의 미세 조정(instruction finetuning)을 수행했을 가능성이 높습니다. 이 논문은 또한 위의 Dall-E의 방법과 함께 사용하면 흥미로운 대규모 지시문 데이터셋들을 생성하기 위한 몇 가지 합성 데이터 기법을 보여줍니다.

In order for the model to be able to follow user instructions, they likely did some instruction fine-tuning similar to the Llava paper. This paper also shows some synthetic data techniques to create a large instruction dataset that could be interesting in combination with the Dalle methods above.

Make-A-Video & Tune-A-Video

Make-A-Video 및 Tune-A-Video와 같은 논문들에서는 프롬프트 엔지니어링이 모델의 자연어 이해 능력을 활용하여 복잡한 지침을 해독하고 이를 응집력 있고 생생한 고품질 비디오 내러티브로 렌더링하는 방법을 보여줍니다. 예를 들어, 간단한 사용자 프롬프트를 형용사와 동사로 확장하여 장면을 더욱 풍성하게 표현할 수 있습니다.

Papers like Make-A-Video and Tune-A-Video have shown how prompt engineering leverages model’s natural language understanding ability to decode complex instructions and render them into cohesive, lively, and high-quality video narratives. For example: taking a simple user prompt and extending it with adjectives and verbs to more fully flush out the scene.

결론 / Conclusion

이 글이 Sora와 같은 시스템을 구성할 수 있는 모든 중요한 요소에 대해 알아보는 출발점이 되길 바랍니다! 저희가 놓친 부분이 있다고 생각되면 언제든지 이메일(hello@oxen.ai)로 알려주세요.

We hope this gives you a jumping off point for all the important components that could make up a system like Sora! If you think we missed anything, feel free to email us at hello@oxen.ai.

여기 소개된 논문들은 결코 가볍게 읽을만한 것들이 아닙니다. 그렇기 때문에 금요일에는 한 번에 하나의 논문을 천천히 읽으며 누구나 이해할 수 있도록 주제를 쉽게 설명합니다. 누구나 AI 시스템 구축에 기여할 수 있으며, 기본을 더 많이 이해할수록 더 많은 패턴을 발견하고 더 나은 제품을 만들 수 있다고 믿습니다.

It is by no means a light set of reading. This is why on Fridays we take one paper at a time, slow down, and break down the topics in plain speak so anyone can understand. We believe anyone can contribute to building AI systems, and the more you understand the fundamentals, the more patterns you will spot, and better products you will build.

ArXiv Dives에 가입하거나 Oxen.ai Discord 커뮤니티에 가입하여 학습 여정에 동참해보세요.

Join us on a learning journey either by signing up for ArXiv Dives or simply joining the Oxen.ai Discord community.

원문

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 뉴스 발행에 힘이 됩니다~

yug6789 · 3월 29, 2024, 4:28오전

우왁! 많은 자료를 손쉽게 한곳에서 볼 수 있다니
항상 늘 감사드립니다. (__)