Open-Sora, 고품질의 비디오 생성을 위한 오픈소스 모델 (feat. HPC-AI)

9bow · 3월 18, 2024, 9:57오후

소개

https://hpc-ai.com/hs-fs/hubfs/1-街景.gif?width=512&height=512&name=1-街景.gif
Open-Sora 1.0으로 생성한 도시 야경 영상

OpenAI의 Sora에서 영감을 받은 Open-Sora 프로젝트는 고품질 비디오를 생성할 수 있는 오픈소스 프로젝트입니다. 전체적인 비디오 생성 전 과정에 대한 파이프라인 - 비디오 데이터의 전처리부터 학습과 추론까지 - 을 지원하며, 모델 가중치까지 공개하였습니다. 제공된 학습 가중치를 사용하여 단 3일 학습만으로 2초 길이의 512x512 비디오를 생성할 수 있습니다. 특히, Open-Sora는 효율적으로 고화질의 비디오를 생산하고, 모델과 도구, 컨텐츠를 모두에게 접근 가능하게 만드는 데에 초점을 맞추고 있습니다.

주요 내용

Open-Sora 1.0은 현재 인기 있는 Diffusion Transformer (DiT) 아키텍처를 기반으로 하며, 텍스트에서 이미지로 생성하는 고품질 오픈소스 모델 PixArt-α를 활용하여 비디오 생성으로 확장되었습니다. 이 모델은 공간-시간적 어텐션 메커니즘을 활용하는 STDiT(Spatial Temporal Diffusion Transformer) 모델로 구성되어 있으며, 비디오 데이터의 시간적 연관성을 모델링하기 위해 2D 공간적 주의 모듈 위에 1D 시간적 주의 모듈을 직렬 방식으로 겹쳐 사용합니다.

학습 단계에서는 먼저 사전 학습된 VAE(Variational Autoencoder) 인코더를 사용하여 비디오 데이터를 압축한 후, 압축된 잠재 공간(latent space)에서 텍스트 임베딩과 함께 제안된 STDiT 모델을 훈련합니다. 추론 단계에서는 VAE의 잠재 공간에서 가우시안 노이즈를 무작위로 샘플링하고, 프롬프트 임베딩과 함께 STDiT에 입력하여 노이즈를 제거한 후의 특징을 얻고, 최종적으로 VAE 디코더에 입력하여 비디오를 생성합니다.

Open-Sora로 생성한 영상 예시

프롬프트: A soaring drone footage captures the majestic beauty of a coastal cliff, [...] The water gently laps at the rock base and the greenery that clings to the top of the cliff.

https://hpc-ai.com/hs-fs/hubfs/4-悬崖海岸.gif?width=512&height=512&name=4-悬崖海岸.gif

프롬프트: The majestic beauty of a waterfall cascading down a cliff into a serene lake. [...] The camera angle provides a bird's eye view of the waterfall.

https://hpc-ai.com/hs-fs/hubfs/5-瀑布.gif?width=512&height=512&name=5-瀑布.gif

프롬프트: A serene underwater scene featuring a sea turtle swimming through a coral reef. The turtle, with its greenish-brown shell [...]

https://hpc-ai.com/hs-fs/hubfs/6-海龟.gif?width=512&height=512&name=6-海龟.gif

프롬프트: A serene night scene in a forested area. [...] The video is a time-lapse, capturing the transition from day to night, with the lake and forest serving as a constant backdrop.

https://hpc-ai.com/hs-fs/hubfs/7-星空.gif?width=512&height=512&name=7-星空.gif

사용 방법

Open-Sora 프로젝트는 다음 단계로 구성된 비디오 제작 파이프라인을 제공합니다:

설치: 필요한 라이브러리와 프로젝트를 설치합니다.

# create a virtual env
conda create -n opensora python=3.10

# install torch
# the command below is for CUDA 12.1, choose install commands from 
# https://pytorch.org/get-started/locally/ based on your own CUDA version
pip3 install torch torchvision

# install flash attention (optional)
pip install packaging ninja
pip install flash-attn --no-build-isolation

# install apex (optional)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" git+https://github.com/NVIDIA/apex.git

# install xformers
pip3 install -U xformers --index-url https://download.pytorch.org/whl/cu121

# install this project
git clone https://github.com/hpcaitech/Open-Sora
cd Open-Sora
pip install -v .

모델 가중치 다운로드: Hugging Face를 통해 제공하는 모델 가중치를 다운로드합니다.

Resolution	Data	#iterations	Batch Size	GPU days (H800)
16×256×256	366K	80k	8×64	117
16×256×256	20K HQ	24k	8×64	45
16×512×512	20K HQ	20k	2×64	35

추론 실행: 제공된 가중치를 사용하여 비디오 샘플을 생성합니다. 사용자 정의 설정을 위한 예제가 제공됩니다. (T5 가중치를 다운로드 받아 pretrained_models/t5_ckpts/t5-v1_1-xxl 에 위치시켜두셔야 합니다.

# Sample 16x256x256 (5s/sample)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x256x256.py --ckpt-path ./path/to/your/ckpt.pth

# Sample 16x512x512 (20s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/16x512x512.py --ckpt-path ./path/to/your/ckpt.pth

# Sample 64x512x512 (40s/sample, 100 time steps)
torchrun --standalone --nproc_per_node 1 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth

# Sample 64x512x512 with sequence parallelism (30s/sample, 100 time steps)
# sequence parallelism is enabled automatically when nproc_per_node is larger than 1
torchrun --standalone --nproc_per_node 2 scripts/inference.py configs/opensora/inference/64x512x512.py --ckpt-path ./path/to/your/ckpt.pth

데이터 처리: 비디오 데이터를 처리하기 위한 도구를 사용하여 데이터셋을 준비합니다.
Downloading datasets. [docs]
Split videos into clips. [docs]
Generate video captions. [docs]
학습: T5 가중치를 다운로드하고, 준비된 데이터셋을 사용하여 모델을 훈련시킵니다.

# 1 GPU, 16x256x256
torchrun --nnodes=1 --nproc_per_node=1 scripts/train.py configs/opensora/train/16x256x256.py --data-path YOUR_CSV_PATH
# 8 GPUs, 64x512x512
torchrun --nnodes=1 --nproc_per_node=8 scripts/train.py configs/opensora/train/64x512x512.py --data-path YOUR_CSV_PATH --ckpt-path YOUR_PRETRAINED_CKPT

참고 자료

원본 기사 URL: Open-Sora GitHub

이 포스트는 기술적인 세부 사항과 코드 조각을 포함하여 Open-Sora 프로젝트의 주요 특징과 사용 방법에 대해 소개합니다. 독자 여러분께서는 이 글을 통해 비디오 제작과 관련된 최신 오픈 소스 프로젝트에 대한 깊이 있는 이해를 얻을 수 있을 것입니다.

더 읽어보기

사전 학습된 Open-Sora 모델 가중치

Resolution	Data	#iterations	Batch Size	GPU days (H800)
16×256×256	366K	80k	8×64	117
16×256×256	20K HQ	24k	8×64	45
16×512×512	20K HQ	20k	2×64	35

이 글은 GPT 모델로 정리한 글을 바탕으로 한 것으로, 원문의 내용 또는 의도와 다르게 정리된 내용이 있을 수 있으니 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 힘이 됩니다~

9bow · 3월 18, 2024, 10:38오후

Open-Sora GitHub 저장소의 Citation 섹션이 가장 놀랍네요.

요즘 인턴십이란.... 어떤 것인걸까요....

@software{opensora,
  author = {Zangwei Zheng and Xiangyu Peng and Yang You},
  title = {Open-Sora: Democratizing Efficient Video Production for All},
  month = {March},
  year = {2024},
  url = {https://github.com/hpcaitech/Open-Sora}
}

Zangwei Zheng and Xiangyu Peng equally contributed to this work during their internship at HPC-AI Tech.

yug6789 · 3월 21, 2024, 7:51오전

... 진짜 세상에... 정말로 인턴십 과정중에 저런걸....