llm.c: ML Framework 없이 순수 C/CUDA를 사용한 GPT-2 학습 코드

9bow · 4월 9, 2024, 1:11오후

PyTorchKR

OpenAI와 Tesla 등에서 많은 업적을 쌓은 Andrej Karpathy ~~갓파시~~가 llm.c라는 이름의 새로운 GitHub 저장소를 공개하였습니다. 이 저장소에서는 기존의 복잡한 머신러닝 라이브러리 없이도 순수 C/CUDA를 사용하여 대규모 언어 모델(LLM)의 학습이 가능하다는 것을 보여줍니다. 복잡한 의존성 없이 단순하고 깔끔한 코드로 GPT-2와 같은 모델을 학습할 수 있으며, 이를 통해 모델의 이해와 최적화가 얼마나 접근하기 쉬워질 수 있는지 탐구할 수 있는 기회를 제공합니다. llm.c의 주요 특징과 사용 방법을 함께 살펴보시죠.

llm.c: ML Framework 없이 순수 C/CUDA를 사용한 GPT-2 학습 코드

소개

llm.c는 대규모 언어 모델(LLM)을 간단하고 순수한 C언어와 CUDA 기반으로 학습하는 프로젝트입니다. 이 프로젝트의 목적은 PyTorch나 TensorFlow와 같은 기존의 머신러닝 프레임워크 없이도 LLM 훈련을 가능하게 하는 것으로, GPT-2 모델을 처음으로 구현한 예제를 통해, 이러한 모델들이 어떻게 구성되어 있는지, 그리고 어떻게 효율적으로 훈련될 수 있는지에 대한 이해를 돕습니다.

llm.c는 기존의 머신러닝 프레임워크와 비교했을 때, 훨씬 간단하고 경량화된 접근 방식을 제공합니다. PyTorch나 TensorFlow와 같은 라이브러리에 비해 의존성이 적고, 코드의 양이 현저히 줄어들며, 직관적으로 모델의 동작 원리를 이해할 수 있게 해줍니다. 이는 특히 모델의 구조와 작동 방식을 깊게 이해하고자 하는 개발자나 연구자에게 유용할 것입니다.

주요 특징 및 하이라이트

순수 C/CUDA 구현: llm.c는 PyTorch(약 245MB)나 CPython(약 107MB) 없이도 잘 동작하며, 약 1,000줄의 코드만으로 GPT-2(CPU, fp32) 학습을 할 수 있습니다.
직관적이고 간결한 코드: 코드는 컴파일하고 실행하는 데 있어서 즉각적인 반응을 보이며, PyTorch 참조 구현과 정확히 일치합니다.
향후 개선 계획: 직접 CUDA 구현을 통한 속도 향상, CPU 버전의 SIMD 지시어 사용 최적화, 더 현대적인 아키텍처(e.g., Llama2, Gemma) 지원 등이 계획되어 있습니다.

사용 방법

llm.c 프로젝트는 GPT-2 모델에 tinyshakespeare나 TinyStories 같은 소규모 데이터셋으로 파인튜닝을 진행합니다. 사용 방법은 다음과 같습니다:

데이터셋을 다운로드하고 토크나이즈(tokenize)합니다. 예를 들어, tinyshakespeare 데이터셋은 소규모로 빠르게 내려받고 토크나이징을 할 수 있습니다.
OpenAI가 공개한 GPT-2 가중치를 초기화하여 C에서 로드할 수 있는 체크포인트로 저장합니다.
make 명령어를 사용하여 코드를 컴파일한 뒤, 훈련을 시작합니다. 이 과정에서는 모델 가중치와 토큰을 로드하여 몇 번의 반복 학습 후 모델로부터 샘플을 생성합니다.

학습 결과

llm.c에서 공개한 Apple Silicon(M3 Max)에서 실행한 GPT-2 모델의 학습 결과는 다음과 같습니다. 이 과정은 CPU 코어 수에 맞추어 스레드 수를 조정하는 것으로 시작하여, Adam 최적화 알고리즘의 학습률을 1e-4 로 설정하고, 몇 번의 반복 학습 후 모델로부터 텍스트 샘플을 생성합니다. 각 레이어의 순전파와 역전파 구현이 포함되어 있으며, 이를 통해 모델의 구조와 작동 원리를 직접 확인할 수 있습니다.

[GPT-2]
max_seq_len: 1024
vocab_size: 50257
num_layers: 12
num_heads: 12
channels: 768
num_parameters: 124439808
train dataset num_batches: 1192
val dataset num_batches: 128
num_activations: 73323776
val loss 5.252026
step 0: train loss 5.356189 (took 1452.121000 ms)
step 1: train loss 4.301069 (took 1288.673000 ms)
step 2: train loss 4.623322 (took 1369.394000 ms)
step 3: train loss 4.600470 (took 1290.761000 ms)
... (trunctated) ...
step 39: train loss 3.970751 (took 1323.779000 ms)
val loss 4.107781
generated: 50256 16773 18162 21986 11 198 13681 263 23875 198 3152 262 11773 2910 198 1169 6002 6386 2583 286 262 11858 198 20424 428 3135 7596 995 3675 13 198 40 481 407 736 17903 11 329 703 6029 706 4082 198 42826 1028 1128 633 263 11 198 10594 407 198 2704 454 680 1028 262 1027 28860 286 198 3237 323
step 40: train loss 4.377757 (took 1366.368000 ms)

더 읽어보기

llm.c 저장소

https://github.com/karpathy/llm.c

GPT-2 모델 내 레이어 정규화의 C언어 구현 튜토리얼

레이어 정규화 레이어의 구현을 통해 복잡한 머신러닝 모델을 더 깊게 이해하고자 하는 개발자를 대상으로 합니다. 레이어 정규화는 모델 내에서 활성화 함수의 출력을 정규화하여 학습 과정 중 발생할 수 있는 "Internal Covariate Shift" 문제를 완화시키는 데 도움을 줍니다. 이는 모델의 학습 속도를 향상시키고, 더 나은 학습 결과를 얻는 데 기여합니다:

github.com/karpathy/llm.c

doc/layernorm/layernorm.md

master


# layernorm

Quick tutorial. Let's look at how LayerNorm is handled, as one example layer in the model. We start with the [PyTorch docs for LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html). LayerNorm of course comes from this original paper by [Ba et al. 2016](https://arxiv.org/abs/1607.06450), and was incorporated into the Transformer in [Vaswani et al.](https://arxiv.org/abs/1706.03762) famous paper Attention is All You Need. [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) picked up the same architecture as the Transformer, but the position of the LayerNorm was famously moved into what is now called the pre-normalization version. That is, the residual path of the Transformer is kept clean, and the LayerNorms are now the first layer of each block of the Transformer. This positively improves training stability.

The first thing to note when looking at [PyTorch LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) is that you will most likely not be able to find the actual implementation of the equation. That's because it is buried 30 layers deep in the code, behind an inscrutable dynamical dispatcher, in some possibly auto-generated CUDA code (for those who are interested in details, see [layer_norm.cpp](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/layer_norm.cpp) and  [layer_norm_kernel.cu](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/layer_norm_kernel.cu)). This is done because PyTorch really really cares about efficiency, fair enough. For our purposes though, we have to start by first implementing LayerNorm manually using simpler PyTorch operations. This will be a lot less efficient than just forwarding a `LayerNorm` module, but it is algorithmically instructive. So here is the direct implementation of the math of LayerNorm using simpler PyTorch operations:

```python
import torch
eps = 1e-5

class LayerNorm:

    @staticmethod
    def forward(x, w, b):
        # x is the input activations, of shape B,T,C
        # w are the weights, of shape C
        # b are the biases, of shape C
        B, T, C = x.size()
        # calculate the mean

This file has been truncated. show original

Andrej Karpathy의 최근 프로젝트 소개 글

이 글은 GPT 모델로 정리한 글을 바탕으로 한 것으로, 원문의 내용 또는 의도와 다르게 정리된 내용이 있을 수 있습니다. 관심있는 내용이시라면 원문도 함께 참고해주세요! 읽으시면서 어색하거나 잘못된 내용을 발견하시면 덧글로 알려주시기를 부탁드립니다.

파이토치 한국 사용자 모임이 정리한 이 글이 유용하셨나요? 회원으로 가입하시면 주요 글들을 이메일로 보내드립니다! (기본은 Weekly지만 Daily로 변경도 가능합니다.)

아래쪽에 좋아요를 눌러주시면 새로운 소식들을 정리하고 공유하는데 힘이 됩니다~