๐Ÿ“– ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ์ดˆ ๋„์„œ [PDF/์˜๋ฌธ/231p] (feat. arXiv)

๐Ÿ“– ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ์ดˆ ๋„์„œ [PDF/์˜๋ฌธ/231p]

Foundations of Large Language Models ๋„์„œ ์†Œ๊ฐœ

Tong Xiao์™€ Jingbo Zhu๊ฐ€ ์ž‘์„ฑํ•˜์—ฌ arXiv์— ๊ณต๊ฐœํ•œ "๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ์ดˆ(Foundations of Large Language Models)" ๋ผ๋Š” ์ œ๋ชฉ์˜ ๋ฌธ์„œ(๋ผ๊ธฐ์—” ์–‘์ด ๋ฐฉ๋Œ€ํ•ด์„œ ๋„์„œ๋ผ๊ณ  ์ง€์นญํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค:sweat_smile:)๋Š” LLM์˜ ๊ธฐ๋ณธ ์›์น™, ๊ธฐ์ˆ , ๋ฐœ์ „ ๋“ฑ์˜ ๋‚ด์šฉ์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋„์„œ๋Š” LLM์˜ ๊ฐœ๋…๋ถ€ํ„ฐ ํ•™์Šต(Training) ๋ฐฉ๋ฒ•๋ก ๊นŒ์ง€ ํฌ๊ด„์ ์œผ๋กœ ์„ค๋ช…ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Foundations of Large Language Models ๋„์„œ์˜ ๋ชฉ์ฐจ

1์žฅ: ์‚ฌ์ „ ํ•™์Šต(Pre-training)

NLP ๋ชจ๋ธ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋˜๋Š” ์‚ฌ์ „ ํ•™์Šต์˜ ๊ฐœ๋…๊ณผ ๋ฐฉ๋ฒ•๋ก ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋น„์ง€๋„(Unsupervised), ์ง€๋„(Supervised), ์ž๊ธฐ์ง€๋„(Self-supervised) ํ•™์Šต์˜ ์›๋ฆฌ์™€ ์ฐจ์ด์ ์„ ์„ค๋ช…ํ•˜๋ฉฐ, BERT์™€ ๊ฐ™์€ ๋ชจ๋ธ ์‚ฌ๋ก€๋ฅผ ํ†ตํ•ด ๊ตฌ์ฒด์ ์ธ ์‚ฌ์ „ ํ•™์Šต ๊ณผ์ •๊ณผ ์‘์šฉ ๋ฐฉ์‹์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

  • 1.1. Pre-training NLP Models (NLP ๋ชจ๋ธ์˜ ์‚ฌ์ „ ํ•™์Šต)
    • Unsupervised, Supervised and Self-supervised Pre-training: ๋น„์ง€๋„, ์ง€๋„, ์ž๊ธฐ ์ง€๋„ ํ•™์Šต ๋ฐฉ๋ฒ•๋ก  ๋ฐ ํŠน์ง•
    • Adapting Pre-trained Models: ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ์„ ๋‹ค์–‘ํ•œ ํ•˜์œ„ ์ž‘์—…์— ์ ์‘ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • 1.2. Self-supervised Pre-training Tasks (์ž๊ธฐ ์ง€๋„ ํ•™์Šต ๊ณผ์ œ)
    • Decoder-only Pre-training: ๋””์ฝ”๋” ์ค‘์‹ฌ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ• ๋ฐ ์–ธ์–ด ์ƒ์„ฑ ๋ชจ๋ธ
    • Encoder-only Pre-training: ์ธ์ฝ”๋” ์ค‘์‹ฌ ์‚ฌ์ „ ํ•™์Šต ๋ฐฉ๋ฒ• ๋ฐ ํ…์ŠคํŠธ ํ‘œํ˜„ ์ƒ์„ฑ
    • Encoder-Decoder Pre-training: ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ธฐ๋ฐ˜ ์‚ฌ์ „ ํ•™์Šต๊ณผ ํ™œ์šฉ
  • 1.3. Example: BERT
    • The Standard Model: BERT ๋ชจ๋ธ์˜ ๊ตฌ์กฐ์™€ ํ•™์Šต ๋ฐฉ๋ฒ•
    • Multilingual Models: ๋‹ค๊ตญ์–ด ๋ชจ๋ธ ํ™•์žฅ๊ณผ ํšจ์œจ์„ฑ ํ–ฅ์ƒ
  • 1.4. Applying BERT Models (BERT ๋ชจ๋ธ ์ ์šฉ)
    • ์‚ฌ์ „ ํ•™์Šต๋œ BERT๋ฅผ ๋‹ค์–‘ํ•œ NLP ์ž‘์—…์— ํ™œ์šฉํ•˜๋Š” ๊ตฌ์ฒด์ ์ธ ์‚ฌ๋ก€

2์žฅ: ์ƒ์„ฑ ๋ชจ๋ธ(Generative Models)

๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์˜ ์ฃผ์š” ์œ ํ˜•์ธ ์ƒ์„ฑ ๋ชจ๋ธ์˜ ์„ค๊ณ„, ํ•™์Šต, ๊ทธ๋ฆฌ๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. GPT์™€ ๊ฐ™์€ ๋””์ฝ”๋” ์ค‘์‹ฌ ๋ชจ๋ธ๊ณผ T5 ๊ฐ™์€ ์ธ์ฝ”๋”-๋””์ฝ”๋” ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๋ฉฐ, ์ด๋“ค ๋ชจ๋ธ์˜ ์žฅ๋ฌธ ์ƒ์„ฑ ๋ฐ ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ์ˆ ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

  • 2.1. A Brief Introduction to LLMs (๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ ๊ฐœ์š”)
    • Decoder-only Transformers: GPT ๋“ฑ ๋””์ฝ”๋” ์ค‘์‹ฌ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ์™€ ํŠน์ง•
    • Training LLMs: ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์˜ ํ•™์Šต ๊ณผ์ •๊ณผ ๊ธฐ์ˆ 
    • Fine-tuning LLMs: ๋ชจ๋ธ์˜ ๋ฏธ์„ธ ์กฐ์ •์„ ํ†ตํ•œ ์ž‘์—…๋ณ„ ์„ฑ๋Šฅ ์ตœ์ ํ™”
    • Prompting LLMs: ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ™œ์šฉํ•ด ๋ชจ๋ธ์„ ํŠน์ • ์ž‘์—…์— ์ ์‘
  • 2.2. Training at Scale (๋Œ€๊ทœ๋ชจ ํ•™์Šต)
    • Data Preparation: ๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์ „์ฒ˜๋ฆฌ ๊ณผ์ •
    • Distributed Training: ๋ถ„์‚ฐ ํ•™์Šต ๊ธฐ๋ฒ•๊ณผ ํšจ์œจ์„ฑ ํ–ฅ์ƒ
    • Scaling Laws: ํ•™์Šต ์„ฑ๋Šฅ๊ณผ ๋ชจ๋ธ ํฌ๊ธฐ์˜ ์ƒ๊ด€๊ด€๊ณ„
  • 2.3. Long Sequence Modeling (์žฅ๋ฌธ ์ƒ์„ฑ)
    • Efficient Architectures: ์žฅ๋ฌธ ์ƒ์„ฑ์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜
    • Cache and Memory Management: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ์บ์‹ฑ ์ตœ์ ํ™” ๊ธฐ๋ฒ•

3์žฅ: ํ”„๋กฌํ”„ํŠธ(Prompting)

๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ฐ•๋ ฅํ•œ ์ ์‘๋ ฅ์„ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„์™€ ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. Zero-shot ๋ฐ Few-shot ํ•™์Šต์˜ ์›๋ฆฌ์™€ ์„ค๊ณ„ ์ „๋žต, ๊ทธ๋ฆฌ๊ณ  ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

  • 3.1. General Prompt Design (์ผ๋ฐ˜ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„)
    • Basics: ๊ธฐ๋ณธ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„ ์›์น™
    • In-context Learning: ์˜ˆ์ œ๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต ๊ธฐ๋ฒ•
    • Prompt Engineering Strategies: ํšจ๊ณผ์ ์ธ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„ ์ „๋žต
  • 3.2. Advanced Prompting Methods (๊ณ ๊ธ‰ ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฒ•)
    • Chain of Thought: ๋‹จ๊ณ„์  ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„
    • Problem Decomposition: ๋ฌธ์ œ๋ฅผ ์ž‘์€ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ๋ฐฉ๋ฒ•
    • RAG and Tool Use: ๊ฒ€์ƒ‰ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ๊ณผ ์™ธ๋ถ€ ๋„๊ตฌ ํ™œ์šฉ
  • 3.3. Learning to Prompt (ํ”„๋กฌํ”„ํŠธ ํ•™์Šต)
    • Prompt Optimization: ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•

4์žฅ: ์ •๋ ฌ(Alignment)

๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ์ด ์ธ๊ฐ„์˜ ๊ฐ€์น˜์™€ ๊ธฐ๋Œ€์— ๋งž๋„๋ก ์ •๋ ฌ์‹œํ‚ค๋Š” ๊ธฐ์ˆ ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™” ํ•™์Šต(RLHF)์™€ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘, ๋ณด์ƒ ๋ชจ๋ธ๋ง ๋“ฑ์„ ํ†ตํ•ด ์‚ฌ์šฉ์ž ์ค‘์‹ฌ์˜ ๋ชจ๋ธ ๊ฐœ์„  ๋ฐฉ์‹์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

  • 4.1. An Overview of LLM Alignment (์ •๋ ฌ ๊ฐœ์š”)
    • Instruction Alignment: ๋ช…๋ น์–ด ๊ธฐ๋ฐ˜ ์ •๋ ฌ ๋ฐฉ์‹
    • Human Feedback Alignment: ์ธ๊ฐ„ ํ”ผ๋“œ๋ฐฑ์„ ํ™œ์šฉํ•œ ์ •๋ ฌ
  • 4.2. Improved Human Preference Alignment (ํ–ฅ์ƒ๋œ ์„ ํ˜ธ๋„ ์ •๋ ฌ)
    • Better Reward Modeling: ๋” ๋‚˜์€ ๋ณด์ƒ ๋ชจ๋ธ ์„ค๊ณ„
    • Direct Preference Optimization: ์„ ํ˜ธ๋„ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ง์ ‘ ํ•™์Šต

๋ผ์ด์„ ์Šค

"๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ์ดˆ(Foundations of Large Language Models)" ๋„์„œ๋Š” Creative Commons Attribution-NonCommercial 4.0 Unported License(CC BY-NC 4.0)๋กœ ๋ฐฐํฌ๋˜๋ฉฐ, ๋น„์ƒ์—…์  ์šฉ๋„๋กœ ์ž์œ ๋กญ๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

:open_book: ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ธฐ์ดˆ ๋„์„œ




์ด ๊ธ€์€ GPT ๋ชจ๋ธ๋กœ ์ •๋ฆฌํ•œ ๊ธ€์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ๊ฒƒ์œผ๋กœ, ์›๋ฌธ์˜ ๋‚ด์šฉ ๋˜๋Š” ์˜๋„์™€ ๋‹ค๋ฅด๊ฒŒ ์ •๋ฆฌ๋œ ๋‚ด์šฉ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ด€์‹ฌ์žˆ๋Š” ๋‚ด์šฉ์ด์‹œ๋ผ๋ฉด ์›๋ฌธ๋„ ํ•จ๊ป˜ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”! ์ฝ์œผ์‹œ๋ฉด์„œ ์–ด์ƒ‰ํ•˜๊ฑฐ๋‚˜ ์ž˜๋ชป๋œ ๋‚ด์šฉ์„ ๋ฐœ๊ฒฌํ•˜์‹œ๋ฉด ๋ง๊ธ€๋กœ ์•Œ๋ ค์ฃผ์‹œ๊ธฐ๋ฅผ ๋ถ€ํƒ๋“œ๋ฆฝ๋‹ˆ๋‹ค. :hugs:

:pytorch:ํŒŒ์ดํ† ์น˜ ํ•œ๊ตญ ์‚ฌ์šฉ์ž ๋ชจ์ž„:kr:์ด ์ •๋ฆฌํ•œ ์ด ๊ธ€์ด ์œ ์šฉํ•˜์…จ๋‚˜์š”? ํšŒ์›์œผ๋กœ ๊ฐ€์ž…ํ•˜์‹œ๋ฉด ์ฃผ์š” ๊ธ€๋“ค์„ ์ด๋ฉ”์ผ:love_letter:๋กœ ๋ณด๋‚ด๋“œ๋ฆฝ๋‹ˆ๋‹ค! (๊ธฐ๋ณธ์€ Weekly์ง€๋งŒ Daily๋กœ ๋ณ€๊ฒฝ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.)

:gift: ์•„๋ž˜:arrow_lower_right:์ชฝ์— ์ข‹์•„์š”:+1:๋ฅผ ๋ˆŒ๋Ÿฌ์ฃผ์‹œ๋ฉด ์ƒˆ๋กœ์šด ์†Œ์‹๋“ค์„ ์ •๋ฆฌํ•˜๊ณ  ๊ณต์œ ํ•˜๋Š”๋ฐ ํž˜์ด ๋ฉ๋‹ˆ๋‹ค~ :star_struck:

2๊ฐœ์˜ ์ข‹์•„์š”