## Add full GPU inference of LLaMA on Apple Silicon using Metal
- The initial… idea was proposed and explained here: https://github.com/ggerganov/llama.cpp/discussions/915
- A basic PoC was demonstrated here: https://github.com/ggerganov/ggml/pull/108
### Demo
M1 Pro + 7B LLaMA:
https://github.com/ggerganov/llama.cpp/assets/1991296/1d4591b4-34e7-407c-8a7b-9b92d6d50287
M2 Max + 7B LLaMA:
https://github.com/ggerganov/llama.cpp/assets/1991296/fd99d9e4-9f6b-4954-b4e9-ce5fbfb4a7d1
M2 Max + 13B LLaMA:
https://github.com/ggerganov/llama.cpp/assets/1991296/60fb8bc1-21d0-4091-aaa9-6c764592511d
M2 Max + 65B LLaMa:
https://github.com/ggerganov/llama.cpp/assets/1991296/b96552c6-9011-4d12-8cba-a95a6fb5bea7
### Details
- The `ggml` API is extended in [ggml-metal.h](https://github.com/ggerganov/llama.cpp/blob/metal/ggml-metal.h)
- The Metal shaders / kernels are implemented in [ggml-metal.metal](https://github.com/ggerganov/llama.cpp/blob/metal/ggml-metal.metal)
- This PR implements support only for `Q4_0`, but all other quantizations can easily be added in the future
- Works well with `mmap` to avoid model data duplication in memory. Still there are a few memory improvements that can be made in the future to reduce the memory usage when Metal is enabled
- The core of the implementation is contained in the [ggml_metal_graph_compute()](https://github.com/ggerganov/llama.cpp/blob/metal/ggml-metal.m#L233-L672) function. It is analogous to the CPU-only `ggml_graph_compute()` and it's purpose is to evaluate a `ggml_cgraph` on the GPU in a similar way
- The implemented shaders currently focus on `qMatrix` x `Vector` multiplication which is normally needed for LLM text-generation. For other tasks that involve `Matrix` x `Matrix` (for example prompt ingestion, perplexity computation, etc) we don't have an efficient implementation yet, so we [fallback to the CPU / ANE](https://github.com/ggerganov/llama.cpp/blob/db3db9e7749c4b7681c96272c87fdbf6b1e235e7/llama.cpp#L1438-L1463)
- There is a nice separation of the implementation: the new `ggml-metal.h`, `ggml-metal.m` and `ggml-metal.metal` files are optional and all Metal-related code is contained within them. 3rd party user apps can decide whether they want to include / modify / ignore them
- The proposed implementation can be easily extended for other backends like CUDA by following the same pattern as demonstrated in this PR
- Optionally, we now have support for exporting static computation graphs. Creation and usage is demonstrated in the [metal](https://github.com/ggerganov/llama.cpp/tree/metal/examples/metal) example
### Usage
- Add `LLAMA_METAL=1` to your `make` command or `-DLLAMA_METAL=ON` to your `cmake` command.
- Add `-ngl 1` to `main` command-line arguments to enable GPU inference
```bash
$ make clean
$ LLAMA_METAL=1 make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" --ignore-eos -n 64 -ngl 1
I llama.cpp build info:
I UNAME_S: Darwin
I UNAME_P: arm
I UNAME_M: arm64
I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL
I LDFLAGS: -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
I CC: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX: Apple clang version 14.0.3 (clang-1403.0.22.14.1)
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml.c -o ggml.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL -c examples/common.cpp -o common.o
cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_ACCELERATE -DGGML_USE_METAL -DGGML_METAL_NDEBUG -c ggml-metal.m -o ggml-metal.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/main/main.cpp ggml.o llama.o common.o ggml-metal.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize/quantize.cpp ggml.o llama.o ggml-metal.o -o quantize -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-metal.o -o quantize-stats -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-metal.o -o perplexity -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-metal.o -o embedding -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_METAL pocs/vdot/vdot.cpp ggml.o ggml-metal.o -o vdot -framework Accelerate -framework Foundation -framework Metal -framework MetalKit -framework MetalPerformanceShaders
==== Run ./main -h for help. ====
main: build = 653 (db3db9e)
main: seed = 1685893102
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size = 256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/ggerganov/development/github/llama.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x120a06020
ggml_metal_init: loaded kernel_mul 0x120a065a0
ggml_metal_init: loaded kernel_mul_row 0x120a06bd0
ggml_metal_init: loaded kernel_scale 0x120a070f0
ggml_metal_init: loaded kernel_silu 0x120a07610
ggml_metal_init: loaded kernel_relu 0x120a07b30
ggml_metal_init: loaded kernel_soft_max 0x120a081e0
ggml_metal_init: loaded kernel_diag_mask_inf 0x120a08840
ggml_metal_init: loaded kernel_get_rows_q4_0 0x120a08ec0
ggml_metal_init: loaded kernel_rms_norm 0x120a09570
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x120a09dd0
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x120a0a7a0
ggml_metal_init: loaded kernel_rope 0x120a0b090
ggml_metal_init: loaded kernel_cpy_f32_f16 0x120a0b920
ggml_metal_init: loaded kernel_cpy_f32_f32 0x120a0c1b0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3616.07 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 768.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 258.00 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 512.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 512.00 MB
system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I believe the meaning of life is to be happy.
That's what I would call my philosophy on how to live life, that's what I want people to remember me for.
I was actually diagnosed with a tumor when I was 17 years old and had a very long surgery in order to get it removed.
llama_print_timings: load time = 1685.43 ms
llama_print_timings: sample time = 45.70 ms / 64 runs ( 0.71 ms per token)
llama_print_timings: prompt eval time = 342.51 ms / 8 tokens ( 42.81 ms per token)
llama_print_timings: eval time = 3079.50 ms / 63 runs ( 48.88 ms per token)
llama_print_timings: total time = 4816.85 ms
```
## Implementation process of this PR (archive)
- [x] Export a `ggml` computation graph of a LLaMA model:
```bash
./bin/main -m ../models/7B/ggml-model-q4_0.bin --export
```
This creates the `llama.ggml` file which contains the computation graph
- [x] We will now load it with a separate tool and attempt to evaluate with Metal:
```bash
./bin/mtl llama.ggml
```
- [x] Implement the entire network layer by layer, comparing the CPU and GPU results
- [x] GET_ROWS_Q4_0
- [x] RMS_NORM
- [x] MUL
- [x] MUL_MAT
- [x] RESHAPE
- [x] TRANSPOSE
- [x] ROPE
- [x] VIEW
- [x] CPY
- [x] SCALE
- [x] DIAG_MASK_INF
- [x] SOFT_MAX
- [x] SILU
- [x] Optimize the kernels to achieve at the very least parity with CPU-only speed
- [x] Adjust dynamic shapes before evaluating the graph (i.e. `n_past`, `N`)
- [x] Simplify encoder dispatch code, reduce duplication
- [x] Add basic text-generation example
---
## Robots
<!--
copilot:all
-->
### <samp>🤖 Generated by Copilot at 324e823</samp>
### Summary
🍎📝🚀
<!--
1. 🍎 - This emoji represents Metal support for Apple devices, which is a major feature of this pull request.
2. 📝 - This emoji represents the improvements to the documentation, formatting, and comments of the code, which make it more readable and understandable.
3. 🚀 - This emoji represents the GPU acceleration and computation graph export/import features, which enhance the performance and usability of llama.
-->
This pull request adds Metal support for llama, a library for tensor manipulation and computation graph export/import. It introduces a new CMake option `LLAMA_METAL` and a new header file `ggml-metal.h` that enable GPU acceleration of llama expressions on Apple devices. It also improves the readability, consistency, and usability of the existing code and documentation, and adds some new features and examples. It fixes a bug in the `main` example program and adds a new `metal` example program that demonstrates how to evaluate a statically exported ggml computation graph with Metal.
> _If you want to use llama with Metal_
> _You can now do so with this pull request, all_
> _You need is to set `LLAMA_METAL`_
> _And then you can export your `ggml`_
> _To a file or a graph that is special_
### Walkthrough
* Add Metal support for llama, a GPU backend for Apple devices ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR204-R228), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL208-R234), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL373-R402), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL387-R423), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52L169-R189), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5L54-R57), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-48ef5f62f6b0ec28f7ac35de1f3ee9adaaa55d7d16649195b124c8df56885348R1-R3), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-f5177339d387bef8794ff76aa0cdae0aa28b68d63067bfc2ffe82904c096b619R1-R102), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-3cebf7a1bc643557eff27cae1c81d9894615a7f944c271eefcfca861aaa013e1R1-R63), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR19-R22), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR245-R248), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR1258-R1264), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1437-R1480), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR2330-R2348), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-a2f09a47e379eeeb66a7398e3a1d11a391af75829d3e7a6dd7218e221b4fcaf3L34-R34))
* Fix a bug in the example program `main.cpp` that used subtraction instead of addition to compute the sum of two numbers ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-2d3599a9fad195f2c3c60bd06691bc1815325b3560b5feda41a91fa71194e805L137-R144))
* Add a command-line option `--export` to the example program `main.cpp` that allows exporting the computation graph to a file named `llama.ggml` ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-81037668d1ec4be4b72740f4070add30efb0b021c28e93e41c2a0a2062ba10e8R302-R303), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-81037668d1ec4be4b72740f4070add30efb0b021c28e93e41c2a0a2062ba10e8R443), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-9ee6fe1df475323e1bc2457aa555d36474236452b27af58909fc410c9a5fe642R74))
* Add a function `llama_eval_export` that exports a static computation graph for a context of 511 and a batch size of 1 using `llama_eval_internal` ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR2979-R2992), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-a2f09a47e379eeeb66a7398e3a1d11a391af75829d3e7a6dd7218e221b4fcaf3R176-R181))
* Change the logic of the function `ggml_graph_import` to parse the arguments of the tensor before creating it, and to handle different cases of view operations differently ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR15011-R15012), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL14987-R15099))
* Change the logic of the function `ggml_nbytes` to handle cases where the tensor is not contiguous in memory ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL3735-R3742))
* Add a call to `ggml_scratch_save` and `ggml_scratch_load` to the functions `ggml_view_1d`, `ggml_view_2d`, `ggml_view_3d` and `ggml_view_4d` to preserve the scratch memory state when creating a new tensor for the offset ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL5805-R5823), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR5852-R5858), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR5898-R5904), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR5946-R5952))
* Add a call to `ggml_set_name` to the functions `ggml_view_2d`, `ggml_view_3d` and `ggml_view_4d` to assign a name to the result tensor for debugging purposes ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR5867), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR5913), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR5961))
* Add a call to `ggml_set_name` to the function `llama_eval_internal` to assign a name to the tensor `Vcur` for debugging purposes ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR1293))
* Add a parameter `cgraph_fname` to the function `llama_eval_internal` that allows exporting the computation graph to a file if not null ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1192-R1212), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1437-R1480), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL2902-R2964))
* Add a variable `eop` to the function `ggml_graph_import` that stores the enum value of the operation code for convenience ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dR15011-R15012))
* Add a `const` qualifier to the variables `mean` and `x0` in the functions `ggml_compute_forward_rms_norm_f32` and `ggml_compute_forward_rope_f32` to indicate that they are not modified after initialization ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL9248-R9287), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL11159-R11198), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL11180-R11219))
* Change the return type of the function `ggml_nrows` from `int` to `int64_t` to match the type of the `ne` field of the `ggml_tensor` struct ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL3726-R3726))
* Change the visibility of the functions `ggml_is_transposed` and `ggml_is_contiguous` from static inline to public by adding them to the ggml.h header file ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL3817-R3828), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-f0f2d0dc971e0aa60560e7e3bc1d512b4bf914aedf44333f7008c605433cd394R445-R447))
* Increase the width of the last column in the format strings of the functions `ggml_graph_export_leaf` and `ggml_graph_export_node` to accommodate longer tensor names ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL14584-R14623), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL14598-R14637))
* Comment out two assertions in the function `ggml_graph_export` that check the work buffer size of the computation graph, because they are not valid when exporting a graph with Metal support ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL14611-R14651))
* Remove an empty line from the function `ggml_graph_export` for consistency ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-6d9ce99fcb6f51ff76f59e479f6e6fc0bb62edef7442805d7a5bb15b23996b5dL14833))
* Remove the declaration of the variable `cur` from the function `llama_eval_internal` because it is declared later in the same scope ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1253-L1254))
* Replace the variable `inpL` with `cur` in the function `llama_eval_internal` to reflect the previous changes in the tensor creation logic ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1404-R1467), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1448-R1495))
* Remove an empty line from the function `llama_eval_internal` for consistency ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1322))
* Add an empty line to the function `llama_eval_internal` for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR1283))
* Format the call to `llama_model_load` in the function `llama_init` to use multiple lines and indentation for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL2248-R2292))
* Format the declarations of the functions `ggml_init` and `ggml_free` in the ggml.h header file to use multiple lines and indentation for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-f0f2d0dc971e0aa60560e7e3bc1d512b4bf914aedf44333f7008c605433cd394L450-R454))
* Format the target link libraries command for llama to use multiple lines and indentation for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL387-R423))
* Align the spacing of the memory requirements expressions in the function `llama_model_load_internal` for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efL1080-R1088))
* Align the spacing of the CMake options for llama to make them more consistent and readable ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL67-R74))
* Rename the variable `GGML_CUDA_SOURCES` to `GGML_SOURCES_CUDA` to match the naming convention of other source variables in the CMake file ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL186-R187), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aL398-R433))
* Add a subdirectory `metal` to the examples CMake file if `LLAMA_METAL` is enabled ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-940e5356fa1137e646ea135a6f68cacbfad4fe7124c3a69163468f588acf9283L40-R43))
* Add an empty line to the README.md file for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R238-R263))
* Add empty lines to the Makefile to separate different conditional blocks for readability ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52R108), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52R120), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52L178-R206))
* Add comments to mark the end of the conditional blocks in the Makefile ([link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52L126-R129), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52L134-R143), [link](https://github.com/ggerganov/llama.cpp/pull/1642/files?diff=unified&w=0#diff-76ed074a9305c04054cdebb9e9aad2d818052b07091de1f20cad0bbac34ffb52L159-R167))