Google Gemma 7B and 2B LLM models are now available to developers as OSS on Hugging Face
On 21st Feb 2024 Google open sourced the Gemma 7B and 2B LLM models to Hugging Face as OSS.
I did some quick testing on these 2 models (10G and 32G ram required) on various hardware configurations below using the GGUF format.
See announcements in the Google Developers Blog
Gemma: Introducing new state-of-the-art open models
and the Google Gemma and Google DeepMind Team’s research report
https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf
This open sourcing of the Google Gemma LLM model is extremely good news for the developer community we are a part of. They join the existing ecosystem of available models provided by Hugging Face.
The GGUF model runs on the llama.cpp engine by default.
See a previous article https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe on setting up your CPU, GPU or shared memory Apple Silicon environment as a proxy to CSP use like the L4 on Google Cloud.
The model was trained on the latest generation of Google TPU hardware — the TPUv5e.
Running the gemma 2B and 7B on local hardware
On Apple Silicon
Make sure to get or pull the latest version (post 20240222:0300UTC) of llama.cpp to get gemma model support
(Thank you to the early adopters on the github issue below — when I searched on a segment fault running a week old version of llama.cpp)
https://github.com/abetlen/llama-cpp-python/issues/1207
from
https://github.com/ggerganov/llama.cpp
llama.cpp % ./main -m models/gemma-2b.gguf -p “Describe how gold is made in colliding neutron stars” -n 2000 -e — color
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors: Metal buffer size = 9561.30 MiB
llm_load_tensors: CPU buffer size = 2001.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 508.25 MiB, (10080.44 / 21845.34)
llama_print_timings: eval time = 14966.81 ms / 472 runs ( 31.71 ms per token, 31.54 tokens per second)
On Windows on i9–13900KS or RTX-A6000+
C:/wse_github/llama.cpp $ ./main.exe -m g:/models/gemma-7b.gguf -p "what portion of gold is made in exploding stars" -n 2000 -e --color -t 24
llm_load_print_meta: model size = 31.81 GiB (32.00 BPW)
llama_print_timings: eval time = 412311.19 ms / 1029 runs ( 400.69 ms per token, 2.50 tokens per second)
I look forward to working with the ML developer community on these additional models.