Google Gemma 7B and 2B LLM models are now available to developers as OSS on Hugging Face

Michael O'Brien
3 min readFeb 22, 2024

--

On 21st Feb 2024 Google open sourced the Gemma 7B and 2B LLM models to Hugging Face as OSS.
I did some quick testing on these 2 models (10G and 32G ram required) on various hardware configurations below using the GGUF format.

See announcements in the Google Developers Blog

Gemma: Introducing new state-of-the-art open models

and the Google Gemma and Google DeepMind Team’s research report

https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf

This open sourcing of the Google Gemma LLM model is extremely good news for the developer community we are a part of. They join the existing ecosystem of available models provided by Hugging Face.

The GGUF model runs on the llama.cpp engine by default.
See a previous article https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe on setting up your CPU, GPU or shared memory Apple Silicon environment as a proxy to CSP use like the L4 on Google Cloud.

The model was trained on the latest generation of Google TPU hardware — the TPUv5e.

Running the gemma 2B and 7B on local hardware

On Apple Silicon

Make sure to get or pull the latest version (post 20240222:0300UTC) of llama.cpp to get gemma model support

(Thank you to the early adopters on the github issue below — when I searched on a segment fault running a week old version of llama.cpp)
https://github.com/abetlen/llama-cpp-python/issues/1207

from

https://github.com/ggerganov/llama.cpp

llama.cpp % ./main -m models/gemma-2b.gguf -p “Describe how gold is made in colliding neutron stars” -n 2000 -e — color
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors: Metal buffer size = 9561.30 MiB
llm_load_tensors: CPU buffer size = 2001.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 508.25 MiB, (10080.44 / 21845.34)
llama_print_timings: eval time = 14966.81 ms / 472 runs ( 31.71 ms per token, 31.54 tokens per second)

On Windows on i9–13900KS or RTX-A6000+

C:/wse_github/llama.cpp $  ./main.exe -m g:/models/gemma-7b.gguf  -p "what portion of gold is made in exploding stars" -n 2000 -e --color -t 24
llm_load_print_meta: model size = 31.81 GiB (32.00 BPW)
llama_print_timings: eval time = 412311.19 ms / 1029 runs ( 400.69 ms per token, 2.50 tokens per second)

I look forward to working with the ML developer community on these additional models.

https://www.linkedin.com/posts/michaelobrien-developer_google-gemma-7b-and-2b-llm-models-are-now-activity-7166291905583005696-su7g?utm_source=share&utm_medium=member_desktop

--

--

Michael O'Brien
Michael O'Brien

Written by Michael O'Brien

Architect/Developer @ Cisco | ex Google

No responses yet