Google Gemma 7B and 2B LLM models are now available to developers as OSS on Hugging Face

3 min readFeb 22, 2024

On 21st Feb 2024 Google open sourced the Gemma 7B and 2B LLM models to Hugging Face as OSS.
I did some quick testing on these 2 models (10G and 32G ram required) on various hardware configurations below using the GGUF format.

google/gemma-7b · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

google/gemma-2b · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

See announcements in the Google Developers Blog

Gemma: Introducing new state-of-the-art open models

Error 404 (Not Found)!!!

Edit description

blog.google

and the Google Gemma and Google DeepMind Team’s research report

https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf

This open sourcing of the Google Gemma LLM model is extremely good news for the developer community we are a part of. They join the existing ecosystem of available models provided by Hugging Face.

The GGUF model runs on the llama.cpp engine by default.
See a previous article https://medium.com/@obrienlabs/running-the-70b-llama-2-llm-locally-on-metal-via-llama-cpp-on-mac-studio-m2-ultra-32b3179e9cbe on setting up your CPU, GPU or shared memory Apple Silicon environment as a proxy to CSP use like the L4 on Google Cloud.

The model was trained on the latest generation of Google TPU hardware — the TPUv5e.

Introduction to Cloud TPU | Google Cloud

Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to…

cloud.google.com

Introducing Cloud TPU v5p and AI Hypercomputer | Google Cloud Blog

The new TPU v5p is a core element of AI Hypercomputer, which is tuned, managed, and orchestrated specifically for gen…

cloud.google.com

Running the gemma 2B and 7B on local hardware

On Apple Silicon

Make sure to get or pull the latest version (post 20240222:0300UTC) of llama.cpp to get gemma model support

(Thank you to the early adopters on the github issue below — when I searched on a segment fault running a week old version of llama.cpp)
https://github.com/abetlen/llama-cpp-python/issues/1207

from

https://github.com/ggerganov/llama.cpp

llama.cpp % ./main -m models/gemma-2b.gguf -p “Describe how gold is made in colliding neutron stars” -n 2000 -e — color
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 19/19 layers to GPU
llm_load_tensors: Metal buffer size = 9561.30 MiB
llm_load_tensors: CPU buffer size = 2001.00 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 508.25 MiB, (10080.44 / 21845.34)
llama_print_timings: eval time = 14966.81 ms / 472 runs ( 31.71 ms per token, 31.54 tokens per second)

On Windows on i9–13900KS or RTX-A6000+

C:/wse_github/llama.cpp $  ./main.exe -m g:/models/gemma-7b.gguf  -p "what portion of gold is made in exploding stars" -n 2000 -e --color -t 24
llm_load_print_meta: model size       = 31.81 GiB (32.00 BPW)
llama_print_timings:        eval time =  412311.19 ms /  1029 runs   (  400.69 ms per token,     2.50 tokens per second)

I look forward to working with the ML developer community on these additional models.

https://www.linkedin.com/posts/michaelobrien-developer_google-gemma-7b-and-2b-llm-models-are-now-activity-7166291905583005696-su7g?utm_source=share&utm_medium=member_desktop

Google Gemma 7B and 2B LLM models are now available to developers as OSS on Hugging Face

google/gemma-7b · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

google/gemma-2b · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

Gemma: Introducing new state-of-the-art open models

Error 404 (Not Found)!!!

Edit description

Introduction to Cloud TPU | Google Cloud

Tensor Processing Units (TPUs) are Google's custom-developed application-specific integrated circuits (ASICs) used to…

Introducing Cloud TPU v5p and AI Hypercomputer | Google Cloud Blog

The new TPU v5p is a core element of AI Hypercomputer, which is tuned, managed, and orchestrated specifically for gen…

Written by Michael O'Brien

No responses yet