Gemma 4 12B Brings Encoder-Free Multimodal Agentic AI to Local Devices
Google's Gemma 4 12B is built to run agentic, multimodal AI directly on everyday laptops using a novel encoder-free architecture. Instead of separate vision and audio encoders, raw pixel patches and audio wave frames are projected straight into the decoder-only transformer, reducing latency and memory fragmentation. The model supports on-device coding, visual reasoning, and tool use via Google AI Edge, LiteRT-LM, and llama.cpp, and is available through Hugging Face and Ollama. Early users praise its local performance and context handling, though some note it excels at simpler tasks rather than replacing larger coding models.