How LLMs Are Actually Being Used in Mobile Apps

AIMobile
5 min read

Let’s start here: people want smart apps and that is as clear as the fact that the Earth is round. 

As of mid-2025, there is a continued growth in downloads and usage for mobile apps with integrated AI features. The hype might have cooled a bit, but the real use hasn’t.

Apps with AI in their metadata were downloaded over 17 billion times in 2024 (like, wow), and momentum hasn’t slowed in 2025, especially in education, health, finance and productivity categories. But don’t let the buzzwords fool you. Not every app with “AI” in its metadata is meaningfully powered by LLMs. The fact is: users are seeking out smarter apps, and companies are rushing to meet that demand.

1. Cloud-based APIs (most common):
Apps connect to services like OpenAI, Gemini, Claude or Mistral APIs. It is fast, flexible and constantly improving. And, it powers everything from ChatGPT’s 1+ billion daily queries to AI-backed email tools, document editors and customer support chat.

2. On-device models:
Smaller models like Gemma 2B, LLaMA 3 8B-Instruct or Apple’s on-device Apple Intelligence system are optimized to run locally on mobile chips. Google’s AI Edge Gallery lets developers deploy directly to Android phones with no roundtrip to the cloud required.

3. Hybrid systems (cloud + edge):
This is where things get interesting. Lightweight tasks (autocomplete, summarization, basic classification) happen on-device. Heavier lifting (like RAG pipelines or reasoning) happens in the cloud. This setup gives you better battery life, lower latency and more privacy control without losing firepower.

In internal enterprise:

  • Instacart uses an internal LLM called Ava for dev workflows;
  • Grab automates reporting with RAG-powered assistants;
  • Royal Bank of Canada implemented a RAG system called Arcane to solve the challenge of accessing and interpreting complex investment policies and procedures.

Yes, LLMs are computational beasts. But compression and quantization are changing the game. You can run a Gemma 3B model under 600MB and push 2,585 tokens/second on modern mobile GPUs. That’s practical.

And on the cloud side, cost and latency are the new constraints. Devs are tweaking prompt caching, switching models dynamically and offloading tasks based on complexity just to stay within API budgets and still deliver fluid UX.

Maryia Puhachova
Maryia Puhachova

You may also like

Get advice and find the best solution




    By clicking the “Submit” button, you agree to the privacy and personal data processing policy