Building Domus: A Multimodal Household Memory Agent With Gemini and Google Cloud

[ Domus_Architecture_Flow ]

/ Article

AI assistants are good at answering questions, but they often struggle with something simpler: remembering everyday information over time. In a household setting, this challenge becomes even more complex because information often needs to be shared between multiple people.

Domus was built to explore a practical idea. What if a chat-based AI system could function as a shared household memory that captures tasks, reminders, appointments, and notes through text and images?

To test that idea, I built Domus as a multimodal AI agent using Google AI models and Google Cloud infrastructure. The system allows users to interact through a web chat interface and converts text messages and uploaded images into structured household records that can be retrieved and managed later.

Domus currently supports text chat and image inputs only. Voice input and audio output are not part of the current prototype.

This project and article were created for the purposes of entering the Google Gemini Live Agent Challenge, which ran from February 16, 2026 through March 16, 2026.

Project Background

After discovering the event, my husband and I began building Domus on March 12, 2026 with the goal of creating a prototype that demonstrates how multimodal AI agents can manage shared household information.

The project focuses on combining conversational AI, structured memory, and multimodal inputs in a practical application. Rather than building a general-purpose assistant, the goal was to design a system that can reliably store and retrieve everyday household information through a simple chat interface.

The initial working prototype was built during the final days of the hackathon period, demonstrating how quickly modern AI tooling and cloud infrastructure can be combined to create functional multimodal agents.

What Domus Does

Domus is a chat-based AI household memory system that helps people capture and manage everyday information through text messages and images.

Users interact with the assistant through a chat interface to:

  • store tasks, reminders, appointments, and notes
  • convert screenshots or images into actionable reminders
  • retrieve the shared household memory list
  • update, complete, or delete existing entries

For example, a user might type "Remind me to buy milk tomorrow." Or they could upload a screenshot of a message that contains a date or event. The assistant interprets the input and converts it into a structured memory record.

Each entry is stored using a consistent schema that includes:

  • text — plain language description
  • type — category of the memory entry
  • subject — who the entry is for
  • details — additional context
  • scheduled_for — date/time if applicable
  • status — active, completed, etc.

This structured approach allows the system to manage information reliably instead of relying on raw conversation history.

System Architecture

Domus is organized as a layered architecture that separates the user interface, backend services, and AI agent logic.

At a high level the system operates as follows:

User → Next.js frontend → FastAPI backend → Gemini agent → tool calls → Firestore storage

The frontend collects user input, the backend processes multimodal requests, and the Gemini agent determines which memory operations should be executed.

Frontend

The frontend is a Next.js application located in the monorepo under apps/web.

Its responsibilities include:

  • rendering the chat interface
  • handling text input
  • supporting image uploads and pasted screenshots
  • displaying assistant responses
  • rendering the household memory list
  • managing authentication state

The frontend communicates with the backend API using HTTP requests and sends chat messages as multipart form data so that text and images can be transmitted together.

Backend API

The backend is implemented as a FastAPI application located in apps/api.

Core endpoints include:

EndpointMethodPurpose
/chatPOSTMultimodal chat input
/memoryGETRetrieve stored household memory
/memoryPOSTCreate memory entries
/memoryPUTUpdate memory entries
/memoryDELETEDelete memory entries
/briefingGETSummary of current household items

Agent Layer

The conversational agent is implemented using Google's Agent Development Kit (ADK). The agent runs through an ADK runner that manages agent execution, session handling, and tool invocation.

The underlying model used by the agent is Gemini 2.5 Flash.

Instead of allowing the model to interact directly with the database, the agent operates through a defined set of structured tools:

  • get_current_datetime
  • get_household_memory
  • add_household_memory
  • update_household_memory
  • delete_household_memory

When a user message arrives, the agent determines the user's intent and decides whether to call one of these tools. The tool executes the operation and returns a result, after which the agent generates a concise confirmation message.

This tool-driven architecture keeps the system predictable while still allowing flexible natural language interaction.

Multimodal Input Processing

Domus supports both text and image inputs.

Users can upload images or paste screenshots directly into the chat interface. These files are sent to the backend as multipart form data.

The backend converts uploaded images into Gemini-compatible inline data objects before sending them to the model. This allows Gemini to analyze visual content alongside the user's text.

Examples of supported inputs include:

  • screenshots of text messages
  • appointment confirmations
  • shopping lists
  • notes or reminders captured from images

The model extracts relevant information from the image and converts it into structured memory entries when appropriate.

Structured Memory Model

Household memory is stored in Google Cloud Firestore, which acts as the persistent storage layer. Using structured documents rather than raw conversation history makes it easier to retrieve, update, and manage household information over time.

For example, a memory entry might look like:

{
  "text": "Museum tickets",
  "details": "8 adults (+ Em), 1 child (<2). Total cost: $119.60",
  "type": "note",
  "scheduled_for": "2026-03-16T18:00",
  "status": "active"
}

This structure allows Domus to track tasks, schedule reminders, and maintain a consistent household memory list.

Session Management

The agent maintains conversational context using ADK's InMemorySessionService. Each session is identified by a user_id and session_id, which allows the system to preserve context between messages in a conversation.

For the hackathon prototype, session data is stored in memory rather than in a persistent store — one of the areas earmarked for future improvement.

Authentication

Authentication is implemented using Firebase Authentication. The frontend uses the Firebase Web SDK to manage authentication state. In a production setup, the backend verifies Firebase ID tokens before allowing access to protected memory routes.

During hackathon development, this verification step is temporarily bypassed so that internal agent tool calls to the memory API can run without authentication failures. This allows the prototype to function correctly while the full authentication flow is still being integrated.

Example Request Flow

A typical reminder creation follows this sequence:

  1. A user sends a text message or image through the chat interface
  2. The frontend sends a POST /chat request to the backend
  3. The FastAPI service receives the request
  4. The Gemini agent interprets the user's intent
  5. The agent calls the add_household_memory tool
  6. The backend writes the memory entry to Firestore
  7. The agent generates a confirmation message
  8. The frontend displays the response to the user

This pattern allows chat-based interaction to trigger structured operations on the memory system.

Key Design Principles

Several design principles guided the architecture of Domus.

Structured memory over raw chat history. Important information is stored as structured documents rather than relying on conversation transcripts.

Tool-driven agent architecture. The model interacts with the system through explicit tools instead of direct database access.

Multimodal interaction. Text and images are treated as equal inputs for capturing information.

Backend as the source of truth. Firestore maintains the canonical record of household memory.

Current Limitations

The current prototype focuses on demonstrating the core architecture and has several limitations:

  • authentication checks are temporarily bypassed
  • session state is stored in memory rather than persistent storage
  • there is no background scheduling service yet
  • household memory is not yet scoped to individual users

These areas represent opportunities for further development.

Future Improvements

Possible next steps for the project include:

  • persistent session storage
  • user-scoped memory collections
  • real-time synchronization across household devices
  • notification scheduling
  • semantic search for memory retrieval

Conclusion

Domus explores how modern AI models can move beyond answering questions and instead help manage everyday information.

By combining Gemini models, tool-based agents, multimodal inputs, and Google Cloud infrastructure, it is possible to build systems that transform simple chat interactions into structured, actionable data. The Domus prototype demonstrates how conversational agents can manage shared information in practical settings such as household organization.

This project and this article were created for the purposes of entering the Google Gemini Live Agent Challenge.