On-Device and Cloud Agent Collaboration

posted on 04 May 2025

adult distributed systems agents

WIP

Distributed Agents and what it means for foundational models?

The next step of the AI paradigm is a push towards distributed, federated AI agent networks. This is a result of a current bottleneck in AI adoption: the best models are controlled by a handful of companies, and these models have poor data control limits (that is there is a chance sensitive data is absorbed into the training data of those foundational models). An obvious workaround would be to create your own model, either by training one from scratch or fine-tuning an open-source model locally (using Ollama for example). A problem with this solution is that this requires an immense amount of resources to build, host and maintain these models. Many companies would not be willing to take on such a proposition.

In order to compromise between performance and some semblance of data control, there is a requirement for entities (companies or people) to have there own local models, but to achieve a certain threshold of performance (especially in reasoning tasks) these entities must have the ability to communicate with the larger foundation models. This problem is set up as a distributed systems problem, where there is a need for a communication protocol(s) that is able to deal with distributed system problems: dynamic resource allocation needed to maximize the efficiency of the group of local LLMs, accept hierarchical structures within the local LLMs, Byzantine faults or dealing with network partitions between local LLMs. A small step in this direction has been taken in the development and release of Google’s A2A protocol and the pre-print of Minions.

In the following I will explain the two protocols, other related explorations or papers, then speak about future directions within the communication protocols.

A2A protocol
Minions paper
Related Explorations
Future Protocols
Citations
Appendix

Minions paper

The Minions protocol explores the idea of reducing inference costs on data-intensive tasks through the collaboration between on-device (on-prem) LMs with frontier LM models.

The model was able to achieve a 30.4x reduction in inference costs while maintaining on average only a reduction of 9.4 accuracy points on average in comparison to a single frontier model. The key innovation was a communication protocol called Minions. This protocol is specifically for the case of reasoning tasks in environments where the context greatly exceeds the context window of the local or frontier models. In fact its able to answer queries without the frontier models needing to know the full context of the problem. This is incredibly important as it not only minimizes a waste of memory and bandwidth resources, but can potentially be a form of protection from data leakages. However, this is not a studied use-case for the protocol, but there have been studies conducted in similar environments [1], [2]. It is important to not that all work is done on long-form document data, an interesting extension would be to test it on image/video data.

The paper starts off with a naive implementation known as Minion: where natural language between local and frontier models is the communication protocol. This however runs into problems as it is difficult for smaller models to follow multi-step instructions and they get confused by large contexts ( > 65k tokens).

This gives rise to a new protocol called Minions, that involves looping over three steps:

Job preparation on remote
Job execution and filtering locally
Job aggregation on remote

In many ways this resembles the MapReduce program found in distributed systems.

Job Preparation on Remote

The frontier model writes code that generates a list of job specifications for the local (on-device) models to run in parallel. Each job specification, $\mathbf{t}^{i}$, is a context-instruction pair, $(q^i,c^i)$, where $q^i$ is the task query for the job and $c^i$ is the partial context necessary to perform the task query. A question arises: how can the frontier LLM know what partial context to provide despite not having the full context within its memory?

It does so by querying the frontier model to decompose the task into a general plan, then using this plan the frontier model is prompted with the task of writing a function that accepts the context, chunks it, and then creates job tasks.

By mapping the local LLMs to these jobs one is creating the mapping function in MapReduce.

Job Execution and Filtering Locally

These jobs are then executed by the local LLMs in parallel, with the local LLMs making a decision on the output: to abstain from outputting any information back to the frontier model or to output a JSON object, $z^{(i)}$, with fields explanation, citation, and answer. This is done to reduce memory load and help verify reasoning. This is then aggregated into a string $w$.

Job Aggregation on Remote

The frontier model will receive $w$ along with a synthesis prompt, $p_{synthesize}$, and generate a JSON object with two fields: a decision field and a response (final answer) field. Based on this JSON object, the frontier model can restart the process or return a final answer to the original task query.

This is analagous to the Map function of a MapReduce program.

Scaling Parallel Workloads on-device

The paper notes scaling the workloads on-device is positively correlated with performance: that is increasing the nubmer of tasks per round, increasing the number of samples per local LLM for each task, and smaller chunk size all improve performance. However, this comes with increased cost and memory requirements. All of which can mitigate any performance increase, or can collapse the ability of the frontier model from making a decision.

A2A protocol

MCP is the emerging standard for connecting LLMs to tools, resources and data. However, from the perspective of an individual AI model (or agent) every other agent is a tool. This relationship imbalance makes it difficult for these agents to collaborate on tasks in a more bidirectional manner. More specifically it allows bidirectional and frequent communication, while also giving an agent an understanding of the capabilities of the agents it is interacting with. Similar to a MCP tool, the agents are mapped to their own server, which controls the API endpoint used to serve requests. The interaction patterns (for now) include: polling, Server-Sent Events (SSE) (useful for real-time event streming tasks), or push notifications (for async tasks that require long wait times). Communication is done through JSON-RPC, however there exists no way to validate the structure of communication outputs before the data reaches a target and that target reads over it. There exists libraries such as legion-a2a which deal with validating JSON from the interaction patterns described earlier. Below you can see JSON-RPC vlaidation methods for polling or SSE, then below it for push notifications. For more complex multi-agent systems this can be dealt with by using CrewAI with A2A. For example, using legion-a2a with CrewAI would be redundant if you wanted a very async communication pattern that is notification heavy, as CrewAI defines its own notification event bus.

class JSONRPCMessage(BaseModel):
    jsonrpc: Literal['2.0'] = Field('2.0', title='Jsonrpc')
    id: Optional[Union[int, str]] = Field(None, title='Id')


class JSONRPCRequest(BaseModel):
    jsonrpc: Literal['2.0'] = Field('2.0', title='Jsonrpc')
    id: Optional[Union[int, str]] = Field(None, title='Id')
    method: str = Field(..., title='Method')
    params: Optional[Dict[str, Any]] = Field(None, title='Params')


class JSONRPCResponse(BaseModel):
    jsonrpc: Literal['2.0'] = Field('2.0', title='Jsonrpc')
    id: Optional[Union[int, str]] = Field(None, title='Id')
    result: Optional[Dict[str, Any]] = Field(None, title='Result')
    error: Optional[JSONRPCError] = None


class PushNotificationConfig(BaseModel):
    url: str = Field(..., title='Url')
    token: Optional[str] = Field(None, title='Token')
    authentication: Optional[AuthenticationInfo] = None


class TaskPushNotificationConfig(BaseModel):
    id: str = Field(..., title='Id')
    pushNotificationConfig: PushNotificationConfig

The most important aspect of A2A is that it defines an “agent context” that is useful for understanding the capabilities of agents (this is the Agent Card in A2A). Below is a class definition of the Agent Card used in A2A and an example of how it is defined.

#Agent Card Definition

class AgentCard(BaseModel):
    name: str
    description: str | None = None
    url: str
    provider: AgentProvider | None = None
    version: str
    documentationUrl: str | None = None
    capabilities: AgentCapabilities
    authentication: AgentAuthentication | None = None
    defaultInputModes: list[str] = ['text']
    defaultOutputModes: list[str] = ['text']
    skills: list[AgentSkill]

#Agent Card Example

capabilities = AgentCapabilities(streaming=True)
        skills = [
            AgentSkill(
                id='download_closed_captions',
                name='Download YouTube Closed Captions',
                description='Retrieve closed captions/transcripts from YouTube videos',
                tags=['youtube', 'captions', 'transcription', 'video'],
                examples=[
                    'Extract the transcript from this YouTube video: https://www.youtube.com/watch?v=dQw4w9WgXcQ',
                    'Download the captions for this YouTube tutorial',
                ],
            )
        ]

        agent_card = AgentCard(
            name='YouTube Captions Agent',
            description='AI agent that can extract closed captions and transcripts from YouTube videos. This agent provides raw transcription data that can be used for further processing.',
            url=f'http://{host}:{port}/',
            version='1.0.0',
            defaultInputModes=YoutubeMCPAgent.SUPPORTED_CONTENT_TYPES,
            defaultOutputModes=YoutubeMCPAgent.SUPPORTED_CONTENT_TYPES,
            capabilities=capabilities,
            skills=skills,
        )

The question then is what is the best way to define the skills and capabilities on an Agent Card to optimize the usefulness of each agent? The A2A protocol answers this question completely qualitatively and generally. Though I do not believe this is the optimal path I believe it works well enough as a first step. This is an important problem, and one that is not dealt with (though I may be mistaken) in the field of distributed systems, as the assumption is that the nodes defined in distributed systems are usually servers with the same specifications. That is, systems do not usually communicate their speicfications with each other as these systems are engineered with specifications in mind (at least as a lower bound). This is not the case in multi-agent systems where potentially we can have multiple specification redundant agents but perform vastly different within certain environments. For example, given two agents whose job are to transcribe video into text, one can be prone to low precision when dealing with filler words, while the other is comparatively better at dealing with filler words. This type of situation is unaccounted for currently. Future iterations of a communication protocol need to decide whether metrics should be included and how. It should also be discussed whether this should appear like a report card which outputs results based on pre-specified levels of granularity, that is the server will respond with more detailed capability descriptions based on a flag or the number of requests for that Agent Card by a specific agent.

https://google.github.io/A2A/tutorials/python/9-ollama-agent/#integrating-ollama-into-our-a2a-server

https://google.github.io/A2A/tutorials/python/4-agent-skills/#test-run

As explained above the Minions protocol resembles a MapReduce program. However, it is not the only paper with that flavour. The LLMxMapReduce paper defines a way to deal with long-form documents by using a MapReduce program that has Map and Reduce functions defined through the use of LLMs. This would be a common use case of a system that would need to be hosted on-prem.

https://github.com/huggingface/smolagents

LLMxMapreduce (https://arxiv.org/abs/2410.09342)

https://proceedings.neurips.cc/paper_files/paper/2024/hash/ee71a4b14ec26710b39ee6be113d7750-Abstract-Conference.html

https://github.com/lmnr-ai/index

https://arxiv.org/abs/2502.15920

https://arxiv.org/abs/2409.18014

https://proceedings.neurips.cc/paper_files/paper/2024/file/df2d62b96a4003203450cf89cd338bb7-Paper-Conference.pdf

https://arxiv.org/pdf/2501.12485

https://ojs.aaai.org/index.php/AAAI/article/view/32957

https://arxiv.org/pdf/2502.04506

Future Protocols

There are many directions that can be explored in the future. The obvious would be to explore how co-training local LLMs together would impact their performance to get job tasks done. This could be constrasted against cost, especially as tasks can be more expansive, allowing quorums of LLMs to solve a few jobs, so that there is a reduction in tokens exchanged between the local LLMs and the frontier model. This can occur by adapting a co-training framework like (MaPoRL)[https://arxiv.org/pdf/2502.18439].

Other directions that can be explored would be to have role-specific LLMs or specialized neural networks as local models, specifying this to the frontier models (using A2A to communicate this). Potentially communicating at the logit or weight level between the local LLMs could improve their ability to perform tasks. Exploring how these protocols deal with (data leakages)[https://arxiv.org/abs/2410.17127] would also be helpful.

Finally protocols that help deal with classical distributed systems problems can be explored within the combined protocol. As discussed in the introduction, problems like network partitions arise where specific LLMs could fail and their messages could fail as well. In this case jobs may not finish and job tasks would need to be restarted. These are problems that could not be dealt with by a simple naive integration of the Minions and A2A protocols. Secondly, there could be hierarchical structures for the group of local LLMs that allow job tasks to be more complex, reducing the costs as the frontier model would need to be used less. Also considering on-prem resources could be highly constrained dealing with dynamic allocation of resources for the local LLMs, there needs to be an exploration of how this can be scheduled correctly to maximize utility usage of available resources. This could be dealt with as a multi-arm bandit problem, for example. The abilities of the LLMs would need to be taken into account.