On-Device and Cloud Agent Collaboration
WIP
Distributed Agents and what it means for foundational models?
The next step of the AI paradigm is a push towards distributed, federated AI agent networks. This is a result of a current bottleneck in AI adoption: the best models are controlled by a handful of companies, and these models have poor data control limits (that is there is a chance sensitive data is absorbed into the training data of those foundational models). An obvious workaround would be to create your own model, either by training one from scratch or fine-tuning an open-source model locally (using Ollama for example). A problem with this solution is that this requires an immense amount of resources to build, host and maintain these models. Many companies would not be willing to take on such a proposition.
In order to compromise between performance and some semblance of data control, there is a requirement for entities (companies or people) to have there own local models, but to achieve a certain threshold of performance (especially in reasoning tasks) these entities must have the ability to communicate with the larger foundation models. This problem is set up as a distributed systems problem, where there is a need for a communication protocol(s) that is able to deal with distributed system problems: dynamic resource allocation needed to maximize the efficiency of the group of local LLMs, accept hierarchical structures within the local LLMs, Byzantine faults or dealing with network partitions between local LLMs. A small step in this direction has been taken in the development and release of Google’s A2A protocol and the pre-print of Minions.
In the following I will explain the two protocols, other related explorations or papers, then speak about future directions within the communication protocols.
Table of Contents
Minions paper
The Minions protocol explores the idea of reducing inference costs on data-intensive tasks through the collaboration between on-device (on-prem) LMs with frontier LM models.
The model was able to achieve a 30.4x reduction in inference costs while maintaining on average only a reduction of 9.4 accuracy points on average in comparison to a single frontier model. The key innovation was a communication protocol called Minions, whose claim to greatness is its ability to successfully perform reasoning tasks in environments where the context greatly exceeds the context window of a single local or frontier model. In fact its able to answer queries without the frontier models needing to know the full context of the problem. This is incredibly important as it not only minimizes a waste of memory and bandwidth resources, but can potentially be a form of protection from data leakages. However, this was not studied as a use case for the protocol, but there have been studies conducted in similar environments [1], [2]. Unfortunately this paper was studied only in contexts with long-form document data, which is quite low in information density. A possible extension to the paper would involve using the protocol on image/video data or even mulit-modal data.
The paper starts off with a naive implementation known as Minion: where natural language between local and frontier models is the communication protocol. This however runs into problems as it is difficult for smaller models to follow multi-step instructions as they get confused by large contexts ( > 65k tokens).
This gives rise to a new protocol called Minions, that involves looping over three steps:
- Job preparation on remote
- Job execution and filtering locally
- Job aggregation on remote
In many ways this resembles the MapReduce program found in distributed systems.
Job Preparation on Remote
The frontier model writes code that generates a list of job specifications for the local (on-device) models to run in parallel. Each job specification, $\mathbf{t}^{i}$, is a context-instruction pair, \((q^i,c^i)\), where \(q^i\) is the task query for the job and \(c^i\) is the partial context necessary to perform the task query. A question arises: how can the frontier LLM know what partial context to provide despite not having the full context within its memory?
It does so by querying the frontier model to decompose the task into a general plan, then using this plan the frontier model is prompted with the task of writing a function that accepts the context, chunks it, and then creates job tasks.
By mapping the local LLMs to these jobs one is creating the mapping function in MapReduce.
Job Execution and Filtering Locally
These jobs are then executed by the local LLMs in parallel, with the local LLMs making a decision on the output: to abstain from outputting any information back to the frontier model or to output a JSON object, \(z^{(i)}\), with fields explanation, citation, and answer. This is done to reduce memory load and help verify reasoning. This is then aggregated into a string \(w\).
Job Aggregation on Remote
The frontier model will receive \(w\) along with a synthesis prompt, \(p_{synthesize}\), and generate a JSON object with two fields: a decision field and a response (final answer) field. Based on this JSON object, the frontier model can restart the process or return a final answer to the original task query.
This is analagous to the Map function of a MapReduce program.
Scaling Parallel Workloads on-device
The paper notes scaling the workloads on-device is positively correlated with performance: that is increasing the nubmer of tasks per round, increasing the number of samples per local LLM for each task, and smaller chunk size all improve performance. However, this comes with increased cost and memory requirements. All of which can mitigate any performance increase, or can collapse the ability of the frontier model from making a decision.
A2A protocol
MCP is the emerging standard for connecting LLMs to tools, resources and data. However, from the perspective of an individual AI model (or agent) every other agent is a tool. This relationship imbalance makes it difficult for these agents to collaborate on tasks in a more bidirectional manner. More specifically it allows bidirectional and frequent communication, while also giving an agent an understanding of the capabilities of the agents it is interacting with. Similar to a MCP tool, the agents are mapped to their own server, which controls the API endpoint used to serve requests. The interaction patterns (for now) include: polling, Server-Sent Events (SSE) (useful for real-time event streming tasks), or push notifications (for async tasks that require long wait times). Communication is done through JSON-RPC, however there exists no way to validate the structure of communication outputs before the data reaches a target and that target reads over it. There exists libraries such as legion-a2a which deal with validating JSON from the interaction patterns described earlier. Below you can see JSON-RPC vlaidation methods for polling or SSE, then below it for push notifications. For more complex multi-agent systems this can be dealt with by using CrewAI with A2A. For example, using legion-a2a with CrewAI would be redundant if you wanted a very async communication pattern that is notification heavy, as CrewAI defines its own notification event bus.
class JSONRPCMessage(BaseModel):
jsonrpc: Literal['2.0'] = Field('2.0', title='Jsonrpc')
id: Optional[Union[int, str]] = Field(None, title='Id')
class JSONRPCRequest(BaseModel):
jsonrpc: Literal['2.0'] = Field('2.0', title='Jsonrpc')
id: Optional[Union[int, str]] = Field(None, title='Id')
method: str = Field(..., title='Method')
params: Optional[Dict[str, Any]] = Field(None, title='Params')
class JSONRPCResponse(BaseModel):
jsonrpc: Literal['2.0'] = Field('2.0', title='Jsonrpc')
id: Optional[Union[int, str]] = Field(None, title='Id')
result: Optional[Dict[str, Any]] = Field(None, title='Result')
error: Optional[JSONRPCError] = None
class PushNotificationConfig(BaseModel):
url: str = Field(..., title='Url')
token: Optional[str] = Field(None, title='Token')
authentication: Optional[AuthenticationInfo] = None
class TaskPushNotificationConfig(BaseModel):
id: str = Field(..., title='Id')
pushNotificationConfig: PushNotificationConfig
The most important aspect of A2A is that it defines an “agent context” that is useful for understanding the capabilities of agents (this is the Agent Card in A2A). Below is a class definition of the Agent Card used in A2A and an example of how it is defined.
#Agent Card Definition
class AgentCard(BaseModel):
name: str
description: str | None = None
url: str
provider: AgentProvider | None = None
version: str
documentationUrl: str | None = None
capabilities: AgentCapabilities
authentication: AgentAuthentication | None = None
defaultInputModes: list[str] = ['text']
defaultOutputModes: list[str] = ['text']
skills: list[AgentSkill]
#Agent Card Example
capabilities = AgentCapabilities(streaming=True)
skills = [
AgentSkill(
id='download_closed_captions',
name='Download YouTube Closed Captions',
description='Retrieve closed captions/transcripts from YouTube videos',
tags=['youtube', 'captions', 'transcription', 'video'],
examples=[
'Extract the transcript from this YouTube video: https://www.youtube.com/watch?v=dQw4w9WgXcQ',
'Download the captions for this YouTube tutorial',
],
)
]
agent_card = AgentCard(
name='YouTube Captions Agent',
description='AI agent that can extract closed captions and transcripts from YouTube videos. This agent provides raw transcription data that can be used for further processing.',
url=f'http://{host}:{port}/',
version='1.0.0',
defaultInputModes=YoutubeMCPAgent.SUPPORTED_CONTENT_TYPES,
defaultOutputModes=YoutubeMCPAgent.SUPPORTED_CONTENT_TYPES,
capabilities=capabilities,
skills=skills,
)
While this approach may not be optimal, it serves as a viable first step. This addresses a critical gap that traditional distributed systems largely bypass. In distributed computing, systems are typically engineered around a known lower bound of hardware capability, assuming that performance metrics remain relatively static and predictable across nodes. Consequently, nodes rarely need to dynamically communicate or negotiate their underlying specifications. Conversely, Multi-Agent Systems (MAS) break this assumption; agents may possess identical computational specifications yet exhibit vastly different performance profiles based on their specific environmental context, localized data, or emergent behaviors. Traditional distributed paradigms are fundamentally unequipped to measure or manage this form of contextual capability variance.”
What is the best way to define the skills and capabilities of an Agent? For the protocol (and especially for future protocols) how can we define those skills in a systemic way on an Agent Card? A2A’s solution acts as a good first step to this problem. This is an important problem, considering the jagged nature of models today and our ability to customize models (both through context engineering and fine-tuning). Distributed systems are typically engineered around a known lower bound of hardware capability and the assumption that performance metrics remain relatively static and predictable across nodes. Multi-agents systems on the other hand may possess identical computational specifications yet exhibit vastly differring performance profiles. As a result, the distributed paradigms we have used previously may need to be modified to measure and manage this form of contextual capability variance. For example, given two agents whose job are to transcribe video into text, one can be prone to low precision when dealing with filler words, while the other is comparatively better at dealing with filler words. This type of situation is unaccounted for currently.
Will the format be independent of context? Is there a limit to the granularity of the capability metrics and descriptions? Future Iterations of Minions needs to decide how these questions will be answered.
https://google.github.io/A2A/tutorials/python/9-ollama-agent/#integrating-ollama-into-our-a2a-server
https://google.github.io/A2A/tutorials/python/4-agent-skills/#test-run
Related Explorations
As explained above the Minions protocol resembles a MapReduce program. However, it is not the only paper with that flavour. The LLMxMapReduce paper defines a way to deal with long-form documents by using a MapReduce program that has Map and Reduce functions defined through the use of LLMs. This would be a common use case of a system that would need to be hosted on-prem.
https://github.com/huggingface/smolagents
LLMxMapreduce (https://arxiv.org/abs/2410.09342)
https://proceedings.neurips.cc/paper_files/paper/2024/hash/ee71a4b14ec26710b39ee6be113d7750-Abstract-Conference.html
https://github.com/lmnr-ai/index
https://arxiv.org/abs/2502.15920
https://arxiv.org/abs/2409.18014
https://proceedings.neurips.cc/paper_files/paper/2024/file/df2d62b96a4003203450cf89cd338bb7-Paper-Conference.pdf
https://arxiv.org/pdf/2501.12485
https://ojs.aaai.org/index.php/AAAI/article/view/32957
https://arxiv.org/pdf/2502.04506
Future Protocols
There are many directions that can be explored in the future. The obvious would be to explore how co-training local LLMs together would impact their performance to get job tasks done. This could be constrasted against cost, especially as tasks can be more expansive, allowing quorums of LLMs to solve a few jobs, so that there is a reduction in tokens exchanged between the local LLMs and the frontier model. This can occur by adapting a co-training framework like (MaPoRL)[https://arxiv.org/pdf/2502.18439].
Other directions that can be explored would be to have role-specific LLMs or specialized neural networks as local models, specifying this to the frontier models (using A2A to communicate this). Potentially communicating at the logit or weight level between the local LLMs could improve their ability to perform tasks. Exploring how these protocols deal with (data leakages)[https://arxiv.org/abs/2410.17127] would also be helpful.
Finally protocols that help deal with classical distributed systems problems can be explored within the combined protocol. As discussed in the introduction, problems like network partitions arise where specific LLMs could fail and their messages could fail as well. In this case jobs may not finish and job tasks would need to be restarted. These are problems that could not be dealt with by a simple naive integration of the Minions and A2A protocols. Secondly, there could be hierarchical structures for the group of local LLMs that allow job tasks to be more complex, reducing the costs as the frontier model would need to be used less. Also considering on-prem resources could be highly constrained dealing with dynamic allocation of resources for the local LLMs, there needs to be an exploration of how this can be scheduled correctly to maximize utility usage of available resources. This could be dealt with as a multi-arm bandit problem, for example. The abilities of the LLMs would need to be taken into account.