A configurable extension to Amazon Bedrock which enhances performance by enabling additional compute at inference time, allowing you to tradeoff the cost, accuracy and latency of an Amazon Bedrock solution using the following features:
- Reflection - Enables models to iteratively refine their responses through multiple rounds of self-reflection with external verification systems
- Multi-Model - Supports parallel inference across multiple Amazon Bedrock models for collaborative problem solving with aggregated responses
- Prompt Caching - Caches model responses to avoid redundant API calls and reduce costs
- Structured Outputs - Validates and formats model outputs into consistent structured data
- Budget Optimisation - Automatically searches for the optimal inference-time configuration considering cost and latency constraints
We have seen significant performance gains over a single Amazon Bedrock call for a wide range of domains using these techniques as shown below:
See task details and expanded evaluation results for more information!
The easiest way to install our latest repository is by running the following command:
pip install git+https://github.com/aws-samples/sample-genai-reflection-for-bedrock.git
Alternatively if you prefer a specific commit or version
pip install git+https://github.com/aws-samples/sample-genai-reflection-for-bedrock.git@<commit_sha> # replace <commit_sha>
pip install git+https://github.com/aws-samples/sample-genai-reflection-for-bedrock.git@<version> # replace <version> e.g. v0.7.1
Additionally, these installation urls can be appended to a requirements.txt
file as shown:
git+https://github.com/aws-samples/sample-genai-reflection-for-bedrock.git
... # other packages
Finally, if you want control over the specific package you can clone the repository and install directly.
git clone https://github.com/aws-samples/sample-genai-reflection-for-bedrock.git
cd sample-genai-reflection-for-bedrock
pip install -e .
The above methods have analogies to GitLab's package registry as well.
There are a variety of ways to leverage this in your project:
Now that you've setup the Hive
client, the easiest way to leverage it in your project is with a single model and an optional number of reflection rounds as shown below. This configuration enables the model to reflect on its' response and apply more compute to solving a more difficult problem.
graph LR;
A[Input] --> B[Initial Thought]
B --> C[Round 1: Revision]
C --> D[Round 2: Revision]
D --> E[Output]
from bhive import Hive, HiveConfig
bhive_client = Hive()
bhive_config = HiveConfig(
bedrock_model_ids=["anthropic.claude-3-sonnet-20240229-v1:0"],
num_reflections=2,
)
messages = [{"role": "user", "content": [{"text": "What is 2 + 2?"}]}]
response = bhive_client.converse(messages, bhive_config)
print(response)
You can also optionally pass a verifier
function to the HiveConfig
which consumes a model output from a previous round of reflection and should return additional context about that response which allows the integration of external information. The verifier
must be a Callable
which consumes a single str
and outputs another str
.
graph LR;
A[Input] --> B[Initial Thought]
B --> V0[Verifier]
B --> C
V0 --> C[Round 1: Revision]
C --> V1[Verifier]
C --> D
V1 --> D[Round 2: Revision]
D --> E[Output]
style V0 fill:#800080,stroke:#000000,stroke-width:2px
style V1 fill:#800080,stroke:#000000,stroke-width:2px
from bhive import Hive, HiveConfig
bhive_client = Hive()
def twoplustwo_verifier(context: str) -> str:
if "4" in context:
return "this answer is correct"
else:
return "this answer is wrong"
bhive_config = HiveConfig(
bedrock_model_ids=["anthropic.claude-3-sonnet-20240229-v1:0"],
num_reflections=2,
verifier=twoplustwo_verifier
)
messages = [{"role": "user", "content": [{"text": "What is 2 + 2?"}]}]
response = bhive_client.converse(messages, bhive_config)
print(response)
There are often cases where additional context can be used to help steer the model during a problem solving iteration. For example, in a text-to-code application, such as text-to-SQL, a verifier
can execute the SQL and return some additional information about runtime errors or data as shown below.
def text2sql_verifier(context: str) -> str:
"""Extracts SQL and validates it against a database."""
extracted_sql_query = extract_sql(context, "<SQL>")
try:
result = execute_sql(db_path=db_path, sql=extracted_context[0])
result_df = pd.DataFrame(result)
base_msg = "The query was executed successfully against the database."
if not result_df.empty:
return f"{base_msg} It returned the following results:\n{result_df.to_string(index=False)}"
else:
return f"{base_msg} It returned no results."
except Exception as e:
return f"Error executing the SQL query: {str(e)}"
You can also incorporate multiple different Amazon Bedrock models to collaboratively try to solve your task. In order to use this functionality you need to provide an aggregator_model_id
which performs the role of summarising the last debate round into a final response. The example code below would implement the following inference method where blue signifies a Claude response and red a response from Mistral.
graph LR;
A[Input] --> B1[Initial Thought]
A[Input] --> B2[Initial Thought]
B1 --> C1[Round 1: Revision]
B2 --> C1
B2 --> C2[Round 1: Revision]
B1 --> C2
C1 --> D1[Round 2: Revision]
C2 --> D1
C2 --> D2[Round 2: Revision]
C1 --> D2
D2 --> AG
D1 --> AG[Aggregator]
AG --> E[Output]
style B1 fill:#005f99,stroke:#333,stroke-width:2px;
style B2 fill:#e63946,stroke:#333,stroke-width:2px;
style C1 fill:#005f99,stroke:#333,stroke-width:2px;
style C2 fill:#e63946,stroke:#333,stroke-width:2px;
style D1 fill:#005f99,stroke:#333,stroke-width:2px;
style D2 fill:#e63946,stroke:#333,stroke-width:2px;
from bhive import Hive, HiveConfig
bhive_client = Hive()
models = ["anthropic.claude-3-sonnet-20240229-v1:0", "mistral.mistral-large-2402-v1:0"]
bhive_config = HiveConfig(
bedrock_model_ids=models,
num_reflections=2,
aggregator_model_id="anthropic.claude-3-sonnet-20240229-v1:0"
)
messages = [{"role": "user", "content": [{"text": "What is 2 + 2?"}]}]
response = bhive_client.converse(messages, bhive_config)
print(response)
This can also be done with multiple instances of the same model by modifying the list as shown in the diagram and code below:
graph LR;
A[Input] --> B1[Initial Thought]
A[Input] --> B2[Initial Thought]
B1 --> C1[Round 1: Revision]
B2 --> C1
B2 --> C2[Round 1: Revision]
B1 --> C2
C1 --> D1[Round 2: Revision]
C2 --> D1
C2 --> D2[Round 2: Revision]
C1 --> D2
D2 --> AG
D1 --> AG[Aggregator]
AG --> E[Output]
style B1 fill:#005f99,stroke:#333,stroke-width:2px;
style B2 fill:#005f99,stroke:#333,stroke-width:2px;
style C1 fill:#005f99,stroke:#333,stroke-width:2px;
style C2 fill:#005f99,stroke:#333,stroke-width:2px;
style D1 fill:#005f99,stroke:#333,stroke-width:2px;
style D2 fill:#005f99,stroke:#333,stroke-width:2px;
from bhive import Hive, HiveConfig
bhive_client = Hive()
models = ["anthropic.claude-3-sonnet-20240229-v1:0", "anthropic.claude-3-sonnet-20240229-v1:0"]
bhive_config = HiveConfig(
bedrock_model_ids=models,
num_reflections=2,
aggregator_model_id="anthropic.claude-3-sonnet-20240229-v1:0"
)
messages = [{"role": "user", "content": [{"text": "What is 2 + 2?"}]}]
response = bhive_client.converse(messages, bhive_config)
print(response)
You can also apply the verifier from the previous stage in this inference method, applying it independently to each revision from each model.
If you are not sure which exact hyperparameter configuration will suit your needs, you can use the hyperparameter optimisation functionality. Here, you can define a set of ranges for the inference parameters such as the Amazon Bedrock models or rounds of reflection and these will be evaluated for against a test dataset. You can also specify a budget constraining the maximum cost ($) and maximum latency (seconds) per example.
graph LR
B[Generate all configurations]
B --> C[For each config in configurations]
C --> D[Evaluate candidate]
D --> F{Is candidate better?}
F -->|Yes| G[Does the candidate meet the budget constraints?]
G -->|Yes| H[Update best candidate]
G -->|No| D
F -->|No| D
An example implementation in the API is shown below:
# craft a test dataset of prompt, response pairs
dataset = [
("What is the capital of France?", "Paris"),
("What is 2 + 2?", "4"),
("Who wrote Hamlet?", "William Shakespeare")
]
# define a configuration of models and reflections rounds
trial_config = TrialConfig(
bedrock_model_combinations=[
["anthropic.claude-3-sonnet-20240229-v1:0"],
["anthropic.claude-3-haiku-20240307-v1:0"],
["mistral.mistral-small-2402-v1:0"],
["mistral.mistral-large-2402-v1:0"],
],
reflection_range=[0, 1, 3],
# other parameter ranges / choices
)
# instantiate a client and run optimise
hive_client = Hive()
results = hive_client.optimise(dataset, trial_config)
By default
Hive.optimise
will directly compare string responses but you can pick from (and extend) other evaluators available inbhive.evaluators
.
![]() |
![]() |
---|---|
Jack Butler | Nikita Kozodoi |
Chat to the team if you have new feature suggestions or bug fixes!
- Can I use this with any model on Amazon Bedrock?
The model must support conversation history to be used, this rules out certain models such as
Jurassic-2 Ultra
which do not have this capability.
- Does it support multimodal queries?
Yes, it mirrors the BedrockRuntime
converse()
messages structure and will perform inference with any modality.
- Can I authenticate with my own
boto3
client?
Yes, you can pass an already initialised client instance to the
Hive
class, otherwise we will try to create a client from theAWS_PROFILE
environment variable.