Skip to content

Conversation

@Siete-F
Copy link
Contributor

@Siete-F Siete-F commented Jul 8, 2025

Sidenotes first:

  • add_document_to_knowledge_base is only used once and named inconsistently, therefore I renamed it to load_document (there is also an async_load_document, calling async_load_documents with a list with one document).
  • upsert should receive a list of Documents (2 out of all the calls to upsert are in a loop, passing one doc at a time, and the other calls are with the list of documents... There is no reason for that distinction). Leaving the looping to the upsert function allows for some additional flexibility inside the upsert, like obtaining the complete document content for example or checking the total number of chunks.
  • num_documents was not used, the final logging should have likely been unindented, printing it's value.

Summary

Upsert was practically not implemented, so kind of a feature.

As I tried to state in the issue #3700, it is technically NOT possible to implement a proper "per-chunk" upsert. with the following assumptions:

  • Changed vector db documents must be deleted (for the given document name that is being 'upserted') to keep the vector db clean.
  • There is no (consistent) way to determine if a full document is changed somehow after chunking.
  • It's not common (or backwards compatible) to add a full document hash as meta data.
  • Weaviate update operations cannot be used since the document ID is actually the content hash, so processing a "change" can only be done by a delete and insert.

Because of this I resorted to the following approach:

  • For the first chunk (or when not chunking) I can check if content changed only by checking if a certain (chunk) content hash is available in the vector db.
    • Then, since it changed, the only way to find the related documents is by name. Since I dont know if the chunking strategy changed, we can only delete all documents with that name.
  • For a second chunk, the name exists, but, after the prior deletion, the content-hash (document id) does not occur anymore... so it is considered a "change". But we should NOT delete the chunks by name... Only add the chunk.
  • Because of this, I resorted to only deleting all the documents by name when it is the first chunk (or the (only) document with no chunks). This means that changes in chunks 2 and higher will be recognized and added to the knowledge base, but the old chunk 2 will remain...

I am open to feedback and am willing to have a discussion on a better implementation.

As stated in the issue, for a better implementation I would need:

Long story short, I think we need to include the hash of the content of the full document+metadata in the document as metadata, so that we can determine if the hash of the document content + metadata is the same or changed. If changed, then delete all documents with that name, and start looping for the upsert (INSIDE the Weaviate module, not outside the module in the Agent class, which is the current state).

The benefit is that on the first chunk you can relyably say with the full-document hash (any of all the chunks) changed. Then delete all 'documents' from the vector db with that name and insert.

Beside that, I would recomment just using a random UUID4 as document id.

Type of change

  • Bug fix
  • New feature
  • Breaking change
  • Improvement
  • Model update
  • Other:

Checklist

  • Code complies with style guidelines
  • Ran format/validation scripts (./scripts/format.sh and ./scripts/validate.sh)
  • Self-review completed
  • Documentation updated (comments, docstrings)
  • Examples and guides: Relevant cookbook examples have been included or updated (if applicable)
  • Tested in clean environment
  • Tests added/updated (if applicable)

Siete Frouws added 4 commits July 7, 2025 20:38
…mitations that are build in, namely using the content as UUID hash for storing the document and not having a document wide indicator (full document hash) that indicates if a document changed somewhere.
@Siete-F Siete-F requested a review from a team as a code owner July 8, 2025 20:04
@Siete-F Siete-F changed the title [bug] Implement weaviate knowledge upsert. Improve AgentKnowledge class. [fix] Implement weaviate knowledge upsert. Improve AgentKnowledge class. Jul 9, 2025
@willemcdejongh
Copy link
Contributor

Hey @Siete-F

Thanks for your contributions!

I think the changes you did are all quite valid. Except for this

if upsert and not self.vector_db.upsert_available():
            raise ValueError(
                f"Vector db '{self.vector_db.__class__.__module__}' does not support upsert. Please set upsert to False or use a vector db that supports upsert."

This restricts users to having very specific rulesets when they setup their knowledge. It could also be that more than one vectorDB is used for a knowledgebase. One of which might support upsert and the other not. I think instead of raising the error, we should keep the default behaviour where it falls back to insert

Would you please mind adding a cookbook implementation for weaviate that uses upsert=True.

@Siete-F
Copy link
Contributor Author

Siete-F commented Jul 14, 2025

Hi @willemcdejongh, thanks for your feedback!
Do you agree that some kind of notification would still be appropriate? Most cases will be people who are convinced they configured "upsert" where it suddenly doesn't work.
If you agree, do you think it should be a logger.info or debug_log or something else? if your scenario occurs more often, it should be suppressible maybe?

Thanks again! will remove the part you discussed.

@kausmeows
Copy link
Contributor

Hi @willemcdejongh, thanks for your feedback! Do you agree that some kind of notification would still be appropriate? Most cases will be people who are convinced they configured "upsert" where it suddenly doesn't work. If you agree, do you think it should be a logger.info or debug_log or something else? if your scenario occurs more often, it should be suppressible maybe?

Thanks again! will remove the part you discussed.

Hey @Siete-F i think a logger info makes more sense with a message like- “upsert functionality not available, falling back to insert.”

Something like that.

@Siete-F
Copy link
Contributor Author

Siete-F commented Jul 16, 2025

@willemcdejongh, I changed it to an info log.

Regarding the cookbook example:

Added it. For upsert only the logs and not having an error make sense, so enabled debug logging and that shows the following.

INFO Initializing local Weaviate client
INFO Creating collection
DEBUG Creating collection 'recipes' in Weaviate.
DEBUG Collection 'recipes' created in Weaviate.
INFO Loading knowledge base
INFO Reading: https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf
DEBUG Upserting 14 documents into Weaviate.
DEBUG Inserting 14 documents into Weaviate.
DEBUG Inserted document: ThaiRecipes ({'page': 1, 'chunk': 1, 'chunk_size': 54})
DEBUG Inserted document: ThaiRecipes ({'page': 2, 'chunk': 1, 'chunk_size': 1387})
DEBUG Inserted document: ThaiRecipes ({'page': 3, 'chunk': 1, 'chunk_size': 2088})
DEBUG Inserted document: ThaiRecipes ({'page': 4, 'chunk': 1, 'chunk_size': 1649})
DEBUG Inserted document: ThaiRecipes ({'page': 5, 'chunk': 1, 'chunk_size': 1275})
DEBUG Inserted document: ThaiRecipes ({'page': 6, 'chunk': 1, 'chunk_size': 1412})
DEBUG Inserted document: ThaiRecipes ({'page': 7, 'chunk': 1, 'chunk_size': 1427})
DEBUG Inserted document: ThaiRecipes ({'page': 8, 'chunk': 1, 'chunk_size': 1497})
DEBUG Inserted document: ThaiRecipes ({'page': 9, 'chunk': 1, 'chunk_size': 1441})
DEBUG Inserted document: ThaiRecipes ({'page': 10, 'chunk': 1, 'chunk_size': 1258})
DEBUG Inserted document: ThaiRecipes ({'page': 11, 'chunk': 1, 'chunk_size': 1280})
DEBUG Inserted document: ThaiRecipes ({'page': 12, 'chunk': 1, 'chunk_size': 1295})
DEBUG Inserted document: ThaiRecipes ({'page': 13, 'chunk': 1, 'chunk_size': 1116})
DEBUG Inserted document: ThaiRecipes ({'page': 14, 'chunk': 1, 'chunk_size': 267})
INFO Added 14 documents to knowledge base
Knowledge base loaded with PDF content. Loading the same data again will not recreate it.
INFO Loading knowledge base
INFO Reading: https://agno-public.s3.amazonaws.com/recipes/ThaiRecipes.pdf
DEBUG Upserting 14 documents into Weaviate.
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Document skipped, content is unchanged: ThaiRecipes
DEBUG Inserting 0 documents into Weaviate.
INFO Added 14 documents to knowledge base
First example finished. Now dropping the knowledge base.
DEBUG Deleting collection 'recipes' from Weaviate.


Start second example. Load initial data...
INFO Creating collection
DEBUG Creating collection 'recipes' in Weaviate.
DEBUG Collection 'recipes' created in Weaviate.
INFO Loading knowledge base
DEBUG Upserting 1 documents into Weaviate.
DEBUG Inserting 1 documents into Weaviate.
DEBUG Inserted document: doc1 ({})
DEBUG Upserting 1 documents into Weaviate.
DEBUG Inserting 1 documents into Weaviate.
INFO Loading knowledge base
DEBUG Upserting 1 documents into Weaviate.
DEBUG Inserting 1 documents into Weaviate.
DEBUG Inserted document: doc1 ({})
DEBUG Upserting 1 documents into Weaviate.
DEBUG Inserting 1 documents into Weaviate.
DEBUG Inserted document: doc2 ({})
INFO Added 2 documents to knowledge base

Now uploading the changed data...
INFO Loading knowledge base
DEBUG Upserting 1 documents into Weaviate.
DEBUG Document already exists, but content changed. Document will be deleted and added again: doc1
DEBUG Deleted document by name: 'doc1' - 1 documents deleted.
DEBUG Inserting 1 documents into Weaviate.
DEBUG Inserted document: doc1 ({})
DEBUG Upserting 1 documents into Weaviate.
DEBUG Document skipped, content is unchanged: doc2
DEBUG Inserting 0 documents into Weaviate.
INFO Added 2 documents to knowledge base
Example finished. Now dropping the knowledge base.
DEBUG Deleting collection 'recipes' from Weaviate.

@Siete-F
Copy link
Contributor Author

Siete-F commented Jul 16, 2025

@willemcdejongh

When running validate.bat (mypy), I run into the following error:

(agno) C:\Users\scfro\Documents\code\agno\libs\agno\scripts>validate.bat

##################################################
# Validating agno
##################################################


##################################################
# Running: ruff check C:\Users\scfro\Documents\code\agno\libs\agno\scripts\\..
##################################################

All checks passed!

##################################################
# Running: mypy C:\Users\scfro\Documents\code\agno\libs\agno\scripts\\.. --config-file C:\Users\scfro\Documents\code\agno\libs\agno\scripts\\..\pyproject.toml
##################################################

C:\Users\scfro\Documents\code\agno\libs\agno\build\lib\agno\__init__.py: error: Duplicate module named "agno" (also at "C:\Users\scfro\Documents\code\agno\libs\agno\agno\__init__.py")
C:\Users\scfro\Documents\code\agno\libs\agno\build\lib\agno\__init__.py: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#mapping-file-paths-to-modules for more info
C:\Users\scfro\Documents\code\agno\libs\agno\build\lib\agno\__init__.py: note: Common resolutions include: a) using `--exclude` to avoid checking one of them, b) adding `__init__.py` somewhere, c) using `--explicit-package-bases` or adjusting MYPYPATH
Found 1 error in 1 file (errors prevented further checking)
[ERROR] mypy validation failed with exit code 2

Is that an error in setup on my side? Or shall I create a ticket for the bug?

Copy link
Contributor

@kausmeows kausmeows left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! 🚀

@kausmeows kausmeows merged commit 125861b into agno-agi:main Aug 8, 2025
3 checks passed
dirkbrnd pushed a commit that referenced this pull request Aug 27, 2025
## Summary

Fixes #4258 
This is a regression from #3784 - the for loop was removed but the
`filters` argument has to be specified per doc, not globally.

## Type of change

- [x] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Improvement
- [ ] Model update
- [ ] Other:

---

## Checklist

- [x] Code complies with style guidelines
- [x] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [x] Self-review completed
- [ ] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [x] Tested in clean environment
- [ ] Tests added/updated (if applicable)

---

## Additional Notes

I guess a more efficient and thorough solution would be to change the
`insert()` and `upsert()` methods to receive a list of filters, but it
would also be a breaking change.

Co-authored-by: Kaustubh <shuklakaustubh84@gmail.com>
@kausmeows kausmeows mentioned this pull request Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants