You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Sidenotes first:
- `add_document_to_knowledge_base` is only used once and named
inconsistently, therefore I renamed it to `load_document` (there is also
an `async_load_document`, calling `async_load_documents` with a list
with one document).
- `upsert` should receive a list of Documents (2 out of all the calls to
upsert are in a loop, passing one doc at a time, and the other calls are
with the list of documents... There is no reason for that distinction).
Leaving the looping to the upsert function allows for some additional
flexibility inside the upsert, like obtaining the complete document
content for example or checking the total number of chunks.
- `num_documents` was not used, the final logging should have likely
been unindented, printing it's value.
## Summary
Upsert was practically not implemented, so kind of a feature.
As I tried to state in the issue #3700, it is technically NOT possible
to implement a proper "per-chunk" upsert. with the following
assumptions:
- Changed vector db documents must be deleted (for the given document
name that is being 'upserted') to keep the vector db clean.
- There is no (consistent) way to determine if a full document is
changed somehow after chunking.
- It's not common (or backwards compatible) to add a **full document
hash** as meta data.
- Weaviate `update` operations cannot be used since the document ID is
actually the content hash, so processing a "change" can only be done by
a delete and insert.
Because of this I resorted to the following approach:
- For the first chunk (or when not chunking) I can check if content
changed only by checking if a certain (chunk) content hash is available
in the vector db.
- Then, since it changed, the only way to find the related documents is
by name. Since I dont know if the chunking strategy changed, we can only
delete all documents with that name.
- For a second chunk, the name exists, but, after the prior deletion,
the content-hash (document id) does not occur anymore... so it is
considered a "change". But we should NOT delete the chunks by name...
Only add the chunk.
- Because of this, I resorted to only deleting all the documents by name
when it is the first chunk (or the (only) document with no chunks). This
means that changes in chunks 2 and higher will be recognized and added
to the knowledge base, but the old chunk 2 will remain...
I am open to feedback and am willing to have a discussion on a better
implementation.
As stated in the issue, for a better implementation I would need:
> Long story short, I think we need to **include the hash of the content
of the full document+metadata** in the document as metadata, so that we
can determine if the hash of the document content + metadata is the same
or changed. If changed, then delete all documents with that name, and
start looping for the upsert (INSIDE the Weaviate module, not outside
the module in the Agent class, which is the current state).
The benefit is that on the first chunk you can relyably say with the
full-document hash (any of all the chunks) changed. Then delete all
'documents' from the vector db with that name and insert.
Beside that, I would recomment just using a random UUID4 as document id.
## Type of change
- [x] Bug fix
- [ ] New feature
- [ ] Breaking change
- [x] Improvement
- [ ] Model update
- [ ] Other:
---
## Checklist
- [x] Code complies with style guidelines
- [ ] Ran format/validation scripts (`./scripts/format.sh` and
`./scripts/validate.sh`)
- [x] Self-review completed
- [x] Documentation updated (comments, docstrings)
- [ ] Examples and guides: Relevant cookbook examples have been included
or updated (if applicable)
- [ ] Tested in clean environment
- [x] Tests added/updated (if applicable)
---
---------
Co-authored-by: Siete Frouws <siete.frouws@bincy.nl>
Co-authored-by: Kaustubh <shuklakaustubh84@gmail.com>
0 commit comments