Chunking table contents for embeddings #1487

rafaeltuelho · 2025-04-28T16:30:11Z

rafaeltuelho
Apr 28, 2025

I'm trying to understand how the HybridChunker process works and also get to know what the proper way(format) is to create embeddings to store in a Vector DB..

Let's consider a document has a table like this:

	A	C	R6	Administrator	Institutional
Maximum sales charge (load) imposed on purchases (as a percentage of offering price)	5.75%	None	None	None	None
Maximum deferred sales charge (load) (as a percentage of offering price)	None 1	1.00%	None	None	None

Using HybridChunker I get this chunk for the table.

Maximum sales charge (load) imposed on purchases (as a percentage of offering price), A = 5.75%. 
Maximum sales charge (load) imposed on purchases (as a percentage of offering price), C = None. 
Maximum sales charge (load) imposed on purchases (as a percentage of offering price), R6 = None. 
Maximum sales charge (load) imposed on purchases (as a percentage of offering price), Administrator = None. 
Maximum sales charge (load) imposed on purchases (as a percentage of offering price), Institutional = None. 

Maximum deferred sales charge (load) (as a percentage of offering price), A = None 1. 
Maximum deferred sales charge (load) (as a percentage of offering price), C = 1.00%. 
Maximum deferred sales charge (load) (as a percentage of offering price), R6 = None.
Maximum deferred sales charge (load) (as a percentage of offering price), Administrator = None. 
Maximum deferred sales charge (load) (as a percentage of offering price), Institutional = None\n

all in one long line (I broke it to make it easy to read here)

So, it is duplicating the first (data) cell for each header column. I'm afraid this is not good for embeddings as it can mess up the topK results.

What would be a better approach to extracting table contents for chunking/embedding?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Chunking table contents for embeddings #1487

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Chunking table contents for embeddings #1487

Uh oh!

rafaeltuelho Apr 28, 2025

Replies: 0 comments

rafaeltuelho
Apr 28, 2025