Skip to content

Commit 748ede1

Browse files
committed
updated all notebooks - Adithya S K
1 parent 94de9b5 commit 748ede1

File tree

2 files changed

+79
-131
lines changed

2 files changed

+79
-131
lines changed

docs/demo.ipynb

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,70 @@
1313
"This makes it easy to test and experiment with different approaches in real-time."
1414
]
1515
},
16+
{
17+
"cell_type": "code",
18+
"execution_count": null,
19+
"metadata": {
20+
"vscode": {
21+
"languageId": "plaintext"
22+
}
23+
},
24+
"outputs": [],
25+
"source": [
26+
"!git clone https://github.com/adithya-s-k/VARAG\n",
27+
"%cd VARAG\n",
28+
"%pwd"
29+
]
30+
},
31+
{
32+
"cell_type": "code",
33+
"execution_count": null,
34+
"metadata": {
35+
"vscode": {
36+
"languageId": "plaintext"
37+
}
38+
},
39+
"outputs": [],
40+
"source": [
41+
"!apt-get update && apt-get install -y && apt-get install -y poppler-utils"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {
48+
"vscode": {
49+
"languageId": "plaintext"
50+
}
51+
},
52+
"outputs": [],
53+
"source": [
54+
"%pip install -e .\n",
55+
"\n",
56+
"## We will be using Docling for OCR\n",
57+
"%pip install docling"
58+
]
59+
},
60+
{
61+
"cell_type": "markdown",
62+
"metadata": {},
63+
"source": [
64+
"### Run Gradio"
65+
]
66+
},
67+
{
68+
"cell_type": "code",
69+
"execution_count": null,
70+
"metadata": {
71+
"vscode": {
72+
"languageId": "plaintext"
73+
}
74+
},
75+
"outputs": [],
76+
"source": [
77+
"!python demo.py --share"
78+
]
79+
},
1680
{
1781
"cell_type": "markdown",
1882
"metadata": {},

docs/simpleRAG.ipynb

Lines changed: 15 additions & 131 deletions
Original file line numberDiff line numberDiff line change
@@ -396,93 +396,15 @@
396396
},
397397
{
398398
"cell_type": "code",
399-
"execution_count": 11,
399+
"execution_count": null,
400400
"metadata": {
401401
"colab": {
402402
"base_uri": "https://localhost:8080/"
403403
},
404404
"id": "m4izqMiY-knW",
405405
"outputId": "d6427938-f1f2-4f08-b051-73e841a48840"
406406
},
407-
"outputs": [
408-
{
409-
"name": "stdout",
410-
"output_type": "stream",
411-
"text": [
412-
"This was the retrieved Context\n",
413-
"======================================================================================================================================================\n",
414-
"\n",
415-
"\n",
416-
"Chunk 1:\n",
417-
"Text: The table uses \"W\" and \"T\" markers to denote which system or department serves as the primary source (writer) or storage location (trailer) for each type of document.\n",
418-
"\n",
419-
"## C More similarity maps\n",
420-
"\n",
421-
"In Figure 7, ColPali assigns a high similarity to all patches with the word \"Kazakhstan\" when given the token <_Kazakhstan> . Moreover, our model seems to exhibit world knowledge capabilities as the patch around the word \"Kashagan\" - an offshore oil field in Kazakhstan - also shows a high similarity score. On the other hand, in Figure 8, we observe that ColPali is also capable of complex image understanding. Not only are the patches containing the word \"formulations\" highly similar to the query token _formula , but so is the upper-left molecule structure.\n",
422-
"\n",
423-
"It is also interesting to highlight that both similarity maps showcase a few white patches with high similarity scores. This behavior might first seem surprising as the white patches should not carry a meaningful signal from the original images\n",
424-
"Chunk Index: 68\n",
425-
"Document Name: colpali.pdf\n",
426-
"\n",
427-
"\n",
428-
"======================================================================================================================================================\n",
429-
"======================================================================================================================================================\n",
430-
"\n",
431-
"\n",
432-
"Chunk 2:\n",
433-
"Text: 4% ) on the aggregated benchmark.\n",
434-
"\n",
435-
"## Can the model adapt to new tasks?\n",
436-
"\n",
437-
"Contrary to more complex multi-step retrieval pipelines, ColPali can be trained end-to-end, directly optimizing the downstream retrieval task which greatly facilitates fine-tuning to boost performance on specialized domains, multilingual retrieval, or specific visual elements the model struggles with. To demonstrate, we add 1552 samples representing French tables and associated queries to the training set. This represents the only French data in the training set, with all other examples being kept unchanged. We see significant NDCG@5 improvements (Figure 4) and even starker Recall@1 gains ( +6 . 63% ) on the TabFQuAD benchmark, with no performance degradation on the rest of the benchmark tasks ( +0\n",
438-
"Chunk Index: 36\n",
439-
"Document Name: colpali.pdf\n",
440-
"\n",
441-
"\n",
442-
"======================================================================================================================================================\n",
443-
"======================================================================================================================================================\n",
444-
"\n",
445-
"\n",
446-
"Chunk 3:\n",
447-
"Text: Optimized late interaction engines (Santhanam et al., 2022; Lee et al., 2023) enable to easily scale corpus sizes to millions of documents with reduced latency degradations.\n",
448-
"\n",
449-
"Offline Indexing. (R3) Standard retrieval methods using bi-encoders represent each chunk as a single vector embedding, which is easy to store and fast to compute. However, processing a PDF to get the different chunks is the most time-consuming part (layout detection, OCR, chunking), and using captioning to handle multimodal data will only exacerbate this already lengthy process. On the other hand, ColPali directly encodes pages from their image representation. Although the encoder model is larger than standard retrieval encoders, skipping the preprocessing allows large speedups at indexing 10 (Figure 3).\n",
450-
"\n",
451-
"Memory Footprint. Our method requires storing a vector per image patch\n",
452-
"Chunk Index: 31\n",
453-
"Document Name: colpali.pdf\n",
454-
"\n",
455-
"\n",
456-
"======================================================================================================================================================\n",
457-
"======================================================================================================================================================\n",
458-
"\n",
459-
"\n",
460-
"Chunk 4:\n",
461-
"Text: This is particularly notable as our training dataset does not contain non-English samples.\n",
462-
"\n",
463-
"ColPali : Adding Late Interaction. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Results in Table 2 show that our ColPali model also largely outperforms the strong baselines based on Unstructured and captioning, as well as all evaluated text-image embedding models. The difference is particularly stark on the more visually complex benchmark tasks, such as InfographicVQA, ArxivQA, and TabFQuAD representing respectively infographics, figures, and tables\n",
464-
"Chunk Index: 25\n",
465-
"Document Name: colpali.pdf\n",
466-
"\n",
467-
"\n",
468-
"======================================================================================================================================================\n",
469-
"======================================================================================================================================================\n",
470-
"\n",
471-
"\n",
472-
"Chunk 5:\n",
473-
"Text: Subsequently, our vision is to combine visual retrieval and visually grounded query answering to create RAG systems that purely function from visual features. An interesting line of research could be attempting to generate answers leveraging information stored in the indexed multivector patch embeddings.\n",
474-
"\n",
475-
"## Limitations\n",
476-
"\n",
477-
"Focus. In this work, we evaluate models on document retrieval tasks, covering several modalities (figures, text, tables, infographics). We however primarily focus on PDF-type documents, and evaluating systems on image retrieval with documents stemming from web page screenshots or handwritten documents might be an interesting generalization. We also focus on high-resource languages (English and French) and although we have shown the capacity of the ColPali model to generalize to languages outside of its fine-tuning set, it is unclear how the model would perform on languages that are not as represented in the model's language backbone\n",
478-
"Chunk Index: 38\n",
479-
"Document Name: colpali.pdf\n",
480-
"\n",
481-
"\n",
482-
"======================================================================================================================================================\n"
483-
]
484-
}
485-
],
407+
"outputs": [],
486408
"source": [
487409
"query = \"what is colpali ?\"\n",
488410
"num_results = 5\n",
@@ -501,7 +423,7 @@
501423
},
502424
{
503425
"cell_type": "code",
504-
"execution_count": 12,
426+
"execution_count": null,
505427
"metadata": {
506428
"colab": {
507429
"base_uri": "https://localhost:8080/",
@@ -510,20 +432,7 @@
510432
"id": "2flyc-fc-knX",
511433
"outputId": "00f23adb-0387-4de3-f7f4-bd4ac61d458d"
512434
},
513-
"outputs": [
514-
{
515-
"data": {
516-
"text/markdown": [
517-
"ColPali is a model designed for document retrieval tasks that combines visual retrieval with language processing to enhance performance, particularly in the context of multimodal documents like PDFs, figures, tables, and infographics. It utilizes a late interaction mechanism to compute interactions between text tokens and image patches, improving retrieval capabilities significantly compared to previous models. ColPali can be trained end-to-end, allowing it to adapt to new tasks and specialized domains efficiently. Additionally, it can handle various languages and leverages visual features for query answering, aiming to create systems that function purely from visual information."
518-
],
519-
"text/plain": [
520-
"<IPython.core.display.Markdown object>"
521-
]
522-
},
523-
"metadata": {},
524-
"output_type": "display_data"
525-
}
526-
],
435+
"outputs": [],
527436
"source": [
528437
"from IPython.display import display, Markdown, Latex\n",
529438
"\n",
@@ -557,45 +466,11 @@
557466
"id": "cTvt2kHk-knY",
558467
"outputId": "3d4d0aba-6b57-4401-ae0e-4692d65da39a"
559468
},
560-
"outputs": [
561-
{
562-
"name": "stdout",
563-
"output_type": "stream",
564-
"text": [
565-
"/content/VARAG/examples\n",
566-
"2024-09-28 09:45:43.314833: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
567-
"2024-09-28 09:45:43.339060: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
568-
"2024-09-28 09:45:43.347860: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n",
569-
"2024-09-28 09:45:44.422675: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT\n",
570-
"INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cuda\n",
571-
"INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2\n",
572-
"/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
573-
" warnings.warn(\n",
574-
"Using device: cuda\n",
575-
"INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version \"HTTP/1.1 200 OK\"\n",
576-
"Running on local URL: http://127.0.0.1:7860\n",
577-
"INFO:httpx:HTTP Request: GET http://127.0.0.1:7860/startup-events \"HTTP/1.1 200 OK\"\n",
578-
"INFO:httpx:HTTP Request: HEAD http://127.0.0.1:7860/ \"HTTP/1.1 200 OK\"\n",
579-
"\n",
580-
"To create a public link, set `share=True` in `launch()`.\n",
581-
"INFO:httpx:HTTP Request: GET https://checkip.amazonaws.com/ \"HTTP/1.1 200 \"\n",
582-
"INFO:httpx:HTTP Request: GET https://checkip.amazonaws.com/ \"HTTP/1.1 200 \"\n"
583-
]
584-
}
585-
],
469+
"outputs": [],
586470
"source": [
587471
"%cd examples\n",
588472
"!python textDemo.py --share"
589473
]
590-
},
591-
{
592-
"cell_type": "code",
593-
"execution_count": null,
594-
"metadata": {
595-
"id": "fRFHnpC7BW_K"
596-
},
597-
"outputs": [],
598-
"source": []
599474
}
600475
],
601476
"metadata": {
@@ -609,7 +484,16 @@
609484
"name": "python3"
610485
},
611486
"language_info": {
612-
"name": "python"
487+
"codemirror_mode": {
488+
"name": "ipython",
489+
"version": 3
490+
},
491+
"file_extension": ".py",
492+
"mimetype": "text/x-python",
493+
"name": "python",
494+
"nbconvert_exporter": "python",
495+
"pygments_lexer": "ipython3",
496+
"version": "3.11.9"
613497
},
614498
"widgets": {
615499
"application/vnd.jupyter.widget-state+json": {

0 commit comments

Comments
 (0)