-
Notifications
You must be signed in to change notification settings - Fork 5
Incorporación de markup_doc, model_ai y utilidades en core #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… de nuevas apps y aumento del límite de campos
…s, textos con idioma y manejo flexible de fechas
…ón de referencias
…s y ampliación de tipos soportados (confproc, full_text, etc.)
…istas de búsqueda, utilidades y hooks de Wagtail
…o, utilidades y hooks de Wagtail
…ones OMML a MathML
…es de inferencia, tareas y hooks de Wagtail
…s de procesamiento de datos
…ial.py y eliminación de migraciones intermedias
…n de Django y traducción de verbose_name a inglés
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR incorporates significant new functionality into the codebase by adding markup document processing, AI model integration, and utilities for handling DOCX files and references. The changes focus on creating a comprehensive document processing pipeline with AI-powered metadata extraction and XML generation capabilities.
Key Changes
- Creation of new apps
markup_docandmodel_aifor document processing and AI model management - Integration of DOCX processing capabilities with OMML to MathML transformation
- Implementation of reference processing with LLaMA model integration
- Consolidation and cleanup of database migrations
Reviewed Changes
Copilot reviewed 53 out of 64 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements/base.txt | Added dependencies for Google Generative AI, langid, and python-docx |
| reference/wagtail_hooks.py | Updated admin interface to use new snippet system and model AI tasks |
| reference/models.py | Enhanced Reference model with status choices and improved field definitions |
| reference/migrations/ | Consolidated migration files for cleaner database schema |
| reference/marker.py | Updated import path for GenericLlama class |
| reference/data_utils.py | Added ReferenceStatus usage and updated task imports |
| reference/config_gemini.py | New configuration for Gemini API integration with reference processing |
| reference/config.py | Enhanced configuration with JSON formatting and additional reference types |
| reference/api/v1/views.py | Updated to use ReferenceStatus enum |
| model_ai/ | Complete new app for AI model management and inference utilities |
| markuplib/ | New library with DOCX processing functions and OMML to MathML transformation |
| markup_doc/ | Complete new app for document markup processing with XML generation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
reference/models.py
Outdated
| # Create your models here | ||
| class Reference(CommonControlField, ClusterableModel): | ||
| mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True) | ||
| mixed_citation = models.TextField(("Mixed Citation"), null=False, blank=True) |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing underscore function call. Should be _('Mixed Citation') to enable internationalization.
| mixed_citation = models.TextField(("Mixed Citation"), null=False, blank=True) | |
| mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necesita incluir el _ antes de (
mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)
reference/models.py
Outdated
| MaxValueValidator(10) # Máximo 10 | ||
| ], | ||
| help_text=_("Rating from 1 to 10") | ||
| help_text="Rating from 1 to 10" |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field should use the internationalization function. Change to help_text=_('Rating from 1 to 10').
| help_text="Rating from 1 to 10" | |
| help_text=_("Rating from 1 to 10") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¿Hay alguna razón especial para eliminar la traducción _()?
model_ai/tasks.py
Outdated
| instance = LlamaModel.objects.get(id=id) | ||
| get_model(instance.hf_token, instance.name_model, instance.name_file) | ||
| instance.download_status = DownloadStatus.DOWNLOADED | ||
| except: |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bare except: clause catches all exceptions including system exits. Use except Exception: instead to catch only recoverable exceptions.
| except: | |
| except Exception: |
reference/config_gemini.py
Outdated
| "source": "The role of physical activity on the need for revision total knee arthroplasty in individuals with osteoarthritis of the knee [dissertation]", | ||
| "organization": "University of Pittsburgh", | ||
| "location": "[Pittsburgh (PA)]", | ||
| "num_pages": 436 p |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid syntax - missing quotes around the value. Should be 'num_pages': '436 p' or 'num_pages': 436 if it's meant to be numeric.
| "num_pages": 436 p | |
| "num_pages": "436 p" |
reference/config_gemini.py
Outdated
| "date": 1903, | ||
| "source": "Plant geography upon a physiological basis", | ||
| "organization": "Clarendon Press", | ||
| "location": "Oxford, UK" |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comma after this field, which will cause a syntax error.
| "location": "Oxford, UK" | |
| "location": "Oxford, UK", |
reference/config_gemini.py
Outdated
| ], | ||
| "date": c2005, | ||
| "source": "Advanced practice nursing: an integrative approach", | ||
| "edition: "3rd ed" |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing closing quote and comma. Should be 'edition': '3rd ed',.
| "edition: "3rd ed" | |
| "edition": "3rd ed", |
reference/config_gemini.py
Outdated
| "edition: "3rd ed" | ||
| "organization": "Elsevier Saunders", | ||
| "location": "St. Louis (MO)", | ||
| "num_pages": 979 p |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid syntax - missing quotes around the value. Should be 'num_pages': '979 p' or 'num_pages': 979 if numeric.
| "num_pages": 979 p | |
| "num_pages": "979 p" |
reference/config_gemini.py
Outdated
| "date": 1995, | ||
| "source": "Inflammatory bowel disease", | ||
| "chapter": "The epidemiology of idiopathic inflammatory bowel disease.", | ||
| "edition: "4th" |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing closing quote and comma. Should be 'edition': '4th',.
| "edition: "4th" | |
| "edition": "4th", |
markup_doc/views.py
Outdated
| response['Content-Disposition'] = f'attachment; filename="{nombre_archivo}"' | ||
|
|
||
| return response | ||
| except TuModelo.DoesNotExist: |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TuModelo is not defined. Should be ArticleDocxMarkup.DoesNotExist based on the context.
| except TuModelo.DoesNotExist: | |
| except ArticleDocxMarkup.DoesNotExist: |
markuplib/function_docx.py
Outdated
| print("*********************************") | ||
| print(links) |
Copilot
AI
Sep 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statements should be removed from production code or replaced with proper logging.
|
@eduranm he contratado el servicio copilot. Por favor, evalue las recomendaciones de correcciones si se aplican. Cambia en tu código local y después suba como nuevos commits. |
Corrige el tipo de excepción para responder 404 cuando el registro no existe.
…nlaces Reduce ruido en logs y mantiene la función enfocada a su retorno.
Mejora legibilidad y buenas prácticas de manejo de errores.
…a prompt de referencias Se agregan comillas a campos textuales y se corrigen comas/keys para evitar errores de parseo del prompt.
Permite traducción de 'Mixed Citation' y 'Rating from 1 to 10'.
| if mission: | ||
| return mission | ||
|
|
||
| language_order = ['pt', 'es', 'en'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eduranm verificar que hay otros tipos de idiomas
| if ' e ' in surname_buscado: | ||
| surname1 = surname_buscado.split(' e ')[0].strip().lower() | ||
| surname2 = surname_buscado.split(' e ')[1].strip().lower() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Código original con problemas identificados:
"""
PROBLEMAS DETECTADOS:
1. Código repetitivo - viola el principio DRY (Don't Repeat Yourself)
2. Las variables surname1 y surname2 se sobrescriben múltiples veces si hay varios separadores
3. No maneja casos donde el split() podría fallar (ej: separador al final)
4. No valida que existan exactamente 2 apellidos después del split
5. Lógica difícil de mantener y extender
"""
# VERSIÓN MEJORADA 1: Usando diccionario y break
def process_surnames_v1(surname_buscados):
"""
Procesa una lista de apellidos buscando separadores comunes.
Retorna una lista de tuplas (surname1, surname2) o None si no hay separador.
"""
# Definir separadores en orden de prioridad
SEPARATORS = [' y ', ' and ', ' & ', ' e ']
results = []
for surname_buscado in surname_buscados:
surname_pair = None
# Buscar el primer separador que aplique
for separator in SEPARATORS:
if separator in surname_buscado:
parts = surname_buscado.split(separator)
if len(parts) == 2: # Validar que tengamos exactamente 2 partes
surname1 = parts[0].strip().lower()
surname2 = parts[1].strip().lower()
surname_pair = (surname1, surname2)
break # Usar el primer separador encontrado
results.append(surname_pair)
return results
# VERSIÓN MEJORADA 2: Usando expresiones regulares (más flexible)
import re
def process_surnames_v2(surname_buscados):
"""
Procesa apellidos usando regex para múltiples separadores.
Más eficiente y flexible que múltiples if statements.
"""
# Patrón que captura los separadores comunes
pattern = re.compile(r'\s+(y|and|&|e)\s+', re.IGNORECASE)
results = []
for surname_buscado in surname_buscados:
match = pattern.split(surname_buscado)
if len(match) >= 3: # [apellido1, separador, apellido2, ...]
surname1 = match[0].strip().lower()
surname2 = match[2].strip().lower()
results.append((surname1, surname2))
else:
results.append(None)
return results
# VERSIÓN MEJORADA 3: Generador con manejo robusto de errores
def process_surnames_v3(surname_buscados):
"""
Versión más robusta con validación y logging opcional.
Usa yield para eficiencia de memoria con listas grandes.
"""
SEPARATORS = {
' y ': 'spanish_and',
' and ': 'english_and',
' & ': 'ampersand',
' e ': 'portuguese_and'
}
for surname_buscado in surname_buscados:
if not surname_buscado: # Validar entrada vacía
yield None
continue
found = False
for separator, sep_type in SEPARATORS.items():
if separator in surname_buscado:
parts = surname_buscado.split(separator)
# Validación más estricta
if len(parts) == 2 and all(p.strip() for p in parts):
surname1 = parts[0].strip().lower()
surname2 = parts[1].strip().lower()
yield {
'surname1': surname1,
'surname2': surname2,
'separator_type': sep_type,
'original': surname_buscado
}
found = True
break
if not found:
yield None
# EJEMPLO DE USO:
if __name__ == "__main__":
# Datos de prueba
test_surnames = [
"García y López",
"Smith and Jones",
"Brown & Williams",
"Silva e Santos",
"Martinez", # Sin separador
" y Solo", # Caso edge: separador al inicio
"Multiple y and separators" # Múltiples separadores
]
print("=== Versión 1 (simple) ===")
for result in process_surnames_v1(test_surnames):
print(result)
print("\n=== Versión 2 (regex) ===")
for result in process_surnames_v2(test_surnames):
print(result)
print("\n=== Versión 3 (robusta) ===")
for result in process_surnames_v3(test_surnames):
if result:
print(f"{result['original']} -> {result['surname1']}, {result['surname2']} [{result['separator_type']}]")
else:
print("No separator found")
# RECOMENDACIÓN FINAL:
"""
Para tu caso en SciELO, recomendaría la versión 3 porque:
1. Es más mantenible y extensible
2. Proporciona metadata útil (tipo de separador usado)
3. Maneja casos edge de forma robusta
4. Usa generadores para eficiencia con grandes volúmenes de datos
5. Facilita el debugging y logging
6. Se integra bien con pipelines de procesamiento de artículos
Si necesitas procesar nombres de autores de artículos científicos,
considera también normalizar acentos y caracteres especiales:
from unicodedata import normalize
def normalize_name(name):
return normalize('NFKD', name).encode('ASCII', 'ignore').decode('utf-8')
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 53 out of 64 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
markup_doc/xml.py:1
- Type mismatch: 'full_text' field should be 'string' type, not 'integer' based on its usage context.
from lxml import etree
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| CreateView, | ||
| EditView, |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate import of CreateView and EditView. These are already imported at line 8 from wagtail_modeladmin.views.
| CreateView, | |
| EditView, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¿Vamos a usar CreateView a partir de modeladmin o de snippets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sugiero borrar la línea 8
reference/models.py
Outdated
| estatus = models.IntegerField( | ||
| _("Reference estatus"), | ||
| choices=ReferenceStatus.choices, | ||
| blank=True, | ||
| default=ReferenceStatus.NO_REFERENCE | ||
| ) |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'estatus' to 'status' in field name and verbose name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vamos a estandarizar los nombres de los campos al idioma inglés (status)
| download_status = models.IntegerField( | ||
| _("Local model estatus"), | ||
| choices=DownloadStatus.choices, | ||
| blank=True, | ||
| default=DownloadStatus.NO_MODEL | ||
| ) |
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected spelling of 'estatus' to 'status' in verbose name.
| def clean(self): | ||
| if not self.pk and LlamaModel.objects.exists(): | ||
| raise ValidationError(_("Only one instance of LlamaModel is allowed.")) | ||
|
|
Copilot
AI
Oct 13, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singleton pattern validation in clean() method will prevent creating multiple instances via admin, but won't prevent programmatic creation. Consider using a custom manager or database constraint for stronger enforcement.
| def save(self, *args, **kwargs): | |
| if not self.pk and LlamaModel.objects.exists(): | |
| raise ValidationError(_("Only one instance of LlamaModel is allowed.")) | |
| super().save(*args, **kwargs) |
…eference status' (incluye migraciones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hay algunos métodos muy grandes que podrían reducirse a métodos pequeños. Esto facilitaría el mantenimiento del código, sobre todo en la parte que manipula datos en Docx. Llegué a implementar alguna forma de instanciar el GenericLlama usando el patrón Singleton y con algo de manejo de errores. Podría ser interesante aprovechar ese código en el GenericLlama de este PR.
| url = url.lower() | ||
| url_parts = url.split("/") | ||
| if not url_parts: | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¿No sería mejor usar urllib.parse para convertir la cadena URL?
from urllib.parse import urlparse, parse_qs
url = "https://example.com/path/page?arg1=val1&arg2=val2#section"
# quebra URL em componentes
parts = urlparse(url)
print(parts.scheme) # https
print(parts.netloc) # example.com
print(parts.path) # /path/page
print(parts.query) # arg1=val1&arg2=val2
print(parts.fragment) # section
# parse de query string
query_params = parse_qs(parts.query)
print(query_params) # {'arg1': ['val1'], 'arg2': ['val2']}| return dict( | ||
| license_type=license_type, | ||
| license_version=license_version, | ||
| license_language=license_language, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
El método devuelve esto:
return dict(
license_type=license_type,
license_version=license_version,
license_language=license_language,
)¿No sería mejor darle otro nombre al método? Por ejemplo, extract_license_data_from_url?
reference/models.py
Outdated
| # Create your models here | ||
| class Reference(CommonControlField, ClusterableModel): | ||
| mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True) | ||
| mixed_citation = models.TextField(("Mixed Citation"), null=False, blank=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Necesita incluir el _ antes de (
mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)
reference/models.py
Outdated
| estatus = models.IntegerField( | ||
| _("Reference estatus"), | ||
| choices=ReferenceStatus.choices, | ||
| blank=True, | ||
| default=ReferenceStatus.NO_REFERENCE | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vamos a estandarizar los nombres de los campos al idioma inglés (status)
reference/models.py
Outdated
| MaxValueValidator(10) # Máximo 10 | ||
| ], | ||
| help_text=_("Rating from 1 to 10") | ||
| help_text="Rating from 1 to 10" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¿Hay alguna razón especial para eliminar la traducción _()?
| return ' '.join(links) if links else None | ||
|
|
||
|
|
||
| def extractContent(self, doc, doc_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_content
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Después vamos a necesitar reducir el tamaño de la mayoría de los métodos siguientes. Algo mucho más pequeño, cada método haciendo solo una cosa.
|
|
||
| return clean_text.strip() | ||
|
|
||
| def extrae_Tabla(element, rels_map, namespaces): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extrae_table (o extract_table)
| from config.settings.base import LLAMA_MODEL_DIR | ||
| from llama_cpp import Llama | ||
| import os | ||
| import google.generativeai as genai | ||
| from model_ai.models import LlamaModel | ||
|
|
||
| class GenericLlama: | ||
|
|
||
| def __init__(self, messages=None, response_format=None, max_tokens=4000, temperature=0.1, top_p=0.1, type='chat'): | ||
| model_ai = LlamaModel.objects.first() | ||
| self.llm = Llama(model_path = os.path.join(LLAMA_MODEL_DIR, model_ai.name_file), n_threads=2, n_ctx=5000) | ||
| self.messages = messages | ||
| self.response_format = response_format | ||
| self.max_tokens = max_tokens | ||
| self.temperature = temperature | ||
| self.top_p = top_p | ||
| self.type = type | ||
|
|
||
| def run(self, user_input): | ||
| if self.type == 'chat': | ||
| input = self.messages.copy() | ||
| input.append({ | ||
| 'role': 'user', | ||
| 'content': user_input | ||
| }) | ||
| return self.llm.create_chat_completion(messages=input, response_format=self.response_format, max_tokens=self.max_tokens, temperature=self.temperature, top_p=self.top_p) | ||
| if self.type == 'prompt': | ||
| #USO DE GEMINI | ||
| model_ai = LlamaModel.objects.first() | ||
| if model_ai and model_ai.api_key_gemini: | ||
| genai.configure(api_key=model_ai.api_key_gemini) | ||
| model = genai.GenerativeModel('models/gemini-2.0-flash') | ||
| return model.generate_content(user_input).text | ||
| else: | ||
| return self.llm(user_input, max_tokens=2000, temperature=self.temperature, stop=["\n\n"]) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aquí hay una versión actualizada del GenericLlama, pensando en el manejo de errores y en la optimización usando Singleton: https://github.com/pitangainnovare/markapi/blob/main/llama3/generic_llama.py
| self.attrs.update({'disabled': 'disabled'}) | ||
|
|
||
|
|
||
| class LlamaModel(CommonControlField): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLMModel ?
| from model_ai.tasks import download_model | ||
|
|
||
|
|
||
| class LlamaModelCreateView(CreateView): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Llama es solo uno de los muchos posibles modelos. ¿Qué tal si llamamos al modelo LLMModel o AIModel?
LLMModelCreateView
Creación y ampliación de la app core (modelos comunes, formularios, utilidades, vistas y hooks).
Inclusión de app markup_doc para el procesamiento y marcado de artículos DOCX (modelos, API REST, tareas y hooks).
Inclusión de la librería markuplib con funciones para DOCX y transformaciones OMML → MathML.
Creación de la app model_ai para integración con modelos LLaMA, utilidades de inferencia y tareas de IA.
Módulos en reference (configuración de Gemini y utilidades de datos).
Consolidación y limpieza de migraciones en reference, core y core_settings.