Skip to content

Conversation

@eduranm
Copy link
Contributor

@eduranm eduranm commented Sep 26, 2025

Creación y ampliación de la app core (modelos comunes, formularios, utilidades, vistas y hooks).

Inclusión de app markup_doc para el procesamiento y marcado de artículos DOCX (modelos, API REST, tareas y hooks).

Inclusión de la librería markuplib con funciones para DOCX y transformaciones OMML → MathML.

Creación de la app model_ai para integración con modelos LLaMA, utilidades de inferencia y tareas de IA.

Módulos en reference (configuración de Gemini y utilidades de datos).

Consolidación y limpieza de migraciones en reference, core y core_settings.

… de nuevas apps y aumento del límite de campos
…s, textos con idioma y manejo flexible de fechas
…s y ampliación de tipos soportados (confproc, full_text, etc.)
…istas de búsqueda, utilidades y hooks de Wagtail
…ial.py y eliminación de migraciones intermedias
…n de Django y traducción de verbose_name a inglés
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR incorporates significant new functionality into the codebase by adding markup document processing, AI model integration, and utilities for handling DOCX files and references. The changes focus on creating a comprehensive document processing pipeline with AI-powered metadata extraction and XML generation capabilities.

Key Changes

  • Creation of new apps markup_doc and model_ai for document processing and AI model management
  • Integration of DOCX processing capabilities with OMML to MathML transformation
  • Implementation of reference processing with LLaMA model integration
  • Consolidation and cleanup of database migrations

Reviewed Changes

Copilot reviewed 53 out of 64 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
requirements/base.txt Added dependencies for Google Generative AI, langid, and python-docx
reference/wagtail_hooks.py Updated admin interface to use new snippet system and model AI tasks
reference/models.py Enhanced Reference model with status choices and improved field definitions
reference/migrations/ Consolidated migration files for cleaner database schema
reference/marker.py Updated import path for GenericLlama class
reference/data_utils.py Added ReferenceStatus usage and updated task imports
reference/config_gemini.py New configuration for Gemini API integration with reference processing
reference/config.py Enhanced configuration with JSON formatting and additional reference types
reference/api/v1/views.py Updated to use ReferenceStatus enum
model_ai/ Complete new app for AI model management and inference utilities
markuplib/ New library with DOCX processing functions and OMML to MathML transformation
markup_doc/ Complete new app for document markup processing with XML generation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

# Create your models here
class Reference(CommonControlField, ClusterableModel):
mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)
mixed_citation = models.TextField(("Mixed Citation"), null=False, blank=True)
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing underscore function call. Should be _('Mixed Citation') to enable internationalization.

Suggested change
mixed_citation = models.TextField(("Mixed Citation"), null=False, blank=True)
mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

Necesita incluir el _ antes de (

mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)

MaxValueValidator(10) # Máximo 10
],
help_text=_("Rating from 1 to 10")
help_text="Rating from 1 to 10"
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field should use the internationalization function. Change to help_text=_('Rating from 1 to 10').

Suggested change
help_text="Rating from 1 to 10"
help_text=_("Rating from 1 to 10")

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

¿Hay alguna razón especial para eliminar la traducción _()?

instance = LlamaModel.objects.get(id=id)
get_model(instance.hf_token, instance.name_model, instance.name_file)
instance.download_status = DownloadStatus.DOWNLOADED
except:
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using bare except: clause catches all exceptions including system exits. Use except Exception: instead to catch only recoverable exceptions.

Suggested change
except:
except Exception:

Copilot uses AI. Check for mistakes.
"source": "The role of physical activity on the need for revision total knee arthroplasty in individuals with osteoarthritis of the knee [dissertation]",
"organization": "University of Pittsburgh",
"location": "[Pittsburgh (PA)]",
"num_pages": 436 p
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid syntax - missing quotes around the value. Should be 'num_pages': '436 p' or 'num_pages': 436 if it's meant to be numeric.

Suggested change
"num_pages": 436 p
"num_pages": "436 p"

Copilot uses AI. Check for mistakes.
"date": 1903,
"source": "Plant geography upon a physiological basis",
"organization": "Clarendon Press",
"location": "Oxford, UK"
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing comma after this field, which will cause a syntax error.

Suggested change
"location": "Oxford, UK"
"location": "Oxford, UK",

Copilot uses AI. Check for mistakes.
],
"date": c2005,
"source": "Advanced practice nursing: an integrative approach",
"edition: "3rd ed"
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing closing quote and comma. Should be 'edition': '3rd ed',.

Suggested change
"edition: "3rd ed"
"edition": "3rd ed",

Copilot uses AI. Check for mistakes.
"edition: "3rd ed"
"organization": "Elsevier Saunders",
"location": "St. Louis (MO)",
"num_pages": 979 p
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid syntax - missing quotes around the value. Should be 'num_pages': '979 p' or 'num_pages': 979 if numeric.

Suggested change
"num_pages": 979 p
"num_pages": "979 p"

Copilot uses AI. Check for mistakes.
"date": 1995,
"source": "Inflammatory bowel disease",
"chapter": "The epidemiology of idiopathic inflammatory bowel disease.",
"edition: "4th"
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing closing quote and comma. Should be 'edition': '4th',.

Suggested change
"edition: "4th"
"edition": "4th",

Copilot uses AI. Check for mistakes.
response['Content-Disposition'] = f'attachment; filename="{nombre_archivo}"'

return response
except TuModelo.DoesNotExist:
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TuModelo is not defined. Should be ArticleDocxMarkup.DoesNotExist based on the context.

Suggested change
except TuModelo.DoesNotExist:
except ArticleDocxMarkup.DoesNotExist:

Copilot uses AI. Check for mistakes.
Comment on lines 122 to 123
print("*********************************")
print(links)
Copy link

Copilot AI Sep 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statements should be removed from production code or replaced with proper logging.

Copilot uses AI. Check for mistakes.
@robertatakenaka
Copy link
Member

@eduranm he contratado el servicio copilot. Por favor, evalue las recomendaciones de correcciones si se aplican. Cambia en tu código local y después suba como nuevos commits.

Corrige el tipo de excepción para responder 404 cuando el registro no existe.
…nlaces

Reduce ruido en logs y mantiene la función enfocada a su retorno.
Mejora legibilidad y buenas prácticas de manejo de errores.
…a prompt de referencias

Se agregan comillas a campos textuales y se corrigen comas/keys para evitar errores de parseo del prompt.
Permite traducción de 'Mixed Citation' y 'Rating from 1 to 10'.
if mission:
return mission

language_order = ['pt', 'es', 'en']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm verificar que hay otros tipos de idiomas

if ' e ' in surname_buscado:
surname1 = surname_buscado.split(' e ')[0].strip().lower()
surname2 = surname_buscado.split(' e ')[1].strip().lower()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# Código original con problemas identificados:
"""
PROBLEMAS DETECTADOS:
1. Código repetitivo - viola el principio DRY (Don't Repeat Yourself)
2. Las variables surname1 y surname2 se sobrescriben múltiples veces si hay varios separadores
3. No maneja casos donde el split() podría fallar (ej: separador al final)
4. No valida que existan exactamente 2 apellidos después del split
5. Lógica difícil de mantener y extender
"""

# VERSIÓN MEJORADA 1: Usando diccionario y break
def process_surnames_v1(surname_buscados):
    """
    Procesa una lista de apellidos buscando separadores comunes.
    Retorna una lista de tuplas (surname1, surname2) o None si no hay separador.
    """
    # Definir separadores en orden de prioridad
    SEPARATORS = [' y ', ' and ', ' & ', ' e ']
    
    results = []
    for surname_buscado in surname_buscados:
        surname_pair = None
        
        # Buscar el primer separador que aplique
        for separator in SEPARATORS:
            if separator in surname_buscado:
                parts = surname_buscado.split(separator)
                if len(parts) == 2:  # Validar que tengamos exactamente 2 partes
                    surname1 = parts[0].strip().lower()
                    surname2 = parts[1].strip().lower()
                    surname_pair = (surname1, surname2)
                    break  # Usar el primer separador encontrado
        
        results.append(surname_pair)
    
    return results


# VERSIÓN MEJORADA 2: Usando expresiones regulares (más flexible)
import re

def process_surnames_v2(surname_buscados):
    """
    Procesa apellidos usando regex para múltiples separadores.
    Más eficiente y flexible que múltiples if statements.
    """
    # Patrón que captura los separadores comunes
    pattern = re.compile(r'\s+(y|and|&|e)\s+', re.IGNORECASE)
    
    results = []
    for surname_buscado in surname_buscados:
        match = pattern.split(surname_buscado)
        if len(match) >= 3:  # [apellido1, separador, apellido2, ...]
            surname1 = match[0].strip().lower()
            surname2 = match[2].strip().lower()
            results.append((surname1, surname2))
        else:
            results.append(None)
    
    return results


# VERSIÓN MEJORADA 3: Generador con manejo robusto de errores
def process_surnames_v3(surname_buscados):
    """
    Versión más robusta con validación y logging opcional.
    Usa yield para eficiencia de memoria con listas grandes.
    """
    SEPARATORS = {
        ' y ': 'spanish_and',
        ' and ': 'english_and', 
        ' & ': 'ampersand',
        ' e ': 'portuguese_and'
    }
    
    for surname_buscado in surname_buscados:
        if not surname_buscado:  # Validar entrada vacía
            yield None
            continue
            
        found = False
        for separator, sep_type in SEPARATORS.items():
            if separator in surname_buscado:
                parts = surname_buscado.split(separator)
                
                # Validación más estricta
                if len(parts) == 2 and all(p.strip() for p in parts):
                    surname1 = parts[0].strip().lower()
                    surname2 = parts[1].strip().lower()
                    yield {
                        'surname1': surname1,
                        'surname2': surname2,
                        'separator_type': sep_type,
                        'original': surname_buscado
                    }
                    found = True
                    break
        
        if not found:
            yield None


# EJEMPLO DE USO:
if __name__ == "__main__":
    # Datos de prueba
    test_surnames = [
        "García y López",
        "Smith and Jones",
        "Brown & Williams",
        "Silva e Santos",
        "Martinez",  # Sin separador
        " y Solo",    # Caso edge: separador al inicio
        "Multiple y and separators"  # Múltiples separadores
    ]
    
    print("=== Versión 1 (simple) ===")
    for result in process_surnames_v1(test_surnames):
        print(result)
    
    print("\n=== Versión 2 (regex) ===")
    for result in process_surnames_v2(test_surnames):
        print(result)
    
    print("\n=== Versión 3 (robusta) ===")
    for result in process_surnames_v3(test_surnames):
        if result:
            print(f"{result['original']} -> {result['surname1']}, {result['surname2']} [{result['separator_type']}]")
        else:
            print("No separator found")


# RECOMENDACIÓN FINAL:
"""
Para tu caso en SciELO, recomendaría la versión 3 porque:

1. Es más mantenible y extensible
2. Proporciona metadata útil (tipo de separador usado)
3. Maneja casos edge de forma robusta
4. Usa generadores para eficiencia con grandes volúmenes de datos
5. Facilita el debugging y logging
6. Se integra bien con pipelines de procesamiento de artículos

Si necesitas procesar nombres de autores de artículos científicos,
considera también normalizar acentos y caracteres especiales:

from unicodedata import normalize

def normalize_name(name):
    return normalize('NFKD', name).encode('ASCII', 'ignore').decode('utf-8')
"""

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 53 out of 64 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

markup_doc/xml.py:1

  • Type mismatch: 'full_text' field should be 'string' type, not 'integer' based on its usage context.
from lxml import etree

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +18 to +19
CreateView,
EditView,
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate import of CreateView and EditView. These are already imported at line 8 from wagtail_modeladmin.views.

Suggested change
CreateView,
EditView,

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

¿Vamos a usar CreateView a partir de modeladmin o de snippets?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugiero borrar la línea 8

Comment on lines 24 to 29
estatus = models.IntegerField(
_("Reference estatus"),
choices=ReferenceStatus.choices,
blank=True,
default=ReferenceStatus.NO_REFERENCE
)
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'estatus' to 'status' in field name and verbose name.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

Vamos a estandarizar los nombres de los campos al idioma inglés (status)

Comment on lines 38 to 43
download_status = models.IntegerField(
_("Local model estatus"),
choices=DownloadStatus.choices,
blank=True,
default=DownloadStatus.NO_MODEL
)
Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'estatus' to 'status' in verbose name.

Copilot uses AI. Check for mistakes.
def clean(self):
if not self.pk and LlamaModel.objects.exists():
raise ValidationError(_("Only one instance of LlamaModel is allowed."))

Copy link

Copilot AI Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singleton pattern validation in clean() method will prevent creating multiple instances via admin, but won't prevent programmatic creation. Consider using a custom manager or database constraint for stronger enforcement.

Suggested change
def save(self, *args, **kwargs):
if not self.pk and LlamaModel.objects.exists():
raise ValidationError(_("Only one instance of LlamaModel is allowed."))
super().save(*args, **kwargs)

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@pitangainnovare pitangainnovare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm
@robertatakenaka

Hay algunos métodos muy grandes que podrían reducirse a métodos pequeños. Esto facilitaría el mantenimiento del código, sobre todo en la parte que manipula datos en Docx. Llegué a implementar alguna forma de instanciar el GenericLlama usando el patrón Singleton y con algo de manejo de errores. Podría ser interesante aprovechar ese código en el GenericLlama de este PR.

Comment on lines +475 to +478
url = url.lower()
url_parts = url.split("/")
if not url_parts:
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

¿No sería mejor usar urllib.parse para convertir la cadena URL?

from urllib.parse import urlparse, parse_qs

url = "https://example.com/path/page?arg1=val1&arg2=val2#section"

# quebra URL em componentes
parts = urlparse(url)
print(parts.scheme)   # https
print(parts.netloc)   # example.com
print(parts.path)     # /path/page
print(parts.query)    # arg1=val1&arg2=val2
print(parts.fragment) # section

# parse de query string
query_params = parse_qs(parts.query)
print(query_params)  # {'arg1': ['val1'], 'arg2': ['val2']}

Comment on lines +502 to +505
return dict(
license_type=license_type,
license_version=license_version,
license_language=license_language,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

El método devuelve esto:

return dict(
    license_type=license_type,
    license_version=license_version,
    license_language=license_language,
)

¿No sería mejor darle otro nombre al método? Por ejemplo, extract_license_data_from_url?

# Create your models here
class Reference(CommonControlField, ClusterableModel):
mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)
mixed_citation = models.TextField(("Mixed Citation"), null=False, blank=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

Necesita incluir el _ antes de (

mixed_citation = models.TextField(_("Mixed Citation"), null=False, blank=True)

Comment on lines 24 to 29
estatus = models.IntegerField(
_("Reference estatus"),
choices=ReferenceStatus.choices,
blank=True,
default=ReferenceStatus.NO_REFERENCE
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

Vamos a estandarizar los nombres de los campos al idioma inglés (status)

MaxValueValidator(10) # Máximo 10
],
help_text=_("Rating from 1 to 10")
help_text="Rating from 1 to 10"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

¿Hay alguna razón especial para eliminar la traducción _()?

return ' '.join(links) if links else None


def extractContent(self, doc, doc_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_content

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

Después vamos a necesitar reducir el tamaño de la mayoría de los métodos siguientes. Algo mucho más pequeño, cada método haciendo solo una cosa.


return clean_text.strip()

def extrae_Tabla(element, rels_map, namespaces):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extrae_table (o extract_table)

Comment on lines +1 to +35
from config.settings.base import LLAMA_MODEL_DIR
from llama_cpp import Llama
import os
import google.generativeai as genai
from model_ai.models import LlamaModel

class GenericLlama:

def __init__(self, messages=None, response_format=None, max_tokens=4000, temperature=0.1, top_p=0.1, type='chat'):
model_ai = LlamaModel.objects.first()
self.llm = Llama(model_path = os.path.join(LLAMA_MODEL_DIR, model_ai.name_file), n_threads=2, n_ctx=5000)
self.messages = messages
self.response_format = response_format
self.max_tokens = max_tokens
self.temperature = temperature
self.top_p = top_p
self.type = type

def run(self, user_input):
if self.type == 'chat':
input = self.messages.copy()
input.append({
'role': 'user',
'content': user_input
})
return self.llm.create_chat_completion(messages=input, response_format=self.response_format, max_tokens=self.max_tokens, temperature=self.temperature, top_p=self.top_p)
if self.type == 'prompt':
#USO DE GEMINI
model_ai = LlamaModel.objects.first()
if model_ai and model_ai.api_key_gemini:
genai.configure(api_key=model_ai.api_key_gemini)
model = genai.GenerativeModel('models/gemini-2.0-flash')
return model.generate_content(user_input).text
else:
return self.llm(user_input, max_tokens=2000, temperature=self.temperature, stop=["\n\n"]) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eduranm

Aquí hay una versión actualizada del GenericLlama, pensando en el manejo de errores y en la optimización usando Singleton: https://github.com/pitangainnovare/markapi/blob/main/llama3/generic_llama.py

self.attrs.update({'disabled': 'disabled'})


class LlamaModel(CommonControlField):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLMModel ?

from model_ai.tasks import download_model


class LlamaModelCreateView(CreateView):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Llama es solo uno de los muchos posibles modelos. ¿Qué tal si llamamos al modelo LLMModel o AIModel?

LLMModelCreateView

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants