Skip to content

non-boundary html tags are being mixed with sibling boundary tags #35

@hassannteifeh

Description

@hassannteifeh

Hey!

Context:

const options = {
    newline_boundaries: true,
    html_boundaries: true,
    html_boundaries_tags: [
        'br',
        'p',
        'h1',
        'h2',
        'h3',
        'h4',
        'h5',
        'h6',
        'ul',
        'div',
        'figcaption',
    ],
    sanitize: true,
    preserve_whitespace: true,
}

const html =`<article> <span>a span here</span><h1>This is a a very cool title.</h1></article>`

console.log(tokenizer.sentences(html, options))

Expected Result:

[ 'a span here', 'This is a a very cool title.' ]

Actual Result:

['a span hereThis is a a very cool title.' ]

I do realise that <span> is not marked as a boundary html tag but in my opinion that shouldn't let its content leak into the text of its sibling html boundary tags.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions