Skip to content

Ignore Scopes in word count [including proof of concept] #99

@mbroedl

Description

@mbroedl

Resulting from a discussion with @X-Raym in the recently accepted PR #94 I had a bit of a thought on why it should not be possible to ignore scopes rather than re-inventing regexes that are implemented as grammars already.

Taking tokenised lines is theoretically possible. Together with a scope selector it is possible to filter (positively or negatively) all text elements that match certain scopes.
This would further allow to ignore punctuation (much requested although now dealt with in a new word count regex), different grammars, etc (see e.g. #55 and #65 ).

See the following minimal example to filter out all scopes that are comments, quotes, or punctuation in their respective grammar. It should be copyable to the console of atom 1:1.

let scopes = ['comment.*', 'quote.*', 'punctuation.*'];
let editor = atom.workspace.getActiveTextEditor();
let scopeselector = require('first-mate').ScopeSelector;

let buffer = [];
editor.displayBuffer.tokenizedBuffer.tokenizedLines.forEach(line => line.tokens.forEach(token => buffer.push(token)));

scopes.forEach(scope => {
    let selector = new scopeselector(scope);
    buffer = buffer.filter(token => !selector.matches(token.scopes));
});

let text = buffer.map(token => token.value).join('');
console.log(text.match(/\S+/g).length);

At the moment it reduces all lines into a single buffer to make looping easier. this destroys line breaks, and may thus not desirable, but I thought as a proof of concept this would be sufficient for now. This does change the final word count a little bit though. — It should be no difficulty to instead loop through the tokens and filter those that do not match the ignored selectors.

Maybe instead of the many current settings, one could have one text box where all to-be-ignored scopes are listed, and then a little pop-up menu on right-click on the word count (similar to the one for the minimap) would allow to activate/deactivate certain scopes in filtering.
This would reduce bulk in the settings (see discussions elsewhere in this package).

Disclaimer:
(A) The tokenised lines are not documented and thus subject to change without further notice, although the current selector seems to have been stable for at least two years.
(B) I do not know how much the speed of calculation would suffer from this way of filtering (compared to regex, and compared to not filtering).

Looking forward to any discussions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions