Enhance `#scan_integer` to check for valid character following it

Recently the new method `#scan_integer` was introduced (see #113) to optimize scanning integer values.

The current implementation works regardless of what follows the integer, i.e. scanning `123`, `123 something`, `123,something`, `123.32` and `123something` all work and would return 123.

However, in - I suspect - many cases an integer may only be a valid integer if it is (not) followed by certain characters. One example is the input `123d` which leads to an error when interpreted as Ruby code.

My use case is PDF syntax. There a token is an integer only when it is followed by a whitespace (ASCII decimal 0, 9, 10, 12, 13 and 32) or a delimiter (`( ) < > [ ] / %`) character (otherwise it is a generic token). To handle this the implementation using `#scan_integer` looks like this:

~~~ ruby
    # Parses the number (integer or real) at the current position.
    #
    # See: PDF2.0 s7.3.3
    def parse_number
      prepare_string_scanner(20)
      pos = self.pos
      if (tmp = @ss.scan_integer)
        if @ss.eos? || @ss.match?(WHITESPACE_OR_DELIMITER_RE)
          # Handle object references, see PDF2.0 s7.3.10
          prepare_string_scanner(10)
          if @ss.scan(REFERENCE_RE)
            tmp = if tmp > 0
                    Reference.new(tmp, @ss[1].to_i)
                  else
                    maybe_raise("Invalid indirect object reference (#{tmp},#{@ss[1].to_i})")
                    nil
                  end
          end
          return tmp
        else
          self.pos = pos
        end
      end

      val = scan_until(WHITESPACE_OR_DELIMITER_RE) || @ss.scan(/.*/)
      if val.match?(/\A[+-]?(?:\d+\.\d*|\.\d+)\z/)
        val << '0' if val.getbyte(-1) == 46 # dot '.'
        Float(val)
      else
        TOKEN_CACHE[val] # val is keyword
      end
    end
  end
~~~

As you can see we

1. need to store the current scan position,
2. check if scanning an integer works at the current position,
3. scan the content after the integer to verify that it is indeed an integer and work with it, or
4. if the previous step didn't work, reset the scan position.

This could be simplified to just a call of `#scan_integer` if this method would optionally check the contents after it. Something like `#scan_integer(separator: SEPARATOR_PATTERN)` or maybe `#scan_integer(separator_chars: STRING)` (where `STRING` contains separator characters, similar to whole `String#tr` works).

Would it make sense to include such functionality?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance `#scan_integer` to check for valid character following it #119

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhance #scan_integer to check for valid character following it #119

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Enhance `#scan_integer` to check for valid character following it #119