Skip to content

Commit 19f7d9e

Browse files
schema: Specify the encoding for character offsets (#224)
This PR adds a new `PositionEncoding` field so that indexers can specify what type of character offsets they are using. This way, consumers of SCIP can unambiguously interpret the offsets for non-ASCII data. I have kept it as a field on `Document` rather than `Index` because: 1. There is no additional benefit from having it on `Index` because occurrences only belong inside Documents, not outside. 2. It allows one to concatenate indexes from different sources which use different kinds of offsets.
1 parent beb6593 commit 19f7d9e

File tree

7 files changed

+3384
-2777
lines changed

7 files changed

+3384
-2777
lines changed

bindings/go/scip/scip.pb.go

Lines changed: 513 additions & 402 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

bindings/haskell/src/Proto/Scip.hs

Lines changed: 1259 additions & 1042 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

bindings/haskell/src/Proto/Scip_Fields.hs

Lines changed: 6 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

bindings/rust/src/generated/scip.rs

Lines changed: 1473 additions & 1326 deletions
Large diffs are not rendered by default.

bindings/typescript/scip.ts

Lines changed: 29 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

docs/scip.md

Lines changed: 61 additions & 6 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

scip.proto

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,8 @@ message Metadata {
4545
// directory.
4646
string project_root = 3;
4747
// Text encoding of the source files on disk that are referenced from
48-
// `Document.relative_path`.
48+
// `Document.relative_path`. This value is unrelated to the `Document.text`
49+
// field, which is a Protobuf string and hence must be UTF-8 encoded.
4950
TextEncoding text_document_encoding = 4;
5051
}
5152

@@ -102,8 +103,46 @@ message Document {
102103
// can be used for other purposes as well, for example testing or when working
103104
// with virtual/in-memory documents.
104105
string text = 5;
106+
107+
// Specifies the encoding used for source ranges in this Document.
108+
//
109+
// Usually, this will match the type used to index the string type
110+
// in the indexer's implementation language in O(1) time.
111+
// - For an indexer implemented in JVM/.NET language or JavaScript/TypeScript,
112+
// use UTF16CodeUnitOffsetFromLineStart.
113+
// - For an indexer implemented in Python,
114+
// use UTF8CodeUnitOffsetFromLineStart.
115+
// - For an indexer implemented in Go, Rust or C++,
116+
// use UTF8ByteOffsetFromLineStart.
117+
PositionEncoding position_encoding = 6;
105118
}
106119

120+
// Encoding used to interpret the 'character' value in source ranges.
121+
enum PositionEncoding {
122+
// Default value. This value should not be used by new SCIP indexers
123+
// so that a consumer can process the SCIP index without ambiguity.
124+
UnspecifiedPositionEncoding = 0;
125+
// The 'character' value is interpreted as a byte offset,
126+
// assuming that the text for the line is encoded as UTF-8.
127+
//
128+
// Example: For the string "🚀 Woo" in UTF-8, the bytes are
129+
// [240, 159, 154, 128, 32, 87, 111, 111], so the offset for 'W'
130+
// would be 5.
131+
UTF8ByteOffsetFromLineStart = 1;
132+
// The 'character' value is interpreted as an offset in terms
133+
// of UTF-8 code units.
134+
//
135+
// Example: For the string "🚀 Woo", the UTF-8 code units are
136+
// ['🚀', ' ', 'W', 'o', 'o'], so the offset for 'W' would be 2.
137+
UTF8CodeUnitOffsetFromLineStart = 2;
138+
// The 'character' value is interpreted as an offset in terms
139+
// of UTF-16 code units.
140+
//
141+
// Example: For the string "🚀 Woo", the UTF-16 code units are
142+
// ['\ud83d', '\ude80', ' ', 'W', 'o', 'o'], so the offset for 'W'
143+
// would be 3.
144+
UTF16CodeUnitOffsetFromLineStart = 3;
145+
}
107146

108147
// Symbol is similar to a URI, it identifies a class, method, or a local
109148
// variable. `SymbolInformation` contains rich metadata about symbols such as
@@ -594,6 +633,9 @@ message Occurrence {
594633
// line/character values before displaying them in an editor-like UI because
595634
// editors conventionally use 1-based numbers.
596635
//
636+
// The 'character' value is interpreted based on the PositionEncoding for
637+
// the Document.
638+
//
597639
// Historical note: the original draft of this schema had a `Range` message
598640
// type with `start` and `end` fields of type `Position`, mirroring LSP.
599641
// Benchmarks revealed that this encoding was inefficient and that we could

0 commit comments

Comments
 (0)