-
-
Notifications
You must be signed in to change notification settings - Fork 768
Add support for the FTS5 trigram tokenizer #1655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: development
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,6 +74,48 @@ public struct FTS5 { | |
#endif | ||
} | ||
|
||
#if GRDBCUSTOMSQLITE || GRDBCIPHER | ||
/// Options for trigram tokenizer character matching. Matches the raw | ||
/// "case_sensitive" and "remove_diacritics" tokenizer arguments. | ||
/// | ||
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer> | ||
public enum TrigramTokenizerMatching: Sendable { | ||
/// Case insensitive matching without removing diacritics. This | ||
/// option matches the raw "case_sensitive=0 remove_diacritics=0" | ||
/// tokenizer argument. | ||
case caseInsensitive | ||
/// Case insensitive matching that removes diacritics before | ||
/// matching. This option matches the raw | ||
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument. | ||
case caseInsensitiveRemovingDiacritics | ||
/// Case sensitive matching. Diacritics are not removed when | ||
/// performing case sensitive matching. This option matches the raw | ||
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument. | ||
case caseSensitive | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have read the SQLite Documentation which says:
Still, I would suggest exposing two distinct options for The reasons for this suggestion are:
Some GRDB APIs, such as Maybe we could aim at: // Public usage
t.tokenizer = trigram(
caseSensitive: true,
diacritics: .remove) API: public struct FTS5TokenizerDescriptor {
/// The "trigram" tokenizer.
///
/// For example:
///
/// ```swift
/// try db.create(virtualTable: "book", using: FTS5()) { t in
/// t.tokenizer = .trigram()
/// }
/// ```
///
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer>
///
/// - parameters:
/// - caseSensitive: Unless specified, performs a case insensitive matching.
/// - diacritics: Unless specified, diacritics are not removed before matching.
public static func trigram(
caseSensitive: FTS5.TrigramCaseSensitiveOption? = nil,
diacritics: FTS5.TrigramDiacriticsOption? = nil,
) -> FTS5TokenizerDescriptor { ... }
}
extension FTS5 {
/// Case sensitivity options for the Trigram FTS5 tokenizer.
/// Matches the raw "case_sensitive" tokenizer argument.
///
/// Related SQLite documentation: <https://www.sqlite.org/fts5.html#the_trigram_tokenizer>
public struct TrigramCaseSensitiveOption: RawRepresentable {
var rawValue: Int
}
extension TrigramCaseSensitiveOption: ExpressibleByBooleanLiteral {
/// When true, matches the "case_sensitive=1" trigram tokenizer argument.
/// When false, it is "case_sensitive=0".
public init(booleanLiteral value: Bool) {
self = value ? .init(rawValue: 1) : .init(rawValue: 0)
}
}
/// Diacritics options for the Trigram FTS5 tokenizer.
/// Matches the raw "remove_diacritics" tokenizer argument.
///
/// Related SQLite documentation: <https://www.sqlite.org/fts5.html#the_trigram_tokenizer>
public struct TrigramDiacriticsOption: RawRepresentable {
var rawValue: Int
/// Do not remove diacritics. This option matches the raw
/// "remove_diacritics=0" trigram tokenizer argument.
public static let keep = Self(rawValue: 0)
/// Remove diacritics. This option matches the raw
/// "remove_diacritics=1" trigram tokenizer argument.
public static let remove = Self(rawValue: 1)
}
} The There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Re-reading myself with a critic eye, I stick to my initial arguments about the "SQLite smell", but I really start to wonder if t.tokenizer = trigram()
t.tokenizer = trigram(caseSensitive: 1)
t.tokenizer = trigram(caseSensitive: 0, removeDiacritics: 1)
public struct FTS5TokenizerDescriptor {
public static func trigram(
caseSensitive: Int? = nil,
removeDiacritics: Int? = nil
) -> FTS5TokenizerDescriptor { ... }
} This would pass my review like a breeze :-) Yes, this would make the Swift FTS5 apis not very consistent. Keeping the old existing apis is very important, because all code that users can put in a migration must basically compile forever (modifying a migration is a cardinal sin). This should not prevent new apis from being shaped by the experience that was acquired over the years. I leave it up to your good taste 😄 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the detailed feedback! When using the raw (In theory we could also have two There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ah. I see. Yes it wouldn't be great if people could write So are we back to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that makes sense, that makes it possible to add availability checks while allowing the user to use raw values to bypass restrictions if needed. A couple of minor questions:
public static func trigram(
caseSensitive: FTS5.TrigramCaseSensitiveOption? = nil,
diacritics: FTS5.TrigramDiacriticsOption? = nil,
) -> FTS5TokenizerDescriptor { ... }
|
||
#else | ||
/// Options for trigram tokenizer character matching. Matches the raw | ||
/// "case_sensitive" and "remove_diacritics" tokenizer arguments. | ||
/// | ||
/// Related SQLite documentation: <https://sqlite.org/fts5.html#the_trigram_tokenizer> | ||
@available(iOS 15, macOS 12, tvOS 15, watchOS 8, *) // SQLite 3.35.0+ (3.34 actually) | ||
public enum TrigramTokenizerMatching: Sendable { | ||
/// Case insensitive matching without removing diacritics. This | ||
/// option matches the raw "case_sensitive=0 remove_diacritics=0" | ||
/// tokenizer argument. | ||
case caseInsensitive | ||
/// Case insensitive matching that removes diacritics before | ||
/// matching. This option matches the raw | ||
/// "case_sensitive=0 remove_diacritics=1" tokenizer argument. | ||
@available(*, unavailable, message: "Requires a future OS release that includes SQLite >=3.45") | ||
case caseInsensitiveRemovingDiacritics | ||
/// Case sensitive matching. Diacritics are not removed when | ||
/// performing case sensitive matching. This option matches the raw | ||
/// "case_sensitive=1 remove_diacritics=0" tokenizer argument. | ||
case caseSensitive | ||
} | ||
#endif | ||
|
||
/// Creates an FTS5 module. | ||
/// | ||
/// For example: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -148,11 +148,11 @@ extension FTS5Tokenizer { | |
private func tokenize(_ string: String, for tokenization: FTS5Tokenization) | ||
throws -> [(token: String, flags: FTS5TokenFlags)] | ||
{ | ||
try ContiguousArray(string.utf8).withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in | ||
try string.utf8CString.withUnsafeBufferPointer { buffer -> [(String, FTS5TokenFlags)] in | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ❤️ |
||
guard let addr = buffer.baseAddress else { | ||
return [] | ||
} | ||
let pText = UnsafeMutableRawPointer(mutating: addr).assumingMemoryBound(to: CChar.self) | ||
let pText = addr | ||
let nText = CInt(buffer.count) | ||
|
||
var context = TokenizeContext() | ||
|
Uh oh!
There was an error while loading. Please reload this page.