You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What behavior of the library made you think about the improvement?
In the current Index abstraction over the outlines-core regex engine, the concept of a final state is misleading: it does not actually signify that the regex has been fully matched or that a client can safely stop generating tokens.
While final_states includes states that can transition to themselves on the eos_token_id, this transition is not enforced or surfaced in a way that makes it obvious when a valid, complete match has occurred.
This results in a problem for client code: even after reaching a state marked as final, the regex may not be truly complete unless an additional eos_token_id transition is explicitly made. But from the perspective of Index::is_final_state, the client may incorrectly assume that it can exit.
Why This Matters:
Consumers using the Index to validate token sequences—especially in contexts like constrained decoding—need a reliable signal that a match is valid and complete. The current API makes it easy to:
Reach a "final" state,
Exit early,
And unknowingly produce an incomplete match.
This is especially problematic when using Index to validate partial decoding paths during generation.
How would you like it to behave?
Potential Fixes / Suggestions:
Clarify Semantics in Docs:
Clearly state that final_states does not guarantee regex completion unless followed by eos_token_id.
Introduce is_accepting_state(&self, state: &StateId) -> bool:
Add a method that checks whether a state is both in final_states and has a valid eos_token_id transition—i.e., a true "accepting" state.
Strengthen Invariants in next_state:
Consider enforcing or guiding the use of eos_token_id transitions when appropriate. Or, make these transitions more visible through a specialized method.
Provide Utility to Check for Completion:
A method like can_finish(&self, state: &StateId) -> bool would be extremely helpful to validate match completion.
Related abstraction that tries to solve that
usecrate::errors::ConstrainedDecodingError;use outlines_core::index::Index;use outlines_core::json_schema;use outlines_core::prelude::StateId;use outlines_core::primitives::TokenId;use outlines_core::vocabulary::Vocabulary;use rustc_hash::FxBuildHasher;use std::collections::HashMap;/// A trait to guide the decoding process.////// Based on PyGuide from outlines-core.pubtraitGuide{/// Returns true if it is generally possible to advance the guide.////// When it returns false, the client shall stop the generation.fncan_advance(&self) -> bool;/// Returns true if the given token is allowed for the next transition from the current state.fnis_allowed(&self,token_id:&u32) -> bool;/// Attempts to transition to the next state using the given token.////// The client can invoke is_allowed to verify the safe transition.////// # Errors////// * `StateTrackerError::TokenNotAllowed` if the token is not allowed for the current state./// * `StateTrackerError::EndOfIndex` if the current state is a final state.fnadvance(&mutself,token_id:&u32) -> Result<(),ConstrainedDecodingError>;/// Returns true if the current state is a final state.////// This means the guide saw the complete sequence, thought the client need to invoke can_advance/// to know if that is a fully complete state or other variations are possible./// The example of such behavior is a greedy pattern that reports as complete early but allows/// longer sequence.fnis_finished(&self) -> bool;}/// A guide that uses a regex to guide the decoding process.////// Based on PyGuide from outlines-core.#[derive(Debug,Clone)]pubstructRegexGuide{index:Index,// (state, allowed_tokens)state:Option<StateId>,// optimized structure for the tokens lookupallowed_tokens:ArrayBasedLookup,}implRegexGuide{/// Creates a new regex guide.////// # Arguments////// * `regex` - The regex to guide the decoding process./// * `eos_token_id` - The token id of the end-of-sequence token./// * `tokens` - A map of tokens to their ids.pubfnnew(regex:&str,eos_token_id:u32,tokens:&HashMap<&str,u32>,) -> Result<Self,ConstrainedDecodingError>{let tokens_map:HashMap<String,Vec<TokenId>,FxBuildHasher> = tokens
.iter().map(|(k, v)| (k.to_string(),vec![*v]))// strip the eos token id from the allowed tokens.filter(|(_, v)| v.iter().any(|id| id != &eos_token_id)).collect::<HashMap<String,Vec<TokenId>,FxBuildHasher>>();let vocabulary = Vocabulary::try_from((eos_token_id, tokens_map)).map_err(ConstrainedDecodingError::BackendError)?;let index =
Index::new(regex,&vocabulary).map_err(ConstrainedDecodingError::BackendError)?;let initial_state = index.initial_state();// The lookups to happen for the logits array which size may be larger then the vocabulary.// Prepare the lookup table based on the max token id.// +1 for the eos tokenlet max_token_id:usize = tokens
.values().chain(std::iter::once(&eos_token_id)).max().unwrap_or(&eos_token_id).to_owned()asusize + 1;let allowed_tokens = ArrayBasedLookup::new(max_token_id);letmut guide = Self{
index,state:None,
allowed_tokens,};// set at the beginning
guide.set_state(Some(initial_state));Ok(guide)}fnset_state(&mutself,new_state:Option<StateId>){match new_state {Some(state) => matchself.index.transitions().get(&state){Some(transitions) => {self.allowed_tokens.reset_with(transitions);self.state = Some(state);}None => {self.state = None;}},None => {self.state = None;}}}/// Creates a new regex guide from a json schema.////// # Arguments////// * `schema` - The json schema to guide the decoding process./// * `eos_token_id` - The token id of the end-of-sequence token./// * `tokens` - A map of tokens to their ids.pubfnfrom_json_schema(schema:&str,eos_token_id:u32,tokens:HashMap<&str,u32>,) -> Result<Self,ConstrainedDecodingError>{let regex = json_schema::regex_from_str(schema,None).map_err(ConstrainedDecodingError::JsonSchemaError)?;Self::new(®ex, eos_token_id,&tokens)}}implGuideforRegexGuide{fncan_advance(&self) -> bool{match&self.state{Some(_) => !self.allowed_tokens.is_empty(),None => false,}}fnis_allowed(&self,token_id:&u32) -> bool{match&self.state{Some(_) => self.allowed_tokens.contains(token_id),None => false,}}fnadvance(&mutself,token_id:&u32) -> Result<(),ConstrainedDecodingError>{match&self.state{Some(state) => {if !self.is_allowed(token_id){// user didn't check itreturnErr(ConstrainedDecodingError::TokenNotAllowed);}matchself.index.next_state(state, token_id){Some(next_state) => {self.set_state(Some(next_state));Ok(())}None => {// should not be reached, consider it as end of indexself.set_state(None);Ok(())}}}None => Err(ConstrainedDecodingError::EndOfIndex),}}fnis_finished(&self) -> bool{match&self.state{Some(state) => self.index.is_final_state(state),None => true,}}}#[derive(Debug,Clone)]pubstructArrayBasedLookup{tokens:Box<[bool]>,is_empty:bool,}implArrayBasedLookup{fnnew(max_size:usize) -> Self{Self{tokens:vec![false; max_size].into_boxed_slice(),is_empty:true,}}fnreset_with(&mutself,state_map:&HashMap<TokenId,StateId,FxBuildHasher>) -> &Self{self.tokens.iter_mut().for_each(|m| *m = false);assert!(state_map.len() <= self.tokens.len());for token_id in state_map.keys(){self.tokens[*token_id asusize] = true;}self.is_empty = state_map.is_empty();self}fncontains(&self,token_id:&u32) -> bool{// the lookup may happen on the value outside of the vocabularyif*token_id >= self.tokens.len()asu32{returnfalse;}self.tokens[*token_id asusize]}fnis_empty(&self) -> bool{self.is_empty}}#[cfg(test)]mod tests {usesuper::*;fncreate_guide_1() -> (RegexGuide,u32,HashMap<String,u32>){let regex = "0|[1-9][0-9]{1,2}";let eos_token_id = 4;let tokens = HashMap::from_iter([("blah",0),("1a",1),("2",2),("0",3)]);(RegexGuide::new(regex, eos_token_id,&tokens).expect("State tracker failed"),
eos_token_id,
tokens
.into_iter().map(|(k, v)| (k.to_string(), v)).collect(),)}#[test]fnguide_1_initial(){let(mut guide, eos_token_id, tokens) = create_guide_1();assert!(!guide.is_finished());assert!(guide.can_advance());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(guide.is_allowed(&tokens["2"]));assert!(guide.is_allowed(&tokens["0"]));assert!(!guide.is_allowed(&eos_token_id));assert!(matches!(
guide.clone().advance(&tokens["blah"]),Err(ConstrainedDecodingError::TokenNotAllowed)));assert!(matches!(
guide.clone().advance(&tokens["1a"]),Err(ConstrainedDecodingError::TokenNotAllowed)));// try 2nd branchassert!(matches!(guide.clone().advance(&tokens["2"]),Ok(())));// go 1st branchassert!(matches!(guide.advance(&tokens["0"]),Ok(())));assert!(guide.can_advance());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(!guide.is_allowed(&tokens["2"]));assert!(!guide.is_allowed(&tokens["0"]));assert!(guide.is_allowed(&eos_token_id));}#[test]fnguide_1_short_branch(){let(mut guide, eos_token_id, tokens) = create_guide_1();// firstassert!(guide.advance(&tokens["0"]).is_ok());assert!(guide.is_finished());assert!(guide.can_advance());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(!guide.is_allowed(&tokens["2"]));assert!(!guide.is_allowed(&tokens["0"]));assert!(guide.is_allowed(&eos_token_id));assert!(matches!(
guide.advance(&tokens["blah"]),Err(ConstrainedDecodingError::TokenNotAllowed)));assert!(matches!(
guide.advance(&tokens["1a"]),Err(ConstrainedDecodingError::TokenNotAllowed)));assert!(matches!(
guide.advance(&tokens["2"]),Err(ConstrainedDecodingError::TokenNotAllowed)));assert!(matches!(
guide.advance(&tokens["0"]),Err(ConstrainedDecodingError::TokenNotAllowed)));// endassert!(matches!(guide.advance(&eos_token_id),Ok(())));assert!(!guide.can_advance());assert!(guide.is_finished());assert!(matches!(
guide.advance(&tokens["blah"]),Err(ConstrainedDecodingError::EndOfIndex)));assert!(matches!(
guide.advance(&tokens["1a"]),Err(ConstrainedDecodingError::EndOfIndex)));assert!(matches!(
guide.advance(&tokens["2"]),Err(ConstrainedDecodingError::EndOfIndex)));assert!(matches!(
guide.advance(&tokens["0"]),Err(ConstrainedDecodingError::EndOfIndex)));}#[test]fnguide_1_long_branch(){let(mut guide, eos_token_id, tokens) = create_guide_1();// firstassert!(guide.advance(&tokens["2"]).is_ok());assert!(!guide.is_finished());assert!(guide.can_advance());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(guide.is_allowed(&tokens["2"]));assert!(guide.is_allowed(&tokens["0"]));assert!(!guide.is_allowed(&eos_token_id));// secondassert!(guide.advance(&tokens["0"]).is_ok());assert!(guide.can_advance());assert!(guide.is_finished());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(guide.is_allowed(&tokens["2"]));assert!(guide.is_allowed(&tokens["0"]));assert!(guide.is_allowed(&eos_token_id));// thirdassert!(guide.advance(&tokens["2"]).is_ok());assert!(guide.can_advance());assert!(guide.is_finished());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(!guide.is_allowed(&tokens["2"]));assert!(!guide.is_allowed(&tokens["0"]));assert!(guide.is_allowed(&eos_token_id));// eosassert!(guide.advance(&eos_token_id).is_ok());assert!(!guide.can_advance());assert!(guide.is_finished());assert!(!guide.is_allowed(&tokens["blah"]));assert!(!guide.is_allowed(&tokens["1a"]));assert!(!guide.is_allowed(&tokens["2"]));assert!(!guide.is_allowed(&tokens["0"]));assert!(!guide.is_allowed(&eos_token_id));}}
The text was updated successfully, but these errors were encountered:
What behavior of the library made you think about the improvement?
In the current Index abstraction over the outlines-core regex engine, the concept of a final state is misleading: it does not actually signify that the regex has been fully matched or that a client can safely stop generating tokens.
While final_states includes states that can transition to themselves on the eos_token_id, this transition is not enforced or surfaced in a way that makes it obvious when a valid, complete match has occurred.
This results in a problem for client code: even after reaching a state marked as final, the regex may not be truly complete unless an additional eos_token_id transition is explicitly made. But from the perspective of Index::is_final_state, the client may incorrectly assume that it can exit.
Why This Matters:
Consumers using the Index to validate token sequences—especially in contexts like constrained decoding—need a reliable signal that a match is valid and complete. The current API makes it easy to:
Reach a "final" state,
Exit early,
And unknowingly produce an incomplete match.
This is especially problematic when using Index to validate partial decoding paths during generation.
How would you like it to behave?
Potential Fixes / Suggestions:
Clarify Semantics in Docs:
Clearly state that final_states does not guarantee regex completion unless followed by eos_token_id.
Introduce is_accepting_state(&self, state: &StateId) -> bool:
Add a method that checks whether a state is both in final_states and has a valid eos_token_id transition—i.e., a true "accepting" state.
Strengthen Invariants in next_state:
Consider enforcing or guiding the use of eos_token_id transitions when appropriate. Or, make these transitions more visible through a specialized method.
Provide Utility to Check for Completion:
A method like can_finish(&self, state: &StateId) -> bool would be extremely helpful to validate match completion.
Related abstraction that tries to solve that
The text was updated successfully, but these errors were encountered: