Skip to content

dogsteven/search_utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

My company has an internal search utilities library that is used in many projects. The library is about providing the frontend an easy-to-use elasticsearch query building syntax so they can query whatever they want.

The company's implementation is bad for the following reasons:

  1. it treats "query" and "search request" as the same concept.
  2. it claimed to be able to handle tree-based syntax, in fact it's not, and the parsing implementation is handwritten, with no lexical analyzer, and a weird parsing algorithm that mixes many level of abstractions.
  3. it's SearchEngine classes are weirdly implemented, they do many things in the search method that can't be customized.
  4. the performance is acceptable for their cases, but the code looks crazy: after getting the identifier list from Elasticsearch, it loads entities one-by-one, making multiple network round trips.

This repository is my attempt to solve this mess.

The entity_loader module provides functionalities for loading entities from an identifier list. The base EntityLoader interface doesn't guarantee that the returned entities will be in order. To solve the ordering problem, the OrderingEntityLoaderDecorator uses the following strategy:

  1. it segments the identifier list into segments of identifiers via a ListSegmenter provided based on the entity type.
  2. for each segment, it loads the corresponding entities via a wrapped EntityLoader, then sorts the entity list based on their positions in the identifier list via a PositionBasedListSorter, and appends the sorted entity list into the result entity list.

By providing a ListSegmenterFactory implementation, you can tune performance for this decorator, segment size should not be too small to reduce the number of network round trips, and it should not be too large either because of sorting overhead.

Currently, there are two provided implementations for PositionBasedListSorter, the first one and the most basic is using the standard Java sort method, which is efficient enough for most use cases. The other is using transposition-based algorithm from Group Theory, which is extremely optimized and memory efficient, but is extremely dangerous if the assumptions of the algorithm aren't met. The most common scenario is that the application gets an identifier list from the search engine right before some entities are removed from the main database, when the application query the database, removed entities will be absent from the list, and the algorithm will fail. Thus, I provide a FallbackListSorterDecorator which fallbacks to a safe algorithm if the fast algorithm fails, which is I think better than sanitizing the identifier list before sorting in terms of optimization.

I also provide two JPA implementations for EntityLoader. The first one is using SELECT ... IN ... statement, which is extremely fast, and supported by most databases. The second one is using inline table expression, which is efficient, and ordering-guaranteed, which means you don't need the ordering decorator for this implementation, the only downside is the syntax isn't widely supported. Note that the inline table expression implementation may put pressure on the database due to sorting, so if you want to share the pressure to your application instances, the first implementation is preferred.

In the searching module, I maintain a boolean expression hierarchy named Expression, which covers my use cases. The hierarchy doesn't depend on Elasticsearch (although it's inspired by Elasticsearch) so you can use this hierarchy for other search engines as well. The hierarchy is shipped with a partial boolean expression optimizer which covers common scenarios including "nested not", "nested and", "nested or", "and of many not", "or of many not". The hierarchy is also shipped with a parser for a simple tree-based syntax called FoxQueryLanguage, which is generated by Antlr4. The hierarchy is of course shipped with double dispatcher/visitor, so you can extend the polymorphic behaviour of the hierarchy to whatever you want. Some applications of this double dispatcher are the default expression optimizer and the Expression to Elasticsearch Query translator.

The FqlQueryBuilder translates a FoxQueryLanguage expression into an Elasticsearch Query object. It's not actually a builder class in OOP sense, I just didn't know that to name it.

I wrote this repository in 1 day, so it hasn't had unit testing yet. Some parts are algorithmic, they need formal proofs, which are easy to write.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published